Learning a Robust Hybrid Descriptor for Robot Visual Localization

Shi, Qingwu; Wu, Junjun; Lin, Zeqin; Qin, Ningwei

doi:https://doi.org/10.1155/2022/9354909

Journal of Robotics

On this page

Abstract Introduction Related Works Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 9354909 | https://doi.org/10.1155/2022/9354909

Learning a Robust Hybrid Descriptor for Robot Visual Localization

Qingwu Shi,¹Junjun Wu,¹Zeqin Lin,¹and Ningwei Qin¹

Academic Editor: Xianfeng Yuan

Received11 Feb 2022

Revised23 Mar 2022

Accepted19 Apr 2022

Published19 May 2022

Abstract

Long-term robust visual localization is one of the main challenges of long-term visual navigation for mobile robots. Due to factors such as illumination, weather, and season, mobile robots continuously navigate with visual information in a complex scene, which is likely to lead to failure localization within a few hours. However, semantic segmentation images will be more stable than the original images against considerable drastically variable environments; therefore, to make full use of the advantages of both semantic segmentation image and its original image, this paper solves the above problems with the latest work of semantic segmentation and proposes the novel hybrid descriptor for long-term visual localization, which is generated by combining a semantic image descriptor extracted from segmentation images and an image descriptor extracted from RGB images with a certain weight, and then trained by a convolutional neural network. Our experiments show that the localization performance of our method combining the advantages of semantic image descriptor and image descriptor is superior to those long-term visual localization methods with only an image descriptor or semantic image descriptor. Finally, our experimental results mostly exceed state-of-the-art 2D image-based localization methods under various challenging environmental conditions in the Extended CMU Seasons and RobotCar Seasons datasets in specific precision metrics.

1. Introduction

Visual localization is a key part of SLAM for mobile robots, which can help the robot to determine the general position and direction. In the GPS-constrained environment, it plays a vital role in navigation for mobile robots [1]. When the robot performs visual navigation, it usually generates an environmental map based on scene representation under certain environmental conditions. However, due to the influence such as weather, illumination, and season, when the robot moves in a large range for a long time, environmental conditions of the scene image that is being observed may vary greatly from the previous. Therefore, long-term visual localization methods need to deal with all these appearance variations [2, 3]. At present, various visual localization problems under the more challenging environment have attracted extensive attention of researchers [4–8]. Therefore, this paper mainly focuses on the long-term visual localization problems for the robots under complex environments, namely, finding the pose of the image in the currently constructed map which is highly similar to the currently observing image. With the above preliminary global localization coarsely, the initial pose can be provided for the regression of local high-precision 6-DOF camera pose with the hierarchical localization methods [9–11].

Traditional methods such as SIFT, SURF, and ORB rely on point descriptor for visual localization. Recently, the global image descriptor [12] extracted from the deep convolutional neural network performs better than the above traditional point descriptor. However, these methods only aggregate the features of the image region without considering the semantic information contained therein.

Intuitively, one of the main challenges of mobile robots when performing long-term work is still to obtain the representation of images under changing conditions. However, semantic information of objects in the scene that is extracted by semantic segmentation or object detection can generate the invariant representation of images under changing conditions. For example, the semantic information of a tree will not change whether or not it is covered with snow, so the visual localization methods with semantic information have attracted the attention of researchers [4–7, 13].

In summary, to improve the accuracy of long-term visual localization for the mobile robots in complex and changing environments, a novel method of long-term visual localization based on hybrid descriptor is proposed, which is generated by combining a semantic image descriptor extracted from segmentation images and image descriptor extracted from RGB images. However, the performance of semantic segmentation based on CNN highly depends on semantic labels, which are expensive and time-consuming to obtain. Therefore, to reduce the large costs of manual labeling for semantic labels, the paper introduces 3D geometric consistency supervision for the training process of segmentation network PSPNet [14], so that the segmentation effects of the same scene under changing environmental conditions are generally consistent. Finally, this paper verifies the effectiveness of this method on Extended CMU Seasons and RobotCar Seasons datasets. The contributions of this paper are as follows: A new method of long-term visual localization based on hybrid descriptor is proposed, which is a compact hybrid descriptor generated by concatenating a semantic image descriptor extracted from semantic segmentation images and an image descriptor extracted from RGB images with a certain weight and then trained by a convolutional neural network. This paper introduces 3D geometric consistency supervision for the training process of segmentation network PSPNet to obtain the semantic labels of the training and testing datasets with little labor costs. This paper shows that the localization performance of our method by combining the advantages of the semantic image descriptor and image descriptor is superior to that of long-term visual localization methods with only an image descriptor or semantic image descriptor. Besides, our method is comparable to state-of-the-art 2D image-based localization methods under various challenging environmental conditions in the Extended CMU Seasons and RobotCar Seasons datasets in specific precision metrics.

The organizational structure of this paper is as follows: Section 1 introduces the research background, as well as defines the challenging problems and the contribution of our method. Section 2 reviews the research related to the work of this paper, mainly including semantic segmentation, domain adaptation, and long-term visual localization in a changing environment. In Section 3, the network architecture and the loss function of our method are described in details. In Section 4, the experimental scheme is introduced in details, including datasets, evaluation metrics and experimental results, and the analysis of the experimental results. In Section 5, the research work of this paper is summarized and the future work prospects is given.

2.1. Semantic Segmentation

Semantic segmentation is the task of assigning a category label to each pixel in the input image, which is a very important task for visual perception of mobile robots. In the early stage, researchers mainly use manually designed descriptor or probability graph models to perform semantic segmentation tasks. In recent years, deep convolutional neural networks based semantic segmentation have been proved to be superior to traditional methods. The pioneering work of Long et al. [15] shows that convolutional neural networks (CNN) originally used for classification, such as AlexNet or VGG, can be transformed into fully convolutional networks (FCN) for semantic segmentation. The following work improved the structure of its neural network based on [15], such as expanding the receptive field [16, 17], making full use of global context information [14] or fusing multiscale features [18, 19]. In addition, some work combined FCN [15], with probabilistic graphical models such as conditional random fields, as a post-processing step [16].

However, the performance of semantic segmentation based on CNN highly depends on semantic tags, which are expensive and time-consuming to obtain. In this case, many weakly supervised methods have been proposed with labels in the form of such as bounding boxes [20], image-level tags [21], or points [22]. In addition, [23] obtains semantic tags in a semiautomatic way, which requires low manual costs than pixel-level annotation to improve the segmentation performance. This paper adopts the method similar to [23] to obtain the segmentation maps of mapillary street level sequences [24].

2.2. Domain Adaptation

The training of models for deep learning requires a large amount of labeled data, but manually labeling a large amount of data is time-consuming and laborious. However, the pixel-level annotation in the source task is available; therefore, the purpose of the domain adaptation method is to learn the knowledge of the source task, so that the model can perform well in the target task. Early work includes [25, 26], which converts the features of the target domain into the source feature domain [25] or the domain-invariant feature space [26]. Some researchers focus on domain adaptation based on the CNN models [27, 28]. These methods mainly aim to make the learned model obtain domain-invariant features, either by training the network based on the adversarial loss to promote the confusion between source domain and target domain [27] or by keeping the feature distribution of source domain and target domain consistent [28]. Recently, several domain adaptation methods have been proposed for semantic segmentation tasks [29–33]. Most of them [29–31] use synthetic datasets, such as [34], which can automatically generate a large number of annotated synthetic images. The method proposed in [31] utilizes an image translation-based technique, which converts the image from the source domain into the target domain and then performs segmentation tasks. Another common approach is to train the network based on adversarial loss, such as [32], which causes the network to fool the domain discriminator to generate roughly the same feature distribution as the generator.

Although the domain adaptation method can also obtain the semantic labels of the training and test datasets, its performance is not good, we introduce 3D geometric consistency as the supervision signal for the training process of segmentation network PSPNet to obtain the semantic labels of the training and test datasets used in this paper, so that the semantic labels for images of the same scene under different environmental conditions are generally consistent.

2.3. Long-Term Visual Localization

Because the benchmark datasets proposed in [2, 3] are challenging and the evaluation metrics of the visual localization provided by it are convincing, it has greatly promoted the research and development of long-term visual localization. At present, the methods of long-term visual localization generally include: Sequence-based image retrieval methods [35], learning-based local feature localization methods [36, 37], 3D structure-based localization methods [38–40], 2D image-based localization methods [5, 12, 41–45], and hierarchical localization methods [9–11].

The 2D image-based localization methods have great advantages in robustness and efficiency, so this paper focuses on the 2D image-based visual localization methods, which do not use any form of 3D reasoning method to calculate the pose of the query image and is usually used for place recognition or closure loop detection tasks of visual SLAM. For the image that is being observed by the robot, given a set of environmental maps with a known camera pose, the 2D image-based localization methods usually approximate the pose of the image that is the most similar visual appearance in the map to the pose of currently observing image (i.e., query image). Since 2D image-based localization methods generally perform well at coarse precision, the hierarchical localization methods use the initial pose obtained by 2D image-based localization methods to regress the high precision 6-DOF camera pose further.

VLAD [46] is a classic method for 2D image-based localization or place recognition under ideal conditions, but it has poor robustness for tasks of long-term visual localization under dramatically changing conditions. On this basis, DenseVLAD [41] uses the VLAD clustering RootSIFT descriptor that is used to match for tasks of visual localization. Subsequently, the long-term visual localization methods based on 2D image localization have achieved good development with the help of CNN models. Therefore, NetVLAD [12] integrates the traditional VLAD algorithm into CNN network structure to achieve end-to-end visual localization.

To improve the visual localization performance in complex environments, many works utilize semantic information [5, 13, 23, 44], context information [45], and depth information [5, 47] under the architecture of the convolutional neural network to learn scene descriptor with invariant environmental conditions. However, these tasks require auxiliary information that usually requires large labor costs to obtain. This paper also makes full use of semantic information as auxiliary information to overcome the impact such as illumination, weather, and season on visual localization tasks. However, to reduce the high manual labeling costs for obtaining auxiliary information, we introduce 3D geometric consistency as the supervision signal for the training process of segmentation network PSPNet to obtain the semantic labels of the training and testing datasets used in this paper.

3. Proposed Method

3.1. Network Model Structure

Figure 1 shows the network model structure of the long-term visual localization method designed in this paper. Above all, this paper adopts a method similar to [23] to obtain the segmentation images, and the specific steps are as follows: (1) using the 2D-2D matches that are composed of two images of the same scene taken under different conditions provided by [23], which provides constraints for the training process of the segmentation network PSPNet, i.e., the segmentation maps of the two images in the same scene should be consistent and (2) using the cross-season correspondence dataset in [23], some roughly annotated images and the correspondence loss, the segmentation effects of images with the same scene can be roughly consistent under changing environmental conditions. For more details, please refer to [23].

The network model structure consists of four parts: (1) training a VGG16 network to extract 16K (1-dimensional) semantic image descriptors from segmentation images, (2) training another VGG16 network to extract 16 K (1-dimensional) image descriptors from RGB images, (3) concatenating the descriptors of semantic image descriptors and image descriptors with the weight of and , respectively, to obtain 16K (1-dimensional) hybrid descriptors, and (4) training a convolutional neural network that is composed of three convolutional layers and two fully connected layers, which converts 16K (1-dimensional) hybrid descriptors to 1024 (1-dimensional) learning hybrid descriptors for tasks of visual localization.

When the mobile robot builds the environment map incrementally, each image that is being observed is processed with our method to generate 1024 (1-dimensional) learning hybrid descriptors for the visual localization module. This descriptor is feature data that contain invariant representation of image, so the environment map built by the robot exists in the form of feature database. The main task of the visual localization module is to continuously measure the similar distance between the feature data that are generated from the currently observing image with our method and feature database based on L1 distance, and the pose of the currently observing image is approximated to the known pose of the candidate image with the lowest similar distance.

3.2. Loss Function

As shown in Figure 1, this paper needs to optimize two types of loss functions: total loss of segmentation task for obtaining semantic segmentation images and triplet loss for task of visual localization.

3.3. Total Loss for Segmentation Task

In order to obtain the high-quality segmentation images used in this paper with little labor costs, we introduce 3D geometric consistency as the supervision signal for the training process of segmentation network PSPNet and need to optimize the loss function for segmentation task, which is composed of standard cross-entropy loss function and correspondence loss to obtain the high-quality semantic segmentation images used in this paper. The total loss function for segmentation task, i.e., is defined as follows:

where is the weight, we set to 1, and the correspondence loss is designed as follows:where is the reference traversal image, is the target traversal image, and are the pixel positions of matching points in the reference traversal and target traversal images, respectively. is hinge loss or the correspondence cross-entropy loss, for more details about , please refer to [23].

3.4. Triplet Loss

In order to guide the model learn robust descriptor for visual localization tasks, we construct triplet loss in the process of training a VGG16 network to extract 16K (1-dimensional) semantic image descriptors from segmentation images and training another VGG16 network to extract 16K (1-dimensional) image descriptors from RGB images. In order to enhance the robustness of the descriptor used for visual localization, this paper obtains tuples from the training dataset to train the network. Tuples are composed of an anchor image , a positive sample (, i.e., the same scene as the anchor image), and negative samples (, i.e., different scenes from the anchor image). To make the distance between positive pairs decrease and make the distance between negative pairs increase, the triplet loss used in this paper is as follows:where means margin, we set to 0.1, , and refer to the cached embeddings for the anchor, positive, and negative images.

4. Experiment

This section describes the experimental protocol in details, including experimental dataset, experimental settings, evaluation metrics, comparison models, experimental results, visualization experiment, and ablation experiment.

4.1. Experimental Dataset

4.1.1. Training Dataset

Mapillary street level sequences [24] are currently the most diverse publicly available dataset for long-term place recognition tasks, covering the regional environment of 30 major cities across six continents from Tokyo to San Francisco for more than seven years, which contains more than 1.6 million images collected from the mapillary and contains huge perceptual changes due to dynamic objects, seasons, region, weather, cameras, and illumination.

4.1.2. Testing Dataset

The Extended CMU Seasons dataset [2] is a subset of the CMU Visual Localization dataset [48]. It records the scene images in a variety of challenging environments such as suburban, urban, and park in Pittsburgh of the United States for more than a year. This dataset contains a total of 1 reference traversal and 11 query traversals, the environmental conditions of the images in the reference traversal are Sunny + No Foliage. The 11 query traversals contain different regional environments (suburban, urban, and Park), different vegetation conditions (no foliage, mixed foliage, and foliage) and different weather conditions (sunny, cloudy, low sun, overcast, and snow), respectively.

RobotCar Seasons dataset [2] is from the publicly available Oxford RobotCar dataset [49], which records the scene images with various changing conditions in Oxford, UK for one year. This dataset contains a total of 1 reference traversal (environmental condition is overcast) and 9 query traversals that contain 5 weather conditions (snow, dusk, sun, dawn, and rain), 2 types of seasons (overcast winter and overcast summer) and 2 night environmental conditions (night and night rain). The environmental conditions of the latter two query traversal constitute Night All, which forms a comparison of different illumination conditions with the Day All formed by the environment of the previous seven query traversals.

4.1.3. Evaluation Metrics

Reference [2] hosts a performance evaluation server for evaluating different visual localization methods, which has attracted extensive attention from many researchers. Therefore, our experiments use the metrics in [2] to test the visual localization performance of the proposed method. The experiments upload the 6-DOF pose files of the query image obtained by our method to this server, and we obtain the performance results and ranking on the public evaluation website. There are three precision metrics on this evaluation site: high precision , medium precision ,, and coarse precision . The website calculates the percentage of pose error within the three precision metrics to evaluate the performance of various visual localization methods.

4.2. Comparison Models

In the experiment, four typical and advanced methods are selected as the comparison model of the method in this paper, which is shown as follows. NetVLAD [12] realized the 2D image-based localization task by integrating the classic VLAD algorithm into the CNN model structure. DenseVLAD [41] realized the 2D image-based localization task by using the VLAD clustering RootSIFT descriptor. WASABI [32] proposed a global image descriptor integrating semantic and topological information constructed by wavelet transform based on semantic edge, which realized 2D image-based localization task by matching semantic edge transformation. DIFL [29] introduced feature consistency loss to train the encoder to generate domain-invariant features in a self-supervised manner to achieve 2D image-based localization tasks.

4.3. Experimental Setting

In this paper, the mapillary street level sequence dataset and the segmentation maps obtained by the method similar to [23] are used for model training. We adopts the Extended CMU Seasons and RobotCar Seasons datasets for testing performance of our method, and the test results were uploaded to the long-term visual localization performance evaluation website provided in [2, 3] (https://www.visuallocalization.net/).

We implemented the proposed method using Pytorch on the computer with two 2080Ti GPUs. The training dataset of this experiment was resized to 640480, and we trained mapillary street level sequences using the ADAM optimizer, and batch size was set to 8 tuples (containing no more than 15 negative samples). The initial learning rate is set to 0.0002, the margin is set to 0.1, and epochs are set to 30.

4.4. Testing Experiment on the Extended CMU Season Dataset

The testing files obtained by the proposed method were uploaded to the above visual localization performance evaluation website. Several state-of-the-art 2D image-based localization methods in this website are selected to compare the visual localization performance with our method under different regional environments, vegetation conditions, and weather conditions.

4.5. Environment under Different Regional Environments

The localization performance of the proposed method and the selected comparison models under different regional environments in the Extended CMU Seasons dataset is shown in Table 1. When the mobile robots move in a large range, environmental conditions of the scene image that is being observed may be significantly different from the previous, so the long-term visual localization method needs to cover as many regional environments as possible. Therefore, the testing environments selected in the experiment include three typical regional environments, namely, urban, suburban, and park. According to the data in Table 1, for suburban and park environments, the performance of the model in this paper is 14.59% and 14.62% higher than the state-of-the-art baselines under the coarse precision metric. In the urban environment, except for the coarse precision metric where our model ranks second in performance, our model performs the best in other cases.

It can be seen that the proposed method is significantly advanced in the park and suburban environments, and its performance is weakened in the urban environments, but it is still more competitive than other state-of-the-art baselines. This is mainly because there are a large number of trees and other static objects in the park and suburban environments, and the proposed model can make full use of the semantic information of these two types of scenes to enhance the accuracy of visual localization. Therefore, the localization performance of the model in this paper improves most under the coarse precision of these two regional environments. However, the semantic information of the same scene in the urban environment is changed due to the existence of a large number of dynamic objects such as pedestrians or cars, which makes the performance of the model in this paper affected.

In summary, the performance of the proposed method in different regional environments is significantly better than that of the selected representative existing methods. Therefore, the model presented in this paper plays a positive role in tasks of the long-term localization for mobile robots, especially in the park and suburban environments.

4.6. Environment under Different Vegetation Conditions

The results of the experiment for environments with different vegetation conditions in the Extended CMU Seasons dataset are shown in Table 2. For two complex vegetation conditions mixed foliage and foliage, the proposed method shows the best robustness compared with other state-of-the-art baselines in three precision metrics.

What is undoubtedly exciting is that different vegetation conditions are the most challenging environmental conditions in the Extended CMU Seasons dataset due to the types, numbers, and positions of leaves. The outstanding visual localization performance of the proposed method is mainly because of the segmentation images used in this paper, which makes the extracted environmental features and constructed scene descriptor have stronger invariance representation, which has a practical value for mobile robots to perform long-term outdoor navigation.

4.7. Environment under Different Weather Conditions

Mobile robots are inevitably faced with weather variations when they work for a long time. Therefore, we not only tested the effectiveness under different regional environments and different vegetation conditions in the Extended CMU Seasons dataset but also tested it in different weather conditions. The experimental results are shown in Table 3. It can be seen from Table 3 that our model outperforms other state-of-the-art baselines with most weather conditions, and our method performs even more prominently under the overcast and low sun environments in the three precision metrics especially.

4.8. Testing Experiment on RobotCar Seasons Dataset

The experiment in this section uses the same method as that in Section 4.5 to verify the effectiveness of our method, and we select several state-of-the-art 2D image-based localization methods in the evaluation website for performance comparison. The experimental environments include two kinds of changing conditions: different weather and illumination conditions.

4.9. Testing under Different Weather Conditions

The comparison results of testing the robustness between the model proposed in this paper and three existing comparison models under different weather conditions in the RobotCar Seasons dataset are shown in Table 4. The proposed method has the best localization performance in the medium and high precision metrics under the snow condition. In addition, although there are a large number of dynamic targets such as pedestrians or cars in the RobotCar Seasons dataset, which changes the semantic information in the same scene, the model in this paper still achieves decent results.

4.10. Testing under Different Illumination Conditions

In addition, we also conducted experiments under different illumination conditions. The test results are shown in Table 5. Compared with the other three comparison models, the proposed model improved 24.00% and 24.62%, respectively, in terms of high and medium precision metrics under night conditions, which is undoubtedly exciting because night conditions are the most challenging environmental conditions in the RobotCar Seasons dataset.

4.11. Visualization Experiment

For the Extended CMU Seasons and RobotCar Seasons datasets, we obtained segmentation images using self-supervised methods similar to [23], as shown in Figures 2 and 3, respectively. Because the ultimate purpose of this paper is to take the obtained segmentation images as input and train the model to generate semantic image descriptor to improve the performance of visual localization tasks, the predicted images of semantic segmentation in the same scene in different environments under the Extended CMU Seasons and RobotCar Seasons datasets should be consistent. As can be seen from Figures 2 and 3, the predicted image of semantic segmentation in the same scene under different environments in the extended CMU seasons and RobotCar Seasons datasets are generally the same.

4.12. Experiment for Testing the Proposed Algorithm

Considering the computational efficiency of the proposed algorithm, we also test the average computational time for retrieval in different database images. This paper focuses on the time when the query image matches the database image. RobotCar Seasons and Extended CMU Seasons datasets are used in the experiment, and the total database images of both are 20,862 and 10,338, respectively. As can be seen from Table 6, the algorithm queries each frame taking 31.42 ms in the Extended CMU Seasons dataset (10k database images), that is, the algorithm can process about 32 frames per second on average, while the algorithm takes 52.17 ms to query each frame in the RobotCar Seasons dataset (20k database images), that is, the algorithm can process about 19 frames per second on average. However, usually the acquisition frequency of mobile robots is 15 frames per second, and a large number of invalid frames need to be removed, which shows that our algorithm has a practical value for mobile robots.

4.13. Ablation Study

A VGG16 network is trained from semantic segmentation images to extract a 16K (1-dimensional) semantic image descriptor and another VGG16 network is trained from RGB images to extract the 16K (1-dimensional) image descriptor to concatenate with weights and , respectively. Therefore, this section discusses the influence of different weights and on the performance of the proposed method. The experimental results are shown in Table 7.

, means that only one VGG16 network is trained from semantic segmentation images to extract the 16K (1-dimensional) semantic image descriptor, while , means that only one VGG16 network is trained from RGB images to extract the 16K (1-dimensional) image descriptors directly for visual localization tasks. Park, suburban, and urban in Table 7 are the regional environmental conditions in the Extended CMU Seasons dataset, while Day All and Night All in Table 7 are the illumination variation conditions in the RobotCar Seasons dataset. As can be seen from Table 7, the visual localization performance is the best when , for the regional environmental conditions in the Extended CMU Seasons dataset, while the visual localization performance is the best when , for the illumination variation conditions in the RobotCar Seasons dataset, which shows that if we trust image descriptor more than semantic image descriptor, we can reach the best results for both datasets, and the semantic image descriptor is more helpful for the illumination condition in the RobotCar Seasons dataset compared with the regional environment in the Extended CMU Seasons dataset. Besides, it can be seen from Table 7 that for the regional environmental conditions in the Extended CMU Seasons dataset and the illumination variation conditions in the RobotCar Seasons dataset, the visual localization effect is not ideal if solely using the semantic image descriptor that is trained from semantic segmentation images or solely using the image descriptor that is trained from RGB images.

5. Conclusion

Aiming at the challenges of robustness faced by mobile robot when it performs long-term work under complex changing conditions, a new method of long-term visual localization based on hybrid descriptor is proposed, which is a compact hybrid descriptor generated by concatenating a semantic image descriptor extracted from semantic segmentation images and the image descriptor extracted from RGB images with a certain weight, and then trained by a convolutional neural network. In this paper, we verify that the visual localization performance of solely using q semantic image descriptor trained from semantically segmented images or solely using image descriptor trained from RGB images is not better than that of using hybrid descriptor obtained by the combination of both with a certain weight. This model was trained on mapillary street level sequences dataset and subsequently tested on Extended CMU Seasons and RobotCar Seasons datasets. The experimental results verify that the visual localization performance of the proposed method is significantly better than that of other state-of-the-art baselines in the Extended CMU Seasons and RobotCar Seasons datasets under different regions, vegetation conditions, weather, and illumination conditions. It can meet the requirements for mobile robots to perform long-term visual localization tasks in a variety of complex environments.

The performance of the visual localization method in this paper depends on the performance of the semantic segmentation method we choose. In addition, the depth information of the object in the same scene is proved to still have strong stability under changing environmental conditions. Therefore, we will integrate the depth information to process the visual variation between images and explore the impact of different semantic segmentation methods on the performance of the proposed method in the future.

Data Availability

All data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Key Area Research Projects of Universities of Guangdong Province under Grant 2019KZDZX1026, in part by the Natural Science Foundation of Guangdong Province under Grant :501100003453, in part by the Innovation Team Project of Universities of Guangdong Province under Grant 2020KCXTD015, in part by Free Exploration Foundation of Foshan University under Grant 2020ZYTS11.

References

D. Li, “Dxslam: a robust and efficient visual slam system with deep features,” in Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4958–4965, Las Vegas, LV, USA, January 2020.
View at: Publisher Site | Google Scholar
T. Sattler, “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8601–8610, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
C. Toft, “Long-term visual localization revisited,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 2074–2088, 2022.
View at: Google Scholar
Y. You, “MISD-SLAM: multimodal semantic SLAM for dynamic environments,” in Proceedings of the Wireless Communications and Mobile Computing 2022, Dubrovnik, Croatia, June 2022.
View at: Publisher Site | Google Scholar
J. Wu, Q. Shi, Q. Lu, X. Liu, X. Zhu, and Z. Lin, “Learning invariant semantic representation for long-term robust visual localization,” Engineering Applications of Artificial Intelligence, vol. 111, Article ID 104793, 2022.
View at: Publisher Site | Google Scholar
J. Ni, “An improved deep residual network-based semantic simultaneous localization and mapping method for monocular vision robot,” Computational Intelligence And Neuroscience 2020, vol. 2020, Article ID 7490840, 14 pages, 2020.
View at: Publisher Site | Google Scholar
J. Li, “Loop closure detection based on image semantic segmentation in indoor environment,” Mathematical Problems in Engineering, vol. 2022, Article ID 7765479, 14 pages, 2022.
View at: Publisher Site | Google Scholar
M. Aladem, S. Baek, and S. A. Rawashdeh, “Evaluation of image enhancement techniques for vision-based navigation under low illumination,” Journal of Robotics, vol. 2019, Article ID 5015741, 15 pages, 2019.
View at: Publisher Site | Google Scholar
P.-E. Sarlin, “From coarse to fine: robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725, Long Beach, CA, USA, June 2019.
View at: Google Scholar
H. Germain, G. Bourmaud, and V. Lepetit, “Sparse-to-dense hypercolumn matching for long-term visual localization,” in Proceedings of the 2019 International Conference on 3D Vision (3DV), pp. 513–523, Québec City, QC, Canada, September 2019.
View at: Publisher Site | Google Scholar
T. Shi, “Visual localization using sparse semantic 3D map,” in Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), pp. 315–319, Taipei, China, September 2019.
View at: Publisher Site | Google Scholar
R. Arandjelovic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307, Las Vegas, NV, USA, June 2016.
View at: Publisher Site | Google Scholar
M. Larsson, “Fine-grained segmentation networks: self-supervised segmentation for improved long-term visual localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 31–41, Seoul, South Korea, October 2019.
View at: Publisher Site | Google Scholar
H. Zhao, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890, Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
J. Long, E. Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440, Boston, MA, USA, June 2015.
View at: Publisher Site | Google Scholar
L.-C. Chen and G. I. K. A. L. Papandreou, “Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
View at: Publisher Site | Google Scholar
Yu Fisher and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015, https://arxiv.org/abs/1511.07122.
View at: Google Scholar
L.-C. Chen et al., “Attention to scale: scale-aware semantic image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3640–3649, Las Vegas, NV, USA, June 2016.
View at: Publisher Site | Google Scholar
O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241, Springer, Munich, Germany, October 2015.
View at: Publisher Site | Google Scholar
K. Anna, “Simple does it: weakly supervised instance and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 876–885, Honolulu, HI, USA, July 2017.
View at: Google Scholar
N. Souly, C. Spampinato, and M. Shah, “Semi supervised semantic segmentation using generative adversarial network,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 5688–5696, Venice, Italy, October 2017.
View at: Publisher Site | Google Scholar
B. Amy, “What’s the point: semantic segmentation with point supervision,” in Proceedings of the European Conference on Computer Vision, pp. 549–565, Springer, Amsterdam, The Netherlands, October 2016.
View at: Google Scholar
M. Larsson, “A cross-season correspondence dataset for robust semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9532–9542, Long Beach, CA, USA, June 2019.
View at: Publisher Site | Google Scholar
F. Warburg, “Mapillary street-level sequences: a dataset for lifelong place recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2626–2635, Seattle, WA, USA, June 2020.
View at: Publisher Site | Google Scholar
B. Kulis, K. Saenko, and Trevor Darrell, “What you saw is not what you get: domain adaptation using asymmetric kernel transforms,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2011, pp. 1785–1792, Colorado Springs, CO, USA, June 2011.
View at: Publisher Site | Google Scholar
K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Proceedings of the European Conference on Computer Vision, pp. 213–226, Springer, Crete, Greece, September 2010.
View at: Publisher Site | Google Scholar
E. Tzeng, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176, Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
M. Long, “Unsupervised domain adaptation with residual transfer networks,” 2016, https://arxiv.org/abs/1602.04433.
View at: Google Scholar
Y. Chen, Li Wen, and L. Van Gool, “Road: reality oriented adaptation for semantic segmentation of urban scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
Yi-H. Tsai, “Learning to adapt structured output space for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
S. Sankaranarayanan, “Learning from synthetic data: addressing domain shift for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3752–3761, Salt Lake City, UT, USA, June 2018.
View at: Publisher Site | Google Scholar
M. Wulfmeier, A. Bewley, and I. Posner, “Incremental adversarial domain adaptation for continually changing environments,” in Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4489–4495, IEEE, Brisbane, Australia, May 2018.
View at: Publisher Site | Google Scholar
X. Wu, “DANNet: a one-stage domain adaptation network for unsupervised nighttime semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15769–15778, Nashville, TN, USA, June 2021.
View at: Publisher Site | Google Scholar
G. Ros, “The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3234–3243, Las Vegas, NV, USA, June 2016.
View at: Publisher Site | Google Scholar
T. Naseer, “Robust visual robot localization across seasons using network flows,” in Proceedings of the Twenty-eighth AAAI Conference on Artificial Intelligence, Quebac, Canada, July 2014.
View at: Google Scholar
R. Clark, “Vidloc: a deep spatio-temporal model for 6-dof video-clip relocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6856–6864, Honolulu, HI, USA, July 2017.
View at: Publisher Site | Google Scholar
Z. Chen, “Deep learning features at scale for visual place recognition,” in Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3223–3230, IEEE, Marina Bay Sands, Singapore, June 2017.
View at: Publisher Site | Google Scholar
L. Liu, H. Li, and Y. Dai, “Efficient global 2d-3d matching for camera localization in a large-scale 3d map,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2372–2381, Venice, Italy, October 2017.
View at: Publisher Site | Google Scholar
T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1744–1756, 2017.
View at: Publisher Site | Google Scholar
L. Svärm and O. F. M. Enqvist, “City-scale localization for cameras with known vertical direction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1455–1461, 2017.
View at: Publisher Site | Google Scholar
A. Torii, “24/7 place recognition by view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817, Boston, MA, USA, June 2015.
View at: Publisher Site | Google Scholar
H. Hu, “Retrieval-based localization based on domain-invariant feature learning under changing environments,” in Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3684–3689, IEEE, Macau, China, November 2019.
View at: Publisher Site | Google Scholar
A. Benbihi, “Image-based place recognition on bucolic environment across seasons from semantic edge description,” in Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3032–3038, IEEE, Paris, France, August 2020.
View at: Publisher Site | Google Scholar
H. Hu, Z. Qiao, M. Cheng, Z. Liu, and H. Wang, “DASGIL: domain adaptation for semantic and geometric-aware image-based localization,” IEEE Transactions on Image Processing, vol. 30, pp. 1342–1353, 2021.
View at: Publisher Site | Google Scholar
Z. Xin, “Localizing discriminative visual landmarks for place recognition,” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), pp. 5979–5985, IEEE, Montréal, Canada, May 2019.
View at: Publisher Site | Google Scholar
H. Jégou, “Aggregating local descriptors into a compact image representation,” in Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3304–3311, San Francisco, CA, USA, June 2010.
View at: Publisher Site | Google Scholar
N. Piasco and D. V. C. Sidibé, “Improving image description with auxiliary modality for visual localization in challenging conditions,” International Journal of Computer Vision, vol. 129, no. 1, pp. 185–202, 2021.
View at: Publisher Site | Google Scholar
H. Badino, D. Huber, and T. Kanade, “Visual topometric localization,” in Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 794–799, IEEE, Baden, Germany, July 2011.
View at: Publisher Site | Google Scholar
W. Maddern and G. C. P. Pascoe, “1 year, 1000 km: the oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Qingwu Shi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

530

Downloads

620

Citations

Journal of Robotics

Learning a Robust Hybrid Descriptor for Robot Visual Localization

Abstract

1. Introduction

2. Related Works

2.1. Semantic Segmentation

2.2. Domain Adaptation

2.3. Long-Term Visual Localization

3. Proposed Method

3.1. Network Model Structure

3.2. Loss Function

3.3. Total Loss for Segmentation Task

3.4. Triplet Loss

4. Experiment

4.1. Experimental Dataset

4.1.1. Training Dataset

4.1.2. Testing Dataset

4.1.3. Evaluation Metrics

4.2. Comparison Models

4.3. Experimental Setting

4.4. Testing Experiment on the Extended CMU Season Dataset

4.5. Environment under Different Regional Environments

4.6. Environment under Different Vegetation Conditions

4.7. Environment under Different Weather Conditions

4.8. Testing Experiment on RobotCar Seasons Dataset

4.9. Testing under Different Weather Conditions

4.10. Testing under Different Illumination Conditions

4.11. Visualization Experiment

4.12. Experiment for Testing the Proposed Algorithm

4.13. Ablation Study

5. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright