Abstract

Visual place recognition (VPR) is considered among the most challenging problems due to the extreme variations in appearance and viewpoint. Essentially, appearance-based VPR can be considered as an image retrieval task, thus the key is to accurately and efficiently describe the images. Recently, global descriptor methods have attracted substantial attention from the VPR community, which has contributed to numerous important outcomes. Despite the growing number of global descriptors presented, little attention has been paid to the comparison and evaluation of these methods and so it remains difficult for researchers to disentangle the factors that led to better performance. This study provided comprehensive insight into global descriptors from a practical application perspective. We present a systematic evaluation that integrates 15 commonly used global descriptors, 6 benchmark datasets, and 5 evaluation metrics, and subsequently extended this evaluation to discuss the key factors impacting the matching performance and computational efficiency. We also report practical suggestions for constructing promising CNN descriptors, based on the experimental conclusions. Our analysis reveals both advantages and limitations of three different types of global descriptors, including handcrafted features-based ones, off-the-shelf CNN-based ones, and customized CNN-based ones. Finally, we evaluate the practicality of reported global descriptors to mediate the trade-offs between matching performance and computational efficiency.

1. Introduction

Over the past few decades, visual simultaneous localization and mapping (SLAM) [1] has been considerably advanced in robotics research communities. As one of the essential components in the visual SLAM, visual place recognition (VPR) denotes the task of ascertaining whether or not the current place has already been visited [2, 3]. In this manner, the system can impose additional constraints for map building and trajectory optimization, subsequently eliminating the incremental drift [46]. As regards the robots that require autonomous operation for an extended period, the appearance of the surrounding environment may change drastically over time. Severe changes in appearance, as well as the adverse impacts caused by the occlusions, dynamic scenes, and perceptual aliasing [4] (see Figure 1), make VPR still considered a daunting task.

VPR tends to be essentially viewed as the data association task and is known as the appearance-based method when the data type is an image. In most cases, the appearance-based methods are conducted within the framework of image retrieval. Specifically, the comparison is drawn between the query image of the current place and images of previously visited places stored in the historical database, and their similarity acts as a key factor in determining whether they are the same place. Therefore, it is very important for appearance-based methods to generate appropriate and accurate image descriptions [5].

Feature descriptor provides the compact and efficient representation of distinctive characteristics in an image [7], and ideally, the descriptor would yield good invariance under image transformation. The correspondences can be established between the query image and database images by a similarity measurement of the descriptors; thus, it is helpful to distinguish one place (image) from another. The VPR approaches tend to be classified into three major divisions, namely, local descriptor-based, global descriptor-based, and local region descriptor-based. Since global descriptor methods include the aggregation of local descriptors and local region descriptors refer to describing the image patches with a global descriptor, we focus on global descriptor methods to limit the scope of this study.

Global descriptors simply describe the image with only one compact feature vector. On the one hand, global descriptors can be directly constructed by extracting the global features of the image, for instance, histogram of oriented gradients (HOG) [8] and Gist [9]. Conversely, they could equally be aggregated from multiple local descriptors. A typical example is bag of visual words (BoVW) [10], which clusters local features (e.g., SIFT [11], ORB [12]) and then generates the vector of frequency histogram for global description. In recent times, many deep learning-based methods are introduced into VPR to extract global features [1318], both off-the-shelf and customized convolutional neural network (CNN) models. While the global descriptor methods in VPR have attracted extensive attention, few papers have contrasted them from a comprehensive perspective. It remains complicated for researchers, especially inexperienced novices to thoroughly comprehend this research topic. Therefore, one of the main focuses of our work is to figure out the mechanism that contributes to improved performance. In particular, considering the great potential of deep learning techniques, we also integrated recent advances into our VPR evaluation framework.

The drastically changing environment will allow for false place recognition, which will disrupt the global consistency of the map and lead to a wrong localization [19, 20]. Therefore, high hopes have been pinned for VPR methods to achieve higher or even 100% recognition precision [21]. Although false-positive results could be filtered out by temporal [4, 2224] or geometric consistency check [2527], it is more important to develop novel descriptors that have better performance being more robust and recognizable. Appearance and viewpoint invariance is very important for descriptors but it is not the only property to be considered in the VPR task. The descriptors should be computationally efficient to attain the requirements of real-time running. Therefore, another key contribution of this research is to assess the global descriptors from a practical application perspective.

To sum up, this work has three main contributions as follows:(i)We present a comprehensive assessment of the global descriptor methods commonly used in VPR tasks, thereby figuring out the motivational factors for improved performance. Our work covers 15 global descriptor methods, 6 benchmark datasets, and 5 metrics.(ii)We give practical advice for the design of better global descriptors for VPR tasks, based upon quantitative and qualitative analysis. The specific analysis of hierarchal features and backbone networks was implemented for this purpose.(iii)We provide valuable information regarding VPR performance from the point of view of practical applications. This investigation offers the trade-offs between matching performance and computational efficiency.

The remainder of this paper is structured as follows. In Section 2, we provide a review of the typical global descriptor methods used in VPR tasks. Then, Section 3 describes the implementation details of the evaluation experiments. The experimental results, comparison, and analysis are carried out in Section 4. Finally, Section 5 gives a conclusion to this study.

2. Literature Review

The similarity between two global descriptors is readily measured by cosine similarity or Euclidean distance, thus it is easy to implement and maintain. Global descriptor methods describe the image as a whole and extract the handcrafted or learning-based advanced features through specified approaches [7]. In this case, the existing methods of global descriptors can be divided into two: (1) handcrafted-based global descriptor methods; (2) deep learning-based global descriptor methods. To facilitate the understanding of deep learning-based global descriptor methods, we also give a brief introduction to recent advances in deep learning techniques, especially CNN models.

2.1. Handcrafted-Based Global Descriptor Methods

Such methods focus on the information present in the image itself, such as the changes in pixel intensity. Histogram of Oriented Gradients (HOG) [8] is one of the most commonly used handcrafted global descriptors. It can extract the structure information of the images by calculating the gradients and orientations of each pixel. McManus et al. [28] extracted HOG features from image patches containing unique visual elements, improving robustness to extreme appearance changes. An impressive work is CoHOG [29], which uses HOG to represent salient image regions for convolutional matching. Another prominent invariant global descriptor is Gist [9]. Murillo and Kosecka [30] presented a Gist-based panorama matching approach for recognizing the revisited places, promoting the application of Gist in VPR tasks. Shortly thereafter, Singh and Kosecka [31] conducted extensive experiments in a 13-mile urban area, demonstrating that the Gabor–Gist descriptor is competent for large-scale scenes.

Instead of using one global feature for the entire images, another class of global descriptors is based on the aggregated local descriptors. BoVW was initially used for image retrieval but has already been demonstrated to be an effective model for place recognition [3, 4, 2224, 32, 33]. Importantly, BoVW also confers scalability of the map, which is very important for place recognition in large-scale environments and long-term autonomy. Benefiting from the tree-structure vocabulary [10] and inverted index, the previously visited places (images) can be stored and retrieved in a highly efficient manner. An early application case of BoVW in VPR was performed by Schindler et al. [34], who stored more than 100 million SIFT features using a vocabulary tree and successfully achieved place recognition on 20 km of urban roads. Gálvez–López and Tardos [4] proposed an enhanced BoVW method and open-sourced their C++ library named DBoW for converting images into a bag-of-word representation and constructing the visual dictionary [35]. Owing to its convenience and efficiency, the improved version DBoW2/3 has been utilized in many excellent SLAM systems [23, 24]. Similar approaches were successfully presented with Fisher vectors (FV) [36] or vector of locally aggregated descriptors (VLAD) [37].

2.2. Deep Learning-Based Global Descriptor Methods

Given the rapid development of deep learning, the novel properties of learning-based techniques have inspired researchers to leverage them to remedy the shortcoming of handcrafted descriptors. After the seminal work of Chen et al. [38], research has increasingly focused on learning-based descriptors which are mainly built on CNN features [13, 3941]. Sünderhauf et al. [13] used pretrained AlexNet [42] as a descriptor extractor and concluded that mid-level features have better robustness to appearance variations. Hou et al. [39] have reported a similar finding, where a CNN model was pretrained on the scene-centric database called Places365 [43]. Zhang et al. [40] constructed a graph-based VPR method through the integration of visual features extracted from VGG16 [44] and temporal information from image sequences. Wang et al. [41] used pretrained ResNet [45] as the image descriptor to realize place recognition in a dynamic environment. Furthermore, researchers have focused on developing specialized neural network architectures for VPR tasks. Reconstructing the traditional approaches via deep learning-based techniques motivated the emergence of novel VPR methods, such as CALC [46], NetVLAD [14], and E2BoWs [47]. These methods combine the complementary strengths of handcrafted and learning-based descriptors to arrive at a remarkable performance. Additionally, autoencoder and its modified versions have also been introduced into the VPR domain [16]. The advantage of these unsupervised learning methods is that they require less manual data preparation.

In theory, the performance of CNN-based descriptors, especially supervised learning ones, depends on high-quality, large-scale training datasets. Driven by the booming VPR research communities, more relevant datasets in this field have been constructed and the development of global descriptors has been further promoted. A typical example is the specific place dataset (SPED), developed by Merrill and Huang [46] in 2017. This study highlighted the differences between a network trained on SPED versus ImageNet and indicated that the CNN-based descriptors trained on tailored datasets tend to have enhanced performance gain. Additionally, it was also found that adaptability and generalization can be improved by fine-tuning the targeted dataset [4749].

2.3. Popular Deep Learning Models

Deep learning is particularly adept at extracting high-level abstract features from raw images. CNN is one of the most popular deep learning networks. The CNN model was first proposed by LeCun et al. [50] for recognizing handwritten digits. Extending work AlexNet [42] has surged a wave in the computer vision community. Several important backbone networks were proposed in subsequent years, such as VGG [44], GoogLeNet [51], ResNet [45], Xception [52], and DenseNet [53]. With the reorganization of processing units and the emergence of new modules, a wide variety of CNN architectures are constantly presented in response to different applications. Over, the trend has been towards deeper and more complicated architectures to derive better performance. However, going deeper means increasing sequential processing and latency. For robots with computationally limited platforms, it is critical to develop lightweight and low-latency models.

To expand the applications in mobile and embedded devices, some attempts were made to reduce the parameter quantity of the CNN model. SqueezeNet [54] employs 1 × 1 convolutions to compress the model to less than 0.5 MB. MobileNet [55, 56] builds on depthwise separable convolution and efficiently balances latency and accuracy. The core component of ShuffleNet [57] is pointwise group convolution and channel shuffle, which significantly reduce computation costs. Benefiting from neural architecture search, EfficientNet[58] offers adjustable depth-width-resolutiontrade-offs and leads to better accuracy and efficiency.

3. Implementation Details

Based on the significant work reviewed above, this paper chooses 15 typical global feature descriptors for evaluation (see Table 1), while the motivation for the choices and implementation details are briefly presented. Then, the datasets and evaluation metrics used in our experiments are presented.

3.1. Global Descriptor Methods
3.1.1. Handcrafted Feature-Based Descriptors

(1) HOG. HOG is a simple but effective descriptor and has good invariance to illumination changes. We use a window size of 16 × 32, a block size of 16 × 16, and a cell size of 8 × 8. The number of bins is set equal to 9. Resultantly, an input image with 160 × 120 size is represented as a 16416-dimensional HOG descriptor.

(2) Gist. The essential idea behind Gist is that an image can be described by the responses of Gabor filters at diverse scales and orientations. In our work, input images are convolved with 20 Gabor filters at 3 scales (8, 8, and 4 orientations). Each feature map is divided into 16 regions by a 4 × 4 grid, thus the output descriptor size is 960 dimensions.

(3) DBoW3-ORB. As mentioned previously, DBow3 is one of the most widely used methods in VPR. We utilize the vocabulary file ORBvoc provided by ORB-SLAM2, which is trained on a large-scale dataset and has good adaptability and generalization.

3.1.2. Off-the-ShelfCNN-Based Descriptors

In this paper, we select six off-the-shelfCNN-based descriptor methods, including AlexNet, VGG16, ResNet50, MobileNet v3, ShuffleNet v2, and EfficientNet B0. They have been introduced in detail in their paper so we will not repeat their description. Here, we provide the implementation details and motivation.

To a certain extent, the performance of the CNN models depends on both the scale and richness of training datasets. For all experiments, six off-the-shelf CNN models pretrained on ImageNet are used to generate global feature descriptors. For AlexNet, VGG16, and ResNet50, since the performance of hierarchal features has attracted considerable attention, the features extracted from different layers are used as global descriptor vectors, respectively. Analysis results are discussed in Section 4. To more comprehensively survey the global descriptors, we also selected three very recent advances to generate holistic features for describing images, including MobileNet v3, ShuffleNet v2, and EfficientNet B0. The choice of them is motivated by their lightweight architectures and practicality: larger models are less suitable for resource-constrained robots or mobile devices.

3.1.3. Customized CNN-Based Descriptors

(1) CALC. CALC is a lightweight convolutional autoencoder model proposed by Merrill et al. to address the shortcoming of HOG which is not robust to viewpoint changes. Our implementation of CALC retains the last three fully connected layers but the original backbone network was substituted with a pretrained ResNet18. Noting that the last pooling layer along with the fully connected layer of ResNet18 is eliminated. For training, we fine-tuned the modified model on the Places365 dataset to better focus on the VPR task and followed the training setup of CALC’s open-sourced work, so the corresponding output descriptors are also 3648-dimensional.

(2) NetVLAD. The reformulating of VLAD through CNN-based techniques contributed to this significant outcome. It provides a differentiable pooling mechanism with trainable parameters. The proposed NetVLAD layer serves as a plug-and-play module and presents a rich yet compact image representation. We implemented the NetVLAD in Pytorch and also used Pittsburgh dataset for training.

(3) MobileNetVLAD. MobileNetVLAD was initially proposed for 6-DoF pose estimation. We integrated it into our evaluation work as a reference for comparison. Interestingly, MobileNetVLAD is trained in a self-supervised manner. Under the supervision of a well-trained NetVLAD model (the teacher), this network (the student) uses knowledge distillation to transfer the knowledge, thus being able to extract NetVLAD descriptors with a more lightweight network.

(4) DBoW3-SuperPoint. SuperPoint is a CNN-based interest point detector and descriptor. With respect to handcrafted local descriptors, it achieves superior homography estimation results in the premise of high real-time performance. To fit the VPR task, we further incorporated this local descriptor into the BoVW model and led to a novel global descriptor. Our implementation is also carried out on the C++ DBoW3 library.

(5) Autoencoder (AE). This type of neural network can learn efficient image representations in an unsupervised manner, and thus is well suited for VPR tasks that lack high-quality labeled data. It is an appropriate case for a customized CNN-based descriptor, based on its outstanding performance as shown by Gao and Zhang [16] and Park et al. [81]. In our work, the encoder of our AE is similar to the implementation of [46] and the decoder is composed of deconvolution and unpooling layers. The network was trained on the Places365 dataset, and the output of the well-trained encoder was considered as the global descriptor.

(6) Variational Autoencoder (VAE). VAE was proposed by Kingma et al. where explicit regularization is introduced to ensure the good properties of its latent space. An interesting VPR method built upon a VAE was proposed by Merrill and Huang [83]. We designed a network structure similar to the autoencoder described above and also trained it on the Places365 dataset. The original idea that we used VAE is for dimensionality reduction rather than image generation, thus the decoder was removed during inference.

3.2. Evaluation Metrics

A descriptor that has state-of-the-art place matching performance, but an unacceptable longer place retrieval time, will fail to meet the rigid demand for real-time localization systems. For practical reasons, we integrated multiple evaluation metrics in this work to comprehensively evaluate these global descriptors in terms of matching performance and computational efficiency. Details of each metric are presented as follows.

3.2.1. Matching Performance

For VPR tasks, true positives (TP) denote the correct image/place matching results, false positives (FP) refer to the situations where the actually incorrectly matched images are judged to be the same place, while false negatives (FN) represent the situations where the true matching cases are not screened out. For most VPR datasets, it should be pointed out that every query image has a ground-truth match in the database images, thus there are usually no true negatives [13]. Precision and recall are computed as follows:

We present the evaluation metrics used in this paper as follows:

(1) AUC. Ideally, a VPR method should achieve 100% precision and 100% recall, and indeed, a negative correlation was found between precision and recall, that is, increasing precision frequently leads to a reduction in the recall. Therefore, many works [3, 4, 32, 46, 8385] have focused on the area-under-the-precision-recall (AUC) curves for a comprehensive evaluation and we also introduced it into our assessment work. The precision-recall (PR) curves reflect the changing trend of precision with rising recall, thus it can help in making an informed decision when facing the precision/recall dilemmas. AUC summarizes precision and recall in a single visualization, only a single correct match result is taken into account when calculating AUC.

(2) Recall at 100% Precision. Furthermore, another metric applied in this study is recall at 100% precision (RP100, for short). It is also a commonly used metric to evaluate the VPR methods [16, 40, 46, 68]. The motivation of using this metric is the favoritism of precision in the VPR system. In general, 100% precision is very important for the VPR system because false positives are extremely disruptive and unacceptable.

(3) Recall@1. The requirement of Recall@1 is that the best-matched database image for a query image must be a true positive. Although Recall@N has been widely used in image retrieval tasks, that is, the correct retrieval only needs to be among the Top-N candidates, the allowable range for VPR or loop closure detection is more stringent from the viewpoint of practical application. In addition, the motivation behind using Recall@1 is that this metric actually reflects the percentage of correctly matched query images.

3.2.2. Computational Efficiency

Despite limited computing resources, the place recognition module in a mobile robot must perform in real-time to maintain good localization accuracy. In this case, the computational efficiency of the descriptors is another major consideration, including feature encoding time and descriptor matching time.

(1) Feature Encoding Time. The feature encoding process of most global descriptors is relatively time-efficient because no keypoints detection process is involved. In such a case, feature encoding time denotes the time spent in extracting the global features. BoVW-based descriptors (i.e., DBoW3-ORB and DBoW3-SuperPoint in this article) are exceptions, their time-consuming process includes the time spent in detecting keypoint, describing, and mapping into bag-of-words space. For statistical validation, the encoding time corresponds to the average of over 200 runs.

(2) Descriptor Matching Time. The total time consumption of descriptor matching is proportional to the scale of the map (database images). For a fair comparison, descriptor matching time here refers to the time required to match two global descriptors, which is also a statistical mean. The manner of similarity measurement also exerts influence in matching time while cosine similarity was used for all global descriptors evaluated in this study.

3.3. Evaluation Datasets

We integrated 6 benchmark datasets to evaluate the performance of the above 15 global descriptors. These datasets feature images from diverse scenarios, including indoors, urban roads, suburbs, and natural scenery. Each dataset has two separate folders, one for organizing query images and one for database images. The ground-truth information is provided by the filename, that is, the images with the same filename indicate the same place. For each query image, there is a database image that was taken at the same place but has undergone changes in appearance and/or viewpoint. Table 2 provides a summary of the datasets and their major challenges; Figure 2 gives the sample images.

Due to the differences in shooting frequency or traveling speed, the image sequences in some datasets may be consecutive and the adjacent images have overlapping visual content. Therefore, the setup of ground-truth tolerance is commonly accepted in VPR tasks but generally stricter than that of computer vision tasks. The ground-truth tolerance used in our work is presented in Table 2.

Here, we provide a brief introduction to these datasets to facilitate analyzing the performance of each descriptor method. The download links of all datasets are available in the footnote.(1)Nordland Dataset (https://nrkbeta.no/2013/01/15/). The Nordland dataset is extracted from video footage recording four 729 km journeys on the same route. It collects both natural and urban landscapes in four seasons. Severe cross-seasonal changes lead to strong appearance changes but no viewpoint changes are involved due to the fixed route. We choose Spring versus Winter image sequences for experimental analysis.(2)SPEDTest Dataset (https://goo.gl/OXeL2X). The SPEDTest dataset is a subset picked from the original SPED dataset. It is captured with the outdoor cameras that are used to collect the long-term scenarios changes and hence contains extreme appearance variations in changeable seasons and illumination conditions. Due to the limitation of the camera’s fixed view angle, this dataset exhibits no viewpoint changes.(3)Campus Loop Dataset (https://github.com/rpng/calc/tree/master/TrainAndTest/test_data). This dataset consists of two image sequences, captured in indoor and outdoor environments. For the purposes of covering multiple challenges, the first image sequence was taken on a cloudy snowy day with buildings and roads covered in snow, whereas the second image sequence was taken on a sunny day.(4)Gardens Point Dataset (https://zenodo.org/record/4561862). This dataset was collected at the campus scenes with a handheld mobile phone. In this study, we used two daytime image sequences recorded under different illumination conditions and left/right walking paths. These factors render this dataset containing strong lateral viewpoint changes and modest appearance changes.(5)Cross-Seasons Dataset (https://www.visuallocalization.net/datasets/). The cross-seasons dataset used in our work contains two image sequences taken in different illumination, seasons, or weather conditions. The interference from the dynamic object and perceptual aliasing further made it challenging to perform place recognition.(6)Alderley Day/Night Dataset (https://www.dropbox.com/s/ejmnz9vfp4n7o7s/alderley.zip?dl=0). This dataset was created by Milford et al. where two image sequences were captured on a bright sunny day and an extremely heavy rainy night, respectively. Furthermore, night storms cause extreme appearance changes and blurring, making it complicated even for humans to achieve successful place recognition.

4. Results and Discussion

In the following, we will present the experimental evaluation of the 15 global descriptor methods and discuss the driving forces behind these results. The analysis was generally carried out from two aspects, including matching performance and computational efficiency, to facilitate more consideration of the practicability.

4.1. Matching Performance Analysis

The PR curves for all 15 global descriptors are presented in Figure 3, and the values of AUC and RP100 for each descriptor are presented in Tables 35.

For three handcrafted feature-based descriptors, Figure 3 shows that their place matching precision either retains at a relatively low level or degrades substantially with the increasing recall when encountering drastic visual changes. HOG can achieve good performance on SPEDTest and Nordland datasets that do not involve any viewpoint changes. However, its performance is significantly degraded when meeting the dual challenge of viewpoint and appearance variations. DBoW3-ORB and Gist can only yield acceptable matching performance on the less-challenging datasets, such as Gardens Point and cross-seasons. Although aggregated as a global descriptor through DBoW3, as a local descriptor, ORB is not robust to appearance changes, which leads to its bottom-ranked AUC on the Norland and Alderley datasets.

Despite the choice of datasets having a moderate influence on assessment results, we observe that the performance of CNN-based descriptors has an overall better performance over non-CNN-based descriptors, particularly for datasets that have extreme variations in viewpoint and appearance. For instance, while all descriptors suffer from multiple challenges, the PR curve of CNN-based descriptors decreases relatively gently, whereas that of handcrafted descriptors declines rapidly. The quantitative comparisons are presented in Table 35. The results of the AUC indicator show that CNN-based descriptors almost always outperform handcrafted feature-based descriptors, a similar picture is seen on the RP100 metric. One exception is DBoW3-SuperPoint, which cannot reach the same level as other CNN-based methods. Like DBoW3-ORB, DBoW3-SuperPoint only performs well on the Gardens Point dataset with slight appearance changes. This further shows that the aggregation of local descriptors cannot hold for the strong appearance changes to the same level as global descriptors that represent the image as a whole. In most cases, we observed that CALC achieves better PR performance than HOG, demonstrating that better matching results can be delivered by integrating the advantages of CNN and traditional methods. Despite their lightweight networks, the matching performance of ShuffleNet, MobileNet and EfficientNet is marginally better than that of the other three off-the-shelf CNN-based descriptors.

In addition, the customized CNN-based descriptors generally have better robustness compared to off-the-shelf CNN-based ones, thereby maintaining good adaptability and generalization in common challenges. In terms of PR curves, the decay of the former’s precision is relatively slow. This means that customized CNN-based descriptors generally achieve higher precision under the same recall. An impressive method is NetVLAD that nearly in most cases attains state-of-the-art performance. MobileNetVLAD can achieve (and sometimes even surpass) NetVLAD-level place matching performance, demonstrating the potential of lightweight CNN in VPR tasks.

4.2. Computational Efficiency Analysis

We now discuss the computational efficiency of the 15 global descriptor methods. In this experiment, we use the Gardens Point dataset with an image resolution of 960 × 540, and the values of feature encoding time and descriptor matching time are listed in Table 6. Note that a unified CPU-only platform was used for both conventional and CNN-based descriptors, whereas CNN-based ones generally require more computational resources. This experiment was performed on an Ubuntu 18.04 LTS operating system running on an Intel Xeon E5-2678 V3 CPU @ 2.5 GHz and RTX 2080Ti GPU.

It can be seen that the matching time and dimension are positively associated when using the same similarity measure. For instance, the HOG descriptor achieves the fastest feature encoding of only 1.46 ms, but this descriptor matching time is significantly higher because of its larger dimension. Similarly, 6 off-the-shelf descriptors are of dimension 1000, therefore matching time for them is nearly identical.

We now turn to the discussion of feature encoding time. As illustrated in Table 6, CNN-based global descriptors are computationally intensive, thereby commonly spending more time in feature encoding with a few exceptions. The lightweight CNN-based descriptors are able to reach the same encoding efficiency as non-CNN-based descriptors on a CPU-only platform. Such lightweight networks are important for meeting the need of the real time. A typical example is MobileNetVLAD (a lightweight version of NetVLAD), which achieves a significant speed boost. The encoding time speed roughly doubled after using MobileNet instead of the original VGG16 backbone.

We also report the encoding times of 12 CNN-based descriptors accelerated with GPU in Table 7. As can be seen, all the CNN-based global descriptors are able to achieve real-time performance under the GPU acceleration. For three lightweight networks, including ShuffleNet, MobileNet, and EfficientNet, they are outstanding regarding the number of parameters (#Params) and floating-point operations per second (FLOPs), but their inference speeds are less prominent than in the CPU platform. This is because of their specific design for mobile and embedded devices. In addition, to these three descriptors, it is apparent that the major factor impacting the descriptor encoding time is the required floating-point operations.

4.3. The Inquiry in Constructing Better CNN Descriptors

Based on the abovementioned results and discussion, we attempted to figure out which aspects a descriptor can benefit from. As discussed earlier, the contradiction lies in that VPR methods generally run on a computationally limited platform but are asked to meet real-time requirements. Therefore, a better global descriptor should be sufficiently lightweight and efficient to compute. Apparently, compressing the size of the CNN-based descriptors does not lead to the loss in matching performance. Comparing MobileNetVLAD and NetVLAD, it will be seen that the former can achieve a better comprehensive performance, even under complex challenges.

Sünderhauf et al. [13] reported that the mid-level layer of AlexNet can yield outstanding performance even under strong appearance changes. Being broader in scope than this work, our work utilizes more models to verify the correlation between matching performance and feature levels. The results are shown in Table 8, and the optimal value for each indicator is indicated in bold in the table. The results of this paper agree with the conclusion drawn in [13], and a similar phenomenon can also be observed in VGG16 and ResNet18/50. Features from shallower layers, particularly the outputs from the first pooling layer cannot achieve matching performance to the same level as features from other higher layers. It is demonstrated that lower-level features (i.e., edges, lines as well as blobs) exhibit sensitivity to the variations in viewpoint and appearance. Contrary to popular belief in the computer vision community, the performance degradation happens in the features from high-level layers. One possible reason is that high-level features are more semantically meaningful but thus suffer significantly on perceptual aliasing. Given the analysis above, we considered that the depth of a CNN descriptor should be moderate, and a fully connected layer should be avoided. It is also readily observed that extracting features from higher layers will introduce more sequential processing, thus decreasing the number of CNN layers will be beneficial in the computational efficiency as well.

Another finding from the PR curves is that place matching performance for CNN-based descriptors is affected by the relevance of the training dataset. Despite the simple structure, AE and VAE yield promising results because they use scene-centric Places365 datasets. In comparison with the off-the-shelf models which trained on ImageNet, they can learn more relevant features for place representation.

4.4. Practicality Analysis

From the perspective of practicality, the system primarily focuses on the proportion of true positives successfully retrieved under acceptable computational efficiency. Taking closed-loop detection in SLAM as an example, the more correct closed loops detected, the more likely the reliable localization accuracy will be maintained. Therefore, we also evaluated the performance of 15 reported descriptors under the Recall@1 metric, as shown in Figure 4. This metric actually reflects the success rate of the place recognition. Figure 4 demonstrates visually the percentage of correct matching for each descriptor.

Although our results are preliminary due to the limited scope of the survey, we recommend the following:(1)In the presence of none or slight viewpoint changes (e.g., a robot whose routes barely change), the HOG descriptor is a very suitable candidate because of its computational efficiency and effectiveness. DBoW3 is a cost-effective and convenient alternative for less-challenging scenes.(2)For more complex and changeable environments, CNN-based global descriptors can retrieve more correct matches, nonetheless at the expense of considerable computing resources. Therefore, most CNN-based descriptors are not suited for a CPU-only platform unless their architectures are lightweight enough to acquire low computations.(3)For an effective but computing-heavy CNN-based descriptor, compressing it into a smaller network is a worthwhile attempt. We have verified that replacing the backbone network with a lightweight one and knowledge distillation are feasible solutions.

5. Conclusion

This article presents a comprehensive evaluation of global descriptor methods for appearance-based visual place recognition. The experiments were conducted on six benchmark datasets, covering diverse scenarios, and utilized five commonly used metrics to assess the matching performance and computational efficiency of 15 global descriptor methods.

Our analysis revealed that each type of descriptor has its own strengths and weaknesses, and we provided valuable insights regarding practicality. Specifically, CNN descriptors generally exhibit significant matching performance, albeit at a higher computational cost, indicating the potential of lightweight CNN descriptors. On the other hand, descriptor methods based on traditional features also have their own utility, with non-CNN-based descriptors being particularly useful in scenarios with less-challenging conditions, owing to their training-free nature and computational efficiency. In addition, our evaluation extended to identify the motivational factors that contribute to improved performance in VPR. The investigation centered around hierarchical features, backbone network designs, model compression, and the choice of training datasets. Our experiential results suggest that utilizing an overly deep network architecture may not be necessary for achieving optimal performance in VPR, given that mid-level features demonstrate more robust performance. Additionally, the network structure should not be too cumbersome to be deployed on a resource-constrained robot platform. Model compression techniques, such as knowledge distillation, may provide feasible solutions to this issue. It is also critical to emphasize the importance of using relevant datasets to train the CNN model.

We anticipate that this study could assist researchers in gaining a more comprehensive understanding of appearance-based VPR and global descriptor methods, especially for novice learners. As a future direction, we will focus on novel methods used in VPR tasks, such as generative adversarial networks and deep multimodal learning. Consequently, this assessment could be extended to integrate additional descriptors, datasets, and metrics, thereby enhancing our understanding of this field.

Data Availability

The visual place recognition data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Special Project of the National Key Research and Development Program of China under grant no. 2020YFB1313304.