Abstract

In this study we describe a new appearance-based loop-closure detection method for online incremental simultaneous localization and mapping (SLAM) using affine-invariant-based geometric constraints. Unlike other pure bag-of-words-based approaches, our proposed method uses geometric constraints as a supplement to improve accuracy. By establishing an affine-invariant hypothesis, the proposed method excludes incorrect visual words and calculates the dispersion of correctly matched visual words to improve the accuracy of the likelihood calculation. In addition, camera’s intrinsic parameters and distortion coefficients are adequate for this method. 3D measuring is not necessary. We use the mechanism of Long-Term Memory and Working Memory (WM) to manage the memory. Only a limited size of the WM is used for loop-closure detection; therefore the proposed method is suitable for large-scale real-time SLAM. We tested our method using the CityCenter and Lip6Indoor datasets. Our proposed method results can effectively correct the typical false-positive localization of previous methods, thus gaining better recall ratios and better precision.

1. Introduction

Simultaneous localization and mapping (SLAM) is widely used to generate maps for localization or autonomous robotic navigation.

The appearance-based SLAM type is characterized by low-cost solutions. Moreover, SLAM based on visual features provides abundant information for use in matching and recognition.

Almost all appearance-based SLAMs are pure bag-of-words approaches that extract SIFT [1] or SURF [2] descriptors from images and then match descriptors by brute force method or NNDR [3] and so forth to calculate likelihood between two locations.

The biggest challenge for improving the precision and recall ratio of loop-closure detection is that the false-positive localization loop-closure hypothesis selection score is higher than the false-positive localization. This results in acceptance of false-positive localizations and rejection of false-negative localizations.

Likelihood calculations between two places are the most decisive factor for establishing a loop-closure hypothesis. But in many conditions a pure bag-of-words approach cannot effectively calculate the likelihood between places.

Our proposed method attempts to improve the likelihood calculation results by appending the geometric constraints of visual words to the classic pure bag-of-words likelihood calculation. The geometric constraints include order and acreage constraints, which are designed as affine-invariant geometric constraints. Therefore, the proposed method can work well even though the viewpoint is significantly changed. This method uses a memory management approach similar to those in [4, 5] for real-time processing and uses SURF for its visual descriptors; descriptors are matched by NNDR [3].

In this paper, we describe the more accurate likelihood calculation to improve the loop-closure detection performance. By establishing an affine-invariant hypothesis, the proposed method excludes incorrectly matched visual words and calculates the dispersion of correctly matched visual words to improve the accuracy of the likelihood calculation. Section 2 reviews some previous pure bag-of-words-based approaches and their typical problems. In Section 3, we describe the proposed method. Section 4 presents our experimental results. In Section 5, we discuss our proposed method’s advantages, disadvantages, and outlook.

References [48] present some typical pure bag-of-words approaches.

Cummins and Newman [6] proposed a rapid method based on the probabilistic bailout condition for appearance-only SLAM. But this approach’s precision and recall ratio are not satisfactory.

Kawewong et al. [9] proposed a method that tracks robust features in a sequence of images, called position-invariant robust features (PIRF). They also proposed two online-incremental-appearance-only methods for SLAM PIRF-nav [7] and PIRF-nav2 [8] based on PIRF. Regarding PIRF’s robustness, the methods in PIRF-nav and PIRF-nav2 perform satisfactorily in dynamic environments. Compared with the method in [6], the precision and recall ratio also improved significantly.

However, PIRF-nav and PIRF-nav2’s processing time for loop-closure detection cannot be controlled very well. The processing time increases as the map’s scale increases. In addition, because PIRF-extracted robust features persist in a sequence of images, many useful features are ignored. This can cause significant loss of visual features, particularly, in indoor low resolution datasets such as Lip6Indoor [10]. Thus, it is difficult to improve the performance of PIRF-nav and PIRF-nav2.

Labbé and Michaud [4, 5] proposed a method based on a short term memory (STM) and Long-Term Memory mechanism called RTAB-map. It can optimize the processing time of SLAM by controlling the processing speed effectively without increasing the processing time when the map’s scale increases.

However, because of the problems shown in Figure 1, it is difficult to improve RTAB-map’s recall ratio.

RTAB-map is the best vision-only SLAM method currently available and probably represents the limit of performance possible for pure bag-of-words approaches.

FAB-MAP3D [11] is a SLAM method that combines a pure bag-of-words approach with 3D geometric constraints. It works better than the FAB-MAP [6], but it requires 3D measurement information about each visual word.

The proposed method attempts to design geometric constraints for appearance-only SLAM without any 3D measuring.

Unlike RANSAC [12] and PROSAC [13], the proposed method estimates an affine-invariant hypothesis and calculates the likelihood between two places without any random elements. Thus the proposed method is more stable and better suited for situations in which only few words match.

3. Proposed Method

This section presents our new likelihood calculation method having geometric constraints. We also include a brief explanation of the loop-closure hypothesis selection. Figure 2 shows the likelihood calculation of the proposed method.

3.1. Image Undistortion

Sometimes a camera lens will cause significant distortions; undistorted images are necessary to establish an affine-invariant hypothesis.

To produce undistorted images, we must establish the camera’s requisite intrinsic parameters , radial distortion coefficients , and its tangential distortion coefficients by calibration. It is easy to calibrate and undistort a camera using OpenCV [15]. Intrinsic parameters and distortion coefficients are stable for certain cameras. More details are available in OpenCV documents.

Since the real world is not flat, real world images do not strictly abide by the affine-invariant constraint. However, for the most part, landmarks in images can be considered to be in a flat environment.

3.2. Order Constraint

We designed a distance order constraint to exclude incorrectly matched visual words.

As illustrated in Figure 3, is an example of incorrect matching. We first calculate ’s relative distance vector , which is sorted from nearest to farthest. and . Except for and , .

Similarly, and . Despite being matched visual words, and are significantly different.

We designed an offset-based linear formula to calculate . The ’s definition is shown in Figure 5. We also define in (1). In Figure 4, , and . Therefore, the higher indicates that the probability of an incorrect matching is higher. can be used to distinguish correctly and incorrectly matched visual words.

Please note that is not an affine-invariant quantity and is sensitive to noise percentage. So we cannot set a certain threshold eliminating incorrectly matched visual words for large-scale SLAM. Using a normalized is one candidate. We normalized s using its mean and standard deviation .

Our proposed method uses kd tree-based [16] FLANN [17] to establish relative distance vectors s when descriptors are extracted. All extracted words are used for establishing s and these vectors are retained for further queries. When required, these vectors eliminate all mismatched words by calculating expression (1) to establish new vectors for the processing of order constraint. The original vectors do not change.

In Figure 4, and are excluded and we can obtain a corrected set of words and .

However, the only order constraint is not strict enough for a highly accurate likelihood calculation. We designed an acreage constraint to establish an affine-invariant hypothesis based on and .

3.3. Acreage Constraint

An example of an affine invariant is illustrated in Figure 6. Although, from to , the coordinates of and changed significantly, the proportional relationship of the acreage illustrated in the figure did not change; that is, not only , but also , , and so forth.

Therefore, when the affine-invariant proportional relationship of acreage has been found, an affine-invariant hypothesis can be established.

We propose a method to establish an affine-invariant hypothesis based on results of the order constraint. First, we calculate a total area:where is the center of gravity of

We then define the deviation of two pairs of visual words according to an affine-invariant hypothesis:

, if and this establishes an affine-invariant hypothesis. Because is a robust affine-invariant threshold, a certain is suitable for large-scale SLAM.

In fact, is important in the establishment of the affine-invariant hypothesis. The is meaningful only in the sense that is built by visual words that obey the affine-invariant constraint. After processing the order constraint, the incorrectly matched words have been eliminated, but the noise remains.

3.4. Likelihood Calculation

After the above processing (), only correctly matched words remain. Now it is possible to calculate a geometric constraints-based likelihood between the testing and current place.where is the testing place and is the current place. is the proportion between ’s size and the sum of the matched word pairs.where is the number of matched word pairs between the two places. .

is dispersion of the affine-invariant words-based parameter for estimating the likelihood between two places.where

Apparently .

In [4, 5], the likelihood calculation formula iswhere and are the total number of words of the signature and the compared signature , respectively. However, since this method attempts to obtain a low likelihood, it may cause a false-negative localization. But for pure bag-of-words-based approaches, because precision is hard to control there is no alternative but to choose low likelihoods.

We propose a new likelihood calculation method combined with :

This likelihood calculation method is fairer than [4, 5]. In addition, since geometric constraints are added to the calculation, the proposed method achieves better accuracy.

3.5. Brief Summary of Loop-Closure Hypothesis Selection

The proposed method uses a loop-closure hypothesis selection method similar to that in [5]. We update the Bayesian filter by the following recursion formula:where is the probability that closes a loop with a past location and is the probability that the current place in the STM is a new place. is important for this formula, being a normalized likelihood by the mean and standard deviation , which the proposed method significantly affects. briefly describes the likelihood, which, due to space limitations, we cannot describe in detail. Please refer to [4, 5].

When is lower than the loop-closure threshold , the loop-closure hypothesis will be accepted.

Please note that when is too high the loop-closure hypothesis will be rejected, although the probability of a high loop-closure hypothesis is very high. This may cause false-positive localizations.

4. Experiments

We performed our calculations using a MacBook Pro, i7 with 16 GB RAM. The application is written in C++. We tested our method by two well-known datasets: Lip6Indoor and CityCenter.

4.1. Lip6Indoor

Figure 7 shows that a typical false-positive localized place occurs in pure bag-of-words approach such as [5, 6, 8]. After processing by the proposed method, geometric constraints for two places , the false-positive loop-closure hypothesis is rejected by the proposed method. The resolution of 388 images in this dataset is 240 × 192. Compared with [5], the proposed method improved the recall proportion by only 1.55%. But as the recall proportion increases in [5], precision decreases rapidly. For a 100% recall proportion of [5] precision is 63%, but the precision of our proposed method is 87.5%. Table 1 shows the results of Lip6Indoor dataset.

References [6, 8] are faster than [5], but their recall proportion is low. After comparison between the proposed method and [5], an average of 53.5 ms additional processing time is required for each frame. The maximum processing time for one frame using our proposed method is 825.3 ms. Because this dataset is captured at 1 HZ, the proposed method can be processed in real time.

4.2. CityCenter

In the CityCenter dataset, since our method has effective control, we obtained a higher recall proportion. The resolution of 2474 images in this dataset is 640 × 480. Every two images were captured simultaneously at the same location.

The recall proportion cannot be further increased because in some scenes (like jungles) there are too many similar words. The proposed method failed in these types of scenes. With too many incorrectly matched pairs, a bad affine-invariant hypothesis was established. Table 2 shows the results of CityCenter dataset.

The maximum processing time for one frame of the proposed method is 1780.7 ms. The dataset is captured at approximately 0.5 Hz, so the proposed method can also be processed in real time in this dataset.

5. Conclusion and Future Studies

These experiments showed that our proposed method can work better than pure bag-of-words-based SLAM approaches. We proved that 2D geometric constraints are an effective way to break the bottleneck and improve the accuracy of appearance-based SLAM.

Although the proposed method works well for the most part, it cannot handle some problems. In particular, one typical problem is too many similar words in the same image. Methods to solve this problem are being considered. One possible solution is to increase NNDR [3] threshold to avoid repeated features more effectively. This step should reduce false-positive ratio of descriptors matching but causes more false-negative avoiding. Then, use the proposed method to construct affine-invariant hypothesis based on features matched by higher threshold. Lastly, test avoided repeated features by affine-invariant hypothesis to retrieve potential correct matches.

Today, high-performance handheld smart phones are very popular. Because the proposed method does not require any 3D measuring to achieve high robustness while using handheld devices, it may be applied to many types of platforms, for navigation by pedestrians.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this article.

Acknowledgments

This research is supported by the Key Project of National Natural Science Foundation of China (Grant no. 51538007) and the Project of National Natural Science Foundation of China (Grant no. 71101096).