Abstract

Visual localization is widely used in the autonomous navigation system and Advanced Driver Assistance Systems (ADAS). This paper presents a visual localization method based on multifeature fusion and disparity information using stereo images. We integrate disparity information into complete center-symmetric local binary patterns (CSLBP) to obtain a robust global image description (D-CSLBP). In order to represent the scene in depth, multifeature fusion of D-CSLBP and HOG features provides valuable information and permits decreasing the effect of some typical problems in place recognition such as perceptual aliasing. It improves visual recognition performance by taking advantage of depth, texture, and shape information. In addition, for real-time visual localization, local sensitive hashing method (LSH) was used to compress the high-dimensional multifeature into binary vectors. It can thus speed up the process of image matching. To show its effectiveness, the proposed method is tested and evaluated using real datasets acquired in outdoor environments. Given the obtained results, our approach allows more effective visual localization compared with the state-of-the-art method FAB-MAP.

1. Introduction

One of the prerequisites of navigation issue is to make the vehicle or robot able to reliably determine its position within its environment. With the wide use of cameras, varieties of approaches were proposed to address the challenges of place recognition based visual localization [1, 2].

FAB-MAP (Fast Appearance Based Mapping) method [3] can be considered as the milestone in the field of visual localization. FAB-MAP approach consists of matching the appearance of current scene to the same (similar) past visited place by converting the images into bag-of-words representations built on local features such as SIFT or SURF.

Recently, binary image descriptors that encode patch appearance, using compact binary string with low memory requirements, are widely used in image description and visual recognition [4]. In local feature based place recognition approaches, image representation is defined as collection of local features which contribute to their robustness when faced with local image variations as well as from discriminative power of their descriptors. Nevertheless, most of these works exhibit a high computation cost or complex feature extraction for image matching [5, 6]. Also, few works pay attention to the depth information for visual place recognition.

Their advantages are that they are invariant to monotonic changes in gray-scale and fast to calculate. One typical binary descriptor is LBP (local binary pattern) [7]. Since it was firstly proposed in 1996, several new variants of binary descriptors have been proposed [8, 9]. They show great invariance to monotonic illumination changes, do not require many parameters to be set, and have a high discriminative power. However, most of them are unfortunately not efficient for background modeling or place describing because of their sensitivity to noise or illumination. In this paper, the most relevant binary descriptors for visual place recognition that will be tested and compared in our approach are LBP, CLBP (complete local binary pattern) [10], CSLBP (center-symmetric local binary patterns) [11], CSLDP (center-symmetric local derivative pattern) [12], and XCSLBP (extended CSLBP) [13]. These different local binary descriptors are noted as LBP.

Despite local binary features efficiency, histograms of oriented gradients (HOG) features have also been successfully used in various vision tasks such as object classification, image search, and scene classification [14]. Wang et al. [15] combine histograms of oriented gradients (HOG) and local binary pattern (LBP) and propose a novel human detection approach capable of handling partial occlusion. For such applications, HOG is one of the best features to capture edge or local shape information which provides a rough description (shape information) of the scene.

Considering the robust and strong image representation ability of binary descriptors and HOG feature, we expect that their combination would provide more useful information and then should improve place recognition performance. In this paper, stereo images are used for visual place recognition. A novel localization approach is then proposed which uses multifeature fusion by combining HOG and binary features (LBP), as shown in Figure 1. HOG features are obtained from gray-scale image while LBP features are built from both gray-scale image and disparity map. We note that the features are first extracted from the blocks composing the gray-scale image and the disparity map and then concatenated. We extend the application of LBP descriptor to disparity map in order to incorporate disparity information in image representation by simply concatenating the two descriptors (LBP from gray-scale image and LBP from disparity map). This produces a new descriptor named D-LBP. The integration of disparity information in image representation provides depth information which should be helpful for place recognition, especially in complex environment situation. Indeed, image description using features LBP and HOG and the depth information will permit reducing perceptual aliasing problems related to visual place recognition. As will be shown in our experiments, features combination permits achieving better recognition performance than single feature. Also the performance of place recognition is compared with the state-of-the-art FAB-MAP algorithm: the achieved scores on four tested datasets using our approach are better than those resulting from FAB-MAP method. Furthermore, considering that high-dimensional multifeatures comparison is time-consuming, locality sensitive hashing is applied on multifeatures to speed up the process of features comparison and image matching.

The most important contributions introduced in this paper are the following:(i)An innovative method for place recognition based visual localization using multifeature descriptor (D-LBP++HOG) extracted from gray-scale image and disparity map. The proposed multifeature descriptor takes advantage of texture, depth, and shape information and hence performs better than single feature (see Section 5.2).(ii)The impact image block size for the binary descriptors is studied. Binary descriptor extracted from small block has better discriminative ability in local details of different locations, while considering large block size for image representation may cause loss of some discriminative information (see Section 5.1).(iii)A speeding-up of the place recognition method is achieved by approximating the Euclidean distance between features with hamming distance over bit vectors obtained by locality sensitive hashing (see Section 5.3).The rest of this paper is organized as follows. Firstly, the LBP descriptor and several of its variants as well as HOG feature are introduced in Section 2. Then, in Section 3, the proposed approach is described in detail. Section 4 deals with the presentation of the tested database and the used performance evaluation parameters. The obtained results are presented and discussed in Section 5. Finally, conclusions and future works close this paper (Section 6).

2. Overview of Used Image Descriptors

In this part, some of the state-of-the-art image descriptors used and compared in the proposed approach are described.

2.1. LBP (Local Binary Pattern)

LBP is a texture descriptor that codifies local primitives (such as curved edges, spots, and flat areas) into a feature histogram. The original LBP operator labels the pixels of an image with decimal numbers, called local binary patterns or LBP codes, which encode the local structure around each pixel [8].

As illustrated in Figure 2, each pixel gray level value is compared with its eight neighbors in a region by subtracting the center pixel value. The resulting strictly negative values are encoded with 0 and the others with 1. A binary number is obtained by concatenating all these binary codes, and its corresponding decimal value is used for labeling the central pixel. In Figure 3, examples of neighborhood used for LBP operator are illustrated. The generalized LBP definition uses sample points evenly distributed on a radius around a center pixel located at . The position , of the neighboring points, where is given byThe local binary code for the position can be computed by comparing the gray-scale value of this center pixel located at and the gray-scale values of its neighbor pixels located at (, ) where . The value of the LBP code of the center pixel at position is given bywhere is the Heaviside function:The operator produces different output values, corresponding to different binary patterns formed by the pixels in the neighborhood. Although this method can capture the relations of nearby and adjacent pixels, it leads to a large data dimension.

Ojala et al. [7] further propose an “uniform patterns” to reduce the dimension of LBP feature while keeping its discrimination power. For this, a uniformity measure of a pattern is used: (“pattern”) is the number of bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular. The value of an LBP pattern can be computed byUniform LBP patterns refer to the patterns which have limited transitions or discontinuities () in the circular binary representation. For instance, 11111111 (0 transitions) and 01110000 (2 transitions) are both uniform whereas 11001001 (4 transitions) and 01010010 (6 transitions) are not. Thus, for neighborhood pixels, a uniform operator produces possible distinct uniform LBP patterns. After the uniform LBP patterns are identified, for an image with size , a histogram is built which can be used as the image feature to represent the image texture:where is the maximal LBP pattern value. The length of the histogram is a .

2.2. CLBP (Complete Local Binary Pattern)

LBP feature considers only signs of local differences (i.e., difference of each pixel with its neighbors) whereas CLBP feature [10] considers both magnitude (M) and sign (S) of local differences as well as original center gray level value (C). Consequently, three operators, namely, CLBP_M, CLBP_S, and CLBP_C, are used to code the magnitude, sign, and center gray level.

Given the gray-scale value of the center pixel and its circularly and evenly spaced neighbors with gray-scale value , , the difference between and can be simply calculated using . The local difference vector characterizes the image local structure at . Because the central gray level is removed in local difference vector, is robust to illumination changes and is more efficient in pattern matching. can be further decomposed into two components:where is the sign component of and is the magnitude component of .

CLBP_M is used to code the magnitude information of local differences:where is a threshold which is set to the mean value of the values from the whole image.

CLBP_S is the same as the original LBP and is used to code the sign information of local differences:

CLBP_C is used to code the information of original center gray level value:where the threshold is set to the average gray level of the input image.

The dimension of the histograms corresponding to CLBP_S and CLBP_M is , while the dimension of CLBP_C is 2. The CLBP_C only uses the center gray level value which can be easily affected by the changing of viewpoints or illumination. Therefore, in our work, only the histograms of CLBP_S and CLBP_M codes are computed and then concatenated together to construct CLBP feature. Thus, the final dimension of CLBP feature is .

2.3. CSLBP (Center-Symmetric Local Binary Patterns)

CSLBP [11] is another modified version of LBP. CSLBP produces shorter feature set than LBP, but it is also a first-order local pattern in center-symmetric direction and it ignores the central pixel information. CSLBP is closely related to the gradient operator, because it compares the gray levels of pairs of pixels in centered symmetric directions instead of comparing the central pixel to its neighbors. In this way, CSLBP feature takes advantage of the properties of both LBP and gradient based features.

For an even number of neighboring pixels distributed on radius , CSLBP operator produces possible distinct patterns. The operator is given bywhere and are the gray values of center-symmetric pairs of pixels. is used to threshold the gray level difference so as to increase the robustness of CSLBP feature on flat image regions. Since the gray levels are normalized in , the authors of paper [11] recommend to use small value for .

It should be noticed that CSLBP is closely related to gradient operator, because like some gradient operators it considers gray level difference between opposite pixels in a neighborhood.

Given an image of size , after the computation of CSLBP patterns, a histogram is built to represent the texture image:By construction, the length of the histogram resulting from CSLBP feature is .

2.4. CSLDP (Center-Symmetric Local Derivative Pattern)

CSLDP operator [12] is a second-order derivative pattern in center-symmetric direction. CSLDP captures more information by encoding the relationship between central pixel and center-symmetric neighbors. Moreover, CSLDP has shorter length than LBP.

For an even number of neighboring pixels distributed on radius , CSLDP operator produces possible distinct patterns and is defined aswhere , are gray-scale values of neighborhood pixels in center-symmetric direction. corresponds to the gray value of central pixel located at . The threshold function is used to determine the type of local pattern transition and is defined asA CSLDP pattern encodes the second-order center-symmetric derivatives at pixel along 0°, 45°, 90°, and 135° directions. They can be represented asThe CSLDP histogram construction method is the same as for CSLBP, and its histogram length is also .

2.5. XCSLBP (Extended CSLBP)

The work in [13] proposes a new LBP variant called XCSLBP (eXtended CSLBP), which compares the gray values of pairs of center-symmetric pixels considering the central pixel, without increasing histogram length. This combination makes the resulting descriptor robust to illumination changes and noise. For an even number of neighboring pixels distributed on radius , XCSLBP is expressed aswhere the threshold function , which is used to determine the types of local pattern transition, is defined aswhere and are the gray values of center-symmetric pixels. XCSLBP operator produces histograms with a length of .

2.6. HOG (Histograms of Oriented Gradients)

Besides LBP and its variants, another histogram feature named HOG has also been widely accepted as one of the best features to capture the edge or local shape information. HOG feature is proposed by Dalal and Triggs [16] and widely used to detect objects in computer vision. The essential idea of HOG feature is that the shape or appearance of local object can be described by the distribution of intensity gradients and edge directions [17]. HOG descriptor is a one-dimensional histogram of gradient orientations of intensity in local regions that can represent object shape.

3. Overview of Proposed Approach

In this section, a robust visual localization based on multifeature combination is developed. The general principle is to find the image that best matches the current acquired one, among a set of previously acquired and GPS-tagged training images.

As shown in Figure 4, HOG divides the image into small connected blocks, and, for each block, a histogram of gradient directions for the pixels within the block is computed. The combination of these cell histograms represents the feature vector. At each pixel, the gradient is a 2D vector with a real-valued magnitude and a discretized direction (9 possible directions uniformly distributed in , ). During the construction of the integral image of HOG, the feature value at each pixel is treated as a 9D vector, and the value at each dimension is the interpolated magnitude value at the corresponding direction. Since HOG takes adjacent pixel gradients information as basis to extract features, it is robust to changes in geometry and is not easily affected by local lighting conditions.

The whole system includes an offline phase and an online phase. In the offline phase, a set of GPS-tagged training image pairs (left and right images) are firstly acquired, where is the number of training image pairs. After image preprocessing (see Section 3.1), multifeature set is extracted from the training database (see Sections 3.2 and 3.3), where is the multifeature extracted from the training image pair . In online phase, multifeature is extracted from current image pair and then compared with each multifeature of based on Euclidean distance. The computed distances are then used to select the best candidate (see Section 3.4); the smaller the distance is, the higher the similarity between the images will be. A distance ratio between the two best candidates (i.e., corresponding to the two minimum computed distances) is considered for matching validation (see Section 3.5). If the ratio is lower than or equal to a threshold , the first best image candidate (with the lower matching distance) is confirmed as positive; otherwise, it is regarded as negative (in this case, no matching result is conserved). When a matching is confirmed as positive, the current position can be obtained from the matched GPS-tagged training image (see Section 3.6).

As illustrated in Figure 5, the overall approach comprises six stages:(1)Image preprocessing: this step consist of downsampling and contrast-limited adaptive histogram equalization (detailed in Section 3.1)(2)Block based feature extraction: LBP feature is extracted from gray-scale image and disparity map; HOG feature is extracted from gray-scale image (detailed in Section 3.2)(3)Multifeature concatenation: the final multifeature D-LBP++HOG is obtained by concatenating LBP and HOG feature (detailed in Section 3.3)(4)Feature comparison and image matching: based on the extracted multifeature descriptors, image matching is conducted through multifeature comparison using Euclidean distance (detailed in Section 3.4)(5)Final Matching validation: according to the distance ratio of the top two best candidates, image matching result is validated (detailed in Section 3.5)(6)Visual localization: the vehicle current position can be obtained through the matched GPS-tagged training image (detailed in Section 3.6)

3.1. Image Preprocessing

Image preprocessing is composed of two parts: downsampling and contrast-limited adaptive histogram equalization (CLAHE).

Downsampling permits reducing the original image size, which makes feature extraction faster. In fact, it has been already proved in [18] that high resolution images are not more helpful than lower resolution ones. Therefore, downsampling is the first step before feature extraction. As it is well known, illumination has significant influence on outdoor image appearance. Therefore, another applied image preprocessing is contrast-limited adaptive histogram equalization (CLAHE), which permits enhancing the contrast of the gray-scale image by transforming the values using contrast-limited adaptive histogram equalization [19]. Through this adjustment, the intensities can be better distributed on the histogram. This allows for areas of lower local contrast to gain higher contrast. This contrast, especially in homogeneous areas, can be limited to avoid amplifying any noise that might be present in the image. On the same time, it also decreases the shadow influence. An image example after applying contrast-limited adaptive histogram equalization can be seen in Figure 6. It is obvious that CLAHE prepossessing improves the image contrast and makes the image more brightened (especially in some dark parts).

3.2. Block Based Feature Extraction
3.2.1. Concept of Block Based Approach

Traditionally, local descriptors are calculated on full images, which can keep the size of the feature database reasonably low. However, local image areas of interest would be ignored as the full image feature extraction does not contain enough local discriminative information.

With respect to local properties and enhanced image representation ability, image features are extracted from small image blocks (subimage areas) without any segmentation and then these independent feature descriptors are concatenated to obtain final image feature. To illustrate the block based feature extraction process, it is applied on an example in Figure 7. Block based approach (that relies on image blocks) can address spatial properties of images. It can be used for any histogram descriptors.

3.2.2. Block Based Feature Extraction

After image preprocessing, features are extracted, as illustrated in Figure 8. LBP feature is extracted from gray-scale image and disparity map independently, while HOG feature is extracted from gray-scale image. For both LBP or HOG, the features are extracted based on image blocks. In order to facilitate the process of block based feature extraction, image blocks in the full image have the same size. The influence of different block sizes will be studied in Section 5.1. Image parts that cannot satisfy a whole block will be ignored.

(i) LBP Feature Extraction.LBP feature from gray-scale image and disparity map are obtained using the following equations:where is a vector which stores the LBP feature obtained from gray-scale image. stores the LBP obtained from disparity map. and are the image block numbers of gray-scale image and disparity map, respectively. is the LBP histogram of the th block of the gray-scale image and is the LBP histogram of the th block of the disparity map. In our work, the disparity map is calculated using the SGBM (Semiglobal Block Matching) algorithm [20]. Using this SGBM method, there are some useless parts (“black areas”), for which no depth information is computed, especially on the left and right sides of the disparity map. In these “black areas,” LBP operator is not applied; therefore, these useless parts are simply removed. Thus, due to the removing of the “black areas” in the disparity map, and are not identical.

By using the block based approach, the features and are extracted from gray-scale image and disparity map, respectively. Then, D-LBP feature can be computed by concatenating and :

(ii) HOG Feature Extraction. HOG feature is also computed for each image block of the gray-scale image. The obtained HOG features from all the image blocks are then concatenated:Here, is the HOG feature extracted from the th image block. It should be noted that HOG feature adopts the same image block size as the LBP feature extraction from gray-scale image; therefore, the number of image blocks is the same.

3.3. Multifeature Concatenation

In order to take advantage of the different features, D-LBP and HOG are combined together to represent the image. Since the D-LBP and HOG are two independent features, we simply consider that they have the same weight in the role of place recognition. The final multifeature can be obtained easily through concatenation using the following equation:Using this method, a multifeature set of all training image pairs is obtained. For a current testing image pair , a multifeature is also obtained. Then the image matching is conducted based on the Euclidean distance comparison between the multifeature of the current testing image and all training image multifeatures from the training images dataset.

3.4. Feature Comparison and Image Matching

Feature comparison is performed based on the Euclidean distance between features. Each testing image multifeature is compared with all the training images multifeatures of the training database.

The distance between the multifeature of the testing image and multifeature vector of a training image is computed as follows:where denotes the Euclidean norm.

In fact, small distance means high similarity. Based on Euclidean distance, image matching candidates are searched. After distance computation, for the testing image, the two minimum distances ( and ) and their corresponding training images (the two best candidates) are conserved.

3.5. Final Matching Validation

For a given current image pair , the validation of matching candidate from the training database is based on the ratio , calculated as follows:where and are, respectively, the first and second minimum distances between the current image multifeature and the multifeatures of all the training images:As said before, the lower the distance is, the more similar the images are. The potential matching candidate is the image (the one giving the lower distance with the testing image). However, if the second best matching candidate provides a distance very close to the first one, this means that the matching algorithm provides two confused solutions. In this case, we propose to ignore the matching result and consider that the testing image has no matching image. For that, a threshold is applied to the ratio , which takes its values in the range .

The last decision is as follows: if is lower than or equal to the threshold , then the pair is considered as positive, and the pair is matched. Otherwise, the pair is considered as negative and the pair is ignored.

3.6. Visual Localization

After image matching result is successfully validated, the vehicle can localize itself through the matched training image position. Since the training images are tagged with the GPS or pose information, the vehicle can get its position information by assimilating its position to the GPS position of the training image matched with the current testing image. This is a topological level localization; that is, the system simply identifies the most likely location. Therefore, this is not a very accurate metric localization, because the training and testing trajectories are not exactly the same.

It should be noted that some places can not be localized at the situation of validation failure (negative matching case).

4. Experimental Setup

4.1. Datasets and Ground-Truth

The proposed method is tested on four different datasets (UTBM-1, UTBM-2, KITTI 05, and KITTI 06).

The taken route for UTBM-1 dataset is shown in Figure 9(a): the experimental vehicle traversed about 4 km in a typical outdoor environment. Three typical areas were traversed: urban city road (area a), lots of factories building (area b), and a nature scene surrounding a lake (area c). The training and testing data were collected at different times, respectively, in 2014/9/11 and 2014/9/5. The training database is composed of 849 images while the testing database is composed of 819 images. The average distance between two successive frames was around 3.5 m. To tag the training images, GPS position of each image is obtained by a RTK-GPS receiver.

The UTBM-2 dataset (Figure 9(b)) consists of a 2.3 km route in Belfort city downtown acquired in 2014/9/5. The first traversal to acquire training dataset was performed in the morning and the second one was conducted in the afternoon to acquire testing dataset. Each travel time across this dataset was approximately 20 minutes. The training database is composed of 540 images while the testing database is composed of 520 images. The GPS information of each image is also collected.

The popular KITTI benchmark dataset is also used to test our proposal. The KITTI Odometry dataset has 22 sequences containing a total of 44182 stereo images (39.2 km). These sequences include environments with different characteristics and challenging situations such as perceptual aliasing and changes on scene. Among them, the datasets KITTI 05 and KITTI 06 that contain loop closures were selected to evaluate our method. There are 2761 and 1101 images in KITTI 05 and KITTI 06 datasets, respectively.

For UTBM-1 and UTBM-2 datasets, ground-truth was constructed by manually finding pairs of frame correspondences according to the GPS data, while the KITTI dataset ground-truth was built according to the pose information [21].

4.2. Image Preprocessing and Feature Extraction

In our work, for faster feature extraction, the original color images were downsampled into half scale size gray-scale image. That means images in datasets UTBM-1 and UTBM-2 were resized to 640 × 480 and the images in dataset KITTI 05 and KITTI 06 were resized to 613 × 235.

In order to reduce the illumination influence on the outdoor image appearance, contrast-limited adaptive histogram equalization (CLAHE) method was used (see Section 3.1).

Moreover, as a pair of images is acquired at each instant, a disparity map can be computed easily using the SGBM (Semiglobal Block Matching) algorithm [20].

After image preprocessing, binary descriptors (LBP, CLBP, CSLBP, CSLDP, and XCSLBP) are extracted with the following parameters: 8 sampling points and 3 pixels radius. HOG descriptor is extracted from the gray-scale images. To capture large-scale spatial information, the cell size of HOG is . The number of cells in each block is specified as a 2-element vector.

An example of extracted image features can be seen in Figure 10. It can be seen that the local binary features pay more attention to texture information. It can also be noted that CSLBP and XCSLBP perform better than LBP. HOG feature depicts object shape information in the image. Therefore, combining LBP and HOG features could bring more information and make place (scene) better described.

4.3. Performance Evaluation

Precision-recall characteristics and score are widely used to show the effectiveness of image retrieval method. Therefore, our evaluation methodology is based on precision-recall curves and score. In our experiments, the training image number is larger than or equal to the testing image number; thus each testing image has a ground-truth matching. Therefore, among the positives, there are true positives (correct results among successfully validated images matching candidates) and false positives (wrong results among successfully validated images matching candidates). The sum of the true positives and false positives is the total retrieved images number.

More specifically, precision is the ratio of true positives over the retrieved images number (number of all the successfully validated image matching candidates), and recall is the ratio of true positives over the total testing images:

The final curve is computed by varying the threshold (applied to the ratio ) in a linear distribution between 0 and 1, with the calculation of the corresponding values of precision and recall. 100 values of threshold are considered to obtain well-defined curves. When the threshold is set to 1, the candidates whose ratio is below or equal to 1 are positives. In this case, the number of retrieved images is identical to the number of testing images, while when the threshold is 0, it means that the candidates whose ratio is below or equal to 0 are regarded as positives. In this case, there is no retrieved image.

Precision relates the number of correct matches to the number of false matches, whereas recall relates the number of correct matches to the number of missed matches. A perfect system would return a result where both precision and recall have a value of one. The score value is a single value that indicates the overall effectiveness of image retrieval method. Based on the precision and recall, score is defined as

5. Experiments and Results

Different aspects of our proposal are evaluated in the following sections. In Section 5.1, the performance of binary features (LBP and its variants) with and without disparity information is studied. In addition, the image block size influence for the binary feature D-CSLBP is investigated in Section 5.1. In Section 5.2, the effect of the multifeature fusion proposed in our approach is analyzed. It is to note that the experiment results obtained in Sections 5.1 and 5.2 are based on Euclidean distance. In Section 5.3, the efficiency of our LSH based visual recognition is checked: the execution time and recognition performance of our complete system are evaluated. Finally, visual localization at 100% recognition level is discussed in Section 5.4.

5.1. Comparison of the Different Binary Features and Image Block Sizes
5.1.1. Performance of Different Binary Features

In this section, we compare binary features performance in two situations: with or without disparity map. Here the features are compared based on the Euclidean distance.

Table 1 gives the scores of the binary descriptors in two cases (without and with disparity information). It can be seen that LBP, CLBP, CSLBP, and CSLDP with disparity information improve the image retrieval ability as scores are higher with disparity information than without disparity information. Among them, D-CSLBP is the best one; it achieves the highest score.

Figure 11 depicts the precision-recall curves obtained by the different binary features in two typical datasets UTBM-1 and KITTI 06. It can be seen that the performance of D-CSLBP is better than the performance with the features D-LBP, D-CLBP, D-CSLDP, and D-XCSLBP. Also, it can be seen that the maximum recall at 100% precision for D-CSLBP is higher than the one of the other features.

5.1.2. Comparison of Different Image Block Sizes

In this section, the influence of block size of block based D-CSLBP feature is studied.

Small block size permits discriminating local details, while large block size makes the representation more robust. Each image block is a square block in our experiment (block size is shorted as 32). The performance of D-CSLBP feature with different block sizes (32, 64, 128, and 32 + 64 + 128 (multiblock sizes, there different block sizes used together)) in place recognition is evaluated.

According to Figure 12, it can be noted that by increasing the block size from 32 to 64 and 128, the place recognition ability decreases. The computation of D-CSLBP feature with combination of the block sizes 32, 64, and 128 only permits achieving a slightly better performance than the D-CSLBP feature with block size 32.

It is obvious that the binary descriptor D-CSLBP extracted from small block size may benefit from discriminative local details, while feature extraction using larger block size makes it easy for image representation to drop some discriminative information.

However, when the block size is too small, the abundant information can not bring more improvement to the image matching process. At the same time, smaller image block size may lead to computation time increase during feature extraction. So, on our following experiments, the image block size for D-CSLBP is set to 32.

5.2. Performance of Multifeature Combination

In this section, we compare the performance of multifeature descriptor (D-CSLBP++HOG) with single independent feature descriptor.

Figure 13 shows the precision-recall curves obtained with the different tested features: D-CSLBP, HOG, and D-CSLBP++HOG. It can be found that the binary feature D-CSLBP combined with HOG permits improving image retrieval performance. Combining D-CSLBP and HOG can achieve better result than each single feature, which means that the combination is useful for place recognition.

Table 2 compares the scores of different features with the state-of-the-art FAB-MAP method. It confirms that the multifeature D-CSLBP++HOG achieves better results than single feature. The score of D-CSLBP++HOG provides the highest value for all the four datasets. Furthermore, the proposed method outperforms the FAB-MAP method.

For a better comprehension of the proposed multifeature, an example of distance matrices for UTBM-1 dataset is presented in Figure 14. Here, for clearly demonstrating the feature performance, the distance matrix is normalized into 0-1 range. The distances of the same or similar images are close to 0 (red color), while, for the larger distances, the corresponding color is close to yellow. As plotted in Figure 14(b), the ground-truth line is red. When perceptual aliasing occurs, some red points (noise) will appear which is outside the ground-truth line. In the distance matrix provided by our method using the D-CSLBP++HOG feature (see Figure 14(c)), it can be seen that the noise which appears around the diagonal (ground-truth line) due to perceptual aliasing is clearly reduced with respect to other feature approaches (CSLBP, D-CSLBP, and HOG). All the previous affirmations are supported by the precision-recall curves depicted in Figure 13(a) and results in Table 2.

We can thus conclude that integrating HOG and disparity information permits improving the image matching results. The reason why the D-CSLBP++HOG achieves better performance than the other features is mainly because the feature combination takes the advantage of texture, shape, and depth information, which makes image representation more robust than considering each single feature independently.

5.3. LSH Based Visual Recognition

Since the block based feature dimension is huge in our approach, computing the Euclidean distance between high-dimensional feature vectors is an expensive operation. Therefore, in order to speed up image matching significantly, locality sensitive hashing (LSH) method that preserves the Euclidean similarity [22] is used for visual recognition. LSH is arguably the most popular unsupervised hashing method and has been applied to many problems, including information retrieval and computer vision [23]. The paper [23] demonstrates that Euclidean distance between two high-dimensional vectors can be closely approximated by the hamming distance between the respective hashed bit vectors. The more the hash bits that hash method contains, the better the approximation.

The LSH method simply uses a random projection matrix to project the high-dimensional data into a low-dimensional binary (Hamming) space; that is, each data point is mapped to a -bit vector, called the hash key. Thus approximate nearest neighbors in sublinear time can be found. A key ingredient of locality sensitive hashing is mapping similar features to the same bucket with high probability.

More precisely, for multifeatures obtained from testing image and obtained from training image, the hashing functions from LSH family satisfy the following elegant locality preserving property:where the similarity measure is directly linked to the Euclidean distance function. Hash keys are constructed by applying binary-valued hash functions to each image feature. The binary-valued LSH functions consists of random projections and thresholds aswhere is a dimensional data-independent random hyperplane, which is usually constructed from a standard Gaussian distribution [24]. is a random intercept. For a normalized dataset with zero mean, the approximately balanced partition is obtained with = 0.

By applying binary-valued hash functions to each image feature, high dimension multifeatures and are converted into a low dimension bits and . Since and are binary bits, they can be more efficiently compared in low dimension space than original feature.

In our experiment, we compare the place recognition performance achieved with hashed multifeature of different binary lengths ( bits) on four datasets in Figure 15. Since the image size is different, multifeature dimension in datasets UTBM-1 and UTBM-2 is 18696 while the multifeature dimension in KITTI 05 and KITTI 06 is 6432. It can be seen that, using 4096 and 2048 bits retains above 86% total place recognition performance.

Table 3 shows the score obtained from different hash bit lengths applied on the multifeature (D-CSLBP++HOG) of our place recognition method. The average matching time is also presented. Here average matching time does not include the feature extraction time. The experiments were conducted on a laptop machine with intel i7-4700MQ CPU and 32 G RAM.

As Table 3 shows, the average matching time using 4096 bits is almost half of the one using the Euclidean distance over the original full features. Compared with the full multifeature matching, hashing the original multifeature into 4096 bits makes the distance computation and comparison easier and faster. There is no doubt that, for large-scale datasets, the speed-up advantages can be more significant.

5.4. Visual Localization Results

In the previous section, 4096 bits obtained by hashing the original feature shows its good performance in place recognition. Therefore, in this section, we describe visual localization results achieved by 4096 hash bits.

Figure 15 shows the final place recognition results for the different datasets at a precision level of 100%. For the datasets UTBM-1 and UTBM-2, we obtained 23.81% and 11.35% recall at the 100% precision, respectively, while in the KITTI 05 and KITTI 06 datasets, a recall rate of 17.38% and 32.39% is achieved, respectively, at the total correctly level. It should be noted that, at 100% precision level, the obtained place recognition result is totally correct. A correct place recognition means a successful visual localization; therefore, the higher th recognition rate (recall) at 100% precision is, the more robust the visual localization system is.

When adjusting the threshold value , the recognition precision is also changing. At 100% precision level, each recognized place is true positive and its localization error is small (depending on the ground-truth criteria, in our case it is 5 m). For achieving the 100% recognition precision level, threshold value is set to 0.88 and 0.58 for UTBM-1 and UTBM-2 datasets, respectively.

When the threshold is set to 1, which means every image matching result is positive, in this case, the precision level is the lowest and there are many false matching for place recognition, which lead to huge localization error. In general, if small threshold is used, there are few false recognition cases.

In addition, for visual recognition precision level below 100%, meaning that recognized places are not totally correct, some false recognition places appear. For these false recognized places, the localization error can be very large, because the testing image can be wrongly matched to anyone in the training image database. That is also the reason why some locations have huge localization error.

Table 4 gives the average localization error and recall ratio at different precision levels. For all these datasets, at 100% precision, the minimum localization error is 0 while the maximum error is not larger than 5 m. It should be noted that, at 100% precision level, some places can not be recognized and no localization results are obtained at these places. This problem can be easily solved by visual odometry technique or extra sensors (as LiDAR or Inertial Measurement Unit (IMU)).

6. Conclusion and Future Works

In this paper, we presented a visual vehicle localization approach that uses multifeature built from gray-scale image and disparity map. The multifeature concatenates the D-CSLBP and HOG features together to take the advantage of texture, depth, and shape information. Also, block based feature extraction was used to consider the spatial information. Image matching using the proposed multifeature D-CSLBP++HOG based on local sensitive hashing makes the visual recognition more efficient. The results of our experiment demonstrated that this approach provides an available place recognition based visual localization in outdoor environment compared with the state-of-the-art FAB-MAP method.

However, in the long-term visual localization, place recognition is prone to be influenced by appearance or seasonal changing. The future objective of our research is to achieve a robust long-life localization at different times and seasons. Sequence matching will be considered for place recognition in the following research.

Abbreviations

FAB-MAP:Fast Appearance Based Mapping.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Yongliang Qiao performed experiments, analyzed the data, and wrote the paper; Zhao Zhang participated in paper preparation and revising.

Acknowledgments

Thanks are due to Mr. Yassine Ruichek and Ms. Cindy Cappelle for the writing guidance and help.