Abstract

The ambiguity resulting from repetitive structures in a scene presents a major challenge for image matching. This paper proposes a matching method based on SIFT feature saliency analysis to achieve robust feature matching between images with repetitive structures. The feature saliency within the reference image is estimated by analyzing feature stability and dissimilarity via Monte-Carlo simulation. In the proposed method, feature matching is performed only within the region of interest to reduce the ambiguity caused by repetitive structures. The experimental results demonstrate the efficiency and robustness of the proposed method, especially in the presence of respective structures.

1. Introduction

Image matching is the core operation in many computer vision tasks [14]. Various approaches focus on improving the appearance distinctiveness and repeatability of the features and design many distinctive descriptors to find more reliable matching pairs [58]. The locally defined descriptors can provide very accurate information that is invariant to scale changes and scene clutter [9]. Moreover, current research has indicated that existing feature matching methods such as Ratio-Match [7], Self-Match [10], and Mirror-Match criteria [11] are so restrictive that they fail to match repeated features [1]. All these methods make a comparison between the best and second best matches to obtain reliable matches. The only difference is the way to find the best matches and the second matches. For the repeated features or similar features from the repetitive structures of a scene, these methods can hardly find unique matches because the best matches are nearly the same as the second matches.

Identifying enough reliable matches from the repetitive structures in a scene is difficult due to the ambiguity resulting from such repetition [1215]. To reduce the matching ambiguity and find more reliable correspondences, Xiaojie Guo and Xiaochun Cao propose a 2D-2D image matching method based on the triangle constraint [15]. And such method is successfully applied in visual attention regions for the video copy detection [14]. They apply the Ratio-Match criteria [7] in a bimatching procedure to select initial matches or the seed vertexes and construct triangle constraints to reduce the search space for matching globally similar feature. However, the bimatching method is sometimes not able to identify any unique matches due to the low robustness of Ratio-Match criteria for the cases such as urban-like scenes where buildings have repetitive structures [1]. The ambiguity resulting from repetitive structures which are common in visual scenes remains a challenge for image matching [13].

This paper proposes a robust matching strategy which relies on the saliency analysis of image features. It is based on the computation of the distances between features and the estimation of their standard deviation via a Monte-Carlo approach. To explore more information from the repetitive structures [16], this research presents a robust matching strategy where the correspondence of salient features is found and used to determine the regions of interest for matching the repeated ones and, in the meantime, similar features outside the region are excluded, leading to the reduction of the matching ambiguities. The experimental results demonstrate the efficiency and robustness of the proposed method, especially in the presence of large portion of respective structures.

This paper is organized as follows. After reviewing the related work in this section, we introduce a region-based image matching strategy based on calculating the feature saliency. Section 3 demonstrates the experimental results. Conclusions are discussed in Section 4.

2. Robust Image Matching Strategy Based on SIFT Saliency

2.1. Saliency Definition for the SIFT Feature

The local saliency of an object is usually defined as the contrast between the object and its neighborhood as illustrated by biologically plausible models such as the center-surround mechanism [17, 18]. As illustrated in Figure 1, the spots with different color are used to represent different SIFT features. The distinctiveness of the same yellow spot is different within the solid and dashed regions. The solid or the dashed regions illustrate different regions of interest (ROIs). The yellow spot is more salient in the solid circular region than in the dashed region because there are no other yellow spots in the solid circular region. As a result, the yellow spot can be easily detected in the solid circular region by using color feature only. In the dashed region, however, this method cannot simply be applied since yellow spots repeat more than once within the region.

Figure 1 illustrates that the difficulty of feature matching in a region with repetitive structures is strongly related to the repetitiveness of the feature inside the region. As a result, this paper defines a feature saliency which is determined by the difference of features in the whole image region.

Given a feature set extracted from the image , the saliency of a local feature is defined aswhere the saliency of is defined as the minimum difference between the features and at the locations and of image . The feature difference is obtained by calculating the ratio between the Euclidean distance of two features and and the standard deviation of . The computation of will be explained in Section 2.2.

Equation (1) can be applied to estimate feature saliency in a ROI [14]. As demonstrated in Figure 1, such estimation is crucial for tackling the ambiguity resulting from repetitive structures. In this paper, we focus on performing the saliency analysis in the whole image . According to (1), if a feature is distinctive in an image, it is more likely to be salient in the region of the whole image. In Figure 1, for instance, the blue and green spots are more salient than the yellow and red spots over the whole image since there are no other spots with similar color in the same ROI. Similarly, the central yellow spot within the solid circle is salient when the solid circle can be determined as the ROI.

2.2. Estimation of Feature Saliency via Monte-Carlo Simulation

The feature saliency defined in (1) can be estimated via a Monte-Carlo method where the degree of freedom is application-dependent. The independent variables include blur, illumination change, JPEG compression, parameters of the camera, and its pose. Given a reference image, the Monte-Carlo method for feature saliency computation is straightforward, including the following steps:(1)Define a set of independent image variables and their probability distributions according to the application settings.(2)Generate a set of simulated images randomly from the probability distributions over the independent variables.(3)Extract local image features from the simulated images.(4)Conduct a repeatability test to select robust features.(5)Calculate standard deviation for each local feature over the simulated images.(6)Calculate saliency for each local feature in the reference image by applying (1).

For example, the reference image shown on the bottom of Figure 2 is taken from the Bikes scene of INRIA dataset [19] where it is matched against a set of test images obtained by varying blur. In the Monte-Carlo process, the Gaussian blur algorithm is used for blurring the reference image to generate simulated images shown on the left column in Figure 2. For each image, the SIFT [7] features are extracted and shown on the right column. As a result, for each SIFT feature in the reference image, there ideally exists a corresponding SIFT feature on the simulated images . Due to the invariance of SIFT feature, we can assume a normal distribution of the feature . Thus, the standard deviation of the feature in the feature space can be calculated as follows:

In practice, it is necessary to remove the feature outliers to estimate in (2). Techniques such as those of robust statistics can be applied to obtain a reliable estimation. To this end, we perform a repetitive test of SIFT features to remove the outliers. Features with a repeatability rate of not less than a threshold are regarded as stable. Only stable features can be applied in (2). In this paper, we set to achieve considerable number of feature key points for the matching process.

2.3. Saliency-Based Feature Matching

Given two feature sets and extracted from the reference image and another image respectively, the proposed saliency-based feature matching method includes the following three steps.

2.3.1. Salient Feature Selection

The salient features are selected by the aforementioned repeatability test and by thresholding the feature saliency to ensure the stability and distinctiveness of the features. The salient feature set in the whole image can be determined bywhere is a salient feature and is the threshold for the feature selection. As indicated by statistical hypothesis testing theories, the larger the value , the more salient and reliable the features in for matching.

2.3.2. Salient Feature Matching

Let be a salient feature in the whole image . Its matching feature is identified by using the following criteria:where is the nearest neighbor to the salient feature and are the matching thresholds. Both the distances between the best and the second best matches are compared with the thresholds and , respectively. In the above equations, the standard deviation is obtained by Monte-Carlo simulation described in Section 2.2. To find the unique matches of salient features, we impose a constraint on the above matching criteria. Such a constraint is used to verify whether there exist other similar features in the observation image due to the occlusion or other unpredictable changes. The larger the value of , the greater the matching reliability that can be achieved. Moreover, if a given salient feature can be matched with only one candidate feature, we consider the match is unique and reliable since there is no ambiguity caused by the repetitive features from both the reference image and the observed image. To guarantee the matching quality of the salient feature, the relationship between and should satisfy

Existing methods [7, 10, 11] share some similarity with the proposed saliency-based feature approach on finding unique matches. In those methods, the criteria for identifying the matching feature for feature can be formulated as follows:

If the ratio of closest distance to second closest distance is greater than , the match is rejected as a false one with the ambiguity.

According to (4) and (6), we can see that both our method and the existing methods [7, 10, 11] have a common point to perform the neighbor ranking to find the best matches and the second best matches, but the proposed saliency-based matching method is faster than the other approaches due to the feature selection. If less stable features are selected as salient features, there are fewer L2 distances between and, to compute. Furthermore, the proposed matching criterion is more theoretically sound since it is based on the principles of statistical hypothesis tests by assuming a normal distribution of SIFT features due to their invariance properties. Thus, to achieve a prediction interval, we need to set approximately.

2.3.3. Region-Based Feature Matching

The features generated from repetitive visual structures are usually similar, resulting in a high mismatch rate if the spatial locations of these features are not taken into consideration. Each repetitive feature has its local saliency region as illustrated in Figure 2. If the ROI can be determined and there are no more similar features in the small region, the repetitive feature becomes salient and the matching robustness is expected to be improved.

In visual search, the human visual system always directs its attention from salient object to less salient ones [20]. Inspired by such an idea, we propose identifying the robust matches of salient features first, followed by a process to apply these robust matches to determine ROIs of other less salient features, especially repetitive features. We assume that the reliable matches of repetitive structures can be found if the saliency regions of features can be determined to find unique matches. In this paper, we use the robust matches of the salient features as initial matches or seeds in the triangle constraint method (T-CM) [15] to determine a small ROI for each repetitive feature.

The quality of the seed features will greatly affect the final matching results since the region-based matching process starts with using these seeds to determine ROIs for other features. The quality includes not only the robustness of the seeds but also their number and spatial distribution. In the seeds selection, we can adjust the matching threshold for different matching methods to obtain different number of the seeds. Normally, the more the number of the seeds, the more concentrated the spatial distribution of the seeds and the smaller the area of the ROI or triangle region. According to Section 2.1, a small ROI is expected for each repetitive feature to find reliable correspondence because the feature has more chance to be different from the fewer features in the ROI. To obtain a large number of the seeds, it requires relaxing the matching threshold at the sacrifice of the matching robustness. It is still an open issue to balance the matching robustness and the number of the seeds. In most cases, the matching threshold is set by experience [7, 10, 11].

In this paper, we only focus on evaluating the quality of seed distribution with a small number to obtain a high matching robustness. When the number of seeds is small, the area coverage of all the seed features plays an important role for finding more candidate matches. We define the area coverage as follows:where is the area bounded by the convex hull of a feature set. and are the sets of the seed features and all the extracted features, respectively.

Figure 3 illustrates the area coverage of the robust seed features with a small number. According to (7), the value of is between 0 and 1. The larger the value of , the better the area coverage of the seed features. Figure 3(a) illustrates the seed features with an ideal distribution quality. The matching ambiguity of each repetitive feature is expected to be eliminated when is equal to 1 and each repetitive feature is contained by only one triangle region constructed by the seeds. Figure 3(b) illustrates the seed features with the area coverage smaller than 1.0. There are generally two problems demonstrated in the case: multiple similar features locate in the same ROI, such as four repetitive features in the dotted triangle region. The ambiguity is not resolved. Not all the repetitive features are covered by the convex hull of the seed features. Thus, the ROIs for these uncovered features cannot be determined and this decreases the chance for finding more candidate matches. Moreover, Figure 3(c) illustrates the seed features with the same area coverage but a different spatial distribution is shown in Figure 3(b). Compared with Figure 3(b), the seed features in Figure 3(c) are distributed more evenly, indicating that the spatial distribution of the seed features is closely related to ROIs for each repetitive feature. To quantify the difference between the spatial distributions shown in Figures 3(b) and 3(c), we adopt the following descriptor as proposed in [21].where is the number of the ROIs constructed by the seed features and is the area of each ROI.

Considering both the area coverage and the spatial distribution, we propose an overall measure to describe the quality of seed distribution.

The smaller the value of , the better the quality of seed distribution.

3. Experimental Results

3.1. Datasets and Procedures

Few resources exist for the purpose of testing our matching method, since most image pairs consist of only one reference image but a large number of training images for computing the proposed feature saliency and selecting the robust features. Thus, in evaluation, we generate the training set for each image pair by scale and rotation transforming or adding some noise to the given reference image from the INRIA dataset [19]. To generate a large number of test sets, we choose a scheme similar to the existing literature [11] to generate a large number of crop pairs with repetitive structures.

The INRIA dataset [19] contains 48 images of 8 scenes, taken from different imaging conditions including varying light, blur, viewpoint, and JPEG compression. The dataset provides homography functions as the ground truth for feature matching. Figure 4 shows a matching pair of the wall scene where repetitive structures are extensively present.

In the experiment, the matching threshold is set to 2 for the proposed method to achieve an about 95.54% prediction interval as discussed in Section 2.3.2. The threshold is set to 0.8 for other methods to eliminate 90% of the false matches at the sacrifice of less than 5% of the correct matches as suggested by [7].

Both the matching precision and recall rate are used to express the performance of SIFT feature matching. where , , and are the number of correct matches, ground-truth matches, and incorrect matches, respectively. The ground-truth matches can be obtained by using the homography function provided by the database. If the positions of two matched features are , they should satisfy

The value of is the distance threshold which is set to 4 pixels.

In this paper, S-Match is used to denote the proposed saliency-based feature matching method. It mainly consists of three steps including feature saliency computation, seeds selection, and region-based image matching. In the absence of imaging conditions, such as parameters of the camera and its pose, it is difficult to generate simulated images with the same conditions to compute the saliency of SIFT features. To avoid estimating the parameters of image conditions, we apply both the simulated images and the images provided by the INRIA dataset [19] for saliency computation in the Monte-Carlo method shown in Section 2.2. Thus, the Leave-One-Out Cross-Validation (LOOCV) [22] test shown in Algorithm 1 is designed. To reduce variability, all possible rounds of cross-validation are performed for each scene in INRIA dataset [19] and the validation results are averaged over the rounds.

Input. An image group of one scene where and . The reference image is .
Select one image as the observation image;
Compute the SIFT feature saliency based on the image training set
according to Section 2;
Perform the feature matching between and ;
If the reference image is matched with all the images from , go to ; else go to ;
Compute the average matching results according to (11)–(13).
Output. The average matching results.
3.2. Feature Matching with Repetitive Structures

To show the robustness of the proposed S-Match method in presence of repetitive structures, we compare our approach with the state-of-the-art methods including Ratio-Match [7], Self-Match [10], and Mirror-Match [11]. These methods rely on evaluating the degree of similarity between the best matches and second best matches to find reliable matches and reduce the ambiguities caused by repetitive structures [10].

Figure 5 illustrates the experimental results of matching all the extracted features in the wall scenes shown in Figure 4. There are 2305 SIFT features extracted from the reference image to be matched. The number of the feature matches found by S-Match, Ratio-Match, Self-Match, and Mirror-Match methods is shown in Table 1.

Table 1 provides the matching precision and recall rates in Figure 5. The same number of matches is coincidentally found by the methods of S-Match, Self-Match, and Mirror-Match. In these three methods, the proposed S-Match method achieves the highest matching precision. The Ratio-Match method obtains the maximum number of matches with the best recall rate but the worst matching precision. Since the matching criteria of the other three methods are stricter than the Ratio-Match, they identify a less number of the matches with a higher matching precision but a lower recall rate.

We use the precision-recall curves for the comprehensive comparison of different matching methods. Figure 6 shows the precision-recall curves for 3 scenes from the INRIA dataset [19]. The results of the other 5 scenes in the dataset are not illustrated because all the matching methods achieve similar performance. In Figure 5, the curves can be obtained by changing and in (5) and (7). Each curve represents the average matching results of one method obtained from the LOOCV test. For each scene, we compare the average matching precision and recall rates for different methods. The results indicate that the proposed method achieves better matching precision for Walls, Trees, and Graffiti scenes. Along with the change of the threshold, the precision obtained by the proposed method slows down more gradually than that by the other methods. Such high precision is preferred for region-based matching methods to select the seeds and will be used in the following section.

3.3. Seed Selection for Region-Based Feature Matching

Robust identification of reliable seeds is the foundation of region-based feature matching. In this paper, the S-Match method is used to determine the seeds. The bimatching methods using different matching criteria including Ratio-Match, Self-Match, and Mirror-Match are applied for comparison. These methods are denoted by biRatio-Match, biSelf-Match, and biMirror-Match, respectively. Figure 7 shows seed selection for tree scenes. Taking 51 seeds determined by different methods, for example, the qualities of seed distributions are different. We apply (9) to compute to analyze the qualities of seed distributions. Table 2 shows the results which indicate that the proposed S-Match method achieves the highest quality.

We use the average precision, recall rates, and seed distributions for comparing the quality of the seeds comprehensively. The experimental procedure of computing the quality mentioned above for each scene is shown in Algorithm 2.

Input. All the matching pairs of each scene for the LOOCV test.
Set a strict threshold for each method as stated in Section 3.1 to find the robust seeds of each matching pair;
Determine the minimum number of the seeds among all the methods;
If there is a method finding more seeds than the minimum number, go to; else, go to ;
Sort the seeds based on their similarity values which are computed according to the Equation (5) and (7)
in an ascending order;
Select the determined number of the seeds from the top-ranked sequence determined in step ;
Compute the , and of the seeds for each method according to (11) (13);
If the quality computation in step for all the matching pairs is finished, go to ; else go to ;
Compute the average quality of the seeds for each scene.
Output. The average quality results for the seeds of each scene.

Figure 8 shows some representative results of the seeds for comparing different matching methods. In Figure 8, each data point represents the average quality results for the seeds from the whole scenes but the cropped ones. For each scene, we compare the average precision, recall rate, and seed distribution for different methods. Figure 8(a) demonstrates that all the average precision of the seeds found by the S-Match method is equal to 100% approximately under the similar recall rates shown in Figure 8(b) with other methods. In most cases, the proposed method achieves a better average quality of seed distribution except that for the Graffiti scene as shown in Figure 8(c).

We use the determined seeds to perform the triangle constraint for the region-based feature matching. Similar to the method proposed in [15], a triangle region of each unmatched feature is determined by three seed features in its neighborhood. The feature difference between each unmatched feature located at and the candidate feature located at is computed as follows [15]:where and is the number of the candidate features extracted from the observation image. The best match for each unmatched feature is determined by the minimum value of .

To validate the effectiveness of the proposed region-based feature matching method for finding more reliable matches from the repetitive structures, both the average precision and the average recall rate of the total matches are given in Figure 9. In the figure, all the methods combined with the triangle constraint are denoted by S-Match + TC, biRatio-Match + TC, biSelf-Match + TC, and biMirror-Match + TC, respectively. We can see that the S-Match + TC method outperforms the other methods. In most cases, it achieves a higher precision and a higher recall rate simultaneously. In the case of the Bark scene, the average precisions of S-Match + TC, biRatio-Match, biSelf-Match + TC, and biMirror-Match + TC are 84.44%, 86.16%, 86.99%, and 86.91% as shown in Figure 9(a) while the average recall rates are 45.78%, 29.84%, 29.29%, and 30.04%, respectively, in Figure 9(b). The precision of the S-Match + TC method is slightly lower. However, a significantly higher recall rate is obtained. The results shown in Figure 9 indicate the proposed S-Match + TC identifies more correct matches. Especially for the Graffiti scene, the proposed method achieves the best final results among all the methods although the average quality of seed distribution is not the best according to Figure 8(c). It indicates that the precision of the seeds also plays a critical role for the region-based feature matching.

3.4. Computational Complexity

As shown in Section 2.2, the Monte-Carlo simulation for computing the feature saliency is time consuming. However, it can be performed before pairwise matching begins in practical applications such as the landmark-based navigation [23, 24]. In these applications, a detailed understanding of the environments is usually required for selecting a small number of features.

Given the feature saliency, the proposed S-Match method has a lower computational complexity than existing state-of-the-art methods because it has the ability to preselect fewer salient and stable features as the matching candidates in the reference image by adjusting the parameters of and . In the existence of a large number of repetitive structures in the scenes, it has important practical significance to reduce the computational cost by utilizing the proposed method. Table 3 lists the computational complexity of both the proposed method and other single-direction matching methods [7, 10, 11]. Given features in the reference image and features in the observation image, our method can be implemented in , where is the number of the preselected salient features from the reference image. When the selected feature number is far less than or , the proposed method can be performed more efficiently.

4. Conclusions

In this paper, we present a robust image matching strategy based on SIFT saliency. The feature saliency analysis indicates the close relationship between the feature distinctiveness and the region of interest. We start to find robust matches for the salient features and perform the triangle constraint to reduce the ambiguities caused by the repetitive features. Experimental results show that the proposed saliency-based feature matching method outperforms the state-of-the-art methods. More correct matches from the repetitive features can be found by combining the triangle constraint method.

In the future, we will focus on estimating the ROI for each repetitive feature and improve the region-based matching performance furtherly.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by National Natural Science Foundation of China (no. 61272523), the National Key Project of Science and Technology of China (no. 2011ZX05039-003-4), General Project of Education Department of Liaoning Province (no. L201683682), and the Fundamental Research Funds for the Central Universities (no. DUT15QY33).