Abstract

Data structures such as -D trees and hierarchical -means trees perform very well in approximate nearest neighbour matching, but are only marginally more effective than linear search when performing exact matching in high-dimensional image descriptor data. This paper presents several improvements to linear search that allows it to outperform existing methods and recommends two approaches to exact matching. The first method reduces the number of operations by evaluating the distance measure in order of significance of the query dimensions and terminating when the partial distance exceeds the search threshold. This method does not require preprocessing and significantly outperforms existing methods. The second method improves query speed further by presorting the data using a data structure called -D sort. The order information is used as a priority queue to reduce the time taken to find the exact match and to restrict the range of data searched. Construction of the -D sort structure is very simple to implement, does not require any parameter tuning, and requires significantly less time than the best-performing tree structure, and data can be added to the structure relatively efficiently.

1. Introduction

The nearest neighbour matching (NN) problem is encountered in many applications of computer science. It is the problem of finding the points in a database nearest to a given query point. The complexity of a simple linear search is proportional to , where is the number of database entries and is the number of data dimensions. Many attempts have been made to reduce the search time by implementing data storage and indexing structures so that the minimum number of data points has to be compared to the query point. Unfortunately, these methods are only effective in low dimensions or when using approximate nearest neighbour matching [1, 2]. Where an exact solution is required in dimensions greater than 20, linear search is only a fraction slower than the best existing search structure.

This paper proposes methods for improving the performance of linear search for the purpose of exact nearest neighbour matching using typical visual descriptors, such as the scale invariant feature transform (SIFT) [3, 4]. Many modern visual descriptors, such as gradient location and orientation histogram (GLOH) [5], are based on SIFT and have very similar characteristics from a search perspective. First, it is shown that simple modifications to the linear algorithm can allow it to outperform all existing search structures, without the need for any data preprocessing. Secondly, by presorting the data it is possible to reduce search time further. A multidimensional sorting data structure called -D Sort is introduced that is trivially simple to construct, does not require any parameter tuning, and supports an efficient data addition operation. Thirdly, intersecting the search sphere with the data unit sphere is used to reduce the search space.

The proposed methods are compared to leading methods based on -D trees and -means trees [1, 2]. A set of SIFT descriptors were extracted from the MIRFLICKR-1 M image collection [6] for this purpose. Two approaches are recommended based on the results. The first requires no preprocessing and significantly outperforms leading tree structures that require a large amount of preprocessing. The second improves results further at the cost of additional memory and preprocessing. Which method is best depends on the application.

The most popular indexing and search methods are based on -D trees [7] and hierarchical -means trees [8, 9]. Due to the so-called curse of dimensionality [10], these methods become less effective as the data dimensionality increases. In the exact matching case, linear search is the best method for data with more than 20 dimensions. In many applications, such as where feature vocabularies are used instead of original features [11], an exact match is not essential. Many methods have been proposed to reduce computation time by finding approximate matches.

A probabilistic best-bin-first search algorithm for -D trees was proposed in [12], where the total number of data points evaluated is fixed to limit computation time. This method does not enforce explicit bounds on the accuracy of the result. In [1] a method is proposed which ensures that, given tolerance , the distance between a query and a selected nearest neighbour is within a factor of the distance between the query and the true nearest neighbour. The construction process does not depend on . A priority queue search is used, where the priority is determined by the distance of cells in the tree from the query point. Cells further than are not evaluated, where is the query and is the current neighbour. These best-bin-first and approximate matching methods can also be applied to -means trees in a very similar fashion.

While -D trees and hierarchical -means trees on average yield similar results, their relative performance depends on the data and type of search query. A procedure for selecting between these algorithms and automatically tuning their parameters is presented in [2]. This procedure constructs a set of sample structures with various parameters from a subset of the data. The parameters of the best performing structure are then used to build the complete data structure. The construction time can be significantly increased by the parameter selection process.

More recently, locality sensitive hashing (LSH) has become a very popular approach to approximate matching. Hashing functions are used to map high-dimensional vectors to buckets. The hashing functions are designed to maximise the likelihood that similar descriptors are mapped to the same bucket. The query point is also hashed multiple times to select target buckets. The descriptors in the target buckets are then compared to the query using the original descriptors in a post-verification stage. The choice of hashing functions needs to be made with the aid of training data that is very similar to the final application data. This method is highly effective for approximate matching, but a large number of buckets need to be selected to ensure that the exact match is found, leading to unremarkable performance in exact matching [13, 14].

This paper focuses on the exact matching case, where the true nearest neighbour is required. Systems based on -D trees and -means trees support guaranteed exact search, but at best outperform linear search by only a fraction. Instead of modifying these structures, this paper investigates methods for improving linear search and shows that it can outperform the above methods.

The objective of visual descriptor extraction is usually to capture some structure. Consequently, the elements of the resulting descriptor vector are not uniformly distributed, but exhibit some structure as well. Additionally, these feature vectors are frequently normalised to have unit length and only have positive values. Figure 1 shows the distribution of descriptor element values for a 128-dimensional SIFT descriptor computed from a large set of maximally stable extremal region (MSER) features [15], extracted from a natural scene. It can be seen that this distribution is approximately exponential. For an evenly distributed unit vector descriptor, one would expect the median value to be near ; however, only 15% of the elements of the SIFT descriptor are above this value. It can therefore be said that most descriptors consist of a small set of relatively large values and a majority of small values (and often many zeros). Since the larger values of any descriptor are both rare and responsible for the majority of the distance between unit vector descriptors, it is possible to evaluate a large component of the distance by considering only a small number of dimensions.

3.1. Partial Distance Evaluation

While the exact distance between a query descriptor and the nearest descriptors is usually of interest, the distance to other descriptors is not, except to determine that all other descriptors are more distant than the descriptor. Evaluation of the distance measure can therefore be aborted once it exceeds the distance to the current best candidate for descriptor (used as a threshold). In the case of normalised vectors, the number of evaluated dimensions can be further reduced by evaluating the distance measure in order of the significance of the query dimensions. This does not generalise to unnormalised, uniformly distributed data. While partial distances have been used before [16], the computation in order of significance is novel. All that is required for ordered partial distance computation is to find the order of significance of the query dimensions. This can be done using a sorting operation in time.

The effectiveness of this method depends on how long it takes the search algorithm to encounter the nearest neighbour. At the start of the search, a relatively large search threshold is used (tight search thresholds have been shown to be ineffective [4]), resulting in complete distance computation for at least the first descriptor. As descriptors closer to the query are encountered, the search threshold is decreased and the partial evaluation becomes more effective.

Implementing the above improvements is trivial and requires only the following three modifications. (1)Sort the query dimensions according to significance. (2)Modify the distance function so that the distance measure is computed in order of query dimension significance. (3)The distance function should terminate as soon as the partially computed distance exceeds a threshold.

3.2. Searching on Sorted Arrays

Presorting the data according to one of the dimensions can be used to improve performance in two ways. Firstly, the sorted dimension can be used as a search priority queue. Bilinear search can be used to find a starting point that is close to the query point in at least one dimension, from where the search progresses outwards. If the sorted dimension is one of the more significant query dimensions, then the linear search is more likely to encounter the nearest neighbour early in the search, thereby improving the effectiveness of partial distance evaluation. The inverse is true if selecting a dimension that is one of the least significant query dimensions (this is demonstrated experimentally in Section 5). The best results are achieved by sorting the data on every dimension and by selecting the order associated with the query’s dominant dimension.

Secondly, the sorted dimension can be used to restrict the search range. A naive approach to defining the search limits would be to simply search in the range , where is the query, is the current search threshold (distance to nearest neighbour), and is the query’s dominant dimension. This is equivalent to stopping the search when the ordered partial distance exceeds the threshold on the first dimension. Because the data is normalised to unit vectors, much tighter constraints are possible. The search limits on are set to the minimum and maximum values that satisfy the intersection of the data unit hypersphere and the search radius hypersphere.

The data space can be defined as

The boundary, , of the search space around query point additionally satisfies the constraint:

Let , where is the component of parallel to and is the orthogonal component such that . The length of is found as

In order to maximise , choose so that is minimised for all , that is, . Add scale factor to correct the length and select to satisfy the orthogonality constraint: Let and solve for using the known length of :

Similarly, can be minimised, resulting in . Note that depends only on and can be computed once per query.

The above equations only find with minimum or maximum value of if the search area is not bounded by one of the coordinate axes. If the -axis intersects the search area, then the maximum value of is 1, and if any other axis intersects the search area, then the minimum value of is 0. Let be the distance between and the unit vector along dimension , and let be the distance between and the unit vector along the minimum dimension of . The upper and lower search limits along dimension are then,

4. The -D Sort System

A search structure named -D sort is used to implement the sorted linear search improvements presented in the previous section.

4.1. Construction

The -D Sort structure consist of index arrays, , with each indexing the descriptor data, , in order of dimension . It requires memory in addition to the data. Construction is extremely simple; sort the descriptor data according to each dimension and write the result in (instead of actually reordering the data), requiring time.

4.2. Adding Data

Adding additional data to the -D Sort structure is relatively simple as well. The new data, , is appended to the existing data, . The ordering arrays are updated by finding the position of the new data. This can be achieved efficiently using binary search in time and array insertion operations in linear time. For a large batch of new descriptors, a more efficient solution is to sort the new data first ( time) and then merge the result with the existing order arrays ( time), rather than performing multiple separate insertions. This addition operation is faster than a complete reconstruction where the new data contains fewer descriptors than the existing data.

4.3. Queries

All matching queries are variants of a nearest neighbours (NNs) query with a maximum distance threshold. The algorithm is listed in Algorithm 1.

  Input: x, , q, , ,
  Output: , ,
(1.1) begin
(1.2)  Sort query, q in ascending order, yielding order array .
(1.3)  Select prime dimension, .
(1.4)  Select data order vector as the search order.
(1.5)  Compute and .
(1.6)  Compute according to (5).
(1.7)  Compute initial search range, , according to (6).
(1.8)  Use binary search to find   .
(1.9)  foreach   in order of increasing distance from   do
(1.10)  If is out of search range , terminate.
(1.11)  Compute distance = dist using the ordered partial distance method.
(1.12) if     then
(1.13)   Insert , in results list, , .
(1.14)   Increment (up to ).
(1.15)   if   then
(1.16)    Reduce the search range, .
(1.17)    Recompute according to (6).
(1.18)   end
(1.19) end
(1.20) end
(1.21) end

Other types of queries are essentially specialisations of the above algorithm. KNN without a threshold is achieved by setting to a very large initial value. Nearest neighbour ratio matching [4] is equivalent to 1NN matching with the additional requirement that no second match is found in the radius . In this case is updated as in line (1.16), and the final match is invalidated if any second nearest neighbour is found to be within .

Two possible methods can be applied to further speed up the query process while sacrificing exact results. Firstly, the number of descriptors visited can be limited. It is possible to compute the worst case accuracy of this approach for each individual query, but it is not possible to enforce a lower accuracy bound on all queries. Alternatively, the distance threshold can be reduced according to an accuracy requirement, to reduce the number of descriptors visited. The convention used in [1] for describing approximation tolerance is used: given tolerance , the distance between a query and a selected nearest neighbour must be within a factor of the distance between the query and the true nearest neighbour. This is implemented in the -D Sort query algorithm in line (1.16). As is shown in the experimental evaluation, this method can reduce search time, but the search accuracy degrades more rapidly than with other search structures.

The computation time of the query operation is highly dependant on the data distribution. Sorting the query requires operations and the initial binary search requires operations. Searching for exact matches requires at best operations and at worst operations (requiring an invalid query or data spherically symmetric around ). In practice, the total number of data points visited is much smaller than , and much fewer than dimensions are evaluated for most visited points.

5. Experimental Evaluation

The following eight search methods were compared experimentally: (1)linear search, (2)linear search using partial distance computation, (3)linear search using significance ordered partial distance computation, (4)-D Sort structured search, (5)-D Sort search, (6)the -D tree from the approximate nearest neighbor (ANN) library [1], (7)the -D tree from the approximate nearest neighbor (ANN) library [1], (8)the fast library for approximate nearest neighbors (FLANN) [2].

Publicly available C/C++ implementations by the original authors were used for ANN and FLANN. The FLANN method was set to use 1000 points in its an automatic parameter selection process. The -D Sort method sorts the data on the first dimension only. This method is added to the evaluation to demonstrate the importance of using the order of the most significant query dimension as search priority queue.

A set of data was generated using a characteristic scale determinant of Hessian feature detector [17, 18] and SIFT descriptor [3, 4] to extract 1 million descriptors from a subset of the MIRFLICKR-1 M image collection [6]. These descriptors are normalised to have an norm of 1 and consists of strictly positive values.

Figure 2 plots construction times and average query time results for a set of queries produced by extracting features for an image not included in the dataset. This is the most difficult test case, since the queries are rarely very close to any points in the dataset. Query time results are expressed in terms of a relative increase in speed over the time taken to do a simple linear search (). Table 1 lists 1NN performance for queries from an image that is not in the database (same as Figure 2(b)), a rotated version of an image that is in the database, and an exact copy of an image in the database.

Data structure construction times are plotted in Figure 2(a) for those methods that require preprocessing. The -D Sort structure construction time is comparable to that of the -D tree on average, while the construction time increases more slowly than the -D Tree. FLANN takes the longest to construct due to the simulation process it uses to select parameters. Figure 3 shows the time taken to append 100 descriptors to a -D Sort structure of increasing size. It can be seen that the addition method is more efficient when adding a relatively small number of points to the database, compared to rebuilding the structure completely.

Figure 2(b) plots the relative exact 1NN query performance against the size of the database. A matching threshold of 2.0 was used. It can be seen that using only partial distance computation is sufficient to bring the speed of linear search up to that of the best tree structure method (FLANN). Ordered partial distance adds a further improvement, all without any data preprocessing. The -D Sort structure achieves the best performance by a significant margin, demonstrating the benefits of presorting the data. In contrast, -D Sort performs worse than the ordered partial distance method. This shows that sorting on an arbitrary dimension is not beneficial for all queries and that it is necessary to sort on the query’s most significant dimension.

The effect of matching threshold on query performance is examined in Figure 2(c). Smaller thresholds reduce the search space and improve matching speed in general. Of the linear methods, the -D Sort structure is best able to take advantage of smaller thresholds. The -D Tree and -D Tree structures benefit the most from smaller thresholds, with -D Tree surpassing -D Sort at a relatively small threshold of 0.2.

The number of nearest neighbours selected impacts the partial distance computation, since the greater distance to the th match leads to a greater search threshold. Figure 2(d) plots the query performance against . A decrease in performance can be seen for -D Sort and modified linear methods, while the ANN and FLANN methods are unaffected. The linear methods remain the most efficient.

The -D Sort structure supports approximate matching by reducing the search threshold below what is needed to prove an exact match. Figure 2(b) plots the match accuracy against the relative query performance for approximate nearest neighbour matching. Match accuracy is measured as the proportion of matches returned that are the true matches. It can be seen that approximate matching leads to more than an order of magnitude improvement in performance at the cost of a slight reduction in accuracy. The accuracy of the -D Sort structure drops more quickly than that of ANN -D tree or FLANN. While -D Sort consistently outperforms the other methods in guaranteed exact matching, it is not the best choice for approximate matching.

The effect of the similarity between queries and the data is demonstrated by Table 1. The first column shows results from an image not in the database, which is the same as one of the points in Figure 2(b). The second column lists results of matching a rotated version of an image in the database and the third column is for an exact copy of a database image. The absolute time performance of linear search is the same for all three cases. The ANN methods perform the same for all three cases as well, which is unexpected. FLANN shows a significant increase in performance. This is the expected behaviour, since the closer true match allows for reduced search space. Both modified linear methods show an improvement in performance, though the improvement is less pronounced than with FLANN. In the exact copy case the modified linear methods practically become equivalent, since any nonzero dimension of the query is sufficient to exclude a potential match after the true nearest neighbour has been found. The -D Sort method shows the best improvement in performance, with performance more than doubling between the outlier case and the rotated image case. In the exact case, the initial bilinear search always delivers the exact match, resulting in search time and extremely fast performance. Unfortunately the exact match case does not have many applications. This last result really shows not only that exact matching is possible using a one dimensional feature, but also that -D Sort is able to take advantage of close similarity.

6. Conclusion

Data structures such as -D trees and hierarchical -means trees are only marginally more effective than linear search when performing exact nearest neighbour matching in high-dimensional local image descriptor data. Of these, the best-performing method is the FLANN approach [2], which improves performance by a factor of 2.5 at best, but typically yields an improvement of less than 1.5-fold. At the same time, this data structure requires more than a minute to construct for even small datasets.

This paper presents several performance improvements to the linear search method for exact nearest neighbour matching. It is shown that evaluating the distance measure in order of the significance of query dimensions and terminating when the search threshold is reached can improve linear search time by 1.7–4.4-fold (usually at least 2.6 fold). These modifications are simple to implement and do not require any data preprocessing. Secondly, the -D sort structure is introduced. This structure essentially presorts the data according to every dimension. No parameter tuning is required and data can be added efficiently. Using the sort order associated with the query’s most significant dimension as a priority queue and to limit the search range improves results further. The -D Sort-based search showed an improvement over linear search of 2–7-fold (usually at least 3.2-fold), while the preprocessing time can be several orders of magnitude less than that of FLANN. While it is possible to implement approximate nearest neighbour search using the -D Sort structure, results show that the accuracy of this method decreases more rapidly than that of ANN and FLANN and does not yield the same performance-accuracy ratio.

In summary, this paper proposes two approaches for exact nearest neighbour search in normalised high-dimensional descriptor data. The first is the use of partial distance computation in order of significance of the query dimensions. This does not require any data preprocessing and yields best results when preprocessing time would be a significant factor, for example, when matching between a small set of images. The second approach is to use the -D sort structure and proposed query mechanism, which yield the best query time performance without the need for any parameter tuning. This yields the best performance where the preprocessing time is small compared to the number of queries that will be performed. The -D Sort structure also supports data addition for problems where the matching database grows progressively.

Acknowledgment

This project was supported by Australian Research Council Grant no. LP0990135.