EL: Local Image Descriptor Based on Extreme Responses to Partial Derivatives of 2D Gaussian Function
We propose a two-part local image descriptor EL (Edges and Lines), based on the strongest image responses to the first- and second-order partial derivatives of the two-dimensional Gaussian function. Using the steering theorems, the proposed method finds the filter orientations giving the strongest image responses. The orientations are quantized, and the magnitudes of the image responses are histogrammed. Iterative adaptive thresholding of histogram values is then applied to normalize the histogram, thereby making the descriptor robust to nonlinear illumination changes. The two-part descriptor is empirically evaluated on the HPatches benchmark for three different tasks, namely, patch verification, image matching, and patch retrieval. The proposed EL descriptor outperforms the traditional descriptors such as SIFT and RootSIFT on all three evaluation tasks and the deep-learning-based descriptors DeepCompare, DeepDesc, and TFeat on the tasks of image matching and patch retrieval.
Local image descriptors represent an important area of research in computer vision. Reliable local feature matching is required in numerous applications, for example, in emerging mobile visual search (MVS) , panorama stitching , image mosaicing , texture classification , partial-duplicate web image retrieval , wide-base stereo , and object recognition [7, 8]. Computer vision researchers have proposed many types of descriptors. We can divide them into handcrafted descriptors (SIFT , GLOH , SURF , BRIEF , KAZE , AG , Max-SIFT , and FBRK ) and those based on learning (BestDaisy , DeepCompare , DeepDesc , TFeat ).
Various benchmarks and different measures and protocols are available for local image descriptor evaluation [10, 21–25]. Recently, Balntas et al.  introduced HPatches, a new public benchmark for evaluation of local image descriptors. It includes a large body of patches obtained from image sequences of different scenes, captured under different lighting conditions and with large changes in viewpoint. The benchmark offers an open-source implementation of protocols for evaluating local image descriptors on three different tasks: patch verification, image matching, and patch retrieval. In the same paper, the authors also show that a simple normalization of the handcrafted RootSIFT descriptor  can boost its performance to the level of deep-learning-based descriptors. The RootSIFT descriptor achieved the best result for the task of image matching and the second best for the task of patch retrieval. These results encouraged us to study the use of first- and second-order partial derivatives of the two-dimensional Gaussian function in defining a local image descriptor.
Local image descriptors based on higher-order image differentials, for example, the local jet , differential invariants , and steerable filters  have been studied before as low-dimensional point descriptors and gave poor results on tests . Using higher-order image differentials in a different way, as histogrammed feature elements, proved to be much more successful. In , the authors propose a very simple algorithm based on the responses of a second-order bank of six Gaussian derivatives, which classifies an image location into one of seven Basic Image Features (BIFs): near-flat location, slope-like points, blob-like points (dark and light), line (dark and light), and saddle-like points. Jaccard et al.  applied BIFs for phase-contrast microscopy image segmentation. In , the authors extend BIFs to oBIFs, by adding local image orientation to slope, line, and saddle-like points. Slope-like points are assigned a gradient orientation and line and saddle-like points an orientation in the direction perpendicular to the largest eigenvalue of the Hessian. Experiments demonstrate that a larger feature alphabet can lead to better performance and simpler encoding of visual words. In , the authors break up the descriptor extraction process into a number of modules and put these together in different combinations. The best descriptors are those with log-polar pooling regions and feature vectors constructed from rectified outputs of steerable quadrature filters. Unfortunately, these descriptors are of large dimensions; therefore, in , the authors add modules for dimension and dynamic range reduction.
In our approach, we use the first- and second-order partial derivatives of the two-dimensional Gaussian function, i.e., the edge and line detection filters. The proposed method forms a descriptor by histogramming and pooling magnitudes of the strongest responses to the steerable filters and uses an iterative adaptive thresholding to normalize the histogram values. The proposed descriptor achieves high mAP scores on the tasks of image matching and patch retrieval, and as such, it represents an attractive alternative to popular local descriptors.
The proposed approach is described in detail in the following sections. In the next section, we first describe how the optimal filter orientations are found and the corresponding magnitudes are computed. In the following section, we describe how the descriptor is formed. This is followed by the presentation of the experimental results and the final concluding section.
The proposed descriptor is publicly available at https://github.com/REVAMJ/ELdescriptor.
2. Extreme Responses to Edge and Line Filters
The proposed EL image descriptor is based on the first- and second-order partial derivative of the two-dimensional Gaussian function. The two partials have a nice property that they are orthogonal to each other.
2.1. Theory of Steerable Filters
Our algorithm is rooted in the theory of steerable filters, described in . Letbe the two-dimensional Gaussian function andbe its first partial derivative in the x direction (Figure 1). The same function rotated by counterclockwise is
Let represent the rotation operator such that for any function , is rotated through an angle θ about the origin. According to the theory of steerable filters, can be synthesized by taking a linear combination of two basis filters and :
Let and whereis the convolution operation and I is an intensity image. An image response to can simply be computed as
The second partial derivative of is equal to
Computation of filter (6) in an arbitrary orientation requires two additional basis filters. We have chosen and (Figure 1). Let , , and , then according to , an image response to is computed aswith interpolation functions , equal to and with , , and being equal to , , and , respectively.
2.2. Filter Best Orientation
At some image location, the filter best orientation is the one that gives the strongest image response. The proposed descriptor will be composed of such responses. For the first-order partial, we compute and obtain two filter orientations, one minimum and the other maximum:
In the case of the second-order partial (7), computation of gives two extrema for the function basic period. The minimum isand the maximum is
In Appendix, we explain how they are computed. The minimal and maximal image responses are different in magnitude (see Figure 2(c)). Both can be positive or negative, or of opposite signs. If our goal is to record both values, and , then we have to distinguish between positive , negative , positive , and negative . Each of the four options requires its own representation (for example, in the case of orientation binning, a histogram), which results in a long descriptor. However, we can choose to discard a piece of information. At the minimum , we compute, rather, the image response to the negative basis filters , , and , which gives . We can interpret as a dark-line detection filter while is a light-line detection filter. Figure 2(c) shows image responses obtained by the positive and negative basis filters. The procedure is as follows. From equations (9) and (10), we compute and and then from equation (7) the image responses and . The two image responses are compared (see Figure 2(d)) and only the larger is considered further, together with the corresponding filter orientation:
Notice that is always positive.
3. Descriptor Representation
The proposed descriptor is composed of two parts. The first part uses and , and the second part uses and . For simplicity, we name the new descriptor EL (Edges and Lines), because the filters used actually detect edges and lines. Descriptor construction requires the following three steps: orientation binning, orientation pooling, and histogram normalization.
3.1. Orientation Binning
The orientation binning for is performed in the same way as in SIFT. We quantize orientation into eight histogram bins corresponding to angles , , , , , , , and . At each sample location, we construct a histogram vector of length eight. A filter response contributes to the two orientation bins adjacent to , whereby the value is distributed linearly between the two bins. Thus, if lies between the two bins with angles and , then bin receives a contribution . Here, means the angular distance between the two angles (with wrap-around).
The orientation binning of is slightly different from the orientation binning of . Due to the symmetry of the filter , we quantize orientation only into four orientation bins corresponding to angles , , , and . At each sample location, we construct two histogram vectors of length four. One vector is intended for equal to and one for equal to . Values of the vector components are determined by distributing to the two bins adjacent to in the same way as describe above for .
3.2. Orientation Pooling
The vectors from the previous stage are summed together spatially weighted with Gaussian weights according to their distance from the pooling centers. As recommended by Winder and Brown , we use 17 pooling centers and three different Gaussian weighting functions, as illustrated in Figure 3.
After orientation summation, each pooling center is represented by one vector of length eight, representing image responses to the first-order partial derivative, and two vectors of length four, representing image responses to the second-order partial derivative. Vectors from all pooling centers are then concatenated into a common vector, i.e., a local image descriptor with .
3.3. Descriptor Normalization
The descriptor is normalized to reduce the effects of linear and nonlinear illumination changes. The proposed algorithm used by EL starts with an adaptive iterative thresholding of descriptor components. It repeats the following three steps ten times:(1)The average value of descriptor components is calculated, (2)The threshold is calculated, (3)Each descriptor component is thresholded to be no larger than T The constant in step 2 is determined experimentally. Then, we follow the approach used by RootSIFT , which uses a square root (Hellinger) kernel instead of the standard Euclidean distance to measure the similarity between SIFT descriptors.(4)The descriptor is normalized to have unit norm(5)Each descriptor component is represented by its square root
The graph in Figure 4 demonstrates the effect of the proposed iterative adaptive thresholding. The truncated descriptor components of the corresponding patches are significantly more similar than those in the case of normalization used by RootSIFT, indicated by the circles on the ordinate, or by the single-step adaptive thresholding, indicated by the results in iteration no. 1.
4. Results and Discussion
The proposed approach is evaluated on the HPatches dataset (https://github.com/hpatches/hpatches-dataset) described in , which provides more than 2.5 million preextracted patches of size pixels from 116 image sequences captured from different scenes. Changes in the images are due to changing scene lighting conditions and varying camera viewpoints (Figure 5). For each image sequence, patches are detected in the reference image and projected on the target images using the ground truth homographies. Detections are perturbed by increasing amounts of geometric noise, resulting in three patch sets of increased difficulty: easy, hard, and tough.
The authors also define the evaluation protocol and present its open-source implementation for fair comparison of local image descriptors on three different tasks: patch verification, image matching, and patch retrieval. In this work, we strictly follow the proposed protocol and use the provided implementation for evaluation of our approach and comparison with the related work.
4.1. Evaluation of the Proposed Descriptor
First, we evaluate the proposed descriptor denoted by EL and its individual parts denoted by E and L. Figure 6 shows the results in terms of the mean Average Precision (mAP(%)) for three different tasks (patch verification, image matching, and patch retrieval) on three different sets (easy, hard, and tough). Results are also shown for the postprocessed variants of descriptors +EL, +E, and +L obtained by applying ZCA whitening with clipped eigenvalues, followed by power law normalization and normalization .
We can observe that the combined EL descriptor produces higher scores than both single-component descriptors E and L. Descriptor L gives lower mAP scores than E. We see at least two reasons for this. In the case of a blob, the filter , used by L, gives an equal response for all filter orientations θ. This means that the filter best orientation in not well defined. In the case of a saddle, the minimal and maximal responses of have the same magnitude. Due to different image deformations, the algorithm might choose in one situation the minimum while in the other the maximum. These situations increase errors in descriptor components. The differences in scores between the two-part descriptor +EL and one-part descriptor +E for the task of patch verification, image matching, and patch retrieval are 2.15, 3.59, and 4.52 percent point (pp), respectively.
We also wanted to verify the contribution of the iterative adaptive thresholding in the normalization step. We compared the proposed approach with the fixed thresholding, as used by the SIFT and RootSIFT descriptors. We therefore replaced our normalization approach, described by steps 1–5 in Descriptor Normalization, with normalization used by SIFT and RootSIFT (using the fixed threshold ). We denote the obtained variants of the descriptor as EL-s and EL-r, respectively. We can notice that for the task of image matching by applying the Hellinger kernel, used by EL-r, we can improve mAP for 8.02 pp and by using the proposed iterative adaptive thresholding, used by EL, for additional 1.51 pp. For the postprocessed versions, the improvements are 5.51 and 1.88 pp. The iterative adaptive thresholding improves scores also for the task of patch retrieval. We can conclude that iterative adaptive thresholding used by EL is beneficial for the task of image matching and patch retrieval.
4.2. Comparison to Related Work
We compare the performance of the proposed descriptor with the previously published results on HPatches . Table 1 shows scores obtained with two established handcrafted descriptors SIFT  and RootSIFT , and the deep-learning-based descriptors DC-S and DC-S2S , DDESC , and TF-M and TF-R .
For all three tasks, EL and its postprocessed variant improve the performance compared to SIFT and RootSIFT. For the tasks of image matching and patch retrieval, EL also achieves better scores than deep-learning-based descriptors. The EL descriptor is particularly suitable for the task of image matching. Notice that the score 36.92% achieved by EL is higher even than the scores obtained by postprocessed variants of all other descriptors. Its postprocessed variant +EL achieves even better score, defeating all other descriptors for more than 6 pp. Similar to SIFT and RootSIFT, the EL descriptor is less appropriate for the task of patch verification.
We can expect that an approach based on pooling across different scales, in addition to spatial locations, i.e., the approach proposed by [34, 35], would improve the obtained scores even further. However, here we limit our evaluation to one scale only.
Table 2 shows the dimensionality, size of the measurement region in pixels, and extraction time of each descriptor. Note that EL, E, and L are implemented in Matlab; therefore, their time efficiency should be interpreted with caution; a more efficient implementation and code optimization could further speed up the descriptor extraction process.
We propose a two-part descriptor, named EL (Edges and Lines), based on the maximal image responses to the first- and second-order partial derivatives of the two-dimensional Gaussian function. The maximal image responses are calculated by using the steering theorems . In this way, we complement the understanding of oBIFs . The two parts of the proposed descriptor, E and L, are of equal size; each one contains 136 values.
To increase the descriptor robustness to nonlinear illumination changes and to increase the impact of less contrasting regions, the fixed thresholding of descriptor components, as used by SIFT and RootSIFT, is replaced by an iterative adaptive thresholding.
The proposed descriptor was evaluated on HPatches benchmark for three different tasks, namely, patch verification, image matching, and patch retrieval. The postprocessed variant of the EL descriptor obtained by applying ZCA whitening with clipped eigenvalues, followed by power law normalization and normalization, outperforms the postprocessed variant of the RootSIFT descriptor for all three tested tasks and the tested deep-learning-based descriptors for the tasks of image matching and patch retrieval.
One of our goals was also to explore the contribution of the second-order partial derivatives of the two-dimensional Gaussian function. Experimental results show improvements in scores for all three tasks. The largest improvement was obtained for the task of patch retrieval. Here, mAP is improved for 4.52 percent point. This is an important improvement, and we therefore recommend including both parts in a local image descriptor.
Overall, the results are very favorable especially for the tasks of image matching and patch retrieval, which are commonly required in many computer vision applications. What is also worth noting is that the proposed approach is very clear and is based on solid mathematical foundations of the theory of steerable filters. The EL descriptor is publicly available at https://github.com/REVAMJ/ELdescriptor.
Equation (7) uses three interpolation functions with , , and being equal to , , and , respectively. The three interpolation functions can be expressed in the following forms:
A filter orientation, which gives the strongest image response, is found by computing the first derivative of (A.4):with A and B being equal to
To find the extrema, we solve , which gives us two solutions for the function basic period:
To determine whether the extremum is a minimum or maximum, we examine the second derivative. For the minimum, the valid condition is , while for the maximum, . Let us compute :
For the minimum, we obtain the condition or equivalentlyand for the maximum
Before we continue, we note that , with given by equation (A.8), is positive when A is positive and is negative when A is negative. On the contrary, , with given by equation (A.10), is positive when A is negative and is negative when A is positive. Solution (A.8) satisfies equations (A.12), and (A.13); therefore, represents a minimum while the solution (A.10) satisfies equations (A.14) and (A.15); therefore, is a maximum.
The HPatches dataset used to support the findings of this study is publically available at https://github.com/hpatches/hpatches-dataset and described in the paper: HPatches: A benchmark and evaluation of handcrafted and learned local descriptors (DOI:10.1109/CVPR.2017.410). The code we wrote is also publically available at https://github.com/REVAMJ/ELdescriptor.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the Slovenian Research Agency (grant number P2-0214).
M. Brawn and D. G. Lowe, “Automatic panoramic image stitching using invariant features,” International Journal of Computer Vision, vol. 74, no. 1, pp. 59–73, 2007.View at: Google Scholar
X. Yu, Y. Zhang, and H. Wang, “A novel local human visual perceptual texture description with key feature selection for texture classification,” Mathematical Problems in Engineering, vol. 2019, Article ID 3756048, p. 20, 2019.View at: Google Scholar
W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large scale partial duplicate web image search,” in Proceedings of the 18th ACM International Conference on Multimedia, pp. 511–520, New York, NY, USA, October 2010.View at: Google Scholar
Y. Dou, K. Hao, Y. Ding, and M. Mao, “A mean-shift-based feature descriptor for wide baseline stereo matching,” Mathematical Problems in Engineering, vol. 2015, Article ID 398756, p. 14, 2015.View at: Google Scholar
P. Alcantarilla, A. F. Bartoli, and A. J. Davison, “KAZE features,” in Proceedings of the 12th ECCV, pp. 214–227, Florence, Italy, October 2012.View at: Google Scholar
L. Xie, Q. Tian, and B. Zhang, “Max-sift: flipping invariant descriptors for web logo search,” in Proceedings of the 18th ACM International Conference on Multimedia, pp. 5716–5720, Mountain View, CA, USA, June 2014.View at: Google Scholar
L. Yang and Z. Lu, “A new scheme for keypoint detection and description,” Mathematical Problems in Engineering, vol. 2015, Article ID 310704, p. 10, 2015.View at: Google Scholar
S. Winder and M. Brown, “Learning local image descriptors,” in Proceedings of the IEEE Conference CVPR, pp. 1–8, Rio de Janeiro, Brazil, July 2007.View at: Google Scholar
S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in Proceedings of the IEEE Conference CVPR, pp. 4353–4361, Boston, MA, USA, June 2015.View at: Google Scholar
E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer, “Discriminative learning of deep convolutional feature point descriptors,” in Proceedings of the IEEE ICCV, pp. 118–126, Santiago, Chile, December 2015.View at: Google Scholar
V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks,” in Proceedings of the British Machine Vision Conference, vol. 11, pp. 1–119, York, UK, September 2016.View at: Google Scholar
H. Aanæs, A. L. Dahl, and K. S. Pedersen, “Interesting interest points,” International Journal of Computer Vision, vol. 97, pp. 18–35, 2012.View at: Google Scholar
J. Heinly, E. Dunn, and J.-M. Frahm, “Comparative evaluation of binary features,” in Computer Vision –ECCV 2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds., vol. 7573 of Lecture Notes in Computer Science, Springer, Florence, Italy, 2012.View at: Publisher Site | Google Scholar
J. L. Schoenberger, H. Hardmeier, T. Sattler, and M. Pollefeys, “Comparative evaluation of hand-crafted and learned local features,” in Proceedings of the Conference Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 2017.View at: Google Scholar
V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “HPatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in Proceedings of the IEEE Conference CVPR, Honolulu, Hawaii, July 2017.View at: Google Scholar
R. Arandjelović and A. Zisserman, “Three things everyone should know to improve object retrieval,” in Proceedings of the IEEE Conference CVPR, pp. 2911–2918, Providence, RI, USA, June 2012.View at: Google Scholar
N. Jaccard, N. Szita, and L. D. Griffin, “Segmentation of phase contrast microscopy images based on multi-scale local basic image features histograms,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 5, no. 5, pp. 359–367, 2017.View at: Publisher Site | Google Scholar
M. Lillholm and L. Griffin, “Novel image feature alphabets for object recognition,” in Proceedings of the 19th IEEE Conference ICPR, pp. 1–4, Tampa, FL, USA, December 2008.View at: Google Scholar
S. Winder, G. Hua, and M. Brown, “Picking the best daisy,” in Proceedings of the IEEE Conference CVPR, pp. 178–185, Miami, FL, USA, August 2009.View at: Google Scholar
J. Dong and S. Soatto, “Domain-size pooling in local descriptors: Dsp-sift,” in Proceedings of the IEEE Conference CVPR, pp. 5097–5106, Boston, MA, USA, June 2015.View at: Google Scholar