Abstract

Illumination-invariant method for computing local feature points and descriptors, referred to as LUminance Invariant Feature Transform (LUIFT), is proposed. The method helps us to extract the most significant local features in images degraded by nonuniform illumination, geometric distortions, and heavy scene noise. The proposed method utilizes image phase information rather than intensity variations, as most of the state-of-the-art descriptors. Thus, the proposed method is robust to nonuniform illuminations and noise degradations. In this work, we first use the monogenic scale-space framework to compute the local phase, orientation, energy, and phase congruency from the image at different scales. Then, a modified Harris corner detector is applied to compute the feature points of the image using the monogenic signal components. The final descriptor is created from the histograms of oriented gradients of phase congruency. Computer simulation results show that the proposed method yields a superior feature detection and matching performance under illumination change, noise degradation, and slight geometric distortions comparing with that of the state-of-the-art descriptors.

1. Introduction

Feature detection and description are low-level tasks used in many computer vision and pattern recognition applications such as image classification and retrieval [1, 2], optical flow estimation [3], tracking [4], biometric systems [5], image registration [6], and 3D reconstruction [7].

The local feature detection task consists of finding “feature points" (points, lines, blobs, etc.) in the image. The points should satisfy certain properties such as distinctiveness, quantity, locality, accuracy, and more important repeatability [8]. To represent each feature point in a distinctive way, a neighborhood around each feature is considered and encoded into a vector, known as “feature descriptor." The feature descriptors of different images are “matched" using either Euclidean or Mahalanobis distances.

It is desirable that the behavior of feature descriptors be invariant to viewpoint changes, blur effect, and affine transformations [913]; but also, it needs to be robust to noise and nonuniform illumination degradations. However, these last two conditions have not been completely solved, even when they are common issues in real-world applications. Thus, the nonuniform illumination variations and noise degradations are still challenges that decrease the performance of the existing state-of-the-art methods.

Since Attneave research [14] about the importance of the image shape information, several techniques for feature detection have been developed [8, 1517]. Many of the existing works are robust to affine transformations (scale and rotations), but they are not designed to work with complex illumination changes. Recently, to address the nonuniform illumination problem, different methods based on the order of the intensity values have been proposed [1821]. However, these methods are only robust to monotonic intensity variations and are sensitive to heavy noise degradations.

On the other hand, the human visual system is able to recognize objects under different illumination conditions. The human eye perceives an amount of light energy passes through, reflected or emitted from an object surface, known as luminance. It converts the light energy into nerve impulses by the photoreceptor cells in the retina, where the information is encoded and sent to the primary visual cortex (V1) [22]. Psychophysical evidence suggests that the human visual system decomposes the visual information in borders and lines components by using phase information. Besides, it is known that different groups of cells in V1 extract particular image features as frequency, orientation, and phase information [23].

In this work, to overcome the luminance variation problem inspired by the human visual system, a phase-based method for computing local feature points and descriptors, referred to as LUminance Invariant Feature Transform (LUIFT), is proposed. The LUIFT method helps us to extract the most significant local features in images degraded by nonuniform illumination, geometric distortions, and heavy scene noise. The proposed technique is suitable for recognition of rigid objects under real conditions. The LUIFT algorithm was extensively tested on common databases. The proposed method yields a competitive matching performance under slight scaling and in-plane rotation with that of the state-of-the-art algorithms. The LUIFT method shows improved performance regarding the feature points repeatability as well as the number of detected and matched feature descriptors under illumination changes and noise degradations.

The rest of this paper is organized as follows. In Section 2, the related works are recalled. In Section 3, the phase-based approach is described. In Section 4, the proposed LUIFT detector and descriptor are presented. In Section 5, computer simulation results are provided and discussed. Finally, Section 6 summarizes our conclusions.

Early works on image feature points began with the research of Attneave [14], showing that the most important shape information of an image is concentrated at the contour points with high curvature values, such as corners and junctions. Since then, several techniques for features detection have been developed, such as contour curvature based methods [8, 24], blob-like detector techniques [16], differential approaches [8, 17], intensity variations based techniques [25, 26], and recently learning-based methods [2729].

The Harris corner detector [30], which is an improvement of the Moravec approach [31], is one of the first and most used corner detectors, which describes the gradient distribution in a local neighborhood of a point based on the second-moment matrix. The feature points are obtained at the points where the local gradient varies significantly in two directions. Similarly to the Harris matrix, the Hessian matrix [32] is constructed by the second-order Taylor expansion of the intensity surface and encodes the shape information of the image. Recently, a Harris-based (HarrisZ) corner detector was proposed [33]. The HarrisZ corner detector considers a z-score to adapt the corner response function, searching the corners near to edges by a coarse gradient mask.

SUSAN (Smallest Univalue Segment Assimilating Nucleus) [25] and, more recently, FAST (Features from Accelerated Segment Test) [26] corner detectors are also intensity-based techniques. They obtain fast feature points associating to image points in a local area with similar brightness. The FAST detector is based on the SUSAN detector, but it uses more efficient decision trees to evaluate intensity pixel values.

The SIFT (Scale Invariant Feature Transform) descriptor [9, 34] utilizes an approximation of the LoG (Laplacian of Gaussian) and HOG (Histograms of Oriented Gradient) [35] for scale and rotation invariance, respectively. Until now, the SIFT descriptor is the most popular state-of-the-art descriptor due to its effectiveness in the feature detection and matching under scale and rotation image changes. That is why different variations of the SIFT descriptor have been proposed. The SURF [11, 36] (Speed Up Robust Features) and the KAZE [12] descriptors are a couple of examples. Unlike the SIFT method, the SURF descriptor uses Haar-like filters and integral images to improve the processing time at the expense of the method performance; meanwhile, the KAZE descriptor is based on nonlinear scale space improving the locally adaptive blurring on the nonlinear scale-space construction. The CenSurE [37] (Center Surround Extremas) feature detector is based on the estimation of the LoG (Laplacian of Gaussian) using simple center-surround filters and integral images for real-time tasks. The Daisy descriptor [10] is inspired by the SIFT and GLOH [17] descriptors but computed more efficiently replacing weighted sums by sums of convolutions.

Binary descriptors have also been suggested. FREAK (Fast Retina Keypoints) [38], BRIEF (Binary Robust Independent Elementary Features) [39], and BRISK (Binary Robust Invariant Scalable Keypoints) [40] are some of them. Basically, they carry out pairwise intensity comparisons within an image patch and use the Hamming distance for fast feature matching.

Although all mentioned methods provide satisfactory results for affine image transformations (rotation and scale), they are usually constructed on the base of differences between the pixel intensities of the image, which makes them sensitive to nonuniform illumination variation and noise degradation. To obtain robust descriptors to intensity variations, new methods have been proposed. The DaLI [27] (Deformation and Light Invariant descriptor) descriptor was developed for nonrigid transformations and illumination changes. The 2D image patches are considered as 3D surfaces and described in terms of a heat kernel signature. Then, for descriptor dimensional reduction a Principal Component Analysis (PCA) is applied. However, DaLI descriptor is not invariant to scale and rotation distortions and has a high complexity due to the computation of eigenvalues for the heat diffusion equation. The TILDE [13] (Temporally Invariant Learned DEtector) and the LIFT [28] (Learned Invariant Feature Transform) methods consider a learned method for feature detection and description. Basically, the detector uses training to obtain those features that remain stable under different conditions. However, a prestage of training and a collection of image patches are needed. The LIOP [21] (Local Intensity Order Pattern) descriptor is based on the intensity values order, assuming the principle that the relative order of pixel intensities remains unchanged with monotonic intensity changes. However, nonuniform illumination variations are not considered.

In this work, we propose a phase-based feature detector and descriptor. Unlike the mentioned above methods, the proposed technique utilizes the image local phase information instead of relying on the image pixels intensities changes. So, there are two main contributions of the proposed work: first, since the local phase contains the most important image information and it is invariant to image pixel intensities [41], the proposed method is robust to nonuniform illumination variations; second, since the proposed method utilizes the local phase congruency approach rather than only image gradients, it is robust to heavy noise degradations.

3. Phase-Based Signal Model

Ever since the Hubel and Wiesel work [42], it has been known that different groups of neurons in the biological visual cortex, called simple cells, respond selectively to bars and edges at particular orientation and location. Furthermore, psychophysical evidence suggests the existence of the frequency-selective V1 neurons operating as bandpass filters and the computation of complex cells energies as a sum of squared responses of simple cells (see [23]).

Morrone and Owens proposed a model of feature perception such as edges, lines, and shadows called the local energy model [4345]. According to this model, the human visual system is capable to determinate a square waveform and a trapezoid by using phase information, and it can be proved that the maximum of the energy function occurs at the points of the maximum phase congruency [46]. Continuing with this approach, Kovesi [4749] proposed a dimensionless measure of phase congruency at each point of an image, where the phase congruency value indicates the significance of the current feature; that is, unity means the most significant feature, and zero indicates the lowest significance.

Felsberg et al. [50] provided a framework to obtain features based on the phase of an image. Unlike other works, they did not use steerable filters, such as Gabor filters, to get the image features. Instead, they proposed a new concept of a two-dimensional analytic signal, referred to as the monogenic signal [51].

3.1. Local Energy Model and Phase Congruency Approach

The local energy model [44, 45] establishes that the visual system could locate features by searching for maxima of local energy and identifies the feature type by evaluating the argument at that point.

Formally, let the pair of filters and be the basic operators of the model with equal magnitude spectra but with orthogonal phases (here denotes the Hilbert Transform of ). The local energy function is defined aswhere is a periodic signal, and is the convolution operator.

The local energy function locates the position of image features but it has no information about the feature type. To determine the feature type, it is necessary to consider the argument defined as follows:

On the other hand, a periodic function, , can be expanded in its Fourier components as follows:where and represent the magnitude and the local phase of the th Fourier component, respectively. The phase congruency function is defined as follows [44]:where is the weighted mean local phase angle of all Fourier components at the point and . The congruency of phase at any angle produces a local feature. A phase congruency value of one means that most of the Fourier components phases are similar and; therefore, there exists a local feature (edge or line), while a phase congruency of value zero indicates the lack of structure. Besides, the value of determines the nature of the feature: values near to zero and correspond to a line feature, and values near to and correspond to an edge feature.

Unfortunately, the function is highly sensitive to noise and frequency spread. To overcome this problem, the following definition of the phase congruency function was proposed [47]:where is a weight for the frequency spread, represents the signal energy, is a noise threshold parameter, and is a small constant to avoid division by zero. We refer to the following papers [4749] for more details.

In practice, local frequency information is obtained via banks of oriented 2D Gabor filters, but this procedure is computationally expensive. Recently, Felsberg and Sommer [52] proposed the monogenic signal, which is a generalization of the 1D analytic signal. It gives us a theoretical framework to obtain local frequency information.

3.2. The Monogenic Signal

The monogenic signal [52] is defined as a combination of the 2D signal and its first-order Riesz transform, defined as follows.

Let be the transfer functions of the first-order Riesz transform in the frequency domain:

The monogenic signal in the frequency domain is defined as follows:where is the Fourier transform of and .

In order to perform scale decomposition of a signal into a set of partial signals, it is necessary to calculate the monogenic signal for narrow bandwidths. A good approximation of the scale decomposition can be done by using appropriate bandpass filters to obtain localization in the spatial and frequency domains.

3.3. Scale-Space Monogenic Signal

Felsberg and Sommer [53] defined the linear Poisson scale-space representation as an alternative to the well-known Gaussian scale-space, because it is related to the monogenic signal. The Poisson scale-space is defined as the convolution of the image with the Poisson kernel, as follows:where is the scale parameter that controls the degree of image resolution. The combination of two lowpass filters with a fixed ratio of scale parameters gives us a family of bandpass filters with a constant relative bandwidth, defined aswhere indicates the relative bandwidth, is the coarsest scale, and denotes the bandpass number [54]. The Poisson scale-space representation in the frequency domain of the image filtered by the bandpass filter is given by Then, the Poisson scale-space monogenic signal representation is formed bywhere and in the spatial domain.

Therefore, the local energy , local orientation , local direction , and local phase (Note that the function , where the factor sign(y) indicates the direction of rotation) can be computed as follows:

Figure 1 shows a block diagram for computing the monogenic scale-space signal.

4. Proposed Feature Detector and Descriptor

In this section, the proposed LUIFT feature detector and descriptor are described. The feature detector is constructed using a modified Harris corner detector and the phase congruency approach, while the feature descriptor is constructed using a modified HOG-based method.

4.1. Feature Detector

First, using the monogenic scale-space framework (see Figure 1) with a bandpass filter set , the scale-space monogenic signal and the sum of amplitudes are computed. Note that, by increasing the bandpass number , more fine scale features are revealed. The phase congruency function in (5) can be calculated for each point of the image as follows:where the energy and the sum of the amplitudes are obtained from the scale-space monogenic signal. The frequency spread weight and the noise threshold are calculated as in [47].

Next, in order to obtain the feature point candidates, a modified Harris corner detector is utilized.

Let be the Harris matrix defined by where and are the partial derivatives of the image . Considering the scale-space monogenic signal, the derivatives of the Harris matrix () are replaced by the monogenic signal components () as follows: where and are normalized. Then, the corner detector function defined in [30] is utilized to obtain corner feature candidates,where is a sensitivity parameter, commonly used .

The obtained candidate features are weighted by its corresponding value, in order to extract feature points with high phase congruency; that is, Then, a thresholding followed by nonmax suppression algorithm is applied to obtain the final feature points. Since the PC value indicates the significance of the detected features (see Section 3.1), the threshold value controls the number of features to be preserved or eliminated. A threshold close to one keeps only those features that belong to sharp lines or borders in the image. By changing the threshold value, important features belonging to borders, and lines with low contrast, high brightness or blur degradations could be preserved. For our experiments, a threshold of 0.3 was experimentally defined. Figure 2 illustrates the performance of the proposed feature detector.

4.2. Feature Descriptor

Because the histograms of oriented gradients [35] show robustness to small deformation such scale and rotations, a modified HOG-based descriptor is constructed. For each detected feature point, a spatial neighborhood around each feature is constructed and weighted by a Gaussian kernel (). Next, the neighborhood is split onto subneighborhoods. For each sub-neighborhood, Histogram of Oriented Phase Congruency (HOPC) is computed using the local direction (see (18)) between 0 and 360 degrees in such a manner that the amount added to each bin depends on the value of each point, as follows: where Figure 3 illustrates the formation of the proposed feature descriptor.

Now, let be the remainder of the modulus (mod), If either or are near to zero, it means that is near to the border between two adjacent bins. Therefore, could be assigned to one of the bins or divided between the bins. So, we assign the half of the value to each of the adjacent bins.

Besides, to provide invariance to rotation, each histogram is normalized using the prominent orientation () obtained as in [34], but taking into account the local direction . Then, sixteen histograms are concatenated and normalized (using the norm) in order to form the final descriptor.

5. Experimental Results

In this section, the performance of the proposed LUIFT algorithm is experimentally presented and analyzed. Three versions of the LUIFT descriptor are evaluated, that is, LUIFT_8, LUIFT_36, and LUIFT_64 which utilize 8, 36, and 64 bins, respectively. The performance of the proposed LUIFT method is compared with FAST [26], STAR[37], SIFT [9], SURF [11], KAZE [12], HARRISZ[33], DAISY [10], and LIOP [21] detectors and descriptors. All simulations were performed using and openCV (http://opencv.org/) library, with the exception of the LIOP descriptor, which was performed in Matlab using the VLFeat (http://www.vlfeat.org/) library.

5.1. Evaluation Setup

To evaluate the performance of the tested methods, the repeatability score, matching score and the overlap error are considered.

Let be a set of feature points detected in the original image , be a transformation matrix, and be the set of feature points detected in the test image . A correspondence is considered if , where denotes the Euclidean distance, and pixels [55]. The feature detector performance is evaluated using the repeatability score [15] defined as the ratio between the number of point-to-point correspondences and the minimum number of points detected in both images.

For the descriptor matching performance, two descriptors are matched if the distance between the descriptors is below a threshold . According to [34], if the ratio is less or equal to 0.9, then a correspondence is considered. To find the nearest neighbors, the Fast Approximate Nearest Neighbor Search algorithm (FLANN) [56] is exploited.

The results are presented by the recall-vs-1-precision curve. Recall and 1-precision are defined as follows [17]: The correct matches are determined with the overlap error [15]. Basically, the overlap error measure (also called surface error) indicates how well two detected feature regions intersect. The overlap error is defined as the ratio of the intersection of the regions and their union as follows: where and are the elliptic regions defined by the second moment matrix that satisfy and is the locally linearized homography in the point .

Finally, the matching score is computed as

5.2. Synthetic Dataset Evaluation

In order to evaluate the performance of the proposed LUIFT detector and descriptor, a synthetic grayscale (range from 0 to 255) dataset was created. The synthetic dataset contains 7164 images, of which 2,106 ones correspond to three different scenes (butterfly, gogh, and graffiti) scaled (6 scales) and rotated (13 rotations) under nonuniform illumination (9 variations); 2,106 ones correspond to three different scenes scaled and rotated under additive Gaussian noise (9 variations); and 3042 images correspond to three different scenes scaled and rotated under brightness and contrast (13 variations) changes. Figure 4 shows examples of the synthetic dataset images.

The test images are corrupted by zero-mean additive white Gaussian noise, varying the standard deviation .

Nonuniform illumination is simulated using the Lambertian model [57] defined aswhere and

The multiplicative function depends on the parameter ; that is the distance between a point in the surface and the light source, and the parameters and are tilt and slang angles, respectively. In our experiments the following parameters were used: and , varying the distance parameter .

Brightness and contrast are simulated bywhere and represent the brightness and contrast parameters, respectively.

Table 1 summarizes the parameters used to generate the synthetic images.

5.2.1. Simulation Results

Using the synthetic dataset (Section 5.2), four experiments were conducted in order to evaluate the performance of the proposed LUIFT method under nonuniform illumination, noise, brightness and contrast variations. The performance of the proposed LUIFT method performance is compared with that of the common methods SIFT [34] and SURF [36], in terms of repeatability and the matching score.

Our first experiment for nonuniform illumination conditions is carried out by varying the distance parameter in test images (rotated and scaled scenes). Figure 5 shows the obtained simulation results for nonuniform illumination in terms of the repeatability and matching score. It can be observed that all the tested methods are capable to detect and match feature points of the synthetic test images. However, the feature detection performance, as well as the matching performance of the SIFT and the SURF methods, decreases considerably when illumination becomes more nonuniform. Note that the proposed method significantly outperforms the tested methods on low-illuminated scenes, reaching up to 50% of improvement.

The next experiment consists in testing of the method performance under Gaussian noise degradations carried out by varying the standard deviation value in test images (rotated and scaled scenes). Figure 6 shows the simulation results for noise degradation in terms of the repeatability and the matching score.

The performance of the SIFT method decreases as the noise variance increases, meanwhile the performance of the SURF detector remains stable. In terms of the repeatability score, the performance of the SIFT and SURF detectors is worse by almost 20% than that of the proposed LUIFT method, whereas the SURF method shows the worst performance with respect to the matching score among all tested descriptors.

The final experiments for brightness and contrast variations are carried out by varying the and parameters in test images (rotated and scaled scenes). Figures 7 and 8 show the simulation results for contrast and brightness variations in terms of the repeatability and matching score, respectively. The obtained results show that the SIFT method is less sensitive to monotonic illumination changes. However, the proposed method yields the best performance in terms of repeatability and matching score.

Next, in order to compare the performance of the proposed detector and descriptor to that of the state-of-the-art methods in real scene images, the OFFICE (http://www.zhwang.me/datasets.html) and the PHOS (http://www.computervisiononline.com/dataset/1105138614) datasets were utilized.

5.3. Real Dataset Experiments

The OFFICE dataset, proposed in [21], contains two different scenes called corridor and desktop. Each scene set contains 5 images with monotonic illumination variations (see Figure 9). For each image set, the performance of the proposed descriptor and the state-of-the-art methods are evaluated.

Figure 10 shows the performance of the tested methods in terms of repeatability for feature detector, and the recall vs 1-precision curve for the feature descriptor. It can be observed that the proposed descriptor obtain a superior performance compared with that of the state-of-the-art evaluated methods. Despite that the performance of the FAST feature detector looks to that of the proposed LUIFT detector for the corridor scene in terms of repeatability (Figure 10(a)), the number of correct feature points detected in all the images for the proposed detector is greater than for the FAST detector (Figure 10(b)). Furthermore, the number of features detected in the original image using the FAST detector decreases by more than 50% as the corridor scene is degraded (Figure 10(b)), and almost 75% for the desktop scene (Figure 10(e)). The main drawback of FAST detector is that the desired number of features detected by the method needs to be adjusted for each type of scene or task. Note that it is important for the detector methods to have not only a high repeatability score, but also to obtain a high number of correct points.

Also the PHOS dataset [58] was used. The PHOS dataset contains 15 different scenes (see Figure 11) captured under different illumination conditions. Every scene of the dataset contains 15 different images: 9 images captured under different uniform illumination, varying the camera exposure between -4 and +4 from the original correctly exposed image (see Figure 12(a)); and 6 images under different degrees of nonuniform illumination, accomplished by adding a strong directional light source to uniform diffusive lights located around the objects (see Figure 12(b)).

Figure 13 shows the performance of the proposed LUIFT descriptor and the state-of-the-art methods on the PHOS dataset in terms of repeatability and the recall vs 1-precision curve. For the case of exposure variations, Figure 13(a) shows the average feature detector performance in terms of repeatability, and Figure 13(b) shows the average feature descriptor performance in terms of the recall vs 1-precision curve. For the case of nonuniform illumination variations, Figure 13(c) shows the average feature detector performance in terms of repeatability, and Figure 13(d) shows the average feature descriptor performance in terms of the recall vs 1-precision curve. The performance of the proposed LUIFT detector and descriptor is superior to that of all the tested methods, even for the LUIFT-8 descriptor.

The performance of the tested methods for each scene set (including exposure and nonuniform illumination) are shown in Figure 14 in terms of the recall vs 1-precision curve. The proposed method outperforms the other descriptors in all the cases.

Finally, Table 2 shows computation time (ms) required by the tested methods for processing of the graffiti image (). As expected, the SURF descriptor is faster than all methods. That is because of the use of Haar-like filters and integral images to improve the processing time at expense of its performance. On the other hand, the SIFT descriptor is based on Laplacian of Gaussian approximations instead of getting second order derivatives, which are more computationally expensive. However, since Laplacian of Gaussian approximations are made by the difference of Gaussian images, there exist errors in feature location or losing features. Besides, the SIFT descriptor duplicate feature points if they get two prominent orientations, collecting more features than the proposed method. All the experiments were performed on a standard PC with Intel Xeon E5-1603 processor with 2.8GHz and 16GB of RAM.

6. Conclusions

In this work, a robust phase-based descriptor for pattern recognition in degraded images using the scale-space monogenic signal and phase congruency approach was presented. With the help of computer simulation, the performance of the proposed method was compared with that of the state-of-the-art methods. The proposed method shows a superior performance under illumination variations, and noise degradations. Besides, the obtained results on typical dataset for evaluation of feature detection and matching performance are competitive with those obtained with the state-ofthe-art descriptors. The performance of the proposed method can be further improved by including into the design pyramidal scale decomposition. Since the proposed method is inherently local, its fast GPU implementation is straightforward.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Acknowledgments

The work was supported by the Ministry of Education and Science of Russian Federation (Grant 2.1743.2017) and by the RFBR (Grant no. 18-08-00782).