Hierarchical Feature Extraction Assisted with Visual Saliency for Image Quality Assessment
Image quality assessment (IQA) is desired to evaluate the perceptual quality of an image in a manner consistent with subjective rating. Considering the characteristics of hierarchical visual cortex, a novel full reference IQA method is proposed in this paper. Quality-aware features that human visual system is sensitive to are extracted to describe image quality comprehensively. Concretely, log Gabor filters and local tetra patterns are employed to capture spatial frequency and local texture features, which are attractive to the primary and secondary visual cortex, respectively. Moreover, images are enhanced before feature extraction with the assistance of visual saliency maps since visual attention affects human evaluation of image quality. The similarities between the features extracted from distorted image and corresponding reference images are synthesized and mapped into an objective quality score by support vector regression. Experiments conducted on four public IQA databases show that the proposed method outperforms other state-of-the-art methods in terms of both accuracy and robustness; that is, it is highly consistent with subjective evaluation and is robust across different databases.
Image processing plays an indispensable role in our daily lives and numerous professional fields. However, there exist many factors that would potentially degrade image quality during image acquisition, compression, transmission, restoration, and other procedures . Therefore, image quality assessment (IQA), which aims to automatically estimate the quality loss due to distortions, is crucial for various image processing systems in performance estimate and optimization . Since human visual system (HVS) is the ultimate receiver of images, subjective IQA completed by human observers always reflects the perceptual quality of images faithfully, yet it is cumbersome, time-consuming, and unstable, resulting in its impracticability to be applied in real-time systems . Thus, accurate and robust objective IQA methods that automatically evaluate the image quality are urgently needed. Generally, objective IQA methods can be divided into three classes according to the availability of the corresponding undistorted reference image, i.e., full reference (FR), reduced reference (RR), and no reference (NR) [1–3]. FR methods require full access to the distortion-free reference image; conversely, NR methods have no access to the reference image, while RR methods make use of partial information about the reference image. In this paper, our work is confined to FR methods, which evaluate image quality by measuring the disparity between the reference and distorted images.
The early proposed FR methods, such as mean squared error and peak signal-to-noise ratio, are calculated simply according to the pixel-wise intensity. Without regarding the properties of HVS, they are widely criticized for not correlating well with subjective ratings . Later explored methods that try to simulate the function of HVS, such as noise quality measure , are blamed for high computational complexity with small performance gain. The popular structural similarity method (SSIM) , along with its improved versions [7, 8], calculates the structural information fidelity to measure the quality degradation. Moreover, natural image statistics are also introduced for feature extraction . Great efforts have been made to develop more comprehensive features to accurately quantify human’s subjective perception upon image quality. The implementation of more complicated feature extraction schemes can be found in more recent works [10, 11].
Numerous studies have confirmed that the desirable image features for IQA should be relevant to image quality closely and correlate well with human’s subjective sensation . Since each of the quality indexes only reveals image quality from a certain aspect, a constructive solution is to employ a comprehensive description that includes features representing quality in different aspects, which is exactly what we attempt to achieve based on the knowledge of the hierarchical properties of HVS. Specifically, different areas in visual cortex show different interests in images, and the primary and secondary visual cortex (areas V1 and V2), occupying the two largest parts in visual cortex and which are most responsible for the generation of vision, are sensitive to spatial frequency information and texture information of images, respectively [12, 13]. Based on this fact, log Gabor filters and local pattern analysis are employed in this paper to extract the two kinds of features, respectively. Moreover, the feature extraction and further implementation stages are conducted with the assistance of visual saliency information.
The contribution of this paper lies in the following: (1) a comprehensive indicator of image quality is proposed based on extracting hierarchical quality-aware features that are attractive to HVS with the assistance of visual saliency maps; (2) by statistically quantifying the difference between the quality indicators of distorted images and the corresponding reference images, an effective quality assessment method is developed, which is proved accurate and robust on multiple databases.
2. Proposed Method
The proposed method follows a three-step framework as shown in Figure 1. Firstly, features assumed to well represent image quality are extracted from the tested and reference images. Then, the similarities between these features are quantified and regarded as indices that reveal the quality of the tested image. The final step is to build a function that synthesizes the quality indices into an objective quality score .
The framework of feature extraction is shown in Figure 2, which can be divided into two stages. In the first stage, the visual important regions of the test and reference images are enhanced with the assistance of saliency map. The second stage is hierarchical feature extraction. As is explained and shown in Figure 2, the spatial frequency and local texture information are captured separately as the lower- and higher-level features. Each procedure involved will be elaborated in the sections below.
2.1. Saliency-Assisted Visual Regions Highlight
Visual attention is usually represented by visual saliency in computer vision; higher saliency value denotes more attention it receives. Distortions in the regions with more attention have a larger impact on subjective sensation than that in less attractive regions . Intuitively, visual saliency is intrinsically related to IQA, since both depend on the behavior of HVS. Thus, researchers have been trying to integrate saliency models into IQA and made a significant progress [15, 16]. In this paper, we use visual saliency to enhance the images by highlighting the more important regions in an image.
The saliency map is computed based on the reference image since the significant information for capturing saliency maps may be damaged in a distorted image. Among various approaches to construct saliency maps, the spectral residual (SR) visual saliency model  is adopted in this paper owing to its robustness and low complexity. The obtained saliency map is then combined with the images by pixel-wise multiplication,where SM is the saliency map, is the original image, and is the enhanced image. Subsequent feature extraction is operated on . Figure 3 shows the effect of saliency-assisted enhancement, (a) is the reference image “bikes,” (b) is its distortion version contaminated by JPEG 2000 compression (JP2K), both taken from the LIVE database , (c) is the saliency map computed based on the intensity of the reference image (a), and (d) and (e) are the enhanced reference and distorted images. It can be observed that the luminance in more salient regions are relatively brighter than other regions.
2.2. Hierarchical Quality-Aware Feature Extraction
As is known, the hierarchical properties of the visual cortex are very complex. Each hierarchy is interested in different kinds of visual features; among them, the primary and secondary visual cortex areas (V1 and V2) are the first receivers of visual signal from eyes, occupy the two largest parts in the visual cortex, and are most responsible for the generation of early vision . Therefore, we emphasize the importance of V1 and V2 in IQA problem. Since V1 is sensitive to simple features like edge, bar, local frequency, etc., while V2 tends to be attracted to higher-level features like local texture and shape information , in this paper, the spatial frequency features and local texture features are captured and integrated for a comprehensive description of image quality. Concretely, the spatial frequency features are represented by the energy maps deriving from a log Gabor filter bank, and the local texture features are denoted as the coding results from local directional texture analysis.
To begin with, Gabor filters are widely used to capture spatial frequency information from images for the multiscale and multidirection properties similar with HVS. However, there exists apparent flaws with Gabor filters for the purpose of IQA application . Firstly, the DC component makes the filter response depend largely on the gray scale of the smooth regions, which is undesirable because stronger responses are expected to occur in complicated regions rather than smooth bright regions. Secondly, the maximum bandwidth of Gabor filters is limited to approximately 1 octave. Thirdly, they are insufficient to cover broad spectral information with maximal spatial localization. As an alternative to Gabor filters, log Gabor filters remove the DC component and have a Gaussian transfer function on logarithmic axis, enabling it to capture information on a broader bandwidth . The filters are defined in a polar coordinate; its radial and angular responses are defined aswhere represent the polar coordinates, and denote the center frequency and orientation angle of the filter, and and denote the scale and angular bandwidths, respectively. The overall frequency response is calculated as the product of the two components. With different and , a log Gabor filter can be defined in various scales and orientations. In this paper, a 6-scale and 4-direction filter bank is involved for multiscale and multidirection feature extraction. Moreover, a fast Fourier transform (FFT) is applied to speed up the process. After filtering the redundant information in frequency domain, an inverse FFT is operated to project the responses back to spatial domain. For each filter of specific scale and orientation, the response is composed of a complex matrix, the magnitude of which is regarded as the extracted spatial frequency feature.
Given the enhanced reference image and its distorted version, shown as Figures 3(d) and 3(e), Figure 4 shows the log Gabor filter responses of them at a certain scale in four directions; that is, 0°, 45°, 90°, and 135°, (a–d) are response magnitude maps of the enhanced reference image and (e–h) are that of the distorted image. It shows that the distortion causes obvious damage on the log Gabor filter responses, indicating that log Gabor filters can act as effective indicators of image quality.
The log Gabor filter bank gives a quality description from the prospective of low-level spatial frequency information; however, it is unable to capture the higher-level texture and shape information, which is of great significance for V2 . For compensation, local texture features are extracted to comprehensively describe image quality. Specifically, the first-order local tetra pattern analysis in directional perspective is adopted , because it captures more detailed and discriminative information than simple local pattern analysis tools like local binary pattern and is only moderately complex in computation.
Given a pixel in an image, let , , and denote the intensity of and its right and below adjacent pixels, respectively; then the directions of the pixels are defined as 1, 2, 3, or 4 according to the relationship between the intensities of target pixel and its horizontal and vertical neighbors,
In this way, the image is coded to a texture map with each item equal to an integer ranged from 1 to 4. Finally, based on the resulting direction value of pixels in image, four texture maps () are calculated as the texture features,where (j = 0 to 7) denotes the jth pixel out of eight neighboring pixels of .
Apparently, the distortion that an image suffers also have impact on texture patterns, as shown in Figure 5, where (a–d) are the patterns calculated from the enhanced reference image, (e–h) are calculated from the distorted image, and the involved reference and distorted images are the same with that used in Figure 4. The extracted patterns (a–d) are texture clear, while the texture information in (e–h) is apparently disrupted.
2.3. Difference Measurement of Extracted Features
It is no doubt that the inherently quality-aware information contained in extracted features will be damaged by distortion, and the difference between the features extracted from the enhanced reference and distorted images is considered as quality indices. In this paper, chi-square distance (CD) is used to measure the difference between the spatial frequency features extracted by log Gabor filters, where the inputs denote the log Gabor response at frequency and orientation , superscripts and denote distorted and reference images, respectively, and is the total number of items in a magnitude map.
For validation of the effectiveness of CD measurement as a quality index, Figures 6 and 7 exhibit the relation between the distortion level and CD in different directions and scales. And the reference image Figure 3(a) and its distorted images contaminated by JP2K at six levels are used to calculate the CD. The vertical axis represents CD, and the horizontal axis represents distortion level, which is determined by differential mean opinion score (DMOS). Note that, for an image in database, its mean opinion score (MOS) or differential mean opinion score (DMOS) is the subjective image quality score assigned to it obtained from experiments conducted on human observers. The features involved in Figure 6 are the responses of log Gabor filters in four directions at the first scale, that is, 0°, 45°, 90°, and 135°, marked with different color. The features involved in Figure 7 are the log Gabor filter responses at six scales in direction 0°, with a view to demonstrate the distinction among different scales. Intuitively, with the filter scale increasing, the relative CD decreases rapidly, which is mainly because responses in higher scales abandon a lot of detailed information and the difference between the responses of reference and distorted images is thus relatively subtler. It can be clearly seen from Figures 6 and 7 that severer distortion always results in larger CD, which agrees with the human visual perception.
In addition, cosine similarity (CS) is utilized to quantify the difference between the local texture features of the reference and distorted images,where the inputs denote pattern of direction , the superscripts and denote distorted and reference images, respectively, and N is the total number of items in TM.
Similarly, Figure 8 further illustrates the validity of CS measurement as a quality index. Figure 8 shows a monotonic relationship between distortion level and CS; severer distortion always results in smaller CS, which demonstrate that the extracted local texture features are of great effectiveness to indicate image quality and the CS is an efficient quality index.
2.4. Objective Quality Mapping by Support Vector Regression
For multifeature extraction based methods, a necessary operation is to construct a regression function that projects the calculated quality indices to an objective quality score. Different regression techniques such as general regression neural networks (GRNN) , multiple kernel learning (MKL) , and SVR  can be used to learn the regression model. In this paper, the SVR technique is adopted for its high performance on high-dimensional regression . Specifically, ε-SVR is employed, and a LIBSVM package is utilized for the implementation .
Given a training set , where , , is the feature vector of the th image in the training set of size k and is the subjective quality score (MOS/DMOS), we try to find a function that has the deviation of at most from for all the training data with the constraint of flatness, that is, seeking a small . The regression function can be represented aswhere is a nonlinear function used to map the input feature vector into a high-dimensional space, is the weight vector, and is the bias term.
Appropriate and should be found to satisfy the following constraint:
By introducing the slack variables and , and can be obtained by solving the following optimization problem :and the constant determines the trade-off between and the slack variables.
As shown in , can be calculated bywhere and are the Lagrange multipliers used in the Lagrange function optimization and is the number of support vectors. For data points satisfying (8), the corresponding and will be zero, and the training data with nonzero and are support vectors used to find . Combining (7) and (10), the regression function can be written asand then the radius basis kernel function can be defined aswhere defines the width of the kernel.
In addition, a fivefold cross-validation scheme is adopted for the training and testing procedure. Concretely, images in a database are divided randomly into five nonoverlapping sets, four of them are used for training and the remaining one for testing, that is, 80% for training and 20% for testing. The cross-validation procedure is repeated 1,000 times and the experimental results presented in this paper are the averaged value.
3. Experimental Results and Discussions
3.1. Experiment Setup
The performance of the proposed method is examined on four large-scale databases, including LIVE , TID2008 , TID2013 , and CSIQ . Each database contains distorted images contaminated by various types of distortions at different levels, and each distorted image is assigned a subjective quality score (MOS/DMOS). The basic information of the four public databases is introduced in Table 1.
To evaluate the performance of IQA methods in terms of accuracy, monotonicity, and consistency, four commonly used metrics are calculated, including Pearson Linear Correlation Coefficient (PLCC), Spearman Rank-Order Correlation Coefficient (SROCC), Kendall Rank-Order Correlation Coefficient (KROCC), and Root Mean Squared Error (RMSE) . Note that a better IQA method is supposed to achieve higher PLCC, SROCC, KROCC, and lower RMSE values. PLCC indicates the linear coherency between the objective and subjective scores. For the th image in a database of size N, given its subjective quality score and predicted objective quality score , PLCC and RMSE can be computed as
The nonparametric rank-based correlation metrics, SROCC and KROCC, which test the coherency between the rank orders of the scores to measure the prediction monotonicity, are given bywhere and are the numbers of concordant and discordant pairs in the dataset, respectively.
Particularly, since the numerical ranges of the objective and subjective scores are different, and the relationship between the subjective and objective scores may not be linear due to the nonlinear quality rating of human observers, before the calculation of linear correlation metrics PLCC and RMSE, a nonlinear mapping function should be involved for a fair comparison of IQA methods as suggested in . The five-parameter logistic regression function is given bywhere and are the objective scores before and after the mapping and to are parameters obtained numerically by a nonlinear regression process in MATLAB optimization toolbox to maximize the correlations between subjective and objective scores. Since the four databases adopt different schemes to quantify subjective quality scores, to of different databases are shown in Table 2.
3.2. Performance Comparison
In this section, we present the performance of the proposed method in comparison to existing methods including MAD , SSIM , VIF , LLM , MCSD , VSI , FSIMvs , GMSD , IFS , LCSIM , and GLD  on four public databases. The overall performance comparison results on LIVE database are presented in Table 3, with the best performance for each metric being highlighted in boldface, which shows that the proposed method outperforms other listed methods in terms of almost every performance metric. Apparently, the proposed method acquires quite large leading margin on LIVE database. The performance comparison results on TID2008 and TID2013 databases are listed in Table 4, where the best result is also highlighted in boldface. Obviously, the proposed method remains an advantage on the databases, with two exceptions appearing in the SROCC as the state-of-the-art method LLM performs slightly better than our method, yet the disparity between the two methods are quite small. Table 5 shows the performance comparison results on CSIQ database, from which we can clearly find that the proposed method exhibits the best and most stable performance among all listed methods.
Obviously, the performance of the proposed method shows evident superiority on most metrics and maintains a high level across all databases. By contrast, the previous methods are either generally inaccurate in predicting subjective evaluation or incapable to stay high level on all databases. For example, the predicting accuracy of classical methods SSIM and VIF is worse than most of the state-of-the-art methods, while the novel methods LCSIM and LLM perform very well on certain database but failed to be as competitive on other databases. Thus, it can be confirmed that the proposed method is accurate and robust.
In addition, since the performance of the proposed training-based method slightly varies during each time of experiments, a -test with significant level at 10% is carried out on PLCC and SROCC to show whether a performance disparity is significant; the results are shown in Table 6, where 0 denotes there is no significance difference between the comparing method and the proposed one and 1 or −1 represents a significant superiority or inferiority of the proposed method. The results demonstrate that except for inferiority comparing to LLM of SROCC on TID2008 and TID2013 and the nondistinguishable difference with LCSIM of PLCC on CSIQ, the proposed method shows significant superiority over other methods, indicating that the predicting accuracy of the proposed method maintains a relatively high level on these databases.
In order to test the computational cost, the running time of each IQA method processing a tested image of size 512 × 512 is listed in Table 7. All experiments are performed on a PC with Intel i5-6500 3.2 GHz CPU and 8 G RAM. Since the source codes of some methods are not openly accessible, five of the compared methods are involved in efficiency comparison. The operating system is Windows 10 and the software platform is MATLAB R2016b. As indicated in Table 7, SSIM is the most efficient method and runs much faster than others. Based on multiscale contrast similarity deviation and gradient magnitude similarity deviation quantification, the computational complexities of MCSD and GMSD are much low, so that they are so efficient compared with other IQA methods. The proposed method exploits properties of HVS to extract hierarchical features that comprehensively indicate image quality. Thus, the proposed method, which exceeds VIF and MAD in efficiency, has a relatively high complexity. However, it can be improved further through code optimization and parallelization in the future work.
To further demonstrate the effectiveness of the proposed method, scatter maps of subjective ratings versus objective scores on LIVE and TID2013 databases are given in Figure 9, where each point represents an image in database. Intuitively, the fitted curves show that the subjective scores display a substantial correlation with the objective scores, and the point cluster is closely around the fitted curve, which illustrates that the proposed method is quite consistent with human perceptual rating.
According to the hierarchical property of visual cortex, a novel full reference IQA method is proposed in this paper. Specifically, log Gabor filters and local pattern analysis are employed to extract the hierarchical features that well reflect image quality. Moreover, the feature extraction is assisted with visual saliency maps since visual attention has great impact upon human evaluation of image quality. The experimental results show that the proposed method achieves outstanding performance in terms of prediction accuracy as well as robustness across different databases.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multi-scale structural similarity for image quality assessment,” in Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers, pp. 1398–1402, Pacific Grove, Calif, USA, November 2003.View at: Google Scholar
J. Y. Lin, T. J. Liu, W. Lin, and C.-C. J. Kuo, “Visual-saliency-enhanced image quality assessment indices,” in Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2013, pp. 1–4, Kaohsiung, Taiwan, November 2013.View at: Publisher Site | Google Scholar
N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti, “TID2008 - a database for evaluation of full-reference visual quality assessment metrics,” Advances of Modern Radioelectronics, vol. 10, pp. 30–45, 2008.View at: Google Scholar