Abstract

Image quality assessment (IQA) has been a topic of intense research over the last several decades. With each year comes an increasing number of new IQA algorithms, extensions of existing IQA algorithms, and applications of IQA to other disciplines. In this article, I first provide an up-to-date review of research in IQA, and then I highlight several open challenges in this field. The first half of this article provides discuss key properties of visual perception, image quality databases, existing full-reference, no-reference, and reduced-reference IQA algorithms. Yet, despite the remarkable progress that has been made in IQA, many fundamental challenges remain largely unsolved. The second half of this article highlights some of these challenges. I specifically discuss challenges related to lack of complete perceptual models for: natural images, compound and suprathreshold distortions, and multiple distortions, and the interactive effects of these distortions on the images. I also discuss challenges related to IQA of images containing nontraditional, and I discuss challenges related to the computational efficiency. The goal of this article is not only to help practitioners and researchers keep abreast of the recent advances in IQA, but to also raise awareness of the key limitations of current IQA knowledge.

1. Introduction

Digital imaging and image-processing technologies have revolutionized the way in which we capture, store, receive, view, utilize, and share images. Today, we have come to expect the ability to instantly share photos online, to send and receive multimedia MMS messages at a moment's notice, and to stream live video across the globe instantaneously. Today, these conveniences are possible because the digital cameras and photo-editing systems used by photographers and artists, the compression and transmission systems used by distributors and network engineers, and the various multimedia and display technologies enjoyed by consumers all have the ability to process images in ways that were unthinkable just 20 years ago.

But despite the innovation and rapid advances in technology and despite the prevalence of higher-definition and more immersive content, one thing has remained constant throughout the digital imaging revolution: the biological hardware used by consumers—the human visual system. Although personal preferences can and do change over time and can and do vary from person to person, the underlying neural circuitry and biological processing strategies have changed very little over measurable human history. As a result, digital processing can alter an image's appearance in ways that humans can reliably and consistently judge to be either detrimental or beneficial to the image's visual quality.

Because of the prevalence of these alterations, a crucial requirement for any system that processes images is a means of assessing the impacts of such alterations on the resulting visual quality. To meet this need, numerous algorithms for image quality assessment (IQA) have been researched and developed over the last several decades. Today, IQA research has emerged as an active subdiscipline of image processing, and many of the resulting techniques and algorithms have begun to benefit a wide variety of applications. Variations of IQA algorithms have proved useful for applications such as image and video coding (e.g., [13]), digital watermarking (e.g., [48]), unequal error protection (e.g., [9]), denoising (e.g., [10]), image synthesis (e.g., [11, 12]), and various other areas (e.g., for predicting intelligibility in sign language video [13]).

Many of the techniques employed by modern IQA algorithms are founded in the early research on quality evaluation of optical systems and analog television broadcast and display systems (e.g., [1425]). For example, in their 1940 paper titled “Quality in Television Pictures,” Goldmark and Dyer [16] stated that

“The factors which chiefly determine the quality of a television picture are (1) definition, (2) contrast range, (3) gradation, (4) brilliance, (5) flicker, (6) geometric distortion, (7) size, (8) color, and (9) noise.” [16].

Although no objective quality assessment formulae were presented in [16], many of today's IQA algorithms do indeed employ measures of one or more of these factors. Later work by Winch [17], on the topic of color TV quality, further pushed toward objective quality assessment by providing guidelines for how photometric and colorimetric properties could be used to derive “characteristic data for correlation with the subjective preferences”; such properties are now commonly employed in modern IQA algorithms. On the optics front, in their 1955 paper titled “On the Assessment of Optical Images,” Fellgett and Linfoot proposed two key strategies and associated numerical measures of image quality: “assessment by similarity” and “assessment by information content” [19]. Indeed, variations of these ideas have been used by many of today's IQA algorithms.

It is interesting to note that nearly all of these early research efforts up through the 1960s mentioned the need to take into account the characteristics of human vision during the quality assessment process. Five of the earliest efforts to explicitly model properties of the human visual system (HVS) for IQA were published in the early 1970s by Sakrison and Algazi [22], by Budrikis [23], by Stockham [24], by Mannos and Sakrison [26], and by Schade [25]. Although no extensive IQA algorithms were presented in these early papers, many of the properties which are used in modern HVS-based IQA algorithms—such as luminance and contrast sensitivity and visual masking—were also suggested in these papers. At that time, Budrikis forecasted that

“Full evaluations are as yet impossible but seem very likely for the foreseeable future, although probably entailing considerable computational tasks.” [23].

Today, 40 years later, we still have yet to achieve full evaluations of quality, though remarkable progress has been made—as I will point out in this paper.

At a glance, the IQA problem for digital images may not seem as a difficult task as reported in the literature. After all, digital processing alters an image's pixel values, and the task of estimating quality requires merely mapping these numerical changes to corresponding visual preferences. Of course, anything that involves the human visual system is rarely straightforward. Humans do not see images as collections of pixels, and consequently, the appropriate mapping varies depending on the image, on the type of processing, on the numerical and psychological interaction between these two, and on numerous additional factors. As an example, Figure 1 shows an original image and 11 altered versions of that image, each with the same peak signal-to-noise ratio in comparison to the reference. Clearly, a mapping based only on the energy of the differences in pixel values cannot capture the wide range of visual qualities exhibited by these images.

The task of judging quality in Figure 1 is facilitated by the presence of an original, undistorted reference image. In his seminal 1975 collection titled “Image Quality: A Comparison of Photographic and Television Systems,” Schade [25] stated that

“Image quality is a subjective judgment made by a mental comparison of an external image with image impressions stored and remembered more or less distinctly by the observer…. Moreover, the rating of a given image may be greatly influenced by the availability of a much better image for comparison purposes.” [25].

Most IQA algorithms operate in this relative-to-a-reference fashion; these are so-called full-reference algorithms, which take as input a reference image and a processed (usually distorted) image and yield as output either a scalar value denoting the overall visual quality or a spatial map denoting the local quality of each image region (see Section 3). More recently, researchers have begun to develop no-reference and reduced-reference algorithms, which attempt to yield the same quality estimates either by using only the processed/distorted image (no-reference IQA; see Section 4.1) or by using the processed/distorted image and only partial information about the reference image (reduced-reference IQA; see Section 4.2).

All three types of IQA algorithms can perform quite well at predicting quality. Some of today's best-performing full-reference algorithms have been shown to generate estimates of quality that correlate highly with human ratings of quality, typically yielding Spearman’s and Pearson’s correlation coefficients in excess of 0.9. Research in no-reference and reduced-reference IQA is much less mature; however, recent methods have been shown to yield quality estimates which also correlate highly with human ratings of quality, sometimes yielding correlation coefficients which rival the most competitive full-reference methods.

The field of image quality assessment is rapidly advancing. With each year comes an increasing number of papers on new IQA algorithms, extensions of existing IQA algorithms, and applications of these IQA techniques to other disciplines. Here, the objective of this paper is not only to provide an overview of the strategies used in IQA algorithms, but also—and more so—to highlight the current challenges in this field. This paper is meant to complement previous reviews and chapters on IQA [2836] (see also [37, 38] for related reviews on video quality assessment). Here, I first provide a more recent survey of research in IQA to help practitioners and researchers keep abreast of the recent advances in IQA. Next, I discuss several open research challenges which are needed to further push IQA algorithms toward achieving the “full evaluations” envisioned by Budrikis.

In the first three sections of this paper, I provide an up-to-date survey of research in IQA. As in previous reviews, Section 2 summarizes several important properties of human visual perception which are used, at least to some extent, directly or indirectly by the vast majority of IQA algorithms. However, I also discuss with each of these properties some of the early experiments in vision science that were performed to uncover the properties; the goal of this discussion is to provide a context which can help define bounds on the applicability of each of the properties. This section also provides an up-to-date survey of the publicly available ground-truth datasets (image quality databases) that can be used to quantify the performances of IQA algorithms in predicting quality. (Note that research on color perception and the specific effects of color on image quality are not covered in this paper. Color-perception research has its own long history, most of which predates research in IQA. The reader is referred to [3941] for discussions on the influences of color on image quality.).

Sections 3 and 4 provide concise surveys of previous and recent IQA algorithms. Again, the primary objective of these surveys is to help the reader keep abreast of the latest IQA techniques. Section 3 surveys full-reference IQA algorithms. Section 4 surveys no-reference and reduced-reference IQA algorithms. For a more specific and thorough discussion of the use of natural-scene statistics for image and video quality assessment, I refer you to the recent review by Bovik [36].

One point should become evident after reading previous reviews and the reviews provided in Sections 2, 3, and 4 of this paper: remarkable progress has been made since the pioneering IQA work of Budrikis, Goldmark, Sakrison, Schade, Stockham, Winch, and others. Today's IQA algorithms can perform extremely well at predicting quality for a variety of images and distortion types.

Yet, beneath the surface of this seemingly orderly picture, behind the scenes of this wealth of IQA knowledge that we have gained lies a more cloudy portrait fueled by a growing number of counterexamples—images, distortions, and other alterations—which modern IQA algorithms are ill-equipped to handle. Under the covers of the numerous successes in IQA research lies a long list of unanswered questions and unsolved challenges.

In Section 5, I discuss seven of these challenges. Some of the challenges are fundamental; some are more application-specific; most of the challenges have been or are actively being researched. But all remain largely unsolved.(1)Section 5.1 discusses the challenges IQA researchers face when designing a model of human visual processing which can cope with natural images. This section highlights the need for improved models of primary visual cortex, the need for more ground-truth data on natural images, and the need for models which incorporate processing by higher-level visual areas.(2)Section 5.2 discusses the challenges researchers face when designing an algorithm that can cope with the variety of distortions that IQA algorithms can encounter. This section discusses the need for improved visual summation models which can handle the broadband nature of distortions, and the need for more research on the perception of suprathreshold distortions.(3)Section 5.3 discusses the challenges researchers face when designing an IQA algorithm that can model the influence of the distortion on the image's appearance. This section discusses the differences between distortions which are perceived as additive and distortions which affect the image's objects. This section also highlights the need to consider the adaptive visual strategies and other higher-level effects that humans use when judging quality.(4)Section 5.4 discusses the challenges researchers face when designing an IQA algorithm that can cope with images which are simultaneously distorted by multiple types of distortion. This section reviews previous work on the effects of multiple distortions on image quality, and it discusses the potential perceptual interactions between the distortions and their joint effects on images.(5)Section 5.5 discusses the challenges researchers face when designing an IQA algorithm that can deal with geometric changes to images. This section reviews existing IQA algorithms which have been designed to handle basic geometric changes, and it discusses research efforts on IQA of textures, which can contain more radical geometric and photometric changes.(6)Section 5.6 discusses the challenges researchers face when designing an IQA algorithm that can perform IQA of enhanced images. This section describes efforts to model the perceptual effects of enhancement on quality, and it discusses the need for more thorough image quality databases which contain enhanced images.(7)Section 5.7 discusses the challenges surrounding run-time performance and memory requirements of IQA algorithms. This section reviews previous efforts to accelerate existing IQA algorithms, and it discusses the need for further related performance analyses and accelerations.

It is important to note that these seven challenges are by no means an exhaustive list of research topics in IQA that require further investigation. Rather, I have selected these particular challenges to highlight some key limitations of current IQA knowledge and to point out areas which can begin to answer broader questions on IQA. Additional important open challenges can be garnered from the Proceedings of SPIEImage Quality and System Performance” and “Human Vision and Electronic Imaging,” among others.

2. Image Quality Assessment by Humans

A common approach toward designing an IQA algorithm is to first consider the physical attributes of images that humans find pleasing or displeasing. By understanding how these physical changes give rise to perceptual changes, one can begin to develop an estimate of image quality based on measures of the physical changes. Numerous studies in the fields of visual psychophysics and visual neuroscience have quantified relationships between the physical attributes of visual stimuli and the corresponding psychological and neurophysiological responses. The results of these studies have provided important insights into the goals and functions of the HVS, and many of these findings have been used in IQA algorithms. In Section 2.1, I provide a brief review of the basic properties of the HVS that are commonly taken into account—either explicitly or implicitly—in the vast majority of IQA algorithms.

Another approach toward gaining insight into how humans judge quality is to directly collect quality ratings from a representative pool of human subjects on a database of altered images. Several such quality-rating studies have been conducted, and the results of these studies are commonly released in the form of the so-called image quality databases. These databases generally contain the set of reference and altered images used in the study, along with corresponding average quality ratings for each altered image. In Section 2.2, I provide a survey of the various publicly available image quality databases, including a brief discussion of how the data are used for quantifying the predictive performances of IQA algorithms.

2.1. Psychophysical Underpinnings of Image Quality

Research in visual psychophysics aims to provide a better understanding of the human visual system (HVS) by linking changes in the physical attributes of a visual stimulus the to corresponding changes in psychological responses (visual perception and cognition). These studies generally entail carefully designed experiments on human subjects using highly controlled visual stimuli and viewing conditions. Many of the most fundamental properties of visual perception which are used for IQA have been obtained from the results of such studies; the most commonly used of these properties are summarized in this section.

It must be stressed that the primary goal of the vast majority of research in visual psychophysics is to gain knowledge of how the HVS operates; any relations to image quality are usually secondary and are usually not extensively discussed in such studies. Consequently, it is often up to the designer of an IQA algorithm to decide how the psychophysical findings relate to image quality. Nonetheless, due in part to the increasing popularity of IQA algorithms, an increasing number of psychophysical studies have been devoted specifically toward image quality (e.g., [4268]).

2.1.1. Contrast Sensitivity Function

Psychophysical studies have shown that the minimum contrast needed to detect a visual target (e.g., distortions) depends on the spatial frequency of the target [69, 70]. This minimum contrast is called the contrast detection threshold, and the inverse of this threshold is called contrast sensitivity. When contrast sensitivity is plotted as a function of the spatial frequency of the target, the resulting profile is the contrast sensitivity function (CSF).

Contrast thresholds for sine waves were first measured by Schade [71] in an experiment that presented human observers with achromatic sine-wave gratings of various spatial frequencies. The key result of Schade's experiment was the discovery that contrast sensitivity varies with the spatial frequency of the grating; the resulting CSF is bandpass, indicating that we are least sensitive to very-low-frequency and very-high-frequency targets, with a peak in sensitivity near 4–6 cycles per degree of visual angle (c/deg).

The reduction in sensitivity at high frequencies has been attributed to the optics of the eye, to receptor spacing, and to quantum noise. Reduced sensitivity at low spatial frequencies is believed to occur, in part, by limited receptive field sizes and by masking effects imposed by the target's DC component. However, when contrast sensitivity is measured using Gabor functions, the CSF tends to be much more low pass [73], and such low-pass-type CSFs are most commonly utilized in IQA algorithms. The CSF has also been measured as a function of the orientation of the sine-wave grating, commonly resulting in reduced sensitivity to diagonal orientations as compared to horizontal and vertical orientations (the oblique effect [74, 75]). Alternative theories of the neural underpinnings of the CSF have also been proposed based on the statistical properties of natural scenes [76, 77].

In IQA algorithms, the CSF is commonly taken into account by prefiltering the images with a 2D spatial filter designed based on the psychophysical results. One popular CSF filter, which is shown in Figure 2, was proposed by Mannos and Sakrison [26] and further adjusted by Daly [72]; its frequency response, , is given by where denotes the radial spatial frequency in c/deg and denotes the orientation and where accounts for the oblique effect (see [72]). In Figure 2, the parameter was set to , resulting in the CSF taking on its maximum value of 0.981 at  c/deg (and forced to be this value for frequencies below ) when or . A very thorough treatment of the use of the CSF in IQA has been published by Barten [78].

2.1.2. Visual Masking

Another finding from the visual perception research which is commonly taken into account in IQA algorithms is the fact that certain regions of an image can hide distortions better than other regions, a finding that can be attributed to visual masking [79]. Visual masking is a general term that refers to the perceptual phenomenon in which the presence of masking signal (the mask) reduces a subject's ability to detect a given target signal. The task of detection becomes a masked detection, and contrast thresholds denote masked detection thresholds. In IQA, it is commonly assumed that the image serves as the mask and the distortions serve as the target of detection.

Luminance masking and pattern masking are the two most common forms of masking employed in IQA algorithms. Detection thresholds tend to increase due to an increase in the luminance of the background (mask) upon which the target is placed (luminance masking, [70, 80]), a process which is believed to be mediated by retinal adaptation [81]. For masks consisting of spatial patterns, detection thresholds also tend to increase when the contrast of the mask is increased [69, 79, 82], a postretinal process believed to be attributable to cortical processing [81]. Current explanations of pattern masking can generally be divided into three paradigms:(1)noise masking, which attributes the reduction in sensitivity to the corruptive effects of the mask on internal decision variables [83]; (2)contrast masking, which attributes reduction in sensitivity to contrast gain control [79] (discussed later); (3)entropy masking, which attributes reduction in sensitivity to an observer's unfamiliarity with the mask [44].

Because a mask's contrast is readily computable, contrast masking has been exploited in a variety of IQA and image processing applications (e.g., [8488]; see Section 3.1). The extent to which a mask constitutes visual noise and the extent to which an observer is unfamiliar with a mask are phenomena which are more difficult to quantify; accordingly, noise and entropy masking are less commonly used in IQA (though, see [89]).

Contrast masking results are commonly reported in the form of threshold-versus-contrast (TvC) curves, in which masked detection thresholds are plotted as a function of the contrast of the mask. Figure 3 depicts TvC curves for the detection of a sine-wave grating presented against noise and sine-wave-grating masks (after [79]). Masked detection thresholds generally increase as the contrast of the mask is increased and often demonstrate a region of facilitation (i.e., a decrease in threshold; “dipper effect”) at lower mask contrasts, depending on the dimensional relationships between the target and the mask (e.g., differences in spatial frequency, orientation, and phase). Note that learning effects have been shown to lower the slopes of the TvC curves [87, 90].

In IQA, a variety of methods have been used to account for masking, particularly in full-reference IQA (discussed later in Section 3). A common approach to explicitly account for masking is to measure the local luminance and contrast in the reference image and then attenuate the estimate of the visibility of the distortions in the distorted image based on these measures (e.g., using a power-function relationship between attenuation and contrast). Other IQA algorithms implicitly incorporate masking either by using local statistical measures which take into account the local contrast or by adjusting the simulated neural responses in the context of a computational neural model of the HVS (see Section 3).

2.1.3. Multichannel Model of the HVS

Schade used sine-wave gratings in his CSF study based on the notion that any stimulus can be described as a superposition of sine-waves. Campbell and Robson [91] extended this idea by measuring detection thresholds for both sine-wave and square-wave gratings. Because a square wave is composed of numerous sine waves, the peak-to-peak contrast of a square wave will always be lower than the peak-to-peak contrast of its fundamental sine wave (by a factor of approximately 1.3 in [91]). The results from Campbell and Robson's experiment revealed that the thresholds for the square-wave gratings were indeed approximately 1.3 times lower than those found for the sine-wave gratings. They concluded from this finding that the HVS performs a local spatial-frequency decomposition of a stimulus in which the frequency components are detected independently via multiple spatial-frequency channels. This paradigm is known as the multichannel model of human vision [82].

Further evidence in support of the multichannel model has been provided by visual adaptation and summation experiments [69, 82]. The CSF measured for a subject adapted to a sine-wave grating of a particular spatial frequency or orientation shows attenuation only within a narrow band of frequencies/orientations around the frequency/orientation of the grating (approximately 1-2 octaves, 15–30 degrees) [70, 82]. Visual summation experiments have revealed that a compound target (e.g., a plaid composed of two sine waves) is detectable only when one of its components reaches its own detection threshold, a finding which is consistent with a multichannel model with independent channels [82, 9297] (the components of the compound target must be separated in spatial frequency by at least one octave or in orientation by at least 30°–45°; see [82]). Similar experiments have shown channels tuned to other dimensions such as color and direction of motion [69, 82].

The multichannel model has also been used to explain the shape of the CSF. Brady and Field [76] and Graham et al. [77] predicted the shape of the CSF via a model with equally sensitive spatial-frequency channels; reduction in detection performance for high spatial frequencies was attributed to extrinsic noise that dominates the response of channels tuned to high frequencies, thus resulting in decreased signal-to-noise ratios for these higher-frequency channels.

2.1.4. Computational Neural Models of V1

The multichannel model has inspired several related computational neural models of primary visual cortex (V1). These computational models have been used both to predict masking results and for IQA [87, 88, 98101]. Models of this type first compute modeled neural responses to the reference image (mask), then compute modeled neural responses to the distorted image (mask + target), and then deem the distortions (target) detectable if the two sets of neural responses sufficiently differ. Quality can be estimated based on the predicted masked thresholds and/or the difference in simulated neural responses.

Figures 4 and 5 show block diagrams of the stages used in a typical computational neural model of V1 used to predict masked detection thresholds (Figure 4) or used to estimate quality (Figure 5). The pixel values of the reference and distorted images are first converted to either luminance or lightness values, and then both images are filtered with a 2D spatial filter designed to mimic the CSF. Alternatively, the CSF can be accounted for by scaling the coefficients of the frequency-based decomposition used to mimic the neural array. Next, two sets of simulated neural-array responses (one set for the reference image, one set for the distorted image) are computed via a filterbank. Further adjustments are made to account for neural nonlinearities and interactions (gain control) [99, 102104]. The adjusted neural responses are then compared and collapsed across space, frequency, and orientation. The resulting threshold prediction or quality estimate is determined based on the comparison, that is, based on the extent to which the simulated neural responses to the reference image (mask) differ from the simulated neural responses to the distorted image (mask + target).

Frequency-Based Decomposition
To simulate an array of visual neurons in primary visual cortex (V1), and to account for the multichannel analysis performed by the HVS, the computational models employ some form of local frequency-based decomposition. Standard approaches to this decomposition include a steerable pyramid (e.g., [88]), a Gaussian pyramid (e.g., [105]), an overcomplete wavelet decomposition (e.g., [106]), radial filters (e.g., [107]), and cortex filters (e.g., [87, 99, 108]).
As shown in Figure 6, at a particular scale/orientation of this local frequency-based decomposition, the resulting matrix of transform coefficients represents the initially linear responses of a simulated array of neurons located at each spatial position in the image. Although real neurons cannot yield negative responses, negative coefficients are permitted and assumed to model the responses of co-located neurons that are tuned out-of-phase (i.e., with an inhibitory central region and excitatory flanking regions).
The parameters of the decomposition are often tuned based on psychophysical and neurophysiological data (e.g., five or more radial frequency bands with 1-2 octave bandwidths, 4–12 orientations with 15°–30° bandwidths). In an IQA setting, the spatial-frequency decomposition is applied to both the reference image and the distorted image, yielding two sets of coefficients. The resulting coefficients are meant to simulate the initial linear responses of the neurons; they must be further adjusted to account for the neurons' nonlinear response properties.

Gain Control
Numerous studies have shown that the responses of neurons in V1 are nonlinearly related to the contrast of the stimulus to which the neurons are exposed (see [69, 109]). In the low-contrast regime, the neurons exhibit a threshold-type behavior in which a minimum contrast is required in order to yield any response. In the high-contrast regime, the neurons exhibit a saturation-type behavior in which further increases in contrast yield no corresponding increases in response. Studies have also shown that responses of V1 neurons can be inhibited by neighboring neurons in space, frequency, and orientation (the inhibitory pool). This inhibition from the neighboring neurons is commonly attributed to a gain control mechanism which is designed to keep the neuron operating in its linear regime and thus prevent saturation.
To account for these response properties, neural models apply a divisive normalization to the coefficients of the local frequency-based decomposition. Let correspond to the coefficient at location , center frequency , and orientation . The (nonlinear) response of a neuron tuned to these parameters, , is most often simulated via where is a gain factor, represents an optional weight designed to take into account the CSF, represents a saturation constant, provides the pointwise nonlinearity to the current neuron, provides the pointwise nonlinearity to the neurons in the inhibitory pool, and the set indicates which other neurons are included in the inhibitory pool. The parameters , , , and are commonly adjusted to fit the experimental masking data. For example, model parameters have been optimized for detection thresholds measured using simple sinusoidal gratings [101], for filtered white noise [100], and for TvC curves of target Gabor patterns with sinusoidal masks [88, 99]. Typically, and are in the range , and the inhibitory pool consists of neural responses in the same spatial-frequency band (), at orientations within of and within a local spatial neighborhood (e.g., 8-connected neighbors).
Equation (3) is applied to each coefficient of the decomposition of the reference image and to each coefficient of the decomposition of the distorted image. This operation results in two sets of simulated neural responses: (1) a set of neural responses to the reference image and (2) a set of neural responses to the distorted image .

Summation of Responses
The final stage used in most V1 models entails comparing the two sets of simulated neural responses and . When used as a masking model (Figure 4), to generate a map indicating the local visibility of the target, the responses at each location are compared and pooled across frequency and orientation as follows. where is a predefined threshold which is typically held constant across images and where the summation exponent is either chosen to match published results from summation studies or adjusted to fit published masking data. In an IQA setting (Figure 5), the comparison with is often replaced with a sigmoid or logistic nonlinearity that maps the -norm to an estimate of quality.
Numerous variations of (4) have been proposed in the literature; often the models are tuned to fit specific psychophysical data. The summation can also be applied across a local spatial neighborhood around to determine a regional rather than a pointwise visibility. However, it is important to emphasize that the neural model is designed to mimic an array of visual neurons. This neurophysiological underpinning limits the choice of model parameters and operations to those which are biologically plausible.
In Section 3.1, I review several IQA algorithms which have employed variants of this V1-based model. It is also important to note that the vast majority of masking data have been obtained using simplistic, highly controlled targets (e.g., sine waves or Gabor patches) presented against unnatural masks (e.g., sine waves, Gabor patches, and noise). Consequently, most computational V1 models employ parameters which have been selected for such targets and masks. As I discuss later in Section 5.1, images can impose unique perceptual effects which cannot be fully captured by current V1 models.

2.2. Image Quality Databases

Another approach toward gaining insight into how humans judge quality is to directly collect quality ratings from a representative pool of human subjects on a database of altered images. Such ratings can also be used to evaluate and refine IQA algorithms. Image quality databases provide this crucial ground-truth information. These databases typically contain a set of reference and altered images and average ratings of quality for each distorted image. The averages are generally taken across subjects, typically after z-score normalization and other adjustments (e.g., outlier tests) to attempt to account for individual biases; see [110]. The resulting averages are almost always reported in the form of mean opinion scores (MOS values) or differential mean opinion scores (DMOS values). For databases containing distorted images, a larger MOS (smaller DMOS) denotes greater quality, whereas a smaller MOS (larger DMOS) denotes lesser quality. Some databases further provide the standard deviations of the ratings across subjects.

Here, I first briefly summarize the existing publicly available image quality databases, and then I discuss techniques which are used to evaluate the performances of IQA algorithms on the databases.

2.2.1. List of Image Quality Databases

There are over 20 publicly available image quality databases, the details of which are described below and summarized in Table 1 (ordered by year of release). Many of these databases are listed as part of the extensive list of multimedia databases provided by the QUALINET consortium (European Network on Quality of Experience in Multimedia Systems and Services) [118]. Both Sheikh et al. [119] and Lin and Kuo [35] have provided analyses of the performances of various IQA algorithms on some of these databases. In addition, in [120], Winkler has provided quantitative comparisons of various aspects (source content, test conditions, and subjective ratings) of some of these databases. Note that 3D image quality databases are not listed here; see [118].(i)IRCCyN/IVC Image Quality Database (IVC). The IRCCyN/IVC database [121, 122], developed at the Institut de Recherche en Communications et Cybernétique de Nantes (IRCCyN), France, contains reference images and distorted images in 24-bpp color BMP format at an image resolution of pixels. There are five types of distortions in this database: JPEG compression ( distorted images), JPEG compression of only the luminance component ( distorted images), JPEG2000 compression ( distorted images), locally adaptive-resolution coding ( distorted images), and Gaussian blurring ( distorted images). Each type of distortion was generated at five different amounts of distortion. The ratings were collected from 15 subjects.(ii)LIVE Image Quality Database. The LIVE database [119, 123, 124], developed at the University of Texas at Austin, USA, contains reference images and distorted images in 24-bpp color BMP format at different image resolutions ranging from 634 × 438 to 768 × 512 pixels. There are five distortion types in this database: JPEG compression (169 distorted images), JPEG2000 compression (175 distorted images), additive Gaussian white noise (145 distorted images), Gaussian blurring (145 distorted images), and JPEG2000 with bit errors via a simulated Rayleigh fading channel (145 distorted images). Each type of distortion was generated at 5-6 different amounts of distortion. The ratings were collected from 29 subjects.(iii)A57 Image Quality Database. The A57 database [125], developed at Cornell University, USA, contains three reference images and distorted images in -bpp grayscale BMP format at a resolution of pixels. The database contains six types of distortion: uniform quantization of the LH subbands of a 5-level discrete wavelet transform (DWT), additive Gaussian white noise, JPEG compression; JPEG2000 compression, custom JPEG2000 compression via the Dynamic Contrast-Based Quantization algorithm [126], and Gaussian blurring. Each type of distortion has three different amounts yielding nine distorted images for each distortion type. The ratings were collected from seven subjects.(iv)Tampere Image Quality (TID2008) Database. The Tampere database [127, 128], developed at the Tampere University of Technology, Finland, contains distorted images generated from reference images. The reference images were obtained from the Kodak Lossless True Color Image Suite. All of the images are stored in 24-bpp BMP format at a resolution of pixels. There are distortion types in the database (e.g., different types of noise, blur, denoising, JPEG and JPEG2000 compression, transmission of JPEG, JPEG2000 images with errors, local distortions, luminance, and contrast changes). Each type of distortion was generated at four different amounts. The ratings were obtained from 838 subjects.(v)Toyama Image Quality (MICT) Database. The MICT database [129], developed at the University of Toyama, Japan, contains reference images and distorted images in 24-bpp color BMP format at a resolution of pixels. There are two types of distortion in this database: JPEG compression ( distorted images) and JPEG2000 compression ( distorted images). Both types of distortion were generated at seven different amounts. The ratings were obtained from 16 subjects.(vi)IRCCyN/IVC Scores on the MICT Database. Additional subjective ratings of the quality for the images from the MICT database were obtained at the Institut de Recherche en Communications et Cybernétique de Nantes (IRCCyN). The IRCCyN ratings were collected by using a different testing protocol, a different type of display, and different populations of subjects [130, 131]. The ratings were collected from 27 subjects.(vii)The Real Blur Image Database (RBID). The RBID [111], developed at the Universidade Federal do Rio de Janeiro, Brazil, contains blurred images in 24-bpp BMP format at resolutions ranging from to pixels. The images in this database are categorized into five different blur classes: unblurred ( images), out of focus ( images), simple motion ( images), complex motion ( images), and others ( images). The ratings were collected from 20 subjects.(viii)IRCCyN/IVC Watermarking Databases. Four separate watermarking databases were developed by the Institut de Recherche en Communications et Cybernétique de Nantes (IRCCyN), France. The images were created by embedding watermarks with different algorithms: Enrico, Broken Arrows (BA), Fourier Subband (FSB), and Meerwald (MW).(a)IRCCyN/IVC Watermarking—Enrico Database. This database [132] contains five reference images and distorted images generated from watermarking algorithms with two embedding strengths. All of the images are in 8-bpp grayscale BMP format at a resolution of pixels. The ratings were obtained from 16 subjects. (b)IRCCyN/IVC Watermarking—Broken Arrows Database. This database [133] contains reference images and distorted images with six different embedding strengths and either with or without CSF weighting. All of the images are in 8-bpp grayscale PPM/PGM format at a resolution of pixels. The ratings were obtained from 17 subjects. (c)IRCCyN/IVC Watermarking—Fourier Subband Database. This database [134] contains five reference images and distorted images containing watermarks in six frequency subbands at seven embedding strengths for each subband. All of the images are in 8-bpp grayscale BMP format at a resolution of pixels. The ratings were obtained from 7 subjects. (d)IRCCyN/IVC Watermarking—Meerwald Database. This database [135] contains reference images and distorted images generated from five embedding strengths either in the DWT domain or in the dual-tree complex wavelet transform domain. All of the images are in 8-bpp grayscale BMP format at a resolution of pixels. The ratings were obtained from 14 subjects. (ix)Wireless Imaging Quality (WIQ) Database. The WIQ database [136, 137] was developed at the Radio Communication Group at the Blekinge Institute of Technology, Sweden. This database contains seven reference images and images distorted via loss of JPEG data over a simulated wireless channel. The images are stored in 8-bpp BMP format at a resolution of pixels. The ratings were obtained in two separate experiments from 30 subjects.(x)The Visual Attention Image Quality (VAIQ) Database. The VAIQ database [112, 138], developed at the University of Western Sydney, Australia, contains ground-truth visual gaze patterns of reference images taken from the LIVE, IVC, and MICT image databases. Although this database is not strictly an image quality database, the visual gaze patterns can be useful for examining the effects of visual attention on quality. The visual gaze patterns were obtained from 15 subjects.(xi)Categorical Subjective Image Quality (CSIQ) Database. The CSIQ database [27, 139], developed at Oklahoma State University, USA, contains reference images and distorted images in 24-bpp PNG format at a resolution of pixels. There are six distortion types in this database: JPEG compression ( distorted images), JPEG2000 compression ( distorted images), additive Gaussian white noise ( distorted images), additive Gaussian pink noise ( distorted images), Gaussian blurring ( distorted images), and global contrast decrements ( distorted images). Each type of distortion was generated at 4-5 different amounts. The ratings were obtained from 35 subjects.(xii)TU Delft Perceived Ringing (TUD1 and TUD2) Datasets. The TUD1 and TUD2 databases were developed at the Delft University of Technology, The Netherlands. The subjective ratings were collected from two experiments: (1) a ringing region experiment (TUD1) and (2) a ringing annoyance experiment (TUD2). In the ringing region experiment, JPEG-compressed images were generated from eight reference images with two levels of compression. The results were collected from subjects and are presented in the form of subjective ringing region maps. In the ringing annoyance experiment; JPEG-compressed images were generated from 11 reference images with four different levels of compression. The ratings were obtained from subjects. (xiii)IRCCyN/IVC DIBR Image Quality Database. The DIBR database [114], developed by the Institut de Recherche en Communications et Cybernétique de Nantes (IRCCyN), France, contains still images extracted from three different multiview-plus-depth sequences. All sequences have the same resolution of pixels but were captured with a variable number of cameras (, , and ) at different camera spacings (6.5 cm, 3.5 cm, and 5.0 cm). Each sequence was processed by seven depth-image-based rendering algorithms to generate four new viewpoints of each sequence. The ratings were obtained from 43 subjects.(xiv)MMSPG JPEG XR Image Compression Database. The MMSPG JPEG XR image compression database [115], developed at the Swiss Federal Institute of Technology (EPFL), Switzerland, contains compressed images generated from reference images using JPEG XR compression. The images are stored in -bpp color BMP format at a resolution of pixels. Six coding bitrates ranging from to bpp were used to generate the distorted images. The ratings were obtained from 16 subjects.(xv)HTI and IBBI Databases. The HTI and IBBI Databases [116], jointly developed at TU Delft and IRCCyN, are designed to test the performances of the blurriness metrics. The highly textured images (HTI) database contains reference images containing highly textured content and blurred versions of the images; the images are stored at a resolution of pixels. The intentionally blurred background images (IBBI) database contains reference images and blurred versions in which the background was intentionally blurred; the images are stored at a resolution of pixels. In both databases, the blurring was performed via Gaussian filtering with five different levels of blur, and the ratings were obtained from 18 subjects.(xvi)VCL@FER Image Quality Database. The VCL@FER database [117], developed at the University of Zagreb, Croatia, contains reference images and distorted images. There are four types of distortion in this database ( distorted images for each type): additive Gaussian white noise, Gaussian blurring, JPEG compression, and JPEG2000 compression. Each distortion type was generated at six different amounts of distortion. The ratings were obtained from 118 subjects.(xvii)Digitally Retouched Image Quality (DRIQ) Database. The DRIQ image quality database [140] is a full-reference enhanced-image database developed at Oklahoma State University, USA. This database contains 26 reference images and 78 enhanced images obtained via manual digital retouching. The images are stored in 24-bpp color PNG format. The ratings were obtained from 9 subjects. (Some images from DRIQ and additional details of the database are provided in Section 5.6.)

2.2.2. Quantifying the Predictive Performance

The image quality databases described in the previous section serve as crucial ground-truth information for evaluating IQA algorithms. Specifically, to quantify how well an IQA algorithm can predict the MOS or DMOS values from a particular database, it is customary to evaluate the algorithm in terms of three performance criteria recommended by the Video Quality Experts Group (VQEG) [141]: (1) prediction accuracy, (2) prediction monotonicity, and (3) prediction consistency.

The prediction accuracy can be quantified either by measuring how well an algorithm's predictions correlate with the MOS/DMOS values or by measuring the average error between the algorithm's predictions and the MOS/DMOS values. The Pearson correlation coefficient (CC) and the root-mean-squared error (RMSE), both recommended in [141], are most commonly used for quantifying correlation and average error, respectively.

Before computing CC or RMSE, it is customary to apply a nonlinear transformation to the predicted scores so as to bring the predictions on the same scale as the MOS/DMOS values and to attempt to obtain a linear relationship between the predictions and opinion scores. Let denote the DMOS or MOS value for the th image, let denote the corresponding predicted score from an IQA algorithm, and let denote the corresponding transformed predicted score. In [141], three suggestions for the transform are provided: The parameters ,, and are chosen to minimize the MSE between the set of DMOS/MOS values (e.g., all DMOS/MOS values in a particular database) and the corresponding set of transformed predicted values . The minimization is conducted under the constraint that must be a monotonic function of over the range of predicted values.

It is important to note that because there is inherent variability across subjects and across different trials of the same subject, if the scores for a particular image demonstrate a large variability across subjects/trials, then the mean score (i.e., the MOS or DMOS) is not necessarily a good indication of what should be predicted. Instead, some leeway, determined based on the variability, should be given around the MOS/DMOS values. Some databases provide the standard deviation associated with each MOS/DMOS value, , which provides a measure of the variability across subjects/trials for the th image. These standard deviations can be taken into account during the fitting procedure to determine the parameters ,, and .

The prediction monotonicity specifies how well an algorithm predicts the rank-ordering of the opinion scores. Various rank-order correlation coefficients can be used to quantify the monotonicity (e.g., Spearman's , Kendall's , Kendall's , Goodman and Kruskal's , and Somer's ; see [142]). The Spearman rank-order correlation coefficient (Spearman's , SROCC) is recommended in [141] and is thus most commonly used.

However, it should be noted that the standard formula for computing SROCC must be adjusted for ties in the ranks [142], an adjustment which is rarely used in the IQA literature. The difficulty in accounting for ties stems from the fact that in order to determine ties, the variability (e.g., ) associated with each score must be known. As I mentioned, not all databases provide such information, and thus ties are rarely taken into account.

The prediction consistency specifies how consistent is an IQA algorithm's quality predictions across the range of content provided in a database—for example, for different images, different distortion types (or other alteration types), and different amounts of each distortion/alteration type. Two measures of prediction consistency are the outlier ratio [141] and the outlier distance [139], both of which require the aforementioned standard deviations.

The most commonly used measure of prediction consistency, which is recommended in [141], is the outlier ratio. The outlier ratio, , is defined as where is the number of predictions outside two standard deviations, , and is the total number of scores. The range of was chosen in [141] because it contains 95% of all the subjective quality scores for a given image.

In addition to knowing if a predicted score is an outlier, it is also informative to know how far outside the error bars () the outlier falls. To quantify this, we proposed in [139] a new measure, termed the outlier distance. The outlier distance, , is the distance from an outlier to the closest error bar; it is defined as where is the set of all predicted scores outside . Note that because is dependent on the dynamic range of the MOS/DMOS values, it cannot be used to compare across databases.

3. Full-Reference Image Quality Assessment Algorithms

The vast majority of IQA algorithms are so-called full-reference algorithms, which take as input both a distorted image and a reference image and yield as output an estimate of the quality of the distorted image relative to the reference. The simplest approach to full-reference (FR) IQA is to measure local pixelwise differences and then to collapse these local measurements into a scalar which represents the overall quality difference, for example, the mean-squared error (MSE) or peak signal-to-noise ratio (PSNR), often measured in different domains, for example, [143]. More complete FR IQA algorithms have employed a wide variety of approaches ranging from estimating quality based on models of the HVS (see Section 3.1), to estimating quality based on image structure (see Section 3.2), to estimating quality by using various statistical and information-theoretic-based approaches (see Section 3.3) and many other techniques (see Section 3.4). Here, I provide a brief survey of these FR IQA algorithms.

3.1. Methods Based on HVS Models

Given a distorted image, a human can readily rate the quality of the image relative to the original image and relative to other distorted images. Accordingly, numerous IQA methods have been developed which employ computational models of the HVS [26, 30, 87, 88, 98, 100, 101, 105107, 125, 130, 139, 144157].

Most HVS-based IQA algorithms employ a variant of the V1 model described previously in Section 2.1. The images are typically processed through a set of spatial filters to obtain oriented, spatial-frequency decompositions of the images designed to mimic the initially linear responses of neurons in V1. The CSF is taken into account either by adjusting the simulated linear neural responses based on the passbands of the filters or by using a prefiltering stage with a 2D CSF-based filter (see (1)). Masking is commonly taken into account by further adjusting the simulated neural responses via a divisive normalization (see (3)). Finally, the quality of the distorted image is estimated based on the extent to which the adjusted responses to the reference image differ from the adjusted responses to the distorted image. Typically, this final stage is performed by computing pointwise absolute differences between the original and distorted responses, and then collapsing these differences via an norm (see, e.g., [87, 88, 99]).

Many HVS-based methods were originally designed to operate as predictors of visible image differences; that is, they have been designed to determine if changes are visible and accordingly operate best when the distorted images contain artifacts near the threshold of detection. Researchers have previously argued that the underlying V1 models need to be extended to take into account higher-level properties of human vision [85, 126, 156]. Unfortunately, although our current understanding of near-threshold vision for controlled stimuli is relatively mature from a modeling perspective, much less is known about how the HVS operates when the distortions are more complex and in the suprathreshold regime (which may invoke areas of visual cortex beyond V1). Nonetheless, recent HVS-based methods have begun to use improved models and/or models of mid- and higher-level vision [107, 125, 139, 151, 157, 158], and many of these methods have been shown to perform extremely well as general IQA algorithms.

For example, in [107], Damera-Venkata et al. augmented traditional models of contrast sensitivity and luminance and contrast masking with models of suprathreshold contrast perception. In [125], Chandler and Hemami presented a visual signal-to-noise ratio (VSNR), in which a wavelet-based model of low-level vision is combined with a model of how the HVS adaptively prefers different spatial frequencies depending on the amount of degradation. In [157], Laparra et al. presented an IQA algorithm which employs an improved divisive-normalization-based masking model. In [159], Cheng et al. supplemented a wavelet-based HVS model with measures of directional structural distortion and structural similarity [123]. Alternative image transforms have also been used to integrate properties of the HVS into IQA algorithms [160, 161].

In [139], Larson and Chandler presented an IQA algorithm, MAD (most apparent distortion), which explicitly models the adaptive nature of the HVS. MAD was one of the first algorithms to demonstrate that quality can be predicted by modeling two strategies employed by the HVS and by adapting these strategies based on the amount of distortion. For high-quality images, in which the distortion is less noticeable, the image is most apparent, and thus the HVS attempts to look past the image and look for the distortion—a detection-based strategy. For low-quality images, the distortion is most apparent, and thus the HVS attempts to look past the distortion and look for the image's subject matter—an appearance-based strategy. In MAD, local luminance and contrast masking are used to model the detection-based strategy for high-quality images, whereas changes in the local statistics of log-Gabor coefficients are used to model the appearance-based strategy for low-quality images.

Another recent trend in HVS-based IQA algorithms has aimed at incorporating aspects of visual attention and regions of interest (ROIs) during quality assessment (see, e.g., [6062, 65, 162164] for related psychophysical studies). Wang and Bovik [165] developed a model for adjusting contrast sensitivity based on foveation. Osberger et al. [152] developed a vision-based metric by using the CSF and masking, into which they incorporated a visual importance map by selecting effective ROIs based on higher-level visual properties such as size, shape, foreground/background, and the presence of people.

Le Callet et al. [155] presented an IQA algorithm which operates based on ten -degree areas containing the maximum perceived distortion, the implicit ROI assumes that the eye is drawn to regions of maximum error, and quality is estimated by spatially pooling the estimates of maximum perceived distortion. In [158], Carnec et al. combined low-level HVS properties with a measure of the structural information obtained via a stick-growing algorithm and estimates of visual fixation points. In [166], Moorthy and Bovik incorporated both visual-fixation-based weighting and quality-based weighting into an IQA algorithm.

In [167], Tong et al. employed saliency maps for IQA based on the observation that salient regions contribute more to the perceived image quality. Salient region information generated by the model of Itti and Koch[168] and a face detection model were used to generate weights in [167] to improve the performances of previous IQA algorithms. Similarly, in [169], Guo et al. assumed that humans often pay more attention to the image regions with important content. The authors incorporated saliency-based visual attention and visual-importance-based visual attention into the SSIM algorithm [123].

In [170], Wu et al. incorporated an “internal generative mechanism” (IGM) into the existing IQA algorithms. Their IGM advocates that the HVS actively predicts sensory information and tries to avoid the residual uncertainty for image perception and understanding. In [170], the images are decomposed into two parts: the predicted portion, consisting of the predicted visual content, and the disorderly portion, consisting of the residual content. SSIM [123] is employed to measure distortions in the predicted portion, PSNR is employed to measure distortions in the disorderly portion, and then the two results are adaptively combined to yield the overall quality prediction.

3.2. Methods Based on Image Structure

A recent thrust in image quality assessment has focused on measuring changes in an image's structure as a proxy for measuring image quality. The central assumption in this approach is that the HVS has evolved to extract structure from the natural environment. Consequently, a higher-quality image is one whose structure closely matches that of the original image, whereas a lower-quality image exhibits less structural similarity to the original. Although a precise definition of “image structure” remains an open question, methods of this type have been shown to correlate quite highly with subjective ratings of quality.

The effects of distortion on image structure and the corresponding effects on image quality have been mentioned in the optics and engineering literature since the 1970s in the context of television pictures (e.g., [25, 171]). Eskicioglu and Fisher [172] were among the first to apply explicit measures of “structural content” and “correlation quality” (based on correlation and normalized cross-correlation) to IQA. In [173], Fränti presented a block-based IQA algorithm which included separate measures for structural errors (based on edge detection), quantization errors, and contrast errors. In [174], Wang and Bovik presented the universal image quality index (UQI), which employed cross-correlation and measures of luminance and contrast differences to estimate quality.

The use of cross-correlation-based measures of structural similarity was made popular by Wang et al. who proposed the Structural Similarity Index (SSIM) [123]. SSIM is an extended version of UQI in which the correlation, luminance, and contrast measures were modified by adding small constants to the numerator and the denominator of each measure. In [175], Wang et al. presented a multiscale version of SSIM in which the correlation, luminance, and contrast measures are also applied to filtered and downsampled versions of the images (MS-SSIM). In [176], Sampat et al. presented a complex version of SSIM which adds robustness to small affine transformations of the distorted image.

Over the last eight years, numerous variations of SSIM and other IQA algorithms which estimate quality based on structural similarity and/or structural degradation have been proposed.

3.2.1. SSIM-Based Methods

Various IQA algorithms have been developed which directly employ SSIM or MS-SSIM as a part of the IQA process.(i)In [177], Yang et al. presented a modified version of MS-SSIM that operates by using the 9/7 DWT filters.(ii)In [178], Ji et al. presented a modified version of SSIM using a discrete Haar wavelet transform (HWSSIM). A multiresolution version of HWSSIM was also defined in [178] by weighting/combining four HWSSIM values evaluated at four Haar wavelet levels with the CSF.(iii)In [179], Cao et al. presented an IQA method which estimates quality based on the influence of both global and local distortions. The global distortion is measured via a rectified mean absolute difference; the local distortion is measured via SSIM. These two measures are combined using a weighting strategy to yield a final image quality estimate.(iv)In [180], Shi et al. presented an IQA method for color images based on structural and color similarity indices. SSIM is applied to the luminance and the hue component of the image to yield two similarity indices which are combined into an overall structural similarity index.(v)In [181], Rao and Reddy presented an IQA method in which the SSIM indices of local image regions are adjusted by perceptual weights defined from the regions' contrasts. A measure of the image's overall perceptual SSIM index () is calculated as the average of these weighted indices.(vi)In [182], Zhang et al. employed image features from the coefficients of the 1st-order and 2nd-order Riesz transform. These two feature maps are masked by the image's edge locations before applying SSIM to compute the structural similarity.(vii)As mentioned in Section 3.1, in [169], Guo et al. presented an IQA algorithm which incorporates the saliency-based the visual attention and visual-importance-based visual attention into SSIM.(viii)As also mentioned in Section 3.1, Wu et al. [170] employ SSIM [123] to measure distortions in the predicted portion of their internal-generative-mechanism-based IQA model.(ix)In [183], Chebbi et al. estimate quality of blurred images based on a combination of a perceptual blur measure and SSIM; the perceptual blur measure is derived from edge maps obtained via a Haar DWT.(x)In [184], Fei et al. estimate the quality based on SSIM and visual masking. The contrast comparison in SSIM is augmented by measures of masking, and the structural comparison is modified by using the image's structure tensor.

3.2.2. Gradient-Based Methods

Another way to measure changes in structure is to compute changes in local image gradients. Several methods have been developed which take this approach.(i)In [185], Kim and Park presented an IQA algorithm based on the Harris response. The authors observed that when an image is degraded or distorted, the image's gradient information is changed, causing the Harris response to change. Thus, in [185], the changes in the Harris response of the image, which is computed from the gradient information matrix and its eigenvalues, are used to measure image quality.(ii)In [186], Zhu and Wang presented an IQA algorithm based on a three-stage multiscale visual gradient similarity (VGS) index. First, global contrast registration is applied for each scale. Second, the similarity of gradient directions and gradient magnitudes are combined to yield comparison maps. Finally, quality is estimated via intrascale and interscale pooling of the maps. Some parameters of VGS are trained on existing image quality databases to optimize performance.(iii)In [187], Chen et al. argued that the structural information can be computed from the distribution of gradient magnitudes and edge directions. Accordingly, edge-direction-histogram (EDH) descriptors are extracted, and then quality is estimated based on the structural similarity between EDH descriptors of the reference and the distorted images.(iv)In [188], Liu et al. compared gradient similarity between the reference and distorted images to evaluate quality. Their method first computes both luminance-based and contrast-based structural changes. Then, these changes are weighted by estimates of visibility thresholds. Finally, the weighted changes are adaptively integrated to yield the image's overall quality estimate.

3.2.3. Methods Based on Other Measures of Structure

Various alternative measures of structural similarity/degradation have also been employed in several IQA algorithms.(i)In [189], Zhai et al. presented an IQA algorithm which operates based on the notion of a “multiscale edge presentation.” Their method measures structure via correspondences in wavelet magnitudes across spatial scales.(ii)In [190], Zhang and Mou combined PSNR with a measure of structure based on differences in wavelet modulus maxima corresponding to low- and high-frequency bands.(iii)In [191], Jin et al. presented a DCT-based IQA algorithm that considers contrast and brightness degradations as well as block-based structural similarity.(iv)In [192], Chou and Hsu proposed an IQA algorithm which uses moment-preserving quantization for extracting geometric structural information. SSIM values of luminance and contrast are combined with a similarity measure of geometric structure to yield the final quality estimate.(v)In [193], Zhang et al. proposed a feature-based similarity measure for IQA which operates based on phase congruency as the primary feature and the image gradient magnitude as a secondary feature. The phase congruency is also used as a weighting function to derive an overall image quality score from the local quality map obtained from the two features.(vi)In [194], Narwaria et al. designed an IQA algorithm based on the phase and magnitude of the discrete Fourier transform. The algorithm compares the phase and magnitude of the Fourier coefficients of the reference and distorted images to compute image quality. Nonuniform binning of the frequency components and linear regression are employed to integrate the effects of the changes in phase and magnitude.

3.3. Methods Based on Image Statistics and/or Machine Learning

Other measures of image quality have been proposed which operate based primarily on statistical/information-theoretic measures, often supplemented by machine learning techniques. The methods have also demonstrated great success at predicting quality. See [36] for a recent thorough discussion of the use of natural-scene statistics for IQA.

In [195], Sheikh and Bovik presented the VIF (visual information fidelity) algorithm, which estimates image quality based on natural-scene statistics. VIF operates under the premise that the HVS has evolved based on the statistical properties of the natural environment. Accordingly, the quality of the distorted image can be quantified based on the amount of information the distorted image provides about the reference image. VIF models images as realizations of a mixture of marginal Gaussian densities of wavelet subbands, and quality is then determined based on the mutual information between the subband coefficients of the reference and distorted images.

In [196], Shnayderman et al. measured image quality based on a singular value decomposition (SVD). The Euclidean distance is measured between the singular values of an original image block and the singular values of the corresponding distorted image block; the collection of block-wise distances constitutes a local distortion map. An overall scalar value of image quality is computed as the average absolute difference between each block's distance and the median distance over all blocks.

Other SVD-based IQA algorithms have also been proposed. In [197], Mansouri et al. presented an IQA algorithm (RSVD) which estimates quality by factoring the image's SVD matrix into a matrix which captures luminance changes and two matrices which capture structural changes. In [198], Narwaria and Lin presented an IQA algorithm which uses SVD-based visual features and feature pooling via machine learning. In [199], Saha et al. presented an IQA algorithm in which approximate descriptions of the reference and distorted images are obtained at different scales by using an SVD filter; quality is estimated by computing the similarity between two pyramidal structures.

More direct machine-learning-based techniques have also been applied to IQA. In [7], Liu and Yang applied supervised learning to derive a measure of image quality based on decision fusion. A training step is used to determine an optimal linear combination of four IQA methods: PSNR, SSIM, VIF, and VSNR. Training is performed via canonical correlation analysis and images/subjective ratings from the LIVE [124] and A57 [125] image databases.

In [200], Peng and Li argued that IQA algorithms which operate based on individual features cannot accurately predict quality across different distortion types. To overcome this limitation, the authors proposed a two-stage scheme. First, the image distortion type is predicted by support-vector classifiers. Second, decision-level fusion of three existing algorithms (SSIM, VSNR, and VIF) is performed based on the k-nearest-neighbor regression where the acquired distortion-type knowledge is employed. More recent related work by Peng and Li can be found in [201].

In [202], Charrier et al. presented the Machine Learning-Based Image Quality Measure (MLIQM) which employs a learned classification process. The MLIQM method first constructs a feature vector consisting of various measured image attributes. Then, a classification process is performed to assign the distorted image into a quality class. Finally, support-vector regression is performed based on the quality class to yield the estimate of quality.

Other statistical models of images have also been used for IQA. In [203], Wang and Li argued that the optimal perceptual weights for pooling the output of an IQA algorithm across space should be proportional to the local information content, which is estimated in [203] by using statistical models of natural scenes. In [204], Chang and Wang presented an IQA algorithm that relates image quality with the correlation between the sparse codes formed from the reference and distorted images. In [205], Pinto and Hemami provided an upper bound for the performances of IQA algorithms in the low-quality regime by using a family of IQA techniques based on VIF.

3.4. Methods Based on Other Techniques

In addition to the methods mentioned in the previous sections, numerous other techniques have been employed in IQA algorithms. For example, IQA algorithms have been developed to assess quality based on different color spaces [206], based on image segmentation and/or region-based analysis [207209] and based on the use of additional features [210217].

In [207], Xu and Hauske presented an IQA algorithm which operates based on error segmentation. The errors are divided into three types: those that affect objects' edges, those that affect other edges, and those that are most visible in smooth regions (e.g., blocking and noise). The segmented errors are used to compute distortion factors, which are combined into an overall estimate of quality via multiple linear regression.

In [210], Bianco et al. presented different computational strategies to improve the robustness and accuracy of SSIM and the S-CIELAB spatial-color model.

In [211], Okarma proposed an IQA algorithm which employs a combination of three previous methods: MS-SSIM [175], VIF [195], and RSVD [197].

In [212], Lahouhou et al. conducted an empirical study of several quality indicators (including PSNR, SSIM, and wavelet-based quality measures) and proposed a regularized regression model for combining the indicators to predict quality.

In [213], Xue and Mou proposed an IQA algorithm based on a ratio of non shift edges. First, the distorted image is filtered by a Laplacian of Gaussian, and then edge points are detected from the filtered image. Next, a binary “non-shift edge” map is computed to represent the strong edge structure present in the distorted image. Finally, quality is estimated based on the map.

In [214], Li et al. demonstrated that adaptively combining two quality measurements could improve quality predictions. The authors proposed an IQA algorithm that separately evaluates detail losses and additive impairments. The detail loss refers to the loss of useful visual information which affects the content visibility, and the additive impairment represents the redundant visual information which distracts attention from the useful content. Two quality measures corresponding to detail losses and additive impairments are computed, and then the outputs of the two quality measures are adaptively combined to yield the overall quality prediction.

In [215], Attar et al. presented the edge-based image quality assessment (EBIQA) algorithm. EBIQA employs four edge features computed from the reference and distorted images: edge orientation, average length of edges, primitive length of edges, and number of edge pixels.

In [216], Ponomarenko et al. presented an IQA algorithm which employs a parameter map that denotes the image's local self-similarity. Quality is estimated based on the mean-squared difference between the parameter maps for the reference and distorted images.

In [217], Solh and AlRegib developed an IQA method for multicamera systems by identifying and quantifying two types of visual distortions: photometric distortions and geometric distortions. Such distortions can be quantified by using three different indices: a luminance and contrast index, a spatial motion index, and an edge-based structure index. These indices are combined into one multicamera image quality measure (MIQM).

4. No-Reference and Reduced-Reference Image Quality Assessment

Although FR IQA provides a useful and effective way to evaluate quality differences, in many applications the reference image is not available. Although humans can often effortlessly judge the quality of a distorted image in the absence of a reference image, this task has proven to be quite challenging from a computational perspective. No-reference (NR) and reduced-reference (RR) IQA algorithms attempt to perform IQA with either no information (NR IQA) or only limited information (RR IQA) about the reference image. Here, I briefly survey existing NR and RR IQA algorithms.

4.1. No-Reference IQA

The vast majority of NR IQA algorithms attempt to detect specific types of distortion such as blurring, blocking, ringing, or various forms of noise. For example, algorithms for sharpness/blurriness estimation have been shown to perform well for NR IQA of blurred images. NR IQA algorithms have also been designed specifically for JPEG or JPEG2000 compression artifacts. Some NR algorithms have employed combinations of these aforementioned measures and/or other measures. Other NR IQA algorithms have taken a more distortion-agnostic approach.

4.1.1. Methods for Blurriness/Sharpness

Numerous algorithms have been developed to estimate the perceived sharpness or blurriness of images. Although the majority of these algorithms were not designed specifically for NR IQA, they have shown success at IQA for blurred images. Modern methods of sharpness/blurriness estimation generally fall into one of four categories: (1) those which operate via edge-appearance models, (2) those which operate in the spatial domain without any assumptions regarding edges, (3) those which operate by using transform-based methods, and (4) hybrid techniques which employ two or more of these methods.

A common technique of sharpness/blurriness estimation involves the use of edge-appearance models. Methods of this type operate under the assumption that the appearance of edges is affected by blur, and accordingly these methods estimate sharpness/blurriness by extracting various properties of the image edges. For example, Marziliano et al. [218] estimate blurriness based on average edge widths. Ong et al. [219] estimate blurriness based on edge widths in both the edge direction and its gradient direction. Dijk et al. [220] model the widths and amplitudes of lines and edges as Gaussian profiles and then estimate sharpness based on the amplitudes corresponding to the narrowest profiles. Chung et al. [221] estimate sharpness based on a combination of the standard deviation and weighted mean of the edge gradient magnitude profile. Wu et al. [222] estimate blurriness based on the image estimated point spread function. Zhong et al. [223] estimate sharpness based on both edges and information from a saliency map. Ferzli and Karam [224] estimate sharpness based on an HVS-based model which predicts thresholds for just noticeable blur (JNB) the JNB for each edge block is used to estimate the block perceived blur distortions, and the final sharpness estimate is based on a probabilistic combination of these distortions. A related JNB-based method can be found in [225].

Other sharpness/blurriness estimators work in the spatial domain but do not attempt to locate edges. Wee and Paramesran [226] estimate sharpness based on the dominant eigenvalues of the covariance matrix of the image pixels. Zhu and Milanfar [227] estimate sharpness based on the SVD of the local image-gradient matrix. Roffet et al. [228] generate a blurred version of the input image and then estimate blurriness based on the variation between neighboring pixels in the input versus blurred images. Tsomko and Kim [229] estimate blurriness by using the variance of the prediction residue, which is computed as the difference between adjacent pixels. Debing et al. [230] measure blurring artifacts from H.264/AVC compression by averaging local blur values calculated at the boundaries of macroblocks.

A number of sharpness/blurriness estimators have also been developed based on transform-domain techniques. Marichal et al. [231] estimate sharpness based on the histogram of nonzero DCT coefficients among all blocks of the transformed image. Caviedes and Gurbuz [232] estimate sharpness based on the kurtosis of DCT coefficients corresponding to edge profiles. Zhang et al. [233] estimate sharpness based on the peakedness of the image's energy spectrum. Shaked an Tastl [234] estimate sharpness based on the ratio of high-pass to low-pass frequency energy of the spatial derivative of each line/column. Kristan et al. [235] estimate sharpness based on the uniformity of the image spectrum. Hassen et al. [236] estimate sharpness based on the local phase coherence in the complex wavelet domain. Vu and Chandler [237] estimate sharpness based on a weighted average of the log energies of the image's DWT subbands.

Hybrid approaches have also been developed which employ a combination of edge-/pixel-based and transform-based methods. Hybrid approaches have generally proven to perform better than edge-only-based or transform-only-based methods, though at the expense of added computational complexity. Chen and Bovik [238] estimate blurriness based on the image gradient histogram and a wavelet-based edge map. Vu and Chandler [239] estimate sharpness based on a combination of spectral and spatial measures. The spectral measure uses the slope of the local magnitude spectrum, and the spatial measure uses the local total variation of pixel values; these two measures are then combined to generate an image sharpness map, which can be collapsed into a scalar indicating overall perceived sharpness.

4.1.2. Methods for JPEG Compression Artifacts

Numerous NR IQA algorithms have also been developed specifically for JPEG images. The general approach involves measuring the edge strength at block boundaries and then using this measure to estimate the visibility of the blocking, often based on masking. Quality is then determined based on this estimate of perceived blockiness.

In [240], Wang et al. presented an NR measure of blockiness which models blocky images as a nonblocky image corrupted by a pure blocky signal. Quality is estimated based on the energy of the blocky signal. In [241], Wang et al. propose a more efficient method that estimates image blockiness based on the average difference across block boundaries and the activity of the image signal.

In [242], Bovik and Liu presented an NR measure of blockiness which operates in the DCT domain. Blocking artifacts are first located via detection of 2D step functions, and then an HVS-based measurement of blocking impairment is employed.

In [243], Meesters and Martens presented an NR measure of blockiness which operates by detecting low-amplitude step edges and by estimating various edge parameters using a Hermite transform.

In [244], Pan et al. presented an NR measure of blockiness in images/video coded via the block discrete cosine transform. Quality is estimated based on directional information measured for edges. The authors demonstrate that their method does not require the exact location of the block boundary and is thus invariant to displacements, rotations, and scalings of the images.

In [245], Perra et al. exploited properties of the Sobel operator to generate an NR blockiness index based on two measures: one which quantifies the luminance variation of block boundary pixels and one which quantifies the luminance variation of the remaining pixels. Similarly, in [246], Zhang et al. presented a NR blockiness measure which calculates the image's luminance gradient matrix by using the Sobel operator. This matrix is used with HVS-based adjustments (luminance adaptation and texture masking) to estimate the severity of blocking artifacts and the annoyance of large flatness in low-rate images.

Park et al. [247] presented an NR measure for blocking artifacts by modeling abrupt changes between adjacent blocks in both the pixel domain and the DCT domain. Similarly, in [248], Chen et al. presented an NR measure of JPEG image quality by using selective gradient and plainness measures followed by a boundary selection process that distinguishes the blocking boundaries from the true edge boundaries.

In [249], Suresh et al. presented a machine-learning-based NR approach for JPEG images. Their algorithm operates by estimating the functional relationship between several visual features (such as edge amplitude, edge length, background activity, and background luminance) and subjective scores. The problem of quality assessment is then transformed into a classification problem and solved via machine learning.

In [250], Suthaharan presented an NR technique for quantifying blocking artifacts via two units. The first unit measures the visibility of distortions as a combination of blocking artifacts and undistorted image edges. The second unit uses patterns of the least significant bits to identify image regions that are affected by JPEG compression. Both units are combined to form a normalized visually significant blocking artifact measure.

In [251], Chen and Bloom presented an NR DFT-based measure of blockiness. Given an image, the absolute difference between horizontally adjacent pixels is calculated, normalized, and averaged along each column. Then, a 1D DFT is employed to derive separate measures of horizontal and vertical blockiness. The overall blockiness is estimated based on a combination of these two directional measures.

4.1.3. Methods for JPEG2000 Compression Artifacts

Several NR IQA algorithms for JPEG2000 images have also been developed. The general approach involves measuring the amount of blurring or edge-spread by using edge-detection techniques. Other methods have also been developed based on natural-scene statistics.

In [252], Ong et al. presented a NR algorithm for JPEG2000 blurring which operates via four steps: (1) computing a gradient direction for each pixel; (2) edge detection by using a Canny edge detector; (3) measuring the edge-spread, that is, the extent of the slope of each edge along and perpendicular to the gradient; and (4) estimating quality based on the results from the previous steps.

In [253], Li et al. presented a principal-components-analysis-based NR method for JPEG2000 images. First, by viewing all edge points in JPEG2000 images as distorted or undistorted, local features are extracted at each of the detected edge points to indicate blurring and ringing. A model is then employed to map these local features to local distortion estimates through the probabilities of the edge points being distorted or undistorted. Quality is estimated based on the local distortion estimates. A similar method can also be seen in [254].

In [255, 256], Sazzad et al. presented an approach which uses pixel distortion and edge information for NR IQA of JPEG2000 images. Their technique operates under the assumption that human visual perception is very sensitive to edge information in images. Visual artifacts manifest as pixel distortions around these edges, and thus quality is estimated by measuring these pixel distortions.

Other algorithms for NR IQA of JPEG2000 images have used measures of the changes in the statistic regularities of DWT/DCT coefficients to estimate quality. For example, Sheikh et al. [257] reported that when JPEG2000 images are decomposed through a wavelet transform, the subband probabilities can indicate the loss of visual quality. Quality is estimated in [257] by first computing features based on these probabilities from all wavelet subbands and then applying a nonlinear combination of the features.

Zhou et al. [258] presented an NR algorithm to evaluate JPEG2000 images which employs three steps: (1) dividing the image into blocks, among which textured blocks are employed for quality prediction based on nature-scene statistics; (2) measuring positional similarity via projections of wavelet coefficients between adjacent scales of the same orientation; and (3) using a general regression neural network to estimate quality based on the features from the previous two steps.

Zhang et al. [259] utilized kurtosis in the DCT domain for NR IQA of JPEG2000 images. Three NR quality measures are proposed: (1) frequency-based 1D kurtosis, (2) basis-function-based 1D kurtosis, and (3) 2D kurtosis. The proposed measures were argued to be advantageous in terms of their parameter-free operation and their computational efficiency (they do not require edge/feature extraction).

4.1.4. Methods for Other/Multiple Artifacts

Researchers have also proposed NR methods for other distortion types/combinations, most commonly noise, blurring, blocking, and/or ringing.

In [260], Li presented an NR IQA method which combines NR measures for three different image distortion types (blur, noise, and block/ringing artifacts). Blur is characterized by a 2D parameterized edge model [261]. Impulse noise is measured by the percentage of noisy pixels, and Gaussian noise is estimated as the energy of noise after the application of a denoising algorithm [262]. Blocking is measured by the likelihood of detecting artificial horizontal or vertical edges, and ringing is measured by a ratio indicating the deviation of the spectrum of noise removed by an anisotropic diffusion filter.

In [263], Corner et al. [263] present an NR noise estimation technique based on data masking. Their method calculates a histogram of the local standard deviations over blocks after filtering and edge suppression using a gradient mask. Based on experimental data, they reported that the histogram median value supplied the most accurate final noise estimate.

In [56], Süsstrunk and Winkler presented an NR method for color images degraded by compression or transmission loss. The authors conducted a psychophysical experiment to obtain quality ratings, and they compared their proposed method with the ratings. Three measures for blockiness, blurriness, and colorfulness are proposed and shown to successfully predict the subjective ratings.

In [264, 265], Gabarda and Cristóbal presented an entropy-based NR IQA algorithm. Their method is based on a multiresolution analysis of images for which they conclude that the entropy per pixel is strictly decreasing with respect to decreasing resolution. Quality is estimated based on anisotropy measures obtained via directional entropy estimates.

In [266], Brandão and Queluz presented an NR algorithm for estimating quantization artifacts due to lossy encoding such as JPEG or MPEG. Their method is based on the statistics of DCT coefficients whose distribution can be modeled by a Laplacian probability density function. The resulting coefficient distributions are used to estimate local errors, and these local error estimates are then used to estimate quality.

In [267], Cohen and Yitzhaky presented an NR method to identify and quantify the impact of noise and blur on quality. Their method operates by measuring common statistics of images obtained from their power spectra. The authors reported that manipulations of the distorted image's spectrum enhance the appearance of the distortion. Accordingly, their resulting method estimates the visual impact of the distortion based on deviations from the expected power spectrum.

4.1.5. Non-Distortion-Specific Methods

Researchers have also developed more general-purpose NR IQA algorithms which do not attempt to detect specific types of distortions. Methods of this type typically reformulate the IQA problem into a classification and regression problem in which the regressors/classifiers are trained using specific features. The relevant features are either discovered via machine learning or specified by using natural-scene statistics.

In [268], Tong et al. presented a learning-based NR algorithm which attempts to directly estimate quality via machine learning. First, some training examples are prepared for both high-quality and low-quality classes. Next, a binary classifier is built on the training set. Finally, the quality estimate of an unlabeled example is denoted by the extent to which it belongs to these two classes.

In [269], Tang et al. presented the LBIQ algorithm, another learning-based NR IQA method. LBIQ measures various low-level quality features derived from natural-scene and texture statistics. LBIQ estimates quality via a regression-based combination of the features.

In [270], Li et al. presented an NR algorithm which operates based on a general regression neural network (GRNN). Various features such as the mean value of a phase congruency map, the entropy of the phase congruency map, and the entropy and gradient of the distorted image are extracted. Quality is estimated by approximating the functional relationship between these features and subjective scores using a GRNN.

In [271, 272], Ye and Doermann presented the CBIQ-I and CBIQ-II algorithms which operate based on visual codebooks. The codebooks consist of Gabor-based features extracted from local image patches. The codebooks form a feature space, which is quantized, and then used to yield an estimate of quality via either example-based regression or support-vector regression.

Another popular approach to NR IQA is to use natural-scene statistics. The main idea in this approach is that natural images demonstrate certain statistical regularities that can be affected in the presence of distortion. Thus, quality can be estimated by extracting features which indicate the extent to which these statistics deviate in the distorted image. See [36] for a more thorough discussion of the use of natural-scene statistics for NR IQA.

Methods of this type usually contain two stages: (1) distortion identification and (2) distortion-specific quality assessment. Both stages require training: the classifier used to measure the probability that each distortion type exists in the distorted image requires training, and the regression model for each distortion type used to map the measured features to an associated quality score must also be trained.

In [273], Moorthy and Bovik presented the BIQI algorithm which estimates quality based on statistical features extracted using the 9/7 DWT. The subband coefficients obtained are modeled by a generalized Gaussian distribution, from which two parameters are estimated and used as features. The resulting 18-dimensional feature vectors (3 scales 3 orientations 2 parameters) are used to characterize the distortion and estimate quality via the aforementioned two-stage classification/regression framework.

In [274], Moorthy and Bovik presented the DIIVINE algorithm, which improves upon BIQI by using a steerable pyramid transform with two scales and six orientations. The features extracted in DIIVINE are based on statistical properties of the subband coefficients. A total of 88 features are extracted and used to estimate quality via the same two-stage classification/regression framework.

In [275, 276], Saad et al. presented the BLIINDS-I and BLIINDS-II algorithms which estimate quality based on DCT statistics. BLIINDS-I operates on 17 17 image patches and extracts DCT-based contrast and DCT-based structural features. DCT-based contrast is defined as the average of the ratio of the non-DC DCT coefficient magnitudes in the local patch normalized by the DC coefficient of that patch. The DCT-based structure is defined based on the kurtosis and anisotropy of each DCT patch. BLIINDS-II improves upon BLIINDS-I by employing a generalized statistical model of local DCT coefficients; the model parameters are used as features, which are combined to form the quality estimate.

In [277], Mittal et al. presented the BRISQUE algorithm, a fast NR IQA algorithm which employs statistics measured in the spatial domain. BRISQUE operates on two image scales; for each scale, 18 statistical features are extracted. The 36 features are used to perform distortion identification and quality assessment via the aforementioned two-stage classification/regression framework. Related work on the use of BRISQUE features and discriminatory latent characteristics for NR IQA can be found in [278].

4.2. Reduced-Reference IQA

Reduced-reference (RR) IQA methods provide a solution for cases in which the reference image is not fully accessible. Methods of this type generally operate by extracting a minimal set of parameters from the reference image, parameters which are later used with the distorted image to estimate quality.

An important question in RR research is how to determine effective parameters for the IQA task. In [279], Wang and Simoncelli argued that the appropriate RR features should (1) provide an efficient summary of the reference images, (2) be sensitive to a variety of image distortions, and (3) be relevant to visual perception of image quality.

In [280], Maalouf et al. presented an RR algorithm based on the grouplet transform. Given a reference image and its distorted version, the grouplet transform is applied to both images in order to extract information regarding textures and gradients of the images. This information is then used with CSF filtering and thresholding to obtain sensitivity coefficients. Quality is estimated by comparing the sensitivity coefficients of the distorted image with the sensitivity coefficients of the reference image.

In [281], Guanawan and Ghanbari presented an RR algorithm which operates based on a local harmonic analysis for images degraded with blocking or blurring. Local harmonic amplitude information is computed from an edge-detected image, and this information is then used with the distorted image to estimate quality.

In [282], Rehman and Wang present an RR version of SSIM [123]. Instead of directly constructing an RR algorithm to predict subjective quality, this method extracts statistical features from a multiscale, multiorientation divisive normalization transform. The authors construct a distortion measure by following the philosophy analogous to that in the construction of SSIM. Based on the linear relationship between RR SSIM and FR SSIM given a fixed distortion type, a regression-by-discretization method is used to estimate quality. Related work can also be seen in [283].

In [284], Chono et al. presented an RR algorithm using distributed source coding for remotely monitoring image quality. In this scheme, an image server extracts a feature vector from the reference image and then transmits its Slepian-Wolf syndrome by using a low-density parity-check code. At the decoder, the feature vector and the received (distorted) image are used to estimate quality.

Other types of RR IQA methods operate based on natural-scene statistics. For example, in [279], Wang and Simoncelli presented an RR IQA method which operates based on a wavelet-domain statistical model of images. Quality is estimated based on the Kullback-Leibler divergence between the marginal probability distribution of wavelet coefficients of the reference and distorted images. Similar work can also be found in [285].

In [286], Xue and Mou presented an RR algorithm based on the steerable pyramid. At each pyramid scale, a strongest component map (SCM) is constructed by assembling coefficients with maximum amplitudes among different orientations. Several statistics of the SCM serve as the RR features, which are used with the distorted image to estimate quality.

In [287], Avanaki et al. presented an RR algorithm which operates by using watermarking to embed RR features into an image; these features can be extracted and used for IQA in the event that the image is distorted. The RR features used in [287] consist of approximation coefficients of a parameterized DWT of the image. At the receiver, the embedded features are extracted and compared to the corresponding features of the distorted image to estimate quality.

In [288], Li and Wang presented an RR algorithm based on a divisive normalization image representation. By using a Gaussian-scale mixture-based statistical model of wavelet coefficients, a divisive normalization transform (DNT) is applied to the images. Quality is estimated by comparing a set of RR statistical features extracted from DNT-domain representations of the reference and distorted images.

In [289], Ma et al. presented an RR algorithm based on DCT coefficient statistics. First, the DCT coefficients of image blocks are grouped into several representative subbands. Next, a generalized Gaussian distribution is employed to model the distribution of coefficients within each subband. Quality is then estimated based on the distance between distributions of the reference and distorted images. Similar work can also be seen in [290].

In [291], Soundarararajan and Bovik presented a framework for RR IQA based on information-theoretic measures of differences between the reference and distorted images by using the entropies of wavelet coefficients. This algorithm differs from other approaches in terms of the amount of data needed for the entropy-difference calculations and in terms of the scalability in the amount of information that is needed from the reference image.

5. Seven Challenges in Image Quality Assessment

The previous sections of this paper have focused on reviewing the current knowledge and accomplishments in IQA research. In this section, the focus now shifts toward unsolved aspects of IQA. Here, I specifically discuss seven open challenges, all of which are critical for furthering IQA research and facilitating deployment and integration of IQA into existing and forthcoming applications.

It is important to stress again that these seven challenges do not represent an exhaustive list of important research topics in IQA. Other notable areas such as IQA of stereoscopic images, IQA of computer graphics, and video quality assessment are not discussed. The seven challenges described in this section were chosen to highlight some key limitations of current IQA knowledge and to point out areas which can begin to answer broader questions on IQA.

5.1. Challenge 1: HVS Models and Natural Images

As discussed in Section 2, knowledge of how the HVS analyzes visual input has played a pivotal role in IQA research. However, it must be stressed that our current understanding of the HVS, and thus the computational HVS modeling used in IQA, is far from complete. The vast majority of computational models do not model beyond primary visual cortex (V1), and many researchers have argued that even current V1 models are still incomplete. And, how visual stimuli are analyzed in V1 is only one contributor to visual perception, not to mention image quality.

A key difficulty in developing a more complete computational HVS model is the fact that visual neurons often respond quite differently to naturally occurring stimuli than they do to simple, controlled stimuli. Because neural responses are highly nonlinear, it has proved difficult to predict neural responses to natural images based on responses to more simple stimuli. The same difficulty carries over to visual perception; psychophysical data collected using natural scenes can be difficult to predict based on data collected using more simple, controlled stimuli.

5.1.1. The Need for Improved V1 Models

While the characterization of V1 based on its responses to simple stimuli has proved useful, other researchers have suggested that in order to fully understand the response properties of visual cortex, one must first understand the signal that is to be encoded, that is, natural scenes. (As defined by Field [292], natural scenes refer to images from the natural environment which are devoid of man-made objects. However, the exclusion of man-made objects has since been relaxed in the visual psychology literature due, in part, to the presence of such objects in several popular natural-scene databases [293, 294]. In the image-processing literature, the term natural images is more commonly used and defined to be photographic images containing any naturally occurring subject matter that may occur during normal photopic or scotopic vision.) Field [292, 295, 296] postulated that cortical neurons are tuned to encode natural scenes in an efficient manner, and thus this special class of input has the potential to reveal properties of visual cortex beyond those invoked by using simple stimuli. Indeed, several studies have shown that neural networks trained with natural scenes under various sparse-coding objectives yield bases which possess striking similarities to simple and complex cortical cells [297, 298]. Other studies, focused on modeling the statistics of natural scenes, have revealed properties such as amplitude spectra [292, 299] (where denotes spatial frequency) and the importance of phase [300, 301] and edge cooccurrences [109] in perception. More recently, natural images have been used in psychophysical studies [55, 302305], revealing both supportive and confounding evidence for previous theories of V1.

In a recent, controversial essay, Olshausen and Field argued that as much as 85% of V1 has yet to be explained [306]. Much of what is known about the response properties of V1 neurons has come from neurophysiological studies employing single-cell recordings. During such recordings, many V1 neurons are not tested due to the fact that they yield weak extracellular action potentials, they yield low firing rates, or they are otherwise visually unresponsive. Olshausen and Field estimated that only 40% of V1 neurons have actually been tested. They further reported that, of these V1 neurons that have been tested, current computational models can explain only 30–40% of the response variance when the neurons are presented with natural images.

The lack of knowledge about how V1 operates when presented with naturally occurring stimuli is even more troubling for IQA because there is always an image present (usually a natural image), and thus our models must necessarily be equipped to handle such stimuli. In terms of visual masking, computational models have been not been extensively tested on thresholds measured using natural images as masks. Instead, as mentioned in Section 2.1, many of the models employed in IQA algorithms use parameters which have been optimized to fit thresholds measured for traditional targets placed upon unnatural masks (e.g., sine waves, Gabor patterns, or noise).

As an example of how natural images can affect masking results, Figure 7 shows relative detection thresholds measured for wavelet distortions presented against various image patches [64]. In this plot, the image patches, which are shown along the horizontal axis, have been ordered by eye based on the ability to learn and recognize changes to the content—from simplistic edges to complex textures (from left to right). All images were matched in RMS contrast (values of 0.32 and 0.64 were tested). The data points denote relative threshold elevation given by , where denotes the contrast threshold for detecting the distortion and CTedge denotes the average contrast threshold for detecting the distortion in images from the edge category. The dashed lines denote average relative threshold elevations for each of the three image types.

Notice from Figure 7 that the thresholds generally increase as the images become more visually complex and thus become harder to learn and recognize, despite the fact that all images have the same RMS contrast. In [64], we reported that a computational model of masking (described in Section 2.1) performed well in predicting only the thresholds for the texture patches; these models generally failed for patches in the edges’ and structures’ categories. In fact, even after optimizing and extending the parameters of the model based on the actual thresholds measured for the edges and structures, the predictions remained quite subpar.

5.1.2. Ground-Truth Data for Natural Images

For IQA, one of the primary limitations when designing a computational neural model that can handle natural images is the lack of ground-truth data. For masking, which is crucial for IQA in the high-quality regime, there exists no database of local contrast detection thresholds for natural images (though, see [49, 58] for related studies). For lower quality images, there exists no database of ground-truth quality maps denoting local quality ratings for natural images. These types of local ground-truth data would be especially useful for training and testing purposes since they can provide insights into whether the local processing is correctly modeled or whether further adjustments are warranted.

Of course, the main difficulty in creating local masking and quality-rating data is the enormous time commitment required for the experiments. Nonetheless, even coarse maps would be a useful first step. For masking, we have begun to address this issue by measuring local masking maps for detection of vertically oriented 3.8 c/deg log-Gabor distortions in images from the CSIQ database [27]. Figure 8 shows preliminary results for 10 of the images.

In Figure 8, the first row shows the original images (masks), the second row shows maps of local detection thresholds in which brighter values denote greater thresholds (greater masking); these maps were averaged across subjects and trials. The third row shows the predicted masking maps obtained from the computational V1 model of masking used in [64], which is based on the standard model described in Section 2.1. For comparison, the fourth and fifth rows show the predicted masking maps obtained from MS-SSIM [175] and the detection-based stage of MAD [139]. Because neither of these IQA algorithms was designed to predict thresholds, we selected a fixed index for each algorithm corresponding to an at-threshold amount of distortion (an MS-SSIM index of 0.995, a MAD index of 15). Then, for each patch of the image, we successively added log-Gabor distortions until the algorithm yielded that at-threshold index for the patch. Finally, the resulting masking map was computed as the RMS contrast of the distortion in each of these distorted patches.

Overall, the neural masking model yields the best predictions of the masking maps with an average Pearson correlation coefficient (CC) between the actual and predicted thresholds of 0.70. The best prediction from this model is on image cactus for which the CC is 0.95. However, there are many notable failure cases, particularly on more structured images. The worst prediction from this model is on image couple for which the CC is 0.42. Both MS-SSIM and MAD perform considerably worse. MS-SSIM yields an average CC of 0.44, with the best and worst predictions on, respectively, images foxy (CC = 0.67) and native American (CC = 0.13). MAD yields an average CC of 0.64, with the best and worst predictions on, respectively, images bridge (CC = 0.86) and shroom (CC = 0.28).

Of course, MS-SSIM was never designed to estimate masking, and the masking model employed in MAD is a simplistic, spatial-domain-only local contrast measure. However, the performance of the neural masking model highlights the need for further research in this area.

5.1.3. Models of Areas Beyond V1

One possible explanation for the failure cases of V1 models in predicting masking in natural images is the fact that such masking is attributable to visual processing in areas beyond V1 [81, 307]. For example, by comparing EEG recordings obtained during unmasked versus masked detection, Fahrenfort et al. concluded that “masking derives its effectiveness, at least partly, from disrupting reentrant processing, thereby interfering with the neural mechanisms of figure-ground segmentation and visual awareness itself” [307]. Thus, even if a complete model of V1 for natural images was available, there still remains the question of how masking and image quality are influenced by visual processing in areas beyond V1.

Unfortunately, much less is known about the mechanisms and objectives of visual processing beyond V1 and the influences such processing might have on V1 itself. It is important to note that approximately half of V1's innervations come in the form of corticocortical feedback from higher levels [307]. Lee et al. [308] have proposed that the higher levels work in conjunction with V1 to perform complex tasks such as pattern analysis and object recognition. Rao and Ballard [309] have suggested that the higher levels function as predictive coders whose feedback connections to V1 carry the prediction and whose feedforward connections from V1 convey the prediction's error.

One generally accepted belief is that higher levels serve to efficiently encode the joint activity of V1 neurons [310312]. Based on single-cell recordings in V2, Willmore et al. argued that V2 neurons integrate the outputs of V1 neurons across spatial frequency to enhance the representation of edges [313]. Related earlier work in visual psychophysics postulated similar theories that an image's features are integrated temporally across scale-space in a coarse-to-fine (global-to-local) fashion [314317]. In terms of IQA, any distortions which disrupt this integration of neural responses/features could potentially lead to severe degradations in quality.

As an example, the global precedence theory of Navon [314] and the related scale-space integration theory of Hayes [317] both advocate that an image's edges are visually processed by combining information across spatial scales, beginning with the coarsest scale and ending with the finest available scale. Under this theory, eliminating or distorting content at an intermediate spatial scale should disrupt the HVS's ability to integrate coarse and fine information into a single percept. Instead, the result would be two percepts: a blurred version of the object and a separate, erroneous high-frequency pattern. This disruption of visual integration and its effects on image quality can be readily demonstrated, as shown in Figure 9. In this figure, the original image is shown on the left, and the distorted image is shown on the right. The distorted image was generated via notch filtering (discrete-space radial frequencies of to have been eliminated for all orientations). Because this filtering disrupts the HVS's ability to integrate information at different scales, the high-frequency content is perceived almost as additive noise on top of a second percept of a blurry image in the background.

The demonstration in Figure 9 highlights the need for further research on the role of higher-level visual processing on image quality. Although the mechanisms and functional objectives of these higher-level areas remain largely unknown, it is still possible to incorporate general principles of object perception and cognition derived from experiments or heuristics. For example, in the VSNR algorithm [125] (described previously in Section 3), we developed a basic model of global precedence specifically for IQA; although this model was based more on empirical observations rather than psychophysical data, it has proved quite effective for certain distortion types. Nonetheless, further research in this area, both in vision science and in IQA, is needed to help move IQA beyond its current capabilities to a level that can begin to capitalize on the properties of higher-level visual processing.

5.2. Challenge 2: Compound and Suprathreshold Distortions

Another challenge which designers face when incorporating psychophysical findings into IQA algorithms is the fact that distortions can be both compound and suprathreshold. The term compound is used to describe a visual target that stimulates more than one channel in the HVS's multichannel analysis. Suprathreshold refers to clearly visible targets that are at a contrast beyond the threshold of detection. In IQA, many of the distortions that are encountered meet both of these criteria.

5.2.1. Simple Targets versus Compound Distortions

In image processing applications, a wide variety of distortions are possible, and an IQA algorithm should ideally be able to handle such distortions. However, in visual psychophysics, the visual targets are generally much more simplistic, often consisting of sine-wave gratings, Gabor functions, or other highly controlled spatial patterns that are localized in space, frequency, and/or orientation. Highly controlled and localized targets are preferable in such experiments because they can be designed to stimulate only one channel of the HVS's multichannel analysis. However, this difference between simple targets and actual distortions poses a difficulty to designers who wish to incorporate psychophysical findings into an IQA algorithm.

In visual psychophysics, distortions would be considered compound targets that consist of multiple simple targets (e.g., multiple sine waves, Gabor functions, or wavelets). Due to the nonlinear behavior of the HVS, it is difficult to apply knowledge about the visibility of simple targets to predict how the HVS will respond to compound distortions. Visual summation studies begin to address this issue by measuring thresholds for detecting compound targets and comparing them with thresholds measured for the individual components of the compound (the individual simple targets).

Traditionally, summation has been tested by using a visual detection paradigm, in which the detection threshold measured for a compound target (e.g., a plaid composed of two sine waves) is compared with detection thresholds measured separately for its individual components (the sine waves). Detection of the compound should be an easier task because the HVS has a greater of chance of detecting the target now that it contains two instead of just one component. Further, assuming that the components within the compound are detected by separate HVS channels, any changes in detection thresholds for compound versus simple targets would point toward interchannel cooperation, that is, summation of channel responses.

As mentioned in Section 2.1, summation is typically modeled via a Minkowski sum in which the amount of summation is controlled by the summation exponent (see (4)). A value of denotes complete summation, whereas a value of denotes no summation. The majority of psychophysical studies have generally found to be in the range 3–5; that is, the compound target is only slightly more detectable than either of its components given that the components are analyzed by separate HVS channels.

However, as mentioned in Section 5.1, the presence of an image in the background can significantly change neural responses and psychophysical results, and visual summation is no exception. In [55], we measured detection thresholds for simple and compound distortions generated via quantization of individual DWT subbands (yielding simple distortions) and pairs of DWT subbands (yielding compound distortions). When the distortions were presented against a gray background (i.e., unmasked detection, as used in previous studies), we found results which were very consistent with previous studies: 3–5. However, when the distortions were presented against either of the two natural images tested in [55], much greater summation was found: 1.3–1.6. One possible explanation for our finding is the presence of intra-channel summation, wherein a single HVS channel is used to detect both components of the compound target. Intra-channel summation may result from the spatial correlations that exist between the distortions and the images and/or from off-frequency looking [83, 318].

It is also important to note that summation studies have tested compound targets consisting of relatively few components. However, in IQA, the distortions can be broadband in terms of radial frequency, orientation, and other dimensions. It remains unclear how summation is affected when such distortions serve as the targets. Furthermore, the distortions encountered in IQA may contain components which constructively interact to form salient visual patterns that might further affect summation. For example, if false contours or borders are formed, visual processing could be mediated by areas beyond V1 which attempt to integrate such contours [319], thus giving rise to different summation rules. Clearly, further research is needed in order to develop improved models of summation in the presence of natural images.

5.2.2. Perceived Contrast of Suprathreshold Distortions

Much of our current understanding of visual perception has resulted from research in visual detection in which the task is to gauge whether the distortions are visible. However, many applications generate images containing suprathreshold distortions whose contrasts are well beyond the threshold of detection. Thus, another challenge which IQA designers face is how to adjust existing models to handle suprathreshold distortions.

One key finding from studies employing suprathreshold stimuli is the fact that the perceived (or “apparent”) contrast of a suprathreshold target depends much less on the target's spatial frequency than what is predicted by the CSF; that is, visual sensitivity at suprathreshold contrasts is relatively frequency independent. This finding, termed contrast constancy [320], was first reported by Georgeson and Sullivan [320] who attributed the effect to an intrachannel gain control mechanism that compensates for the CSF at suprathreshold contrasts.

In a similar study, Brady and Field [76] found contrast constancy using both Gabor patches and broadband noise patterns. Their data were successfully predicted via a model with equally sensitive octave-bandwidth spatial-frequency channels, which was reported to yield a constant response to the spatial scales of natural scenes. Brady and Field's study and a later study by Graham et al. [77] were the first to provide a theoretical account of why white noise is perceived as containing mostly high-frequency content despite the fact that the CSF peaks near 4–6 c/deg.

For IQA, not only can the distortions be suprathreshold, but the distortions are necessarily presented against an image. Thus, the perceived contrast of the distortions can be influenced by the image. In [54], we investigated whether contrast constancy is also observed in the presence of natural images by repeating Brady and Field's experiment using suprathreshold octave-bandwidth wavelet distortions presented against either a solid gray background or one of three natural images. In the experiment, subjects were asked to adjust the contrasts of 1.15, 2.3, 4.6, and 9.2 c/deg wavelet distortions such that they appeared to have the same contrast as 18.4 c/deg distortions, the latter of which was fixed in contrast at various suprathreshold values. Figure 10 shows the results of the experiment.

In Figure 10, the horizontal axis of each graph corresponds to the center spatial frequency of the wavelet distortions. The vertical axis of each graph denotes the physical RMS contrast of the distortions increasing in the downward direction. The topmost curve in each graph (square symbols) corresponds to average detection threshold values from a previous experiment [53]; this curve can be interpreted as a CSF for wavelet distortions. The lower four curves in each graph indicate how perceived contrast changes for increasingly suprathreshold distortions.

Notice in Figure 10 that the lower curves in each graph (the perceived contrast curves) are much flatter than the top curve (the CSF curve). These data indicate that, at increasingly suprathreshold contrasts, the perceived contrast becomes increasingly invariant with frequency; that is, contrast constancy is observed. The curves obtained when using the image backgrounds demonstrate a slight reduction in the perceived contrast for lower-frequency distortions, but the contrasts still demonstrate a significantly lesser dependence on spatial frequency than the top (CSF) curve.

However, although contrast constancy can be used to estimate the perceived contrast of the distortion, contrast constancy has found little use in IQA. To demonstrate why, Figure 11 depicts images to which horizontally oriented 1.15–18.4 c/deg wavelet distortions have been added (these distortions were not generated via quantization and are thus spatially uncorrelated with the image). The RMS contrasts of the distortions in these images have been allocated in two different ways: For the image on the left, the contrasts have been proportioned according to the CSF (specified by the top curve in Figure 10); for the image on the right, the contrasts have been proportioned as specified by the middle curve (solid circles) of Figure 10 for the image lena. The distortions in both of these images exhibit a total RMS contrast of approximately 0.18.

Whereas the results of the contrast-matching experiments suggest that when distortions are suprathreshold, physical contrast is a better indicator of perceived contrast than predictions based on the CSF, Figure 11 clearly demonstrates that image quality is much better preserved when the contrasts of the distortions are proportioned according to the ratios specified by the CSF. One possible explanation for this effect is due to the total perceived contrast of the compound suprathreshold distortions. Specifically, by examining just the distortions in Figure 12, it is clear that the perceived contrast of the distortions when using contrast-matching-based proportions is much greater than the perceived contrast of the distortions when using CSF-based proportions. Thus, there appears to be an unexpected visual summation effect: although the perceived contrasts of the individual bandlimited distortions are relatively constant, when these distortions are combined and viewed together, they appear to visually interact in a way that affects the total perceived contrast.

Unfortunately, whereas visual summation at near-threshold contrasts has been extensively studied, visual summation of perceived contrast has received much less attention (see [321]). For IQA, visual summation of perceived contrast would further need to be tested in the presence of various images. Clearly, there is a need for more research in this area. Such research could prove particularly useful for IQA of images containing distortions which are perceived to be overlaid on top of the image as opposed to distortions which interact with the image's subject matter. In the following section, I delineate between these two types of distortions.

5.3. Challenge 3: Effects of Distortions on Image Appearance

Although the perceived contrast of the distortions can be used to estimate quality, the inherent assumption in this approach is that the viewer is looking for the distortion in the presence of the image. When the distortions are severe and spatially correlated with the image, viewers tend base quality judgments on the interaction between the distortions and the image's objects. Properly determining and modeling the perceptual effects of this interaction is yet another challenge in IQA research.

5.3.1. Capture and Transparency in IQA

For IQA, an important consideration is whether the distorted image is perceived as a single distorted image or whether it is perceived as distortions with an image in the background. In the visual psychophysics literature, these two scenarios would fall under the aegis of capture and transparency [322], which describe whether a target + background are perceived as one combined stimulus (captured) or whether they are perceived as two separate stimuli (transparent).

One of the earliest studies pointing to the need to distinguish between capture and transparency in IQA was performed by Goodman and Pearson [171] who used a multidimensional scaling (MDS) experiment. For image quality, an MDS experiment can be used to determine the number of, type of, and interactions between the perceptual attributes that underlie quality ratings. Goodman and Pearson used MDS to specifically investigate the quality of TV pictures impaired both by additive-type distortions (e.g., noise, echo) and by coding- and transmission-type distortions (DPCM quantization artifacts and blurring). Based on their MDS analysis, they reported that one of the multiple perceptual dimensions “appears to be separating those impairments which cause the integrity of the objects in the picture to be destroyed from overlay types of impairment.

To demonstrate the effects of capture and transparency on image quality, Figure 13 shows two distorted versions of lena. The image on the left is repeated from Figure 11; this image contains additive wavelet distortions which are spatially uncorrelated with the image. The image on the right was generated via actual quantization of the wavelet subbands, thus resulting in distortions which are spatially correlated with the image (distortions which interact with the image's objects). For both images, the contrasts of the per-subband distortions have been proportioned as specified by the middle curve (solid circles) of Figure 10 for the image lena. The distortions in both images are at a total RMS contrast of approximately 0.18. Figure 14 shows the distortions in these images presented against a solid gray background.

In the experiment in [54], subjects were instructed to match the contrasts of the wavelet distortions, a task which involves examining the distortions. However, when the distortions are severe and spatially correlated with the image, judging the quality of the image involves attending to and looking at (capturing) the image. Indeed, the images in Figures 13 and 14 suggest that it is not just the perceived contrast of the distortions that determines the image's quality; notice from Figure 14 that the overall perceived contrasts of just the distortions are quite similar for both images. Rather, quality is also determined by visual interaction between the distortions and the image's subject matter. Nachmias [323] reported a similar observation in context of masked detection of sine-wave gratings. Namely, when a target is presented against a suprathreshold and spatially coherent mask, it is often easier to detect the target by examining its effect on the phenomenal appearance of the mask.

Thus, in addition to considering the perceived contrast of the distortions, for IQA, it is also important to take into account the effects these distortions impose on the appearance of the image. In particular, it would seem necessary to distinguish between the additive or “overlay” types of distortion and those which visually interact with the image's subject matter.

5.3.2. The Role of Visual Strategy in IQA

The effects of capture and transparency in IQA begin to address the broader issue of the adaptive nature of the HVS. Namely, the visual strategy that the HVS uses when judging image quality can change depending on both the amount of distortion and whether the distortion affects the phenomenal appearance of the image's objects. Numerous studies have shown the HVS to be a highly adaptive system, with adaptation occurring at multiple levels ranging from single neurons [69] to the entire cognitive processes [324]. It seems logical to assume that the visual strategy adapts based on many other factors related to the interaction between the distortion and the image.

In [139], we asked whether IQA could be improved by modeling this adaptive nature of the HVS via two separate computational models. For images containing near-threshold distortion (high-quality images), we assumed that transparency was in effect, and thus the HVS employs a detection-based strategy in an attempt to look for the distortions. For images containing suprathreshold distortion (low-quality images), we assumed that capture was in effect, and thus the HVS employs an appearance-based strategy in an attempt to recognize the image's content.

Figures 15 and 16 demonstrate the need to explicitly model these two separate strategies. As shown in Figure 15, which contains high-quality images, when the distortions are not readily visible, our visual system seems to employ a detection-based strategy in an attempt to locate any visible differences. However, for the low-quality images shown in Figure 16, which contain suprathreshold distortions, the distortions dominate the overall appearance of each image, and thus visual detection is less applicable. Instead, for these latter images, quality is determined based primarily on our ability to recognize image content. We demonstrated in [139] that by using separate computational models for these two fundamentally different strategies and by estimating quality based on an adaptive combination of these modeled outputs, significant improvements in IQA could be achieved.

In [67], Rouse et al. performed a more specific study designed to investigate the role of recognition in determining image quality and image utility (usefulness). For high-quality images, Rouse et al. reported that the perceived utility scores do not correlate with the perceived quality scores. However, for low-quality images, a linear relationship between perceived utility and perceived quality was reported. These results and the later work in [205] suggest that the ability to recognize and utilize an image's content can play a crucial role in determining quality.

5.3.3. Higher-Level Effects of Distortion on Image Appearance

Although the efforts in [67, 139, 205] and the related work in [201] begin to address the issue of adaptive visual strategies, more generally IQA research could benefit from a better understanding of the interaction between the distortions and images. As an example, consider the two images shown in Figure 17, one of which has been compressed with JPEG and the other with JPEG2000, both at the same low bit-rate. Both images contain compression distortions which are clearly visible, since, at this low rate, there is no chance of hiding the distortions in the traditional sense. However, most people clearly prefer the JPEG2000 image over the JPEG image.

One explanation for the lower quality of the JPEG image is due to JPEG's blocking artifacts. However, there are several other, higher-level aspects which come into play: (1) the facial expressions are better preserved in the JPEG2000 image because the wavelet basis functions better capture the curvature of the eyes; (2) the objects in the image—water, skin, and hair—happen to be physically smooth in the real world, so JPEG2000's blurring is acceptable for these objects; (3) the object boundaries are less degraded in the JPEG2000 image, and therefore it is easier to recognize the image's subject matter. None of these higher-level perceptual aspects are considered in the current coding and IQA algorithms; they just happen to work in JPEG2000's favor for this particular image. An improved IQA algorithm which explicitly models higher-level perception can potentially lead to better quality estimates and thereby benefit compression and other image-processing applications.

5.4. Challenge 4: Multiple Types of Distortions

Although IQA algorithms have been tested on images containing individual types of distortion, some applications can give rise to images which simultaneously contain multiple types of distortions. This scenario adds yet another level of difficulty for IQA. An IQA algorithm must not only consider the joint effects of these distortions on the image, but also consider the effects of these distortions on each other.

Consider, for example, a traditional model in signal processing in which the output signal is given by , where is a transformation of the input signal (e.g., blurring or JPEG2000 coding-decoding) and is additive noise. Here, the output image may contain at least two distinct types of distortions: , which most often represents distortions that are spatially correlated with the image (e.g., blurring, ringing artifacts), and , which is most often modeled as noise that is spatially uncorrelated with the image. Below, I describe several studies which have investigated the joint effects of these distortions on image quality.

5.4.1. Joint Effects of Blur and Noise

Much of the early work in investigating the effects of multiple types of distortions on image quality involved the use of multidimensional scaling (MDS) experiments. As described in Section 5.3, Goodman and Pearson [171] used MDS to investigate the joint effects additive-type distortions (e.g., noise, echo) and coding- and transmission-type distortions (DPCM quantization artifacts and blurring). They reported that four perceptual dimensions were used by subjects to rate quality: “(1) overall picture clarity, (2) a distinction between overlay impairment and object impairment, (3) the amount of purely spatial or stationary overlay patterning, and (4) the amount of spatiotemporal or moving overlay patterning.” [171].

In [325], Linde employed MDS to investigate the interactive effects of blur and noise. The images used in [325] were blurred in varying amounts, and then varying amounts of noise were added to these blurred images. Linde reported two key findings. First, when a fixed amount of noise was added to images subjected to varying amounts of blur, the perceived strength of the noise increased for the increasing amount of blur; that is, the noise is more pronounced when it is added to a more blurry image than when the noise is added to a less blurry image. This result would seem attributable, at least in part, to masking; that is, blurring an image reduces its ability to mask noise. Second, for a fixed amount of blur, the addition of increasing amounts of noise gave rise to images which appeared progressively less blurry. This latter result might be attributable to cross-masking whereby the noise serves to mask the blurring.

In a similar study, Kayargadde and Martens [326] employed MDS to investigate not just the potential interactions between blur and noise, but also their effects on the overall image quality. As in [325], Kayargadde and Martens reported that when a fixed amount of noise was added to images subjected to varying amounts of blur, the perceived strength of the noise was greater for the more blurry images. Similarly, they found that for highly blurred images, increasing the amount of noise served to make the images appear sharper. However, in contrast to [325], Kayargadde and Martens reported that for small amounts of blur, the opposite trend was reported. For mildly blurred images, increasing the amount of noise served to make the images appear less sharp. And, for intermediate amounts of blur, the addition of noise had no effect on perceived sharpness. In terms of the effects of the blur and noise on image quality, Kayargadde and Martens reported that quality decreased for the increasing amount of blur and noise. For all of the images tested in [326], blur had a greater impact on quality than noise.

5.4.2. Joint Effects of Wavelet Distortions and Noise

As mentioned in Section 5.3, some distortions are perceived as more additive or “overlay” type distortions, whereas other distortions are more indirectly perceived based on how they affect the image's objects. In [59], we investigated how quality is affected when images were simultaneously subjected to both types of distortions. We specifically considered the case in which represents distortion of the image's structure induced via disruption of the global precedence effect via quantization of wavelet subbands [54] and in which represents spatially uncorrelated additive white noise.

Figure 18 shows distorted versions of one of the images tested in [59]. Images in the top row contain only ; three contrasts of are shown increasing from left to right (six contrasts were actually tested). Images in the leftmost column contain only ; three contrasts of are shown increasing from top to bottom (six contrasts were actually tested). The remaining images contain combinations of and in the contrasts specified by their row and column headings. These combinations of distortions were tested on three natural images and—as a control condition—a solid gray image in which the distortions were generated from the image horse but were presented against a solid gray background (see Figure 19). For each of the distorted images, subjects were asked to rate the degradation in quality relative to the corresponding original image (resulting in DMOS values).

Figure 20 depicts the results for each of the four images. In each graph, the horizontal axis denotes the total RMS contrast of the combined distortions and the vertical axis denotes DMOS. The data represented by the black circles in Figure 20 correspond to the condition in which the images contained only structural distortion; that is, and , where and denote the contrasts of and , respectively. The white circles in Figure 20 correspond to the condition in which the images contained only additive white noise; that is, and . The other symbols represent the conditions in which the images contained both and .

Notice from the trends in Figure 20 corresponding to the condition in which the images contained just (black circles) or just (white circles) that when an image contains just one type of distortion, increasing the contrast of the distortions also serves to increase the perceived distortion (decrease quality). Moreover, for the range of contrasts tested, gave rise to much greater perceived distortion than ; for the three natural images, the perceived distortion induced by was on average 4-5 times greater than the perceived distortion induced by . These data provide further evidence that spatially correlated distortions which disrupt visual processing of the image's objects increase perceived distortion to a much greater extent than spatially uncorrelated additive white noise.

However, whereas RMS contrast and perceived distortion were monotonically related for images containing just or just , such a trend does not hold when both and were present. Rather, as reported in [325, 326], the trends in Figure 20 reveal that adding low-contrast to an image which already contains low- to mid-contrast actually serves to decrease perceived distortion; that is, adding small amounts of noise to a structurally distorted image serves to increase quality.

5.4.3. Modeling the Joint Effects of Multiple Distortions for IQA

One possible explanation for the interactive effects of the distortions observed in the above-mentioned studies is the cross-masking. Specifically, the noise might mask the structural distortion (or the blur/echo), and thus DMOS is decreased (quality increased) by adding to an image which also contains (see Figure 21). However, such cross-masking should also be present for the solid gray image tested in [59], yet the DMOS values for this image were largely unaffected by adding on top of . Another possible explanation is that the addition of noise serves to synthesize textures which were destroyed by (see Figure 22). This could explain why the greatest improvement in quality in [59] was observed for image horse, a significant portion of which is textured.

The results of the above-mentioned studies point toward an important implication for IQA: when multiple types of distortions are added to an image, the distortions can perceptually interact with each other, and with the image, in ways that may not be easily predicted based on their physical combinations. IQA algorithms could certainly benefit from further research on these effects.

5.5. Challenge 5: Geometric Changes

One well-known shortcoming of the vast majority of IQA algorithms is their inability to handle geometric distortions such as translation, scaling, rotation, shearing, or changes in viewpoint. Such geometric changes, if they are not too drastic, usually have a minimal impact on visual quality. However, even slight geometric changes can give rise to massive pointwise changes in pixel intensities, and consequently, many IQA algorithms predict these geometrically modified images to be of much lower quality than indicated by the actual subjective ratings.

As an example, Figure 23 depicts five versions of an image which have been distorted via either geometric changes (spatial shift, rotation) or photometric changes (additive Gaussian white noise, change in brightness, change in contrast). All of the images have approximately the same MSE relative to the original. Clearly, the geometric changes have a lesser impact on quality compared to the photometric changes, and the addition of white noise appears to have the greatest impact on quality.

Of course, one way to address geometric changes would be to employ a front-end stage which attempts to undo the geometric changes (e.g., via image registration). However, such an approach will not work in general, particularly when the type of geometric change is unknown, when multiple geometric changes are imposed, and when geometric changes are compounded with more traditional distortions such as noise or JPEG compression artifacts.

As IQA moves into more mainstream applications, the need to handle geometric changes has become increasingly apparent. One prime example is the comparison of photographs of the same scene taken with different cameras at slightly different viewpoints. Another example is when applying IQA on a frame-by-frame basis to assess the quality of video. If the frames of the distorted video become temporally unsynchronized with the frames of the reference video (e.g., delayed by a few frames), then the difference between the reference and distorted frames will typically manifest in the form of geometric changes due to movement of the subject matter and/or panning, zooming, and other viewpoint changes imposed by the camera.

5.5.1. Perception and IQA of Geometric Changes

One theory of why geometric changes have minimal impact on visual quality is that geometric changes are quite prevalent during normal vision, and thus the human visual system has adapted to become relatively insensitive to such changes. This theory was proposed in [327] by Kingdom et al. based on the results of a psychophysical experiment in which discrimination thresholds were measured for images containing various geometric distortions (affine transforms) and for images containing photometric distortions (luminance changes, contrast changes, and various forms of noise). Kingdom et al. found that subjects were 11–14x more sensitive to noise than they were to geometric distortions and were 2-3x more sensitive to brightness/contrast changes than they were to geometric distortions. Based on these findings, Kingdom et al. suggested that “observers are least sensitive to those transformations most commonly experienced in the natural world.” [327].

Some work in visual perception and IQA has specifically focused on addressing geometric distortion. For example, in [328], Chow et al. compared the detectability of local warping distortions in computer-generated scenes on monitors versus head-mounted displays. In [329], Rovamo et al. measured thresholds for detecting geometric distortion in faces as a function of retinal eccentricity. For IQA of watermarked images, Setyawan et al. [330] investigated the impact of particular forms of geometric distortion on the perceived quality of watermarked images. Setyawan et al. also presented in [330] an FR IQA algorithm which uses local estimates of affine transformations to estimate quality.

In [331], Wang and Simoncelli presented an FR IQA algorithm which can handle minor translations, rotations, and scalings. Their algorithm, CW-SSIM, is an extension of SSIM [123] which uses complex wavelets to achieve its invariance to these geometric changes. Quality is assessed via SSIM-type measures applied separately to the magnitudes and phases of the coefficients. Invariance to translation is afforded by the fact that translation in the spatial domain manifests in the coefficients as concerted shifts in phase. Wang and Simoncelli demonstrated the utility of CW-SSIM both for IQA of geometrically distorted images and for character recognition. In [176], CW-SSIM was also shown to perform well on more general similarity tasks such as comparing segmentations and quantifying similarities between 3D facial surfaces.

In [332], D'Angelo et al. presented an FR IQA algorithm which uses an HVS model and displacement fields for QA of geometrically distorted images. Their technique applies a single-level, multiorientation Gabor decomposition with both even- and odd-symmetric filters to the reference and distorted images. The even- and odd-symmetric responses are combined via an -norm to mimic the responses of complex cells in V1. These latter responses are then combined with local gradient information obtained from a displacement field, and the collection of modified responses are collapsed across orientation and space to arrive at a scalar estimate of quality. D'Angelo et al. demonstrated that, on a database of geometrically distorted images, their technique can yield quality predictions that correlate well with subjective ratings; Spearman’s and Pearson’s correlation coefficients of approximately 0.8 were reported.

5.5.2. More Radical Geometric Changes

Beyond basic affine transformations, it is also possible to generate images with more radial geometric changes and combinations of geometric and photometric changes. One particular area which researchers have begun to explore is IQA of textures. Given two samples of textures, humans can readily determine whether the two samples were taken from the same source texture (the same material). Or, given a database of actual and synthesized textures, humans can assign consistent quality ratings to each synthesized version relative to its corresponding original. From a computational standpoint, however, this task is extremely challenging. A major factor which complicates IQA of textures is the fact that point-by-point comparisons, a common approach used to some extent by most IQA algorithms, cannot be used to compare the visual similarity of two textures.

Although a great deal of human vision research has been conducted to investigate the perceptual and neural mechanisms which underlie the visual appearance of texture (see [333] for a review), further research is needed on how to actually apply these findings to IQA of textures. In [334], Bénard et al. investigated the effects of fractalization on the visual quality of synthesized textures; they reported that the average cooccurrence error between gray-level cooccurrence matrices measured for the original and fractalized textures can perform well in predicting the subjective ratings. In [335], Zujovic et al. addressed IQA of textures by designing a structural similarity index for texture retrieval. In [336], Zujovic et al. also demonstrated the utility of their index for texture-synthesis-based image compression.

Also on the topic of IQA for synthesized textures, in [337] we presented a preliminary database of original and synthesized textures and associated DMOS values. Forty-two textures from the Brodatz database [338] served as originals, and various texture-synthesis algorithms were used to generate the synthesized versions. Figure 24 shows some of the original and synthesized textures used in [337]. An examination of the quality ratings values revealed that the most detrimental artifacts were, (1) lack of structural details, (2) misalignment of the texture patterns, (3) blurring introduced in the texture patterns, and (4) tiling of the same pattern. In [337], we also demonstrated that a weighted geometric combination of KLD and parameters from Portilla and Simoncelli's parametric texture-synthesis algorithm [339] showed promise in predicting the ratings. However, there were many notable failure cases, particularly on textures containing more structured objects (e.g., flowers, stones).

With the emergence of applications which can potentially give rise to images containing such radical changes in geometric properties (e.g., texture-synthesis coding and object-based coding), there is clearly a need for further research in this area, both in terms of visual perception and in terms of associated IQA algorithms and databases.

5.6. Challenge 6: Enhanced Images and Aesthetic Quality

Today, the photo-editing software used in digital photography has replaced the role of the darkroom used in traditional photography. Through digital enhancement or “retouching,” it is possible for an altered image to surpass the visual quality of the original image. For video, enhancements such as motion sharpening and anti-judder processing found in newer displays are known to make profound differences in quality.

Yet, the vast majority of IQA algorithms have been designed for distorted images, often operating under the assumption that a high-quality image is one which is most visually similar to the original (reference) image. However, for enhanced images, the notion of similarity is less applicable and a different QA tactic is needed. IQA of enhanced images remains an open research challenge.

5.6.1. Image Enhancement and Other Applications

Image enhancement is one of the most fundamental operations in image processing and digital photography. Enhancement/retouching is a required step that every professional photographer performs after capturing photos. Although there is no standard rule to follow when editing a scene, most photographers implement several steps such as (1) cropping the images for recomposition; (2) removing obstructions or unwanted objects; (3) applying noise-reduction techniques, if needed; (4) adjusting brightness and contrast; (5) white-balancing and color-correction; (6) sharpening, usually as the final step. These and other forms of processing (e.g., demosaicing, superresolution, computational photography) yield images that are dissimilar, but are most often of superior visual quality compared to the original images.

As mentioned previously, the vast majority of IQA methods have been designed for distorted images. IQA of enhanced images is challenging due to the fact that the changes can often be subtle and can affect the artistic impression of the image. Nonetheless, it is still possible to perform QA of enhanced images based on changes in low-level attributes. The work of Fairchild and Johnson [340] begins to address this issue by using contrast- and color-appearance models. In addition, VIF [195] can yield a value larger than unity (denoting quality greater than the original) when the “distorted” image contains linear contrast enhancement.

Beside image enhancement, other image-processing applications can give rise to images of greater visual quality than the unprocessed images. For example, in denoising and/or artifact reduction, the primary objective is to yield an image of superior visual quality to the input (distorted) image. The performances of such algorithms are quantified by comparing the denoised/artifact-reduced image to the original, pristine image, where the comparison is made via PSNR or some FR IQA algorithm. However, such a comparison is not always valid because the denoised/artifact-reduced image often undergoes processing such as sharpening, contrast-stretching, and inpainting. Again, our current IQA algorithms have not been designed to handle such changes. A similar argument applies to dithering algorithms; see [107]. These and related applications could certainly benefit from IQA algorithms designed for such changes; proper IQA could not only be used to quantify performance, but they could also provide criteria for parameter optimization.

5.6.2. A Database for IQA of Enhanced Images

One of the main roadblocks in IQA of enhanced images is the lack of a database containing enhanced images and associated quality ratings. To address this issue, in [341], we presented such a database, the DRIQ (digitally retouched image quality) database. DRIQ contains 26 reference images of size 512 × 512 pixels obtained from the Kodak and CSIQ databases. Each of these 26 images was manually retouched to generate three enhanced versions spanning varying amounts of quality. The enhancements were made by editing either contrast, sharpness, brightness, color saturation, or combinations of these properties, both globally and locally. In total, the database contains 104 images (26 original images and 26 × 3 = 78 enhanced versions of these originals).

Obtaining reliable ratings of quality for enhanced images is more difficult than that for degraded images due to the fact that the changes are often subtle. Based on several pilot experiments, we employed a three-step procedure: (1) intraimage ranking via a pairwise-comparison procedure, then (2) Intraimage rating constrained by the ranks via a multiple-stimulus continuous quality evaluation (MSCQE) procedure, and then (3) across-image ratings constrained by the within-image ratings, again via an MSCQE paradigm. See [341] for further details of the experiments. Figure 25 shows some examples of the enhanced images and associated DMOS values; the entire database and ratings are available online [140].

Figure 26 shows the absolute best- and worst-rated images in the database along with their corresponding original reference images. The original image flower appears low in contrast, sharpness, and colorfulness. Its enhanced version, which received the highest rating, was enhanced in contrast, sharpened, and locally color-corrected (for the stamen of the flower and the leaves). However, the image redwood in the second row has little room for enhancement, and thus this image was enhanced only in terms of contrast; other enhanced versions of this image received similarly low DMOS values.

One very consistent finding which we observed when viewing an original image and its enhanced version is that the enhanced image makes the original image appear degraded (as long as the enhanced image is not overenhanced such that it appears artificial). Given this fact, in [341], we asked whether it is possible to perform IQA of enhanced images by operating existing distortion-based FR IQA methods in reverse and then reinterpreting the results. Specifically, given a reference image and its enhanced version, the reference image can be thought of as a distorted version of the enhanced image. Thus, existing FR IQA algorithms may be able to perform IQA of enhanced images by specifying the enhanced image as the reference image and by specifying the original image as the distorted image.

In [341], we showed that this reverse-mode-based approach, when supplemented with global measures of sharpness, contrast, and saturation, can yield quality estimates which correlate decently with the DMOS values (Pearson’s CC of approximately 0.85). These preliminary findings suggest that a future IQA algorithm designed specifically for enhanced images may benefit from a strategy that involves both enhanced-feature measurements and statistical comparisons of local frequency coefficients.

Although the DRIQ database represents an important first step, it is also important to note that a correlation of approximately 0.85 still leaves much room for improvement. Furthermore, none of the images in DRIQ contain both distortions and enhancement, none of the images contain overenhancement (which will certainly result in lower quality ratings), and none of the images were enhanced via more sophisticated recomposition, superresolution, or computational photography techniques.

5.6.3. Image Enhancement and Aesthetic Quality

At the 2008 SPIE Human Vision and Electronic Imaging Conference, Scott Daly provided very insightful perspectives on IQA of enhanced images in his talk “On the Role of Artistic Intent of Image Quality.” In his presentation, Daly clearly demonstrated how many standard enhancement techniques, which would normally improve image quality, can severely degrade aesthetic quality. Daly's main argument was based on the fact that artists and photographers often produce images which capture and convey specific visual impressions, images which make specific visual statements. Because traditional forms of enhancement can destroy the artist's intentions, even an IQA algorithm that can handle normal enhancement still has a long way to go in order to be successful in predicting the aesthetic quality.

Developing an IQA algorithm which can handle enhanced images, while considering the effects on artistic intent and aesthetic quality, certainly remains an unsolved research challenge.

5.7. Challenge 7: Runtime Performance

Although a great deal of research on IQA has focused on improving prediction accuracy, much less research has addressed algorithmic and microarchitectural efficiency. As IQA algorithms move from the research environment into more mainstream applications, issues surrounding efficiency—such as execution speed and memory bandwidth requirements—begin to emerge as equally important performance criteria. Many IQA algorithms which excel in terms of prediction accuracy fall short in terms of efficiency, often requiring relatively large memory footprints and runtimes on the order of seconds for even modest-sized images (e.g., <1 MPixels). As these algorithms are adapted to process frames of video (e.g., [342, 343]) or are used during optimization procedures (e.g., during RD optimization in a coding context), efficiency becomes of even greater importance.

From a signal-processing viewpoint, it would seem that the bulk of computation and runtime are likely to occur in two key stages employed by many IQA algorithms: (1) local frequency-based decompositions of the input image(s) and (2) local statistical computations on the coefficients. The first of these two stages can potentially require a considerable amount of computation and memory bandwidth, particularly when a large number of frequency bands are analyzed and when the decomposition must be applied to each image as a whole. The latter of these two stages would seem to require more computation, particularly when multiple statistical computations are required for each local region of coefficients. For example, in VIF [195], wavelet subband covariances can be computed via a block-based or overlapping block-based approach. In MAD [139], variances, skewness, and kurtoses of log-Gabor coefficients are also computed for overlapping blocks in each subband. As described in Section 3, many other HVS-based IQA methods have been designed to mimic the cortical processing in the HVS in which the local responses of neurons in V1 (modeled as coefficients) are computed and compared. Yet, unlike the HVS, most modern computing machines lack dedicated hardware for such computation.

5.7.1. Acceleration of Image Transforms and Local Statistics

Due to their extensive use in image compression and computer vision, a considerable amount of research has focused on accelerating two-dimensional image transforms which provide local frequency-based decompositions. For example, the discrete cosine transform (DCT) has been accelerated at the algorithm level by using variations of the same techniques used in the FFT (e.g., [344]) and by exploiting various other algebraic and structural properties of the transform, for example, via recursion [345], lifting [346], matrix factorization [347], cyclic convolution [348], and many other techniques (see [349] for a review). Numerous techniques for the hardware-based acceleration of the DCT have also been proposed using general-purpose GPU (GPGPU) and FPGA implementations (e.g., [350353]). Algorithm- and hardware-based acceleration has also been researched for the discrete wavelet transform (e.g., [354356]) and the Gabor transform (e.g., [357360]).

Techniques for accelerating the computation of local statistics in images have also been researched, though to a much lesser extent than the transforms. One technique, called integral images, which was originally developed in the context of computer graphics [361], has emerged as a popular approach for computing block-based sums of any two-dimensional matrix of values (e.g., a matrix of pixels or coefficients). The integral image, also known as the summed area table, requires first computing a table which has the same dimensions as the input matrix and in which each value in the table represents the sum of all matrix values above and to the left of the current position. Thereafter, the sum of values within any block of the matrix can be rapidly computed via addition/subtraction of three values in the table. A similar technique can be used to compute higher-order moments such as the variance, skewness, and kurtosis (see, e.g., [362, 363]).

5.7.2. Acceleration of Specific IQA Algorithms

Other researchers have investigated techniques for accelerating specific IQA algorithms. For example, in [364], Gordon et al. investigated the acceleration of PSNR by using GPGPU implementations in both CUDA and OpenGL. Via a performance analysis, they specifically investigated how the application and system performance is affected by utilizing GPGPU acceleration of PSNR in a model-based coding application (the primary bottleneck in model-based coding stems from the optimization procedure used to determine the model parameters from the input image). Gordon et al. concluded that because CUDA uses the CPU's store units to copy data between the graphics card and system memory and because data exchanged between GPU APIs travel through the main processor, a non-GPGPU implementation of the PSNR computation runs faster than the same implementation using GPGPU programming methods.

In [365], Chen and Bovik presented the Fast SSIM and Fast MS-SSIM algorithms, which are accelerated versions of SSIM and MS-SSIM, respectively. Three modifications were used for Fast SSIM: (1) the luminance component of each block was computed by using an integral image, (2) the contrast and structure components of each block were computed based on 2 × 2 Roberts gradient operators, (3) the Gaussian-weighting window used in the contrast and structure components was replaced with an integer approximation. For Fast MS-SSIM, a further algorithm-level modification of skipping the contrast and structure computations at the finest scale was proposed. By using these modifications, Fast SSIM and Fast MS-SSIM were shown to be, respectively, 2.7x and 10x faster than their original counterparts on frames from videos of the LIVE Video Quality database. Although algorithm-level modifications were used, the authors demonstrated that these modifications had negligible impact on predictive performance; testing on the LIVE Image Quality and Video Quality databases revealed effectively no impact on SROCC, CC, and RMSE. By further implementing the calculations of the contrast and structure components via Intel SSE2 (SIMD) instructions, speedups of approximately 5x for Fast SSIM and 14x for Fast MS-SSIM were reported. In addition, speedups of approximately 17x for Fast SSIM and 50x for Fast MS-SSIM were reported by further employing parallelization via a multithreaded implementation.

In [366], Okarma and Mazurek presented GPGPU techniques for accelerating SSIM, MS-SSIM, and CVQM (a video quality assessment algorithm developed previously by Okarma, which uses SSIM, MS-SSIM, and VIF to estimate quality). To accelerate the computation of both SSIM and MS-SSIM, the authors described a CUDA-based implementation in which separate GPU threads were used to compute SSIM or MS-SSIM on strategically sized fragments of the image. To overcome CUDA's memory-bandwidth limitations, the computed quality estimates for the fragments were stored in GPU registers and transferred only once to the system memory. Okarma and Mazurek reported that their GPGPU-based implementations resulted in 150x and 35x speedups of SSIM and MS-SSIM, respectively.

In [363], Phan et al. presented the results of a performance analysis and techniques for accelerating the MAD algorithm [139]. Although MAD is among the best in predictive performance, it is currently the one of the slowest IQA algorithms, requiring over 55 seconds for a image when tested on several modern computers (Intel Core 2 and Xeon CPUs; see [363]). A performance analysis revealed that the main bottleneck in MAD stemmed from its appearance-based stage, which accounted for 98% of the total runtime. Within this appearance-based stage, the computation of the local statistical differences accounted for most of the runtime, and computation of the log-Gabor decomposition accounted for the bulk of the remainder. Phan et al. proposed four techniques of acceleration: (1) using integral images for the local statistical computations, (2) using procedure expansion and strength reduction, (3) using a GPGPU implementation of the log-Gabor decomposition, and (4) precomputation and caching of the log-Gabor filters. The first two of these modifications resulted in an approximately 17x speedup over the original MAD implementation. The latter two resulted in an approximately 47x speedup over the original MAD implementation.

5.7.3. The Need for Broader Performance Analyses

Although the aforementioned studies have successfully yielded more efficient versions of their respective algorithms, several larger questions remain unanswered. To what extent are the bottlenecks in IQA algorithms attributable to the decomposition and statistical computation stages versus more algorithm-specific auxiliary computations? To what extent are the bottlenecks attributable to computational complexity versus limitations in memory bandwidth? Are there generic implementation techniques or microarchitectural modifications that can be used to accelerate all or at least several IQA algorithms?

The answers to these questions cannot only provide important insights for furthering IQA research, but they can also facilitate deployment and integration of IQA algorithms into existing and forthcoming applications and platforms. Further research in this area can help guide, (1) the design of new IQA algorithms, which are likely to draw on multiple approaches used in several existing IQA algorithms, (2) efficient implementations of multiple IQA algorithms on a given hardware platform, (3) efficient integration of multiple IQA algorithms in specific applications, and (4) the selection and/or design of specific hardware which can efficiently execute multiple IQA algorithms.

6. Conclusion

In this paper, I have discussed the current state of the art in IQA research, summarized the IQA-related knowledge that has been gained from studies in visual psychophysics, and surveyed the progress that has been made through HVS modeling, non-HVS-based approaches, and other statistical modeling approaches. One main conclusion which can be drawn from this paper is that today's IQA algorithms can perform remarkably well at predicting human judgments of quality.

However, it should also be evident that the IQA problem is far from being solved; we have yet to reach the summit of this investigative ascent. Rather, it may seem that our current accomplishments have shown that we can design IQA algorithms to handle images generated by the more traditional applications—compression, watermarking, transmission errors, and camera/display processing—and other applications which have served as catalysts for IQA research. But, further research is needed to improve IQA based on the current challenges and to prepare IQA for future challenges.

Here, I have identified seven open challenges in IQA. The objective of this discussion was not only to highlight the limitations in our current knowledge of image quality, but to also emphasize the fact that there is substantial room for alternative theories and techniques beyond those surveyed here. Budrikis' original 1972 statement that “full evaluations are as yet impossible” [23] still holds true today. However, IQA research continues to grow at an astounding rate, and these efforts will undoubtedly lead to improved evaluation techniques and further advancements in our understanding of image quality.

Acknowledgments

This material is based upon work supported by, or in part by, the National Science Foundation Awards 0917014 and 1054612 and by the U.S. Army Research Laboratory (USARL) and the U.S. Army Research Office (USARO) under Contract/Grant number W911NF-10-1-0015.