Abstract

Human gaze is not directed to the same part of an image when lighting conditions change. Current saliency models do not consider light level analysis during their bottom-up processes. In this paper, we introduce a new saliency model which better mimics physiological and psychological processes of our visual attention in case of free-viewing task (bottom-up process). This model analyzes lighting conditions with the aim of giving different weights to color wavelengths. The resulting saliency measure performs better than a lot of popular cognitive approaches.

1. Introduction

Saliency models are more and more important in computer vision due to their fundamental contribution in intelligent systems (robotics [1], serious games [2], intelligent video surveillance [3], etc.). Indeed, the mechanisms of visual attention they are supposed to mimic enable selection of relevant contextual information [4, 5] and lead to more autonomous systems.

If it is known that the human visual attention can be deployed unintentionally on the scene (bottom-up mechanism) or subjectively (top-down mechanism), the cognitive and biological development of this process remain the subject of scientific investigation.

In recent decades, many facets of science have been aimed towards answering this question. An early psychological model, which strongly influenced the field of visual attention, is the Features Integration Theory (FIT) [6]. Treisman and Gelade suggest that some image features (colors, orientations, etc.) are first processed in parallel in order to build a master map of location that draws our attention to an area (where) of the scene. Object recognition takes place after focusing attention on the where and requires inhibition of feature maps that do not describe the searched target. This theory also assumes that our attention is deployed sequentially on each stimulus present in the scene. This finding has been refuted by several studies that found that, during a task (in top-down case), our attention can be deployed to 4 or 5 regions simultaneously [7, 8].

An improved version of the FIT is the Guided Search Model of Wolfe [9]. In addition to selecting the features that better describe the target, top-down bias are introduced for highlighting features that better discriminate target from its distractors (similar objects).

Other approaches are subject to connectionist strategies during their processes of visual attention [1014]. In those models, a neural network describes our visual attention by inhibition and excitation mechanisms that allow the emergence of an area of the scene.

Some models of visual attention use other kinds of processes: The Gestalt theory [15, 16] proposed by Christian von Ehrenfels in 1890 [17]. The Gestalt is both psychological and philosophical theory which maintains that perception and mental representation spontaneously treat phenomena as structured sets (forms) and not as a simple addition or juxtaposition of elements (features).

In this paper we propose a bottom-up saliency model which provides a lot of explanations about both biological and cognitive mechanisms of visual attention. Section 2 provides a state of the art in cognitive saliency models (by cognitive we mean biologically plausible model that describes not only the psychological mechanisms of visual attention but also those physiological processes [18]). We just focus on bottom-up axis and computational models in this topic. Section 3 allows us to introduce a new biologically inspired bottom-up model. Section 4 is reserved to the tests of correlations between cognitive models and human visual attention. Analyses are also made. Section 5 provides discussion and conclusion.

2. State of the Art in Cognitive Models of Bottom-Up Visual Attention

Cognitive models have the advantage of expanding our view of biological underpinnings of visual attention [18].

One of the first computational attention models was proposed by Koch and Ullman [19]. This model is based on the FIT. Different features are filtered and combined to form a final saliency map where a neural network (winner takes all or WTA) indicates the more salient region. This model is the basis for several implementations such as Clark and Ferrier [20] where feature maps are summed and weighted according to their saliency.

One of the most popular models inspired by the architecture of [19] is Itti et al. [21]. Gaussian pyramid filtering of each feature (color, intensity, and orientation) is added. Other retinal mechanisms like center periphery [22] are introduced before normalization and combination of resulting filtered maps into a final saliency map.

The model of [21] is very popular because it is well documented and freely available online. It was also the first computational model to yield interesting results in the free-viewing task.

Other models like VOCUS [23] use the architecture of [21]. In fact, VOCUS uses a LAB space where the attention process is described in the same manner as Itti et al.’s architecture.

Another computational model based on the FIT was proposed by Le Meur et al. [24]. Contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions are some of the features implemented in this model. In [24], we have three aspects of the vision: visibility, perception, and perceptual grouping. The visibility part simulates the limited sensitivity of the human eyes and takes into account the major properties of the retinal cells. The perception is used to suppress the redundant visual information by simulating the behavior of cortical cells. And the final saliency map is enhanced by the perceptual grouping.

Later, Le Meur et al. extended this model to the spatio-temporal domain [25] by fusing achromatic, chromatic, and temporal information. The architecture of Koch and Ullman also allowed the implementation of models such as Guironnet et al. [26]. The static component extracts common orientations with Gabor filters. The frame difference allows a temporal component to detect moving objects.

Gestalt principles were first implemented by Zou et al. [27] and then by Kootstra et al. [15, 16, 28].

In the last decade, connectionist theories have not been subject to a computational implementation in case of static images. Some researchers have implemented connectionist theories by using dynamic neural fields [29, 30] on video. The connectionist approaches use a microscopic description of cells involved in visual attention, in particular by using a neural network for modeling visual attention. FIT-based approaches (which typically use filter banks) are macroscopic because they describe visual attention without take care of the details of the neural topography of human cortical cells.

We preferred the FIT-based approach to connectionist ones because the functioning of cortical cells described by macroscopic models is better known than the exact mechanism involved in the interaction of different neurons modeled by connectionist models. The Features Integration Theory also provides a better understanding of the physiological process of visual attention compared to The Gestalt theory that does not give enough information about its physiological way to deal with visual attention.

So, in this paper we propose to integrate light level analysis and a better description of visual cortex in the process of attention. Our model is based on the FIT but its main advantage is to evolve visual attention mechanisms with lighting conditions. Another level of vision which gives the essence of a scene (furthermore we call this mode Perceptual) is also introduced.

3. Pop-Out: A New Cognitive Model of Visual Attention That Uses Light Level Analysis to Better Mimic the Free-Viewing Task of Static Images

Our vision is impaired in the dark, and we are highly sensitive to the light intensity emitted by colors hue (see Figure 2). Luminance is the feature which contains most of the visual information because the finest area of our eye, called fovea, is mainly photosensitive [32, 33]. The luminous efficiency (or the magnitude of the intensity perceived according to a given wavelength) is strongly affected by the color of the light source [34, 35]. In photopic vision, our fovea is more sensitive to the red wavelength while the blue wavelength influences scotopic vision. Between day and night illumination conditions, there is another light level called mesopic vision where red and blue hues are competitive [36] (see Figure 3).

To modelize these findings (or the light and color intensity sensitivities), we introduce an improved space which recovers the whole luminance of the scene and remains selective to color sensitivity. Our improved space allows us to make a suitable combination of and components according to the result of the light level analysis (this analysis let us know if the picture was taken in photopic, scotopic, or mesopic condition). All components are filtered by Log-Gabor functions (see Figure 1). In fact, Log-Gabor wavelets better describe the receptive fields (RF) or the simple cells responses of the V1 area of our visual cortex [4043]. A module of light level analysis detects the luminosity of the scene and gives different weights to or component after Log-Gabor filtering. and weighted maps are then used to enlighten a specific region of the filtered map.

The system proposed is called “pop-out.” Pop-out is an effect in human vision that only occurs if there is a single target that differs from its surrounding while all distractors (or the rest of the scene) are homogeneous [9]. This mainly refers to a search task purpose rather than the bottom-up axis of visual attention. Although we proposed a bottom-up architecture in this paper, we choose the name of “pop-out” because a salient area of an image even pop out from the scene to our eyes. The perspective of top-down integration in our architecture also seems to be a part of the reason that led us to call our system “pop-out.”

3.1. Image Preprocessing

The image preprocessing is used to address a major need: contrasts enhancement. Both psychological and physiological experiments give evidences to the theory of early transformation in the human vision system (HVS) of the L (long wavelength, which is sensitive to the red part of the visible spectrum), M (medium wavelength, sensitive to the green wavelength), and S (short wavelength, sensitive to the blue part of the visible spectrum), signals issued from cones absorption. L-cones, M-cones, and S-cones are mainly located in the central part of the retina, called fovea [24, 36]. HVS is one of the many color spaces that separates color from their intensity. This transformation provides an opponent color space in which the signals are less correlated [24].

There are a variety of opponent color spaces which differ in the way they combine the different cone responses (, , , etc.). Here, we use the space [44, 45] because it is an additional space that allows us to combine intuitively blue and red wavelength for providing all kinds of colors. is also currently recommended as a standard definition of digital and high definition television systems [46].

The first step consists of extracting the image intensity . Since we empirically found that traditional value of in the standard (ITU-R BT.601-5 and ITU-R BT.709-5) does not get all luminance of the scene, we improve it by using the mean of components:

This change is very important because, in the conventional space, is more sensitive to the Green component of space. And in accordance with [36] (see Figure 3), this kind of space (where green wavelength is very important) cannot fit with our study since we want to be more flexible to vision conditions (photopic, mesopic, and scotopic). In fact, if we keep the traditional space this corresponds to be in mesopic conditions where Green component can be more important than any other wavelengths.

We applied a fuzzy mask [47] to enhance the contrast of and the initial image (we obtain from the enhancement of image). So, we have (the intensity extracted from initial image and enhanced by the fuzzy mask) and and components (that we retrieve from ). This step highlights image contrast and provides less correlated features to the input of our RF modeling (Log-Gabor functions).

3.2. Log-Gabor Filtering and the Attentive Unit in a New HVS Space

After preprocessing, the image is injected at the input of the Log-Gabor filters. The image at the input of the filter is composed of three elements that are luminance enhanced (taken from ) and and (taken from ).

Log-Gabor filter is defined in polar coordinates by the following equation:

or describes the Log-Gabor function (of radial component and angular component ) for a frequency and an orientation . Our implementation of Log-Gabor filter is based on Peter Kovesi’s work [41], and all parameters (central frequency , initial orientation , radial band-pass , and angular band-pass ) are set to have the finest bandwidth (that means a very precise RF modeling). These parameters remain the same for all images. They allow us to avoid any overlap between different Log-Gabor wavelets while providing very fine textures. We use 4 scales and 8 orientations.

Assuming that we are more sensitive to the magnitude of the luminosity perceived (intensity wavelength sensitivity), we get the amplitude of the filtered result (which is nothing else but the texture related to a given components):FFT2 and IFFT2 are, respectively, the fast Fourier transform and the inverse fast Fourier transform of a 2D image. and are, respectively, the scale and the orientation of the filter. represents the textures extracted (see Figure 4).

Based on the curve in [34, 38, 48], we analyze the shape of (or the intensity of initial image). According to this shape we can know if the picture was taken in photopic, scotopic, or mesopic conditions. In photopic case, we give more weight to . In scotopic case, is more important than . When we are in mesopic condition and are competitive. The weighted sum of and is adjusted by and variables (with and ).

The bottom-up saliency map (; see Figure 5) is such that

One finding of our study is that the luminous efficiency perceived can lead us to the most important part of the scene which can be considered as the essence of the image (or what we really got from a visual scene). This sensitivity to edges reminds not only the ganglion cells [24] but also the magnitudes (or the textures) extracted from Receptive Fields (modeled by Log-Gabor filters in our method). But we cannot establish the real connection between “edges sensitivity” and “bottom-up attention processes” since all chromatic information is less important than the component in this mode. The perceptual mode (; see Figure 6) can be processed by an appropriate combination of , , and (same weights are given to and in this case):

4. Cognitive Models versus Eye-Tracking Experiences: Assessment after Free-Viewing Task

In this section, we compare our method with three cognitive saliency models on the Toronto dataset [39]. The Toronto dataset contains data from 11 subjects free-viewing 120 color images of outdoor and indoor scenes. Each image has been freely viewed by participants during 4 seconds. The particularity of this database is that a large subset of images does not contain any semantic objects or faces. In fact, due to the free viewing task and the image in this database, the Toronto dataset is very suitable for the validation of bottom-up models. All images in this dataset were taken in photopic conditions.

We ran our model in three modes: automatic (where different weights are automatically given to each color component according to the light level analysis), mesopic (where same weight is given to both and components), and perceptual (edges sensitivity as described in the previous section). One of the goals of this step was to study the contribution of the light level analysis module by using our model in automatic mode and by comparing the results when the same weights are given to and components (case of mesopic vision). Our light level analysis module achieves a performance of 99.17%; just one photopic image is misclassified.

We compared our model with the most popular cognitive saliency measures: Itti-Koch-Niebur [21], VOCUS [23], and Le Meur et al. [24]. Since the Bruce-Tsotsos saliency measure [39] is not considered as a cognitive approach but as an information theoretic model [18], we do not compare our model with it.

Two comparison metrics are used during analysis: Area Under the Receiver Operating Characteristics (AUROC) and Earth Mover’s Distance (EMD). In AUROC score, human fixations are considered as the positive set and some points from the image are sampled, either uniformly or nonuniformly to account for center-bias and form the negative set. The saliency map is then treated as a binary classifier to separate the positive samples from the negatives ones. Perfect prediction corresponds to a score of 1 while a score of 0.5 indicates chance level. While an ROC analysis is useful, it is insufficient to describe the spatial deviation of predicted saliency map from the actual fixation map [49]. If a predicted salient location is misplaced, but misplaced close to or far away from the actual salient location, the performance should be different. To conduct a more representative and selective evaluation, we also use the EMD that indicates the distance between two probability distributions (human gaze versus saliency map) over a region (lower is better).

The model of Itti-Koch-Niebur [21] uses 9 scales and 4 preferred orientations (in total, 42 feature maps are computed: six for intensity, 12 for color, and 24 for orientation).

Pop-out uses 8 scales and 4 orientations: 32 feature maps are thus computed from each component (32 feature maps from the component, 32 feature maps from the component, and 32 feature maps from the component, too).

We thus used more feature maps than Itti-Koch-Niebur when color information is taking into account (case of ): Each set, of 32 feature maps obtained, is summed together to give , , and maps.

Our fusion strategy (see (4)) highlights the color wavelength contained in the luminance texture. This strategy is completely different from the ones used by models [21, 23] which resort to a simple linear combination. In fact, the fusion strategy used by [21] is the linear combination of different feature maps. Like Itti-Koch-Niebur [21], VOCUS [23] also uses a sum of weighted feature maps (linear combination). Thus, by (4), we provide a new fusion strategy that takes into account the color wavelength contained in the luminance texture.

As shown in Figure 7, our model performs better than a lot of cognitive models when it is used in automatic mode (AUROC = 0.73; EMD = 2.94). It is more selective than [24] (see EMD results in Figure 7) because the most enlightened wavelength color of the image is selected without making perceptual grouping of higher-level structures of the scene. Indeed, the model of [24] uses perceptual grouping and some fusion strategies that lead to fuzzy and less selective maps (see Figure 8). We also note the contribution of our model to the FIT when we compare it to the Itti saliency measure. It is mainly caused by the Log-Gabor filters which are more biologically plausible [43] than Gaussian pyramids and Gabor filters used in [21]. The HVS space used also leads us to more accurate architecture than [21]. The bottom-up part of VOCUS [23] is an improvement of the architecture of [21]; features are weighted higher when they are unique in the scene, so, salient objects in the scene are highlighted. However, this performance is close to [21] and it does not give a real improvement of FIT like pop-out in automatic mode.

Both mesopic and automatic mode achieve roughly the same performances. Since the database used just comprises photopic images, it is very difficult to have a difference between the two modes because the curve of mesopic vision encompasses a large portion of the photopic curve (see Figure 3). Besides that, the light level analysis module detects only one mesopic image. Therefore, it is not easy to make a real difference between the mesopic mode and pop-out in automatic mode by using photopic images. Concerning the perceptual mode, it is obvious that the edges sensitivity is less close to eye-tracking experiences in free-viewing task (bottom-up attention).

5. Conclusion and Discussion

We introduced an improved physiological model of the FIT by using light level analysis in order to give different weights to chrominance components in an enhanced space. Our saliency model (pop-out in automatic mode) performs better than a lot of popular cognitive approaches [21, 23, 24].

However, as shown in Figure 7, we cannot see its main advantage compared to the traditional FIT-based models because mesopic curve encompasses the photopic curve. Then, the results on the Toronto dataset (which is mainly constituted by photopic images) do not show the real advantage of such light level analysis (in fact, the difference between the performance results in photopic and mesopic mode is not statistically significative). Nevertheless, there are some differences in saliency maps (see Figures 5 and 8) and our approach challenges the eye-tracking experiments which are often made with photopic, mesopic, and scotopic images without make sure to have the same lighting conditions during viewing task.

This latter finding has never been considered before; for instance, when we show a scotopic image (captured by night) in photopic conditions during an eye-tracking experience, we do not have the same lighting conditions as a person who had seen the same image by night, which corresponds to answer to the question: where do people look when it is dark? So, there is a true limit of current eye-tracking databases that should be completely reviewed!

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Part of this work is funded by Sebastien Makiesse family and Emmanuel Betukumesu. The author thanks Sophie for being involved in improving the style of the paper and thanks are due to Nathan Salabiaku for his involvement in the validation of the pop-out model.