Abstract

Saliency can be described as the ability of an item to be detected from its background in any particular scene, and it aims to estimate the probable location of the salient objects. Due to the salient map that computed by local contrast features can extract and highlight the edge parts including painting lines of Flying Apsaras, in this paper, we proposed an improved approach based on a frequency-tuned method for visual saliency detection of Flying Apsaras in the Dunhuang Grotto Murals, China. This improved saliency detection approach comprises three important steps: (1) image color and gray channel decomposition; (2) gray feature value computation and color channel convolution; (3) visual saliency definition based on normalization of previous visual saliency and spatial attention function. Unlike existing approaches that rely on many complex image features, this proposed approach only used local contrast and spatial attention information to simulate human’s visual attention stimuli. This improved approach resulted in a much more efficient salient map in the aspect of computing performance. Furthermore, experimental results on the dataset of Flying Apsaras in the Dunhuang Grotto Murals showed that the proposed visual saliency detection approach is very effective when compared with five other state-of-the-art approaches.

1. Introduction

In Chinese traditional culture, Buddhism (particularly in the west of China) has played a prominent role in Buddhism history. This link is conveyed via the Dunhuang Grotto Murals, which have been of significant value to cultural studies across various Chinese dynasties. Protecting and maintaining the cultural integrity of the Dunhuang Grotto Murals have been a tremendous challenge for researchers. Digitalization, comprehensive preservation, and virtual interaction have been adopted as the key approaches to protecting and propagating the culture embodied in the Dunhuang Grotto Murals.

Flying Apsaras are a significant symbol of the Dunhuang art, and contain a lot of cultural information and connotations, which have been composed by a wide variety of people, namely, streamers, clouds totem, and so on. In addition, because of the age, diverse colors, and complex composition of the images in Flying Apsaras, extracting the outlines accurately is of great significance in analyzing the painting styles of the Dunhuang Grotto Murals in different Chinese dynasties. Furthermore, due to the multioverlap phenomenon existing in some images of the Dunhuang Grotto Murals, that is, later works were painted directly on the early works, analyzing the cultural meanings of these images is even more difficult. Fortunately, one approach, called saliency detection, can be used to analyze the different painting styles of Flying Apsaras with different historical periods in the Dunhuang Grotto Murals. The relationship between Flying Apsaras and visual saliency can be closely connected by two important aspects: on the one hand, visual saliency can represent the initial state of these images without coloring, and saliency detection of Flying Apsaras in the Dunhuang Grotto Murals is of great historical value for historians in helping determine the original state of the images in the Dunhuang Grotto Murals; on the other hand, the final salient map of Flying Apsaras that computed by local contrast features can extract and highlight the edge parts with striking local contrast; these parts including the painting lines have played a critical role in analyzing significant painting styles of Flying Apsaras. In addition, visual saliency detection can be used to extract the salient regions of Flying Apsaras, and these regions can represent the remarkable paintings’ main thought, design, and plot.

One of the most challenging problems with perception is information overload. This overload presents a challenge when trying to obtain useful information from images due to the semantic gap between low level features and high level semantic concepts [1]. Visual attention is one of the primary features of human visual system (HVS) to derive important and compact information from the natural scenes [2]. And its mechanism enables a reduction of the redundant data that benefits perception during the selective attention process. In the image processing field, saliency detection is the process of detecting the interesting visual information. Furthermore, the selected stimuli need to be prioritized, with the most relevant being processed first and the less important ones later, and it leads to a sequential treatment of different parts of the visual scenes. Thus, saliency detection has been widely used in discovering regions of interests, predicting human visual attention, image quality assessment, and many other applications [3]. The saliency detection approach is used in many fields including object detection, object recognition, image enhancement, image rendering, image compression, and video summarization [4].

Many different approaches to visual saliency detection have been proposed by a variety of researchers from different research backgrounds. As pioneers, Koch, Ullman, and Poggio [5, 6] proposed a very influential Bioinspired model and Winner-Take-All selected mechanism to simulate visual system of humans. Itti et al. [7] derived bottom-up and top-down visual saliency using center-surround differences via multiscale features integration. Harel et al. [8] proposed a new bottom-up visual saliency detection model, graph-based visual saliency (GBVS), to form activation maps on certain feature channels and normalize them with highlighting conspicuity and combining with other maps. Liu et al. [9] used the conditional random field (CRF) approach to effectively combine a set of novel features including multiscale contrast, center-surround histogram, and color spatial distribution to describe salient object locally, regionally, and globally. Goferman et al. [10] proposed a new type of saliency model, context-aware saliency, which is based on four principles observed in the psychological literature: local low-level features, global considerations, visual organization rules, and high-level factors. Zhai and Shah [11] developed a fast approach for computing pixel-level salient maps using color histograms of images and a dynamic fusion technique applied to combine the temporal and spatial salient maps. Hou and Zhang [12] proposed a spectral residual approach for visual saliency detection to construct the corresponding salient maps in the spatial domain and extract the spectral residual of an image by analyzing the log-spectrum of it. This is replaced by the phase spectrum of the Fourier transform in Gopalakrishnan et al. [13], because it is more effective and computationally efficient. Achanta et al. [14] combined low-level features of color and luminance to detect salient regions of images, which was called a frequency-tuned method. More recently, Yan et al. [15] employed a multilayer approach to analyze saliency cues and the final salient map was produced in a hierarchical model. Siva et al. [16] used an unsupervised learning approach that annotated objects of interest in images to capture the salient maps.

Due to employing only a few low-level features of images, the visual saliency detection results of the frequency-tuned method proposed by Achanta et al. [14] are not very satisfactory. In this paper, an alternative approach, histogram combined with image average and Gaussian blur (HIG), is proposed as an improvement to the frequency-tuned method. HIG combines features of color and luminance with the gray histogram information and uses the Euclidean distances between local features and global average features to define visual saliency as local contrast at pixel level. And the final visual saliency definition is based on normalization of prior visual saliency and spatial attention function. Unlike existing approaches that focus on multiple image features, the proposed approach only focuses on contrast and spatial attention information from the simulation of HVS, and these local contrast features have played a key role in highlighting the edge parts including painting lines of Flying Apsaras. This allows for more efficient computation of salient maps and very effective performance when compared to five other state-of-the-art approaches. These five state-of-the-art approaches of saliency detection are Zhai and Shah [11], Hou and Zhang [12], Achanta et al. [14], and Cheng et al. [17], hereby referred to as LC, SR, IG, HC, and RC, respectively. The choice of these algorithms was based on their frequency of citation in literature, popularity, and variety.

The rest of this paper is organized as follows: Section 2 provides an overview of the proposed saliency detection approach, followed by a more detailed description; Section 3 presents the experimental validations and comparison of results; finally, Section 4 contains the conclusion and recommendations for future works.

2. The Proposed Saliency Detection Approach

An overview of the improved saliency detection approach proposed for Flying Apsaras in the Dunhuang Grotto Murals in this paper is shown in Figure 1. This improved approach comprises the following three important steps: (1) image color and gray channel decomposition; (2) gray feature value and color channel convolution; (3) visual saliency definition based on normalization of prior visual saliency and spatial attention function. Using these three steps, the final salient map is determined from the input image that is shown as an example in Figure 1.

2.1. Decomposition of Color and Gray Components from the Input Images

It has been widely recognized among researchers that HVS does not equally process all the information available to an observer. Color contrast has a significant impact on the ability to detect salient regions in images and has been highlighted in many previous works. For example, Osberger and Rohaly [18] suggest that a strong influence occurs when the color of a region is distinct from its background and some particular colors (e.g., red, yellow, and blue) attract our attention more than others.

For an RGB image, the three color channels are correlated linearly. In addition, because the value of each color channel is not related to the representation of stimulus intensity, it is difficult to compute the difference in color intensity. Using a variety of color transformation formulas, other color spaces (e.g., HSI, Lab, and LUV, which are specified by the International Commission on Illumination (CIE)) can be derived from the RGB color space. The CIE’s Lab color space is closest to human visual perception and perceptual uniformity, where represents luminance information and matches human perception of lightness closely, and represent chromatic values, and the transformation formula can be shown in the following [19]: where In this paper, we utilize this nonlinear relational mapping of color channels , , and and gray histogram information to imitate the nonlinear response of the human eyes. In addition, CIE’s Lab is the most widely used and efficient image features for visual saliency detection algorithms, that is, researches in Itti et al. [7], Achanta et al. [20], Achanta et al. [14], Goferman et al. [10], Cheng et al. [17], and Yang et al. [21].

2.2. The Frequency-Tuned Saliency Detection Approach

Visual saliency refers to the ability of a vision system (human or machine) to select from an image a certain subset of visual information for further processing. This mechanism serves as a filter to select only the relevant information in the context of the current behaviors or tasks to be processed while ignoring other irrelevant information [4]. Achanta et al. [14] define the five key functions for a saliency detector as being able to (1) emphasize the largest salient objects; (2) uniformly highlight whole salient regions; (3) establish well-defined boundaries of salient objects; (4) disregard high frequencies arising from texture, noise, and blocking artifacts; and (5) efficiently output full resolution salient maps. These functions can be used when comparing alternative visual saliency detection approaches.

From the perspective of image frequency, IG approach [14] suggests that the input image can be divided into the constituent low frequency and high frequency parts of the frequency domain. The overall parts of structure information, such as contours and basic compositions of the original image, are reflected in the low frequency parts, while details such as texture and noise are reflected in high frequency parts. In saliency detection parts, the overall structure information contained in low frequency parts has been widely used to compute visual saliency.

Let and represent the minimum frequency and the maximum frequency in the frequency domain, respectively. On the one hand, in order to accurately highlight the overall information representing potential salient objects, should be low and it is helpful to highlight the entire object consistently and uniformly. On the other hand, in order to maintain the rich semantic information contained in any potential salient objects, should be high, while remaining cognizant of the fact that the highest elements in the frequency domain should be discarded, for they can most likely be attributed to texture or noise parts.

Achanta et al. [14] chose the Difference of Gaussian (DoG) filter (3) for band-pass filtering to capture frequency limits and . This DoG filter has also been used for interest point detection [22] and saliency detection [7, 8]. The DoG filter is defined as where and () are the standard deviations of the Gaussian function.

A DoG filter is a simple band-pass filter and its pass-band width is controlled by the ratio  : . The DoG filter is widely used in edge detection since it closely and efficiently approximates the Laplacian of Gaussian (LoG) filter, cited as the most satisfactory function for detecting intensity changes when the standard deviations of the Gaussian function are in the ratio  :  :  [23]. If we define and , namely, , then the bandwidth of the DoG can be determined by . Because a single DoG operation cannot cover the bandwidth several narrow band-pass DoG filters are combined and it is found that a summation over DoG with standard deviations can cover the bandwidth by the ratio ; the formula is given by for an integer , which is the difference of two Gaussians whose standard deviations can have any ratio . That is, we can obtain the combined result of applying several band-pass filters by choosing a DoG with a relatively large . In order to make the bandwidth as large as possible, must be large. In the extreme case , then is actually the average vector of the entire image. This process illustrates how the salient regions will be fully covered and not just focused on edges or in the center of the regions [14].

The main steps of IG approach are as follows: (1) the input RGB image is transformed to a CIE’s Lab color space; (2) the salient map for the original image is computed by where is the arithmetic mean image feature vector, is a Gaussian blurred version of the original image using a 5 × 5 separable binomial kernel, is the norm, and and are the pixel coordinates.

In conclusion, IG approach is a very simple visual saliency method to implement and only requires use of a Gaussian blur and image averaging process. From the point of view of implementation, IG approach is based on a global contrast method, and the global information is the mean feature vector of the original image.

2.3. The Improved Saliency Detection Approach

The saliency of an input image can be described as the probability that an item, distinct from its background, will be detected and output to a salient map in the form of an intensity map. The saliency detection algorithm aims to estimate the probable location of the salient object. A larger intensity value at a pixel indicates a higher saliency value, which means the pixel most probably belongs to the salient object. In Achanta et al. [14], features of color and luminance were exploited to define visual saliency. Specifically the norm between the corresponding image pixel vector value in the Gaussian blurred version and the mean image feature vector of the original image. In many natural scenes, particularly in the three widely used datasets including Bruce’s [24] eye tracking dataset (BET), Judd’s [25] eye tracking dataset (JET), and Achanta’s [14] human-annotated dataset (AHA), the frequency-tuned saliency detection algorithm performs excellently. But, like many other saliency detection approaches, it is not effective for images of Flying Apsaras in the Dunhuang Grotto Murals. This is probably due to the rich content of the Dunhuang Grotto Murals, which presents a big challenge to distinguishing foregrounds from the mural’s backgrounds.

In order to maintain the complete structural information of salient objects, as well as considering features of color and luminance, this paper also takes the gray histogram information of the original image and human visual attention stimuli into account and proposes an improved saliency detection approach for Flying Apsaras in the Dunhuang Grotto Murals. The gray feature contains rich luminance information and can be used to detect the salient objects of an image. In this paper, after converting the original image into gray-scale image, we can define the gray feature value of pixel as where is the normalized variance of Gaussian function and and represent the height and width of the input image , respectively. Thus, the mean value of gray feature can be expressed by where represents the total number of pixels of the original image.

Meanwhile, after converting the input image in RGB color space into Lab color space, we can obtain the luminance value and color feature values and of a pixel at the position . Then, using (3) to calculate the Gaussian burred version of the original image, as expressed in Achanta’s research, the salient regions will be fully covered and not just be highlighted on their edges or in the center of the salient regions. Then, the arithmetic mean image feature vector of the input image can be determined by Combining features of color and luminance with the gray histogram information, this paper exploits the visual contrast principle to define visual saliency as local contrast based on pixel level. We consider that pixels with significant difference compared with the mean vector are part of the salient regions and assume that pixels with values close to the mean vector are part of the background. Therefore, the saliency of a pixel at location is characterized by the difference between it and the mean vector; namely, where is the norm and is the absolute value function.

Because the final salient map is actually a gray image, in order to optimize display of the final salient map, it can be processed by a gray-scale transformation based on some normalized functions. In this paper, the normalized salient map can be further computed by where and denote the maximum and minimum values of the salient map , respectively. Normalization ensures these values of have a unified range from 0 to 255 for different image locations, which is convenient for visualization of final output.

Generally when an observer views an image, he or she has a tendency to stare at the central location without making eye movements. This phenomenon is referred to as central effect [26]. In Zhang et al. [26], when searching the scenes for a conspicuous target, fixation distributions are shifted from the image center to the distributions of image features. Based on the center shift mechanism, in this paper, we define a spatial attention function , a weight decay function, to reflect the spatial attention tendency of human eyes as follows: where denotes the distance between pixel at location and the center of a given image and is the length of the diagonal of the input image. For any pixel at location , ; thus, , and is a monotonically decreasing function with the variable .

In summary, the definition of the spatial attention function is reasonable, since it does not exclude the possibility for the pixels located at the boundary of an image to be noticed [26]. In order to integrate the extracted features and interaction effects of human observations that are measured by the spatial attention function , the final salient map in the proposed approach is evaluated by This improved saliency detection approach of combining gray histogram information, features of color and luminance, and the human’s spatial visual attention function for solving visual saliency detection problem of the Flying Apsaras in the Dunhuang Grotto Murals can be expressed by Algorithm 1.

Algorithm 1. The improved saliency detection approach for Flying Apsaras in the Dunhuang Grotto Murals, China, is as follows.Input: Images of Flying Apsaras in the Dunhuang Grotto Murals, China.Output: Visual saliency detection maps of images in the Dunhuang Grotto Murals, China.Process:(1) Convert the original image into a grayscale image, and then compute the gray feature value of pixel according to the gray feature function according to (6). Obtain the mean value of gray feature simultaneously.(2) Convert the original RGB color space image into CEI’s Lab color space format to obtain three different property values , , and of three different color channels , , and .(3) Use a separable binomial kernel of pixel at the position to obtain the Gaussian blurred version of the original image, and obtain the arithmetic mean image feature vector of input image simultaneously.(4) Define the new saliency detection formula shown in (9), and use (10) to export the normalized salient map .(5) Integrate the spatial attention function to the normalized salient map to obtain the final definition of visual saliency detection .(6) Export the final visual saliency detection maps of the original images.

3. Experimental Results

In order to evaluate the performance of this proposed visual saliency detection approach, an image dataset of Flying Apsaras in the Dunhuang Grotto Murals was established from Zheng and Tai’s [27] book. To prove the effectiveness of the proposed approach, this paper compares experimental results obtained using the proposed approach with those obtained when applying five other state-of-the-art algorithms.

3.1. The Dataset of Flying Apsaras in the Dunhuang Grotto Murals

This paper collected the experimental data (300 images) of the Flying Apsaras in the Dunhuang Grotto Murals published in Zheng and Tai’s [27] book. These pictures cover four historical periods: the beginning stage (Northern Liang (421–439 A.D.), Northern Wei (439–535 A.D.), and Western Wei (535–556 A.D.)), the developing stage (Northern Zhou (557–581 A.D.), Sui Dynasty (581–618 A.D.), and Early Tang (618–712 A.D.)), the flourishing stage (High Tang (712–848 A.D.), Late Tang (848–907 A.D.), and Five Dynasties (907–960 A.D.)), and the terminal stage (Song Dynasty (960–1036 A.D.), Western Xia (1036–1227 A.D.), and Yuan Dynasty (1227–1368 A.D.)).

3.2. Saliency Detection Results of Four Historical Periods

All the algorithms are used to compute the salient maps for the same dataset. Our approach is implemented in MATLAB R2009a software environment. The other algorithms are downloaded from the authors’ homepages. All these algorithms were implemented using Windows XP with an Intel(R) Core (TM) i3-2130 CPU@ 3.40 GHz processor and 4 GB of memory. Figure 2 shows that the salient maps produced by the proposed algorithm can highlight cultural elements such as streamers, clouds totem, and clothes of Flying Apsaras; particularly for images in the flourishing stage, the line structure and spatial layout that reflect painting styles are highlighted for further processing, that is, image enhancement or image rendering. This is beneficial to the objective of preserving the complete structural information of salient objects and it is useful for researchers when analyzing the spatial structure of streamers with different periods.

3.3. Comparison with Five Other State-of-the-Art Approaches

Visual saliency detection aims to produce the salient maps of images via simulating the behavior of HVS. The proposed approaches compared with five state-of-the-art approaches designed for saliency detection, including HC [17], LC [11], RC [17], SR [12], and IG [14]. Visual saliency ground truth is constructed from eye fixation patterns of 35 observers in our experiments. The state-of-the-art performance of visual saliency detection for real scene images with fixation ground truth similar to the case was reported by Judd et al. [25].

Figure 3 illustrates that the salient area can be adequately distinguished from the background area via the proposed approach. One reason for this is that the contrast based on color component and gray histogram information is more distinguishable than other representations; the other reason is that a spatial attention function to simulate human’s visual attention stimuli is adopted in our approach. From perspective of highlighting painting lines of Flying Apsaras, the final salient maps by using of the proposed approach are more consistent than the other approaches, with the ability to highlight the painting lines of Flying Apsaras’ streamers in the Dunhuang Grotto Murals and guarantee their integrity simultaneously. And from Figure 3, it can be concluded that the saliency detection results obtained by using the proposed approach (HIG) are outperforming results obtained using the frequency-tuned method (IG).

3.4. Performance Evaluation of Different Saliency Detection Approaches

In order to evaluate the pros and cons of each approach in a quantitative way, a False Positive Rate and True Positive Rate (FPR-TPR) curve, which is a standard technique for information for our Flying Apsaras datasets retrieval community, is adopted in this paper. False Positive Rate and True Positive Rate (or Recall) are defined as [2830] where is false positive (regarding the pixels not in the obtained salient regions (background) as the pixels in the ground truth), is true negative (regarding the pixels not in the obtained salient regions (background) as the pixels not in the ground truth), is true positive (regarding the pixels in the obtained salient regions (foreground) as the pixels in the ground truth), and is false negative (regarding the pixels in the obtained salient regions (foreground) as the pixels not in the ground truth). False Positive Rate is the probability that a true negative was labeled a false positive, and True Positive Rate corresponds to the fraction of the detected salient pixels belonging to the salient object in the ground truth.

Using MABLAB software, the FPR-TPR curves were plotted for the different saliency detection algorithms and shown in Figure 4. Our experiment follows the settings in [14, 17], where the salient maps are digitized at each possible threshold within the range . Our approach achieves the highest TPR in almost the entire FPR range . This is because combining saliency information from three scales makes the background generally have low saliency values. Only sufficiently salient objects can be detected in this case.

From these FPR-TPR curves in Figure 4, we can see that our approach has better performances when there is less FPR; thus, it can be concluded that the improved saliency detection approach (HIG) is superior to the other five state-of-the-art approaches. Achieving the same TPR, the proposed algorithm has the lowest FPR, which means it can detect more salient regions; and with the same FPR, the proposed algorithm has high TPR, which means it can detect salient regions more accurately.

In addition, Precision, Recall (or True Positive Rate), -measure, and the values of area under the FPR-TPR curves (AUC) are also used to evaluate performances of these saliency detection approaches quantitatively. Recall is defined as above, and Precision is defined as [14] Precision corresponds to the percentage of detected salient pixels correctly assigned, while Recall corresponds to the fraction of the detected salient pixels belonging to the salient object in the ground truth. High Recall can be achieved at the expense of reducing Precision. So, it is necessary and important to measure them together; therefore, -measure is defined as [14] We use suggested in Achanta et al. [14] to weight Precision more than Recall.

In Figure 5, compared with the state-of-the-art results, we can concluded that the rank of these approaches is HIG > HC [17] > LC [11] > IG [14] > RC [17] > SR [12], and our improved approach (HIG) shows the highest Precision (90.13%), Recall (94.59%), -measure (91.12%), and AUC (91.51%) values in statistical sense, which are a little higher than HC [17] with the Precision (84.52%), Recall (93.82%), -measure (86.50%), and AUC (89.39%) values, and this conclusion also can be supported by Figure 4.

4. Conclusion and Future Works

In this paper, an improved saliency detection approach based on a frequency-tuned method for Flying Apsaras in the Dunhuang Grotto Murals has been proposed. This proposed approach utilizes a CIE’s Lab color space information and gray histogram information. In addition, a spatial attention function to simulate human’s visual attention stimuli is proposed, which is easy to implement, and provides consistent salient maps. Furthermore, it is beneficial to highlight the painting lines of Flying Apsaras’ streamers in the Dunhuang Grotto Murals and guarantee their integrity simultaneously. The final experiment results on the dataset of Flying Apsaras in the Dunhuang Grotto Murals show that the proposed approach is very effective when compared to the five state-of-the-art approaches. In addition, the quantitative comparison results demonstrated that the proposed approach outperforms the five classical approaches in terms of FPR-TPR curves, Precision, Recall, -measure, and AUC.

Future research will consider the use of deep learning techniques to extract semantic information from the Dunhuang Murals, and based on the extracted semantic information, visual saliency detection approaches should be enhanced. More recently, these approaches including Deep Belief Networks (DBN), Restricted Boltzmann Machine (RBM), and Convolutional Neural Networks (CNN) could be applied to determine various activities in the Dunhuang Grotto Murals, for example, the depicted musical entertainment scene. Recognition of these scenes would be highly valuable in ongoing efforts to analyze Chinese cultural backgrounds and evolution in the early dynasties between 421 A.D. and 1368 A.D.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported in part by National Key Basic Research Program of China (no. 2012CB725303), the National Science Foundation of China (no. 61170202, no. 41371420, no. 61202287, and no. 61203154), and the Fundamental Research Funds for the Central Universities (no. 2013-YB-003). The authors would like to thank the anonymous reviewers and the academic editor for the valuable comments.