Abstract

This paper proposes a simple framework for face photo-sketch synthesis. We first describe the shadow details on faces and extract the prominent facial feature by two-scale decomposition using bilateral filtering. Then, we enhance the hair and some unapparent facial feature regions by combining the edge map and hair color similarity map. Finally, we obtain the face photo sketch by adding the results of the two processes. Compared with current methods, the proposed framework demands non feature localization, training or iteration process, creating vivid hair in sketch synthesis, and process arbitrary lighting conditions of input images, especially for complex self-shadows. And more importantly, it can be easily expanded to natural scene. The effectiveness of the presented framework is evaluated on a variety of databases.

1. Introduction

Face sketching is a simple yet expressive representation of faces. It depicts a concise sketch of a face that captures the most essential perceptual information with a number of strokes [1]. It has useful applications for both digital entertainment and law enforcement.

In recent years, two kinds of representative methods for computer-based face sketching have been presented: (1) line drawing [15] and (2) Eigen transformation based [6, 7]. Line drawing-based methods are expressive to convey 3D shading information, at the cost of losing sketch texture. The performances of these approaches largely depend on the shape extraction and facial feature analysis algorithms, such as active appearance model [5]. Other line drawing methods use compositional and-or graph representation [8, 9] or the direct combined model [10] to generate face photo cartoon. Eigen-transformation-based approaches use complex mathematical models to synthesize face sketches, such as PCA, LDA, E-HMM, and MRF. Gao et al. [6] use an embedded hidden Markov model and a selective ensemble strategy to synthesize sketches from photos. However, the hair region is excluded in PCA/LDA/E-HMM-based methods [11]. Wang and Tang [2] use a multiscale Markov random fields (MRFs) model to synthesize face photo-sketch and recognize it. The face region is first divided into overlapping patches for learning, and the size of the patches decides the scale of local face structures to be learned. From a training set, then the joint photo-sketch model is learned at multiple scales using a multiscale MRF model. This method requires modeling both face shape and texture and can provide more texture information. However, the current approaches have three disadvantages: (1) both line drawing and Eigen-transformation-based methods require complex computing; (2) human face sketch is unnecessarily exaggerated, which depicts the facial feature with distortion; (3) most of the existing methods can only sketch the human face but fail to apply to natural scene images.

In this paper, we present a novel and simple face photo-sketch synthesis framework. The hair is synthesized using a two-scale decomposition and a color similarity map. The proposed framework is very simple, without any iteration, or facial feature extraction. Specially, the proposed method can easily be applied to natural scene for sketching.

A schematic overview of our framework is shown in Figure 1. Firstly, for an input face image, a two-scale image decomposition by bilateral filtering is used to describe the shading texture and the prominent feature shape, while the color similarity map-based hair creating can generate the hair texture and the unapparent facial feature. Then, the edge map is computed by edge detector from skin color similarity map. As a result, the hair, eyes, and thus mouth regions are enhanced by multiplying the edge map and hair color similarity map. Finally, the face photo sketch is synthesized by combining the results of former two processes using addition operation.

2. Bilateral Filter

2.1. Bilateral Filter

The bilateral filter is an edge-preserving filter developed by Tomasi and Manduchi [12]. It is a normalized convolution in which the weighting for each pixel is determined by the spatial distance from the center pixel , as well as its relative difference in intensity. The spatial and intensity weighting functions and are typically Gaussian [13, 14]. The spatial kernel increases the weight of pixels that are spatially close, and the weight in the intensity domain decreases the weight of pixels with large intensity differences. Therefore, bilateral filter effectively blurs an image while keeping sharp edges intact. For input image , output image , and a window neighboring to , the bilateral filtering is defined as follows: where and are the size of spatial kernel and the range kernel, corresponding to the Gaussian functions and . When increases, the larger features in image will be smoothed; when increases, the bilateral filter will become closer to Gaussian blur.

2.2. Facial Feature Detail Detection by Two-Scale Image Decomposition

Multiscale image decomposition (or multiscale retinex, MSR) is developed by Jobson et al. [1517], in an attempt to bridge the gap between images and the human observation of scenes. It is widely used in HDR image rendering for color reproduction and contrast reduction [18, 19], color enhancement, and color constancy processing. In HDR image rendition, the smallest scale is strong on detail and dynamic range compression but weak on tonal and color rendition. The reverse is true for the largest spatial scale. Multiscale retinex combines the strengths of each scale and mitigates the weaknesses of each [15].

Durand and Dorsey [19] only use two-scale decomposition to decompose the input image into a “base” and a “detail” image. The base layer has its contrast reduced and contains only large-scale intensity variations, which is obtained by bilateral filtering. The detail layer is the division of the input intensity by the base layer, while the magnitude is unchanged, thus preserving detail.

Then, we will describe how two-scale decomposition can be used for lighting conditions and facial feature detail detection in face photo-sketch synthesis. According to image decomposition, we can get the detail image by subtracting the base layer from the input image. And the detail image can preserve the important details of the input image, such as edges, texture, and shadows, depending on the smoothing degree. In theory, the input image is smoothed more heavily; the details will be preserved more. Figures 2 and 3 illustrate the phenomenon.

On the other hand, facial feature and shadow details are very important to face photo sketch. The difference between sketches and photos mainly lies in two aspects: texture and shape, which are often exaggerated by the artist in sketch. The texture contains hair texture and shadow texture [2]. In this paper, we perform good shading effects near the interest features from detail images by two-scale image decomposition, proposed by Durand and Dorsey [19]. And the shape of obvious facial features is also obtained from the detail images.

The two-scale decomposition is performed on the logs of pixel intensities using piecewise-linear bilateral filtering and subsampling. On the one hand, the use of logs of intensities is because image can be considered as a product of reflectance and illumination component. So the decomposition can be viewed as an image separated into intrinsic layers of reflectance and illumination [2022], while base layer corresponds to illumination component and detail corresponds to reflectance [23]. Therefore, we can obtain the facial feature from the detail layer because of the distinct reflectance difference in facial feature and skin region. In fact, human vision is mostly sensitive to the reflectance rather than to the illumination conditions. Even more, the logarithm function deals with the low intensity far better than those high-intensity pixels because of its function character. On the other hand, the piecewise bilateral-linear filtering in the intensity domain and a subsampling in the spatial domain can efficiently accelerate bilateral filtering.

As is described in (2.1), when the scale of the spatial kernel or/and the scale of intensity domain increases, the input image will be smoothed more. Although the scale of the spatial kernel has little influence on the result, it plays an important role in facial feature detection. Several conclusions can be observed. (1) For the same small scale of intensity domain, the increase of scale of the spatial domain will smooth more pixels near the edges of the facial feature, while it results in the heavier shadow in detail layer without change in edges. The results are shown in Figure 2 by using different spatial scales. (2) For the same scale of spatial domain, the increase of scale of the intensity domain will directly smooth the input image more than (1), because the larger intensity scale will smooth more pixels on the edges [24, 25]. Figure 3 illustrates the different results with the intensity-scale changes.

Further research demonstrates that facial feature, such as hair, eyebrows, eyes, mouth, and nose, which have low intensity than skin, can be obtained in detail image by two-scale decomposition using piecewise bilateral filtering. With the small scale in intensity domain and the larger scale in spatial domain, we can get more reflectance component responses of the low contrast area, such as the shadow near the nose and the chin, as shown in Figure 2. While using the small scale in spatial domain and larger scale of intensity domain, lower intensity pixel in small area regions can appear in detail image, including eyes and eyebrows, as shown in Figure 3. When the scales in both intensity and spatial domain are set to be largest values, we can get the lowest intensity region, such as black hair, as shown in Figure 4. However, it is noticeable that this works only for the dark hair with fuscous color. To create good hair for arbitrary color, we propose a new method in Section 3. In addition, because of the highlight on some human mouth, especially when its color and intensity are similar to skin on human face, the mouth will not be extracted at this step, and unapparent mouth extraction is dealt with in Section 3.

In this paper, we define the two-scale decomposition of input image as follows [16]: is the th color component of the MSR output, , is the number of scales, is the weighting factor for the th scale, is the image distribution in the th color band, “” denotes the convolution operation, and is the weighting function in the th bilateral filtering, that is, the is given by, So the base image is the output of bilateral filtering, and the detail image is As a conclusion, to ensure the speed and effect, we set the relative scale in two-scale decomposition as follows. (1) For creating the shadows near the facial features, the small intensity scale is set to constant value of 0.05~0.08, and the associated spatial scale is set to constant value of 5% the image column size. (2) For creating the clear facial features, including eyes, eyebrows, and sometimes mouth, the large intensity scale is set to a constant value of 0.35, and the spatial scale is set to constant to a value of 2% the image column size. Experimental results demonstrate that the above fixed-scale values perform consistently well for all our face images. The results and analysis process are shown in Figure 5.

3. Hair and Facial Feature Creating

3.1. Color Similarity Map

In this section, we discuss two problems: (1) computing color similarity map for input image, which is used to select the skin region and hair region separately and (2) hair and unapparent facial feature creating.

To detect the skin color region, we propose a skin/hair classification method based on color similarity. A Gaussian similarity measuring function is defined to compare the similarity between two colors. Gaussian convolution performs directly on every pixel of the image in the same way, which achieves the similarity of all the pixels to specific skin color by the concept of color difference to compute it. If the difference between current pixel’s color with the specific color is larger, the probability of this pixel of skin/hair is lower. Let denote the Gaussian masks, and is the specified color (known color), color similarity, function can be defined as is the color of pixel in CIE Lab color space, and is the color difference between the specified color and the pixel in input image, whose value is determined by CIELAB color difference equation. is the threshold of color difference, which determines whether the current pixel belongs to the same kind as the known color. Generally, we keep constant to a value of 30.

To exactly compute the similarity of any color to the specified color, the first step is to confirm a benchmark color. It can be set by two methods: specified by user interaction or the program produces an average color automatically as benchmark while using a known color to select a certain percent of the most similar pixels. The time complexity of this method on confirming a skin color is lower than any face detection-based algorithm. Consequently, hair similarity map and skin similarity map can be achieved by different benchmark colors.

3.2. Hair and Unapparent Facial Feature Creating

After the skin color similarity computing, we can get the color similarity map, as shown in Figure 6. Because the color and intensity of skin distribute uniformly, the gradient is very small in skin region, while the gradient is large in hair region due to the irregular hair colors. In hair color similarity map, the hair region’s value is the largest. Then the product image of gradient skin color similarity map and the negative of the hair color similarity map will strengthen the hair region by hair value minimization.

The hair region enhancement is realized by multiplying the gradient edge map of skin color similarity and the hair color similarity map, which is proposed in Section 3.1. Since facial feature regions, such as nose and mouth regions, have obvious color difference, they will be also enhanced by the multiplication. On the other hand, the color of eyes and eyebrows either is similar to hair color or is distinct from skin color. Both of the cases will be enhanced by the above operation. Figure 6 shows the process of hair and facial feature creating. The proposed method can extract good hair texture under different lighting conditions. It is noticeable that the hair creating method can be applied to other scenes, for example, image abstraction, cartoon making, and wig wearing, in which hair region is necessary.

4. Experimental Results

We tested the two-scale decomposition-based face sketch synthesis framework on CUHK face photo-sketch database, which contains 606 faces totally. All the input images’ size is 1024 × 768 pixels. On average for the uncropped CUHK student face image, one decomposition of an input image for RGB component took about 15.5 seconds (24 s with large spatial scale and 6.9 s with small spatial scale) using two-scale image decomposition based on piecewise bilateral filtering. And the time for hair and unapparent feature creating based on color similarity computing took about 3 seconds. So the total synthesis time is less than 35 seconds. In fact, if the decomposition is for gray image, the corresponding time consumption will be reduced to about 30 percent. All the above experiments were implemented on a 2.66 G PC. If realized on the cropped image of CUHK (URL: http://mmlab.ie.cuhk.edu.hk/facesketch.html), whose size is , the time cost is 2.4 seconds for two-scale decomposition and 0.45 seconds for hair and unapparent feature creating. The speed comparison is shown in Table 1. Figure 7 shows the experimental results on CUHK cropped images.

5. Extended Applications

5.1. Natural Scene Sketch and Line Extraction

The method we proposed can be applied in natural scene sketch and line extraction, with a little modification. The modified framework is shown in Figure 8. While illustrating the detail by two-scale bilateral filtering, which is the same with proposed framework, high boost filtering [29] in Figure 8 is used to enhance highlights and shadows in image. Sobel detector can give prominence to the distinct edges. Both the computation cost of high boost filtering and Sobel detector are small, so the natural scene sketch is very fast.

When multiplying the Sobel edge by the two detail layers, we can get the initial natural scene sketch, which has a line look, as shown in Figure 9(b). Then, we can extract the lines using Difference of Gaussian (DoG). Without human skin similarity computing and human hair extraction, the line drawing process is speeded up. The results are shown in Figure 9. Our method performs well on thick subtle edges in input images. Although Kang’s method depicts the edges with smooth and coherent lines [26, 27], its speed is very low because of LIC. In addition, it is hard to detect the dense edges in some region, such as the edges in the building in the second image. On the other hand, DoG operator on the input image fails to deal with the edges in shadows and details.

Our line extraction approach takes less than 0.7 second for a image to synthesize a sketch. We implement our method using MATLAB and run the codes on a computer with 2.20 GHz CPU. The speed comparison is shown in Table 2.

If we want to get the better sketch of any input images, high boost filtering is preferred. Our approach operates well on any kind of images, such as outdoor natural scenes, animals, plants, building, and human faces. Some of the sketch results are shown in Figure 10. Figure 11 shows the human face sketch results of the CUHK database, which is introduced in Section 4. And all the parameters in bilateral filtering are the same with Section 2.2. The performance of our approach is well when it is used in human face. It can deal with gray images and color images of different races and different lighting conditions. More results on other kinds of images are shown in the appendix.

5.2. Image Stylization

Based on the initial line sketch, we can easily extract the edges by DoG operator. In combination with base layer in two-scale decomposition after color quantization, we can get the simple image abstraction as proposed in [27, 28]. The results are shown in Figure 12.

6. Discussion and Conclusion

We have presented a novel framework for human face photo-sketch synthesis, which is very simple, fast and requires no parameter setting. Firstly, by combing the two-time two-scale image decomposition results, the detail reflecting conditions of facial features and the prominent facial features are obtained. Secondly, based on the color similarity map, we extracted the unapparent facial features and created the vivid hair region. Finally, we exploited the framework by the simple addition operation of the former results, which can be applied to other races, such as the white race. Moreover, the framework can also be expanded to other applications, such as natural scene sketch and line extraction. In conclusion, the proposed framework is very simple in that no feature localization algorithm, complex mathematic model, or iteration is needed. In addition, the sketch synthesis result is more vivid than other methods, especially in hair texture creating. Most importantly, the method is easy to be used for other applications, such as line extraction, natural scene sketches, and image abstraction. Sketch recognition is more and more widely used in Sketch-based user interfaces [31]. Since the predominant forces in line extraction of our sketch method, we will try to implement recognition of face photo sketch or other image sketch.

Appendix

See Figures 13, 14, 15, 16, and 17.

Acknowledgment

This work was supported by the National Science Foundation of China under Grant no. 61003200 and by Program for New Century Excellent Talents in University, Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology, SKL of PMTI, SKL of CG&CAD, and Open Projects Program of NLPR.