Abstract

Many types of deep neural networks have been proposed to address the problem of human biometric identification, especially in the areas of face detection and recognition. Local deep neural networks have been recently used in face-based age and gender classification, despite their improvement in performance, their costs on model training is rather expensive. In this paper, we propose to construct a local deep neural network for age and gender classification. In our proposed model, local image patches are selected based on the detected facial landmarks; the selected patches are then used for the network training. A holistical edge map for an entire image is also used for training a “global” network. The age and gender classification results are obtained by combining both the outputs from both the “global” and the local networks. Our proposed model is tested on two face image benchmark datasets; competitive performance is obtained compared to the state-of-the-art methods.

1. Introduction

Age estimation and gender distinction from face images play important roles in many computer vision-based applications, such as visual surveillance, security control, and human-computer interaction. Over the last decades, many methods have been proposed to tackle the age and gender classification task.

In early works, pixel intensity values are used directly as input to train a classifier such as neural network [1, 2] or support vector machine (SVM) [3]. However, when the resolutions of images increase, directly using intensity values dramatically increases the scales of image features as well. Therefore, some feature reduction techniques such as principal component analysis (PCA) are applied to reduce the dimensions of image features [4]. Some image descriptors which are more powerful for image representation have also been used in the area of age estimation and gender recognition tasks, such as local binary patterns (LBP) [5], shift-invariant feature transform (SIFT) [6], Gabor filters [7], histogram of oriented gradient (HOG) [8], and biologically inspired features (BIF) [9]. Although the tasks of age estimation and gender classification have been widely investigated over the last decades, the results obtained are still far away from real applications [10, 11].

In recent years, deep learning, especially convolutional neural networks (CNN) [1215], have become an important tool in computer vision applications. In many vision-based areas, such as image classification, object detection, pose estimation, visual tracking, CNN have achieved superior results. [16]. More recently, CNN have been employed in face image-based age and gender classification tasks [1719]. However, as face images vary in a wide range under the unconstrained conditions (namely, in the wild), the performances of CNN still need to be improved, especially in age estimation tasks. Moreover, the time cost on training CNN models is quite expensive in most proposed solutions.

In order to reduce the cost on CNN model training, a local deep neural network (LDNN) was proposed [20] for gender recognition; the LDNN model can achieve state-of-the-art performance while the training cost is considerably reduced. More recently, a modified version of LDNN is proposed by Liao et al. [21]; this modified LDNN shares the same network architecture with the one used in [20]. In [21], the number of image patches used for network training is further reduced; 9 fixed image patches are selected for network learning. The modified LDNN model can be used for both age and gender recognition. However, in this model, the local image patches used for training are fixed; this may not work well on the unconstrained images without carefully preprocessing. In [21], the authors find the eye areas and the mouth area are crucial parts for age estimation, while only the eye areas are important for gender classification.

The success of LDNN in age and gender classification and the relative discoveries from the former LDNN works inspire us to propose a LDNN model for age and gender estimation. In our proposed model, the local image patch selection is based on the detected facial landmarks, that is, the image patches used for network learning are dynamically generated. Therefore, the number of image patches can be greatly reduced while all the important information in a face image can be kept.

In [20], the Sobel edge detector is used for local feature extraction. However, in [21], it is illustrated that using other feature detectors can obtain different performances. In our proposed model, the holistically-nested edge detection (HED) [22] is used for global feature extraction. The age and gender classification results are obtained by combining both the outputs from the “global” and the local networks.

The remainder of the paper is listed as follows. In Section 2, a brief review of related work on age and gender classification using CNN is given. Section 3 introduces the proposed local deep neural network for age and gender estimation. Section 4 presents the experiment settings and the experimental results and analysis. Conclusions and future work are included in Section 5.

The successful applications of CNN on many computer vision tasks have revealed that CNN is a powerful tool in image learning. If enough training data are given, CNN is able to learn a compact and discriminative image feature representation. Therefore, many researchers propose to use CNN in age and gender classification from face images. In this section, the related work on age and gender classification using CNN is briefly reviewed. The previous research on local deep neural networks for age and gender estimation is also introduced.

2.1. CNN for Age and Gender Estimation

An early CNN model used for age and gender estimation can be seen in [23], in which a multiscale convolution neural network model is proposed. In [18], the authors propose a convolutional net architecture that can be used even when the amount of learning data is limited. A chained CNN-based age and gender classification scheme is introduced in [24], where the age classifiers are trained for different genders. The apparent age estimation task is investigated in [25]; their proposed model fuses the real value-based regression and the Gaussian label distribution based GoogLeNet; the model was tested on LAP dataset. Later, their result is improved by Antipov et al. [26] by fusing the general model and the children model.

Some researchers suggest using deeper networks for age and gender estimation. Yang et al. introduce the deep label distribution learning for apparent age estimation, where the distribution-based loss functions are used for training, which can exploit the uncertainty induced by manual labeling to learn a better model than using ages as the target [27]. The deep age distribution learning (DADL) is proposed in [28] for age prediction. Hou et al. [29] propose a deep CNN model similar with a VGG-16 net coupled with the smooth adaptive activation functions for age estimation. Their results were further improved by using the exact squared earth mover’s distance in loss function [30]. In [31], convolutional neural networks are used for the extraction of deep features, then the standard support vector regression is used for gender and age prediction. Recently, the model in [31] is further improved by adding an expected value formulation after classification [32]. The directional age-primitive pattern (DAPP) is proposed in [33], which is a local face descriptor containing aging cue information; the model obtained state-of-the-art performance on Adience dataset.

2.2. Local Deep Neural Networks (LDNNs) for Age and Gender Estimation

Compare with CNN, LDNNs use a different training strategy: the small image patches around important regions of faces are extracted and used for network learning. An LDNN model for gender recognition is proposed in [20], where a feed-forward neural network without dropout is used. An edge detector is firstly used to obtain edges in face images, and small image patches are then selected around the obtained edges. All the image patches are fed into neural networks for training. The predictions of all the patches from the input test image are averaged for the final output. Using patches obtained in this way seldom leads to overfitting since the most redundant information has been removed during filtering.

Another LDNN model was proposed recently, which aims to further reduce the number of image patches used for training [21]. This model uses the same network architecture of [20]. In this modified version of LDNN, only 9 fixed image patches are used for the local network training, as presented in Figure 1. In addition, the authors split an image into five rows (2) and find that the rows containing eye regions and mouth region are important rows for age estimation. By using less training image patches, the model still achieve a competitive performance.

3. Methodology

In this section, we describe the proposed architecture for age and gender classification. Our methodology is essentially composed of three steps: (1) to implement face detection and facial landmark localization, (2) to select image patches based on the obtained facial landmarks, (3) and to construct LDNN model. In the following, the three parts are described in detail.

3.1. Facial Landmark Localization and Patch Selection

The first step of our proposed model is to detect a face in an image and to obtain the facial landmarks on the face, both are widely investigated areas [3436]. Currently, the global spatial models are polularly used landmark localization methods, which are mainly based on local part detectors. Therefore, it is common to use mixtures of deformable part models or to use mixtures of trees for face detection and landmark estimation. Then the efficient dynamic programming algorithms can be applied to find globally optimal solutions. Without loss of generality, a mixture of trees model for face detection and landmark localization [37] is used here. A brief introduction of the method is given below.

The model is based on a mixture of trees with a shared pool of parts . Every facial landmark is modeled as a part; the global mixtures are used to capture topological changes due to vi a viewpoint or deformable changes such as changes in expression.

Each tree-structured pictorial structure [38] is linearly parameterized and written as , where represents a mixture and . For an image , the location of a part is denoted as . All the image parts are scored as

In (2), for the feature vector extracted from pixel location of image , the appearance scores of placing the template for part at the location tuned for mixture are summed up.

Equation (3) computes the mixture-specific spatial arrangement of parts . and represent the displacement of the th part relative to the th part. Each term in the sum can be interpreted as a spring that introduces spatial constraints between a pair of parts [37]. The parameters specify the rest location and rigidity of each spring.

The model is trained in a fully supervised scheme, where the positive images with landmarks and mixture labels are provided, and the negative images without faces are also provided as well. The shape and appearance models are learned by using a structured predication framework. The Chow-Liu algorithm [39] is used to find the maximum likelihood tree structure which can give the best description of the landmarks in a given mixture. Figure 2 shows the landmark localization result from a sample image.

Once the facial landmarks are obtained, the local image patches can be determined. As shown in Figure 2, for the sample image, a total of 68 landmark points are detected; therefore, 68 image patches around the landmark centers are selected for the network learning. The size of the image patches in Figure 2 are (blue squares); however, the size of an image patch can vary according to the size of the input images. In our experiments, the performances of different patch sizes are also compared.

The image patch selection used here is different with the two former LDNN-based methods used in [20, 21]. In [20], although the authors only keep the image patches whose center pixel is an active pixel in the binary mask image, there are still hundreds of image patches left for network training. In [21], only 9 fixed patches are used (Figure 3). However, in order to improve model performance, an image is divided into 5 rows, and the rows containing the eye regions and mouth region are used to assist the model output. The patch selection is more empirically decided in this scenario. Our landmark-based patch selection method can keep the most important information in a face image; moreover, it largely reduces the number of training patches.

3.2. Network Architecture

LDNNs are trained by using the image patches extracted from landmark regions of face images. As most of the redundant information has been discarded in our patch selection process, the left training patches cannot lead to the problem of overfitting. Therefore, it is reasonable to use a simple feed-forward neural network. The network architecture used in [20, 21] can also be directly used here for our tasks. Figure 4 shows the network architecture.

The whole procedure of our method is shown in Figure 5. For an input image, its landmark patches are detected and classified by the trained neural network. Then the outputs of the patches are averaged. Following the routine of [21], the entire image can be used to improve classification performance as well. Therefore, another neural network is trained by the entire image is also used here. Moreover, we employ the holistically-nested edge detection (HED) detector [22] to train a third neural network for further performance improvement.

The HED is a deep learning-based edge detection method; it aims to obtain a network that learns features from which it is possible to produce edge maps approaching the ground truth. HED uses multiscale and multilevel structure to generate 5 side-outputs which improve the final fusion result. The architecture of HED can be seen in Figure 6.

In Figure 6, the side-output layers are inserted following the convolutional layers. Deep supervision is imposed at each side-output layer to guide the side-outputs toward edge predictions. The outputs of HED are multiscale and multilevel, with the side-output-plane size becoming smaller and the receptive field size is becoming larger. One weighted-fusion layer is added to automatically learn how to combine outputs from multiple scales. The entire network is trained with multiple error propagation paths (dashed lines). The details of HED can be seen in [22].

4. Experiments and Results

A series of experiments has been conducted on two popularly used face image datasets, the LFW database and the Adience database. In this section, the datasets used in our experiments are introduced firstly then the parameter settings of the experiments are introduced. Finally, the experimental results of gender and age estimation are given.

4.1. Face Image Datasets
4.1.1. Labeled Faces in the Wild (LFW)

The labeled faces in the wild (LFW) database contains 13,233 face photographs labeled with the name and gender of the person pictured. Images of faces were collected from the web with the only constraint that they were detected by the Viola-Jones face detector [40]. The sample images from LFW database are shown in Figure 7.

There are four versions of LFW—the original version, funneled version, deep funneled version, and frontalized version (3D version). LFW is an imbalanced database including 10,256 images of men and 2977 images of women from 5749 subjects; 1680 of which have two or more images [40]. The 3D version is used in this work since the images are already cropped, aligned, and frontalized properly.

4.1.2. Adience Dataset

There are 26,580 face images from 2284 persons in the Adience dataset [41]. The images are with age and gender labels, which are collected from the Flickr albums and released by their authors under the Creative Commons (CC) license. The images are completely in the wild as the photos were taken under different variations in appearance, noise, pose, and lighting, and so on.

There are three versions of the Adience database, including the original version, aligned version, and frontalized version (3D version) with 26,580, 19,487, and 13,044 images, respectively. The 3D version is used in this work since most images are already frontalized and aligned to the centre of the image. However, images in the Adience database 3D version may be extremely blurry or frontalized incorrectly as shown in Figure 8. Additionally, people in the images could show emotions. Therefore, it is reported that patches extracted from those images may not always contain the same face region which may result in lower classification rates [21].

There are three subsets of the Adience dataset 3D version; this is because it is not necessary to label gender with age groups or vice versa. The first subset contains 12,194 images labeled with gender. The second subset comprises 12,991 images labeled with age. 12,141 images are included in the third subset, which is labeled as both gender and age. Our experiments are run on the third subset. The label information can be seen in Table 1.

4.2. Experimental Settings

In order to find appropriate parameters for the proposed method, a series of experiments has been conducted. The parameters listed in Table 2 produced good outcomes.

The experiments were run on a PC with an Intel i7 4 cores CPU, 16 G memory and an NVIDIA Geforce GTX 1080 GPU (8 G memory); the time cost for training the proposed model is around 10 hours.

4.3. Experimental Results on LFW

For comparison, we follow the routines in [20, 21] to carry out our experiments. Five cross-validations using the same five folds as [20, 21] are used. Around 67% of patches of men are randomly discarded in each fold to balance the data. We first set the size of the image patches as , which is the same as described in [20]. Table 3 lists the classification results of our method and the compared methods, where different numbers of hidden layers are also tested.

In Table 3, one can see that for the same model, besides the image patches themselves, if the center locations’ coordinates are added to indicate where a patch is extracted (the “LDNN + location” and “proposed + location” columns), the classification performance can be improved. The method in [21] uses 9 fixed image patches; therefore, the location of patches are also fixed; the method is named as LDNN-F in Table 3.

In Table 3, the best gender classification is 96.25% from [20] with patch location. Due to the huge number of patches and the limited amount of memory, it is not feasible to train a neural network using all of the four training folds. In the same way in [21], only one fold was used for network training in this work. The smaller training set is a factor leads to a lower performance.

The effect of using different sizes of image patches are also evaluated. Three hidden layers are used in the network. The compared results can be seen in Table 4. The best performance among the tested patches sizes was obtained by , which is the same with the results in [20], where the authors explain the size of was determined from a previous research. The best performance of [21] was obtained by using a larger size as . The reason is only 9 location-fixed patches are used for training; a larger patch is able to contain more useful information.

In our proposed model, besides the landmark-based image patches, the entire image and the holistic feature map extracted from the entire image are also used to further improve model performance. Table 5 lists the results of improvement bought by the holistic feature.

Some of the state-of-the-art methods work on the LFW dataset for gender classification are also compared in our experiments. The results are listed in Table 6. Among the compared methods, the best performance 97.03% was obtained by the method of “Compact CNN”; however, this method needs to construct an ensemble of learning models, which is much more complicated on model construction compared to our method.

4.4. Experimental Results on the Adience Dataset

The age and gender classification are run on the 3D version of the Adience dataset. We used the same routine in [21]; the networks are first trained separately for age and gender then the gender classification results are used to help age estimation.

The same parameters listed in Table 2 are also used here for model training. The performance of our model is shown in Table 7. Our proposed model achieves 80.64% correction rate on the data set, where the result in [21] is 78.63%. The main reason is the Adience dataset are not frontalized well; the location-fixed patches used in [21] may not always contain the same region of faces. In our method, by detecting facial landmarks in advance, the obtained landmark-based patches can relieve this problem much better.

The Aidence dataset contains 8 age groups and another 20 different age labels. Some folds even lack the images for some age groups; therefore, the age labels must be merged. We used the same merging scheme used in [21]; all the labels are merged into the 8 age groups. Please see their paper for details.

In the same way in [18, 21], the one-off classification rate is used for age estimation. That is due to the apparent similarity of persons in adjacent age groups; images which are categorized into adjacent age groups are considered to be correct classification.

For the age estimation, three sets of neural networks are constructed; each contains the model shown in Figure 3. The neural network 1 is used for gender classification, and neural network 2 and 3 are for male and female age estimation, respectively. If an input image is recognized as male then network 2 will be used for its age estimation; otherwise, network 3 will be activated. The whole process can be seen in Figure 9.

It should be noted that the neural network 2 in Figure 9 is trained using 5740 face images of men, and the network 3 is trained using 6410 images of women. The individual performance of neural network 1 and neural network 2 on age estimation is shown in Table 8, and the results from the model in Figure 9 is given in Table 9.

5. Conclusion and Future Work

A modified version of local deep neural networks is proposed in this paper. Instead of using location-fixed patches, the facial landmarks are detected in advance, then the image patches around landmarks are selected for network training, which greatly reduces the training cost. Moreover, the experimental results show that the method proposed in this paper achieves competitive performance in the two tested datasets. The performance of the proposed model still can be improved by incorporating other schemes into current architecture, for example, to use a more efficient facial landmark detection method or to further optimize the network structure, these will be investigated in our future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Additional Points

The ownership of Figures 2, 4, and 9 in this paper belongs to the original author Yungang Zhang. Please do not reprint, duplicate, or use these pictures in any form without the permission from the author. Otherwise, the author will have the right to investigate for legal liability.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The project is funded by Natural Science Foundation China Grant no. 61462097 and Yunnan Provincial Education Department research Grant no. 2018JS143.