Multifeature Extreme Ordinal Ranking Machine for Facial Age Estimation
Recently the state-of-the-art facial age estimation methods are almost originated from solving complicated mathematical optimization problems and thus consume huge quantities of time in the training process. To refrain from such algorithm complexity while maintaining a high estimation accuracy, we propose a multifeature extreme ordinal ranking machine (MFEORM) for facial age estimation. Experimental results clearly demonstrate that the proposed approach can sharply reduce the runtime (even up to nearly one hundred times faster) while achieving comparable or better estimation performances than the state-of-the-art approaches. The inner properties of MFEORM are further explored with more advantages.
With the rapid development of computer vision, pattern recognition, and biometrics, more and more attention has been paid to computer-based human facial age estimation, which will be utilized in the scenarios where an individual’s age needs to be obtained without specifically identifying other irrelevant personal information, such as electronic customer information management [1, 2], human-computer interaction (HCI) , security surveillance monitoring [4, 5], age-based visual advertisement, and even entertainment.
Unlike other face-oriented problems, the difficulties of computer-based facial age estimation are reflected in the following aspects:(1)Difference of aging process: different people have their own living environment, ethnic group, gender, lifestyle, social contact, health condition, and even gene diversity, which all together determine the speed of aging.(2)Shape or texture: different forms of aging will emerge in different age levels. For example, from infancy to adolescence, the craniofacial growth (shape growth) is the main change. However, from adult period to old age, the craniofacial change decreases remarkably and skin transformation (texture change) would be the most prominent change.(3)Data insufficiency: we only have a very limited number of aging datasets, especially which can cover all the age range.(4)Disturbance: some females tend to show their younger face, so final estimation results will be largely interfered by using cosmetics and accessories.
A lot of facial age estimation approaches have been put forward, some of which are able to obtain rather satisfying performance. Among them, most of the traditional approaches formulate facial age estimation problem based on classification [6–9], regression [4, 10–12], or combination of the two. Suppose we have a dataset of training samples, , in which represents the th face image and represents the corresponding age label. In multiclass classification, every sample will be regarded as a single independent age label for training; as a result, we get a multiclassifier to estimate a person’s age. But the trouble is that the age labels have no relationship with each other; that is, each age label is only treated as a separate entity in the training process while, in essence, human’s age labels are sequential. So this kind of multiclassification method may omit some connotative information of the correlation among different age labels, which together compose the fine-ordered age set. For instance, two images with adjacent age labels for the same person will be more similar than those with far-apart labels. In short, multiclass classification cannot take full advantage of the correlation among ordinal age labels. In contrast, the regression method aims to find the best mapping from raw images to the corresponding ages and get a function for age estimation. However, craniofacial and skin changes in different age levels would result in unstable random process in feature space, so the kernels used to assess the similarities among different ages could be drifted. As for the estimation performance, it has been shown in the literature [4, 11, 13] that when different datasets are used for training and testing, the regression method will show better or worse results than the classification-based method. In addition, Guo et al. [4, 14] proposed a hybrid method which combines classification and regression approaches together to make use of both advantages. As a result, the actual performance is further improved to some degree.
In order to overcome the aforementioned defects of classification and regression approaches, ordinal hyperplanes ranker (OHRank)  based age estimation has been proposed. As everyone knows, the aging process is diversified for different age levels. As an analogy, the aging process from 22 to 25 would have a different tendency compared to that from 62 to 65. So it is more credible to compare two age labels’ relative sequences (smaller or larger) than to compare the differences among labels. In spite of all the above merits, OHRank only utilizes a single-feature set as the feature representation model, so it fails to synthetically include all discriminative information from all available feature sets. More specifically, each feature set has its own advantages over others. For example, the anthropometry models  mathematically model the growing of people’s head from babyhood to adulthood, so it reveals some information of the face’s size and proportion; Active Appearance Models (AAM)  can represent both shape and texture information instead of only facial geometry; age manifold  learns the age tendency from different face images at every age label. So OHRank cannot integrate these pieces of discriminative information of different feature sets. Later Weng et al.  present a multifeature ordinal ranking (MFOR) method to utilize multifeature sets simultaneously, so the feature information’s discriminative power is further reinforced. Experiment results demonstrate that MFOR outcompetes other age estimation approaches. However, almost all these approaches including MFOR largely rely on the complicated mathematical optimization solution. Take OHRank and MFOR as an example: these two ordinal ranking-oriented methods have top performance so far but they are all constructed under SVM-based formulation, so the SVM parameters must be computed iteratively by working out complicated and time-consuming optimization problems. As a result, calculation complexity would be a heavy burden for improving efficiency.
Recognizing this point, we propose a multifeature extreme ordinal ranking machine (MFEORM) for facial age estimation. Basically, we divide our approaches into three stages: (1) representing features using certain feature extraction models, (2) processing the obtained feature sets, and (3) applying certain algorithms to estimate age. On the first stage, we use multiple feature models parallelly to represent our facial image database. For the second stage, since it is more logical and reasonable to distinguish which is older/younger between two facial images than to directly predict the age from images, the more reliable “larger or smaller than” information is used for one binary classification at each age. In this case, the abstract age estimation problem can be downgraded into binary classification subproblems where represents the number of total age labels. For the third stage, an ultrafast extreme learning machine (ELM) with kernel function is applied to get a series of classifiers. These classifiers are then integrated together according to a certain rule which will be illustrated in the following. Experiment results explicitly demonstrate that our MFEORM is able to notably reduce the runtime (even up to nearly one hundred times faster) while achieving similar or even better estimation results against state-of-the-art methods.
To sum up, the following contributions are made in this paper:(1)Multifeature extreme ordinal ranking machine (MFEORM) for facial age estimation is proposed, which combines the advantages of multifeature space, age’s natural characteristics of ordinal information, and extreme learning machine’s rapid learning rate while achieving similar or even better performances with much less time compared to state-of-the-art methods. Our approach avoids tediously conducting the iterative computation for mathematical optimization problem and improves efficiency.(2)The experiments are conducted comprehensively and thoroughly from the internal and external aspects, respectively, and find out more about MFEORM’s particular characteristics and advantages.(3)Further properties are explored: (a) the influence of different number of feature models on the final results and (b) the influence of number of dimensions (after PCA dimension reduction) on the final results.
The rest of this paper is organized as follows: firstly, briefs of previous ordinal ranking-oriented age estimation solutions and extreme learning machine will be reviewed in Section 2. Then our proposed method will be detailed in Section 3. After that, experiment results and remarks are reported in Section 4. Finally, Section 5 concludes the paper.
2. Extreme Learning Machine (ELM)
All through the years, conventional learning techniques like support vector machines (SVMs) and neural networks have been suffering from the following: (1) slow training and learning speed, (2) human involvement, and (3) unsatisfactory generalization performance. However, extreme learning machine (ELM) [20–22], which recently draws more and more attention, conquers these drawbacks and gets a satisfying performance to a certain extent. Primarily, ELM is based on generalized single-hidden layer feedforward neural networks (SLFNs). ELM has the following advantages:(1)In ELM, all hidden layer parameters of SLFNs do not need to be tuned and do not rely on training samples. They only need to be randomly generated and reduce human intervention.(2)ELM has much faster speed and more superior generalization performance.
Let us start from the structure of SLFNs. Figure 1 shows a typical SLFN’s construction. Generally speaking, SLFN can be described as where represents the output weight between the th neuron and the output node, is the output function of the th neuron, is the weight vector linking to the th node, and is the bias of the th node. Particularly, we have Many researchers have built the theoretical foundation [23–25] that SLFN is able to learn arbitrary distinguishing samples with zero error provided this SLFN has any bounded nonlinear activation function and hidden neurons at most. More precisely, suppose we have training samples , where . Also, we let Then the abovementioned “zero error” means In other words, we can find a combination of and such that To make it concise, we formulate (5) as where
In accordance with the above traditional theories of neural network, only when all the hidden layer parameters (e.g., and ) are adjustable can SLFNs work as the universal approximators. In order to minimize the workload of tuning these parameters, researchers proposed incremental approaches in which parameters of existing hidden neurons remain unchanged while those of newly added neurons will be tuned and then fixed. In all the above methods, the hidden layer parameters need to be adjusted for greater than or equal to once. However, this situation will not happen in extreme learning machine. For ELM, all the hidden layer parameters do not need to be iteratively tuned; instead, they are randomly selected and independent from training samples. In theory, Huang et al.  have proved that, for the purpose of letting SLFNs serve as universal approximators, we can randomly choose the hidden layer parameters and analytically determine the output weight vectors connecting the output layer and hidden layer. In this case, for additive nodes, activation functions can be arbitrary bounded nonconstant continuous piecewise functions : ; for RBF nodes activation functions can be any arbitrary integrable continuous piecewise functions : and . Furthermore, Huang et al.  demonstrate that, as long as activation functions are infinitely differentiable, the hidden layer parameters can be randomly selected. After selection, the hidden neuron parameters will be kept unchanged. Up to now, the sole unknown parameters are output weights connecting hidden layer and output layer. So the least square method can be used to calculate the output weight vector .
In accordance with the theory of feedforward neural network , ELM can be mathematically formulated by Therefore, training such SLFN is the same as seeking out a least-squares solution of this linear system. Suppose that this optimal solution is , so we have According to the mathematical theory of matrices , the smallest norm least-squares solution for the above system is , where represents the Moore-Penrose generalized inverse of . In short, ELM can be summarized by the following steps: (1)Randomly selecting the hidden layer parameters, that is, the input weight vector and the hidden layer biases , where (number of hidden neurons).(2)Calculating the hidden layer output, namely, matrix .(3)Calculating the output weight, namely, vector : .
When calculating the Moore-Penrose pseudoinverse of matrix , the orthogonal projection approach can be used with efficiency which can be described as follows:(1)When is nonsingular, ;(2)When is nonsingular, .
Furthermore, in order to obtain more stable solution and better generalization performance, Huang et al.  added a positive value to the diagonal of or during the calculation of . In this way, (when is nonsingular) or (when is nonsingular). Consequently, according to and the output function , we have with the corresponding or with the corresponding .
On the basis of the abovementioned, the kernel-based ELM is also available, in which kernel matrix can be formulated as
So using kernel-based method, output function (10) will become
Formula (13) indicates that kernel function (e.g., for RBF kernel, ) can be directly calculated without knowing specific hidden layer feature mapping and the number of hidden nodes .
3. Proposed Method: Multifeature Extreme Ordinal Ranking Machine (MFEORM)
Basically for the proposed MFEORM, facial age estimation is regarded as an ordinal ranking-oriented issue. More specifically, first the database of the th feature models can be divided as follows: where is the number of total age labels of the database, is the instance number, and is the age labels. Then the problem can be separated into classification subproblems. For each subproblem , the purpose is to find whether the face image is older or younger than . So each subproblem is equivalent to a basic binary classification problem. In each binary classification problem, it is assigned that if and if . Next the aforementioned rule is applied for all feature models, which will be combined together as our total training sets. Then RBF-kernel-based ELM can be used to train each of binary classifiers. After traversing from 1 to , binary classifiers are obtained, which imply the ordinal relationships among all age labels. We call these classifiers the “weak classifiers.” In the end, these “weak classifiers” can be comprehensively integrated to form the final result according to our preference-collection rule.
In age estimation, the most popular performance measurement is the mean absolute error (MAE), which can be described by where is the estimated age, is authentic age, and is the number of test images.
The steps of the proposed MFEORM algorithm are as follows:(1) For each target age , apply the following steps:(a)For each feature model, separate the raw training data into and .(b) Integrate all feature models’ information together as the total training sets.(c) Use the kernel-based extreme learning algorithm to train each binary classifier (the so-called “weak classifiers”) and get a decision function according to (13).(2) Construct a preference-collection rule from all subproblems: where is equal to 1 if the inner part holds and otherwise it is equal to 0. For example, suppose we have trained all the binary classifiers based on our algorithm and now we use one face image whose authentic age is 25 to test. So we substitute into for all and get all the . Finally we find that holds for , so the final estimated age is 27.
4.1. Data Sets
A series of experiments on the popular benchmark aging database FG-NET  were conducted. FG-NET has 1,002 grayscale or color facial images of 82 people, which includes comprehensive poses, expressions, and lighting environments, just as Figure 2 shows. All people’s age range is from 0 to 69 in FG-NET and its age level distribution is displayed in Table 1. In order to process uniformly, all facial images in FG-NET were converted to grayscale, aligned and normalized. Finally histogram equalization was conducted in order to decrease the illumination influence.
4.2. Experiment Settings
Basically, three feature models for information extraction were used from FG-NET raw images: Active Appearance Model (AAM) , local binary patterns (LBP), and Bioinspired feature (BIF) , which would be combined together for a total dataset. Active Appearance Models (AAM) can represent both shape and texture information instead of only facial geometry, which is also popularly selected by other age estimation methods. LBP is also a widely used feature for texture classification in computer vision. BIF was selected because of its high age estimation accuracy. The information extracted from the above three feature models can complement each other; together they were combined as a total dataset. For AAM features, the feature dimension was set to retain 95% of variability. For BIF features, the number of bands was set at 8 (16 scales totally) with 4 orientations each. Also, to reduce the entire feature space, principal component analysis (PCA) was used to reduce the dimension. More specifically, all the three feature models would be reduced to 100 dimensions, respectively. Particularly, the AAM model includes both shape and texture information, so these two subproperties would be reduced to 50 dimensions, respectively (in total 100 dimensions). For extreme learning machine, we used RBF kernel-based ELM. Also, leave-one-person-out (LOPO), a popular test procedure, was utilized for the test strategy, which was suggested in [4, 7, 11, 31–33]. In terms of the accuracy, MFEORM’s experimental results were compared with MFOR , OHRank , WAS , AGES , RUN1 , RUN2 , Multitask Warped Gaussian Process (MTWGP) , and SVM. For the time consumed, because the accuracy of other algorithms is lower than MFOR we only need to compare the consumed time results with MFOR. Also, in order to find out the inner properties of MFEORM, different combinations of MFEORM’s inner parameters were tried to analyze its inner characteristics. Furthermore, the number of feature dimensions’ impacts on the final estimation result will be shown and the fact that that the multifeature dataset (three-feature dataset) outcompetes two-feature and single-feature dataset based on the proposed MFEORM will be verified in Section 4.3.
4.3. Experiment Results and Analysis
The biggest advantage of our algorithm is the extreme speed. So firstly the consumed time of MFEORM was tested and compared with MFOR.
Table 2 shows the comparison of MFEORM and MFOR in respect of the runtime and accuracy (MAE). As can be explicitly seen, the MFEORM can shorten the runtime by a significant amount while maintaining high accuracy. Because previous facial age estimation solutions based on multifeature ordinal ranking usually formulate their models as complex constrained-optimization problems with inequality constraints, solving these convex optimization problems needs huge quantities of time. But for our algorithm, the input weights and hidden layer parameters are randomly generated and analytically determined, so the hidden layer parameters need not be tuned like SVM. As a result, the proposed method can be much faster than the previous multifeature ordinal ranking methods.
Table 3 displays the MAE results of different algorithms derived from the FG-NET database, which demonstrates that our algorithm achieves a similar or even better accuracy compared to other popular facial age estimation algorithms.
To take it one step further, next the inner properties of MFEORM are explored. Based on the aforementioned description, there exist two uncertain parameters, regularization coefficient and kernel parameter . For these two parameters, experiments are conducted to explore their influences towards experiment results.
Firstly, different regularization coefficient and same kernel parameter which equals 50 were assigned, respectively. Table 4 shows the tuning results. Finally regularization coefficient was set at 500.
Next different kernel parameter with same regularization coefficient at 500 was assigned, respectively. Table 5 shows the coarse tuning results. Then was fine-tuned and set at 31 with the optimal MAE of and runtime of 320.6 seconds.
Remark 1. The estimation accuracy and runtime of our MFEORM are insensitive to its two inner parameters, namely, kernel parameter and regularization coefficient .
The influence of different number of feature models on the final results was further explored (regularization coefficient is set at 500 and kernel parameter is 31 uniformly). As Table 6 indicates, conclusion can be made that the more feature models used, the better results. Particularly, in the single-feature test, AAM shows its superior estimation performance in MFEORM, which is better than LBP and BIF features, so we conjecture that this is because AAM gets the feature from both face shape and texture information separately. So next AAM feature may as well be used to analyze the influence of number of dimensions on the final results.
Table 7 displays the tests which were made to explore the influence of number of dimensions (after PCA dimension reduction) on the final results (regularization coefficient is set at 500 and kernel parameter is 31 uniformly). Generally speaking, the estimation accuracy and runtime of our MFEORM are insensitive to the number of dimensions, which means that there is no need to waste a large amount of time on tuning the inner parameters of MFEORM. However, from a nuanced prospective, when the dimensions are too small, the dataset will suffer from insufficient information from feature models, so the deviation will become larger. In contrast, when the dimensions are oversized, the dataset will include not only main components but also some interference information, which can also lead to a relatively large deviation.
Note that the emphasis and innovation points of this paper and  are different. In detail, the target of this paper is to achieve both rapid learning speed and similar or better age estimation performances compared to state-of-the-art methods. As can be seen, for every comparison in the experiment section, the data of both MAE (accuracy) and runtime (speed or efficiency) are listed. However,  mainly focuses on the improvement of the ranking rule (preference-collection rule) in the ordinal ranking process so that the estimation performance can be increased. In other words,  only pays attention to further improving the estimation performance (accuracy) and does not care about the time or efficiency. Although ELM is also used in , that is only for the consideration of saving time; that is to say, if it were not to this end, other classifiers can also be used to replace ELM, such as SVM. Consequently, in , the proposed method is suitable for different classifiers.
MFEORM combines multifeature space, age’s natural characteristics of ordinal information, and extreme learning machine’s rapid learning rate, achieving similar or even better performances with much less time compared to state-of-the-art methods. Experimental results demonstrate the fact that the more feature models, the better performance. Further experiments from internal and external aspects have been performed, respectively, so that the following properties and advantages of MFEORM are obtained: the performance of MFEORM including estimation accuracy and runtime is insensitive to its two inner parameters and the number of dimensions (after dimension reduction) so there is no need to consume much time on choosing different inner parameters of MFEORM and the number of dataset dimensions. Note that the proposed method MFEORM has few relationships with the pure ELM ranking problem [35, 36]. Essentially it is a combination of many classification subproblems, though from literal meaning it seems relevant to pure ranking problem and causes misunderstanding.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Electronic Customer Relationship Management (ECRM), https://en.wikipedia.org/wiki/ECRM.
A. Dix, J. Finlay, G. D. Abowd, and R. Beale, “Human-Computer Interaction,” http://svcognac.nl/wp-content/uploads/2012/04/1_MMI_summary.pdf.View at: Google Scholar
Z. Yang and H. Ai, “Demographic classification with local binary patterns,” in Proceedings of the International Conference on Advances in Biometrics, pp. 464–473, 2007.View at: Google Scholar
B. Xiao, X. Yang, H. Zha, Y. Xu, and T. Huang, “Metric learning for regression problems and human age estimation,” in Advances in Multimedia Information Processing—PCM 2009, vol. 5879 of Lecture Notes in Computer Science, pp. 88–99, Springer, Berlin, Germany, 2009.View at: Publisher Site | Google Scholar
G. Huang and C. Siew, “Extreme learning machine with randomly assigned RBF kernels,” Internatoinal Journal of Information Technology, vol. 11, no. 1, pp. 16–24, 2005.View at: Google Scholar
D. Serre, Matrices: Theory and Application, Springer, New York, NY, USA, 2002.View at: MathSciNet
The FG-NET aging Database, http://www.fgnet.rsunit.com/.