#### Abstract

Face age estimation has been widely used in video surveillance, human-computer interaction, market analysis, image processing analysis, and many fields. There are several problems that need to be solved in image-based face age estimation: (1) redundant information of age characteristics; (2) limitations of age estimation methods in solving age estimation problems; (3) the performance of age estimation models being also affected by gender factors. This paper proposes CA-XTree network. Firstly, features are extracted through the convolution layer and then combined with the local channel attention module to strengthen the ability of age feature information interaction between different channels. Secondly, extracted features are inputted into the recommendation score function to obtain the recommendation score, by combining the recommendation score with the gradient ascending regression tree. The lifting tree processed loss function is the mean square loss function, and the final age value is obtained by the leaf node. This paper improves state of the art for image classification on MORPH and CACD datasets. The advantage of our model is that it is easy to implement and has no excess memory overhead. In the age dataset CACD, the mean absolute error (MAE) has reached 4.535 and cumulative score (CS) has reached 63.53%, respectively.

#### 1. Introduction

With the arrival of the information age and the wide application of various intelligent devices, computer vision technology has become a research hotspot. Age estimation, it provides a picture that automatically identifies the true age of the person in the image [1]. Using this algorithm to estimate age is a difficult task [2]. In recent years, great progress has been made in face recognition, but there are still many challenges in the cross-age recognition and retrieval. First of all, the face has different manifestations at different ages. Age is related to facial biological characteristics in different periods (for example, facial bone growth in children, facial wrinkles in adults). Secondly, age estimation is also affected by external factors such as illumination, posture, and expression. In addition, for individuals of different races [3], even if one’s age is the same as the other, the skin color will age to varying degrees.

At present, there are many difficulties in the field of age estimation. Firstly, due to the spatial heterogeneity of face features, the prediction of real age estimation has always been a nonlinear mapping function. People of the same age have great differences in appearance, and faces have different manifestations at different ages. Secondly, due to insufficient training data and unclear labels, there is too much background information in the image, the accuracy of age estimation uses limited samples [4], and fuzzy labels are affected. Finally, in shopping markets, advertising campaign show to collect customer information and make relevant evaluations according to customers’ age and gender, so as to realize targeted product recommendation services. From a biological point of view, the changes of facial contour and skin texture are affected by many factors such as living environment, race, and genetic differences, resulting in a nonstationary random process of aging pattern. Regression problem of nonstationary learning [5] is very difficult because it does not fit the training process. In addition, most face databases are under certain circumstances, such as on the Internet. In the real scene, due to the background, facial expression, and illumination [6] and partial shade, these [7] will limit the generalization ability of the model.

In the field of image classification, deep convolution neural network model mainly uses supervised learning to classify images, but it is limited by the quality and scale of dataset. The depth network is integrated in an end-to-end multilayer way. The depth of the network can enrich the “hierarchy” of features. Deep features have large acceptance domain and rich semantic information. Deep features are robust to the changes of object attitude, occlusion, and local deformation, but robust to geometric details [8] which are lost due to the reduction of resolution. The extraction of shallow features contains some of its details, and with the deepening of layers, the geometric details of extracted features may completely disappear [9, 10]. Recent evidence shows that network depth is crucial to improve accuracy and has achieved leading results on challenging ImageNet datasets [11] and recent evidence shows that network depth is crucial to improve accuracy and has achieved leading results on challenging ImageNet datasets.

Deep learning is a new research direction in the field of machine learning, and the representative model is deep convolutional neural network (CNN). With the development of convolutional neural network (CNN), designing residual network [12] is a breakthrough that has an impact on CNN. Based on the “Black box” phenomenon, the depth of the convolution of the neural network has been plagued by researchers for convolution neural network decision-making process and cannot intuitively describe network in what to do, such as what features were extracted from each layer, According to Chen et al. [1] the residual network cannot perform strict mathematical analysis, causing the remaining loss function of network learning process visualization. The visualization proves that the residual learning optimization problem is easier. Decision tree model (DT) [13] has a good model interpreter, which can provide good model interpretation. Deep decision tree combines features extracted from deep convolutional neural network with decision tree to improve model accuracy and increase model interpretability. At the same time, because it is combined with decision tree, it can also provide basis for model judgment clearly and step by step. In recent years, soft splitting functions [14–16] have been used to extend traditional decision trees to deep and deep path decision forests, so that decision trees have deep representation learning ability.

Face age in biology is the most intuitive expression in the human face, and face facial aging is a nonsmooth process, mainly divided into three parts, respectively, the juvenile period, middle age, and old age. At present, how to estimate the age of face image is particularly important in age estimation research. Age estimation method is mainly divided into multiple classification method and regression method, using relative order, between age and age tags in [8, 16]. Through the comparison between different individuals, by comparing the results with the change of the age difference, change the present *s* type curve. The CA-NeXt network is composed of local channel attention modules and grouped volumes.

We propose a simple architecture, as shown in Figure 1. Regression tree age classification network based on local channel attention, by not dimension reduction of global average pooling level channel, implements each age’s characteristics of the channel and the interaction between *K* led 7 characteristics, module is made up of a one-dimensional fast convolution [17], the size of the one-dimensional convolution *K* represents the local channel number and the interactive age characteristics of the book, the last type is output in the whole connection layer, and the output results were input into the Softmax function to calculate the scoring function. The scoring function results were input into the regression tree, and the final age estimation results were obtained through the regression tree model. In general, the main contributions of this paper are as follows:(1)We design the residual network with regression tree (RT) to train the CA-XTree structure through tree supervision loss. High experimental results are obtained.(2)Lightweight channel focus modules are added to the remaining modules to effectively capture information between channels, to realize the information interaction between channels with the characteristics of the times.(3)We split the convolution using a compromise between grouping convolution and choosing ordinary convolution and depth, and the number of channels generated for each branch map is *n* (*n* > 1). The balance between the two strategies is achieved by controlling the number of organizations, with fewer hyperparameters.

#### 2. Related Work

Age estimation is generally considered a regression or classification problem. Earlier studies were based on artificial feature extraction and used various classifiers in machine learning to estimate ages. In classification methods, age estimation is mainly based on these features, and classification [18] adopts support vector machines or random forest and other classifiers. In regression methods, support vector regression (SVR) [19] or partial least squares (PLS) [20] are mainly used to predict age values. Guodong and Guowang [21] adopt biological heuristic model to extract face features. Guo et al. [19] propose to treat age as a pattern subspace and build face images based on age. In Geng’s work, face feature vectors are extracted using principal component analysis (PCA), and the extracted feature vectors are combined to describe face age features. Nowadays, more and more people are using deep learning methods, which have greatly improved image classification [11], target detection [22, 23], semantic segmentation [24, 25], time series tasks, and many other visual tasks. Wang [26] used convolutional neural network (CNN) as a feature extractor and then trained SVR to achieve age estimation. Channel attention is of great help in extracting age-related features, such as SENet [27]. Currently ,the latest channel attention networks ,FCAnet, and ECAnet [28, 29], pay more attention to the information interaction between local channels.

Divide-And-Conquer for Age Estimation: Divide-And-Conquer method is to put the problem into smaller subproblems, the smaller the subproblem then built, will have to solve the subproblems to merge; finally it is concluded that, in “mother” problem solution, age estimation regression model has the problem of low accuracy, the traditional research methods the method of the machine. Specifically, the traditional learning method is to manually extract features, input the model and train the model to predict the age value, such as the traditional random forest [30]. Recent studies combine deep learning with random forest (NDF) [16] to obtain higher accuracy and interpretability, and residual random forest is used in [31]. Divide-And-Conquer is similar to mathematical induction. It finds the solution equation formula to solve the problem and then designs the recursive program according to the equation formula [32].

Residuals: the residual network unit module contains two convolution layers, as shown in Figure 2. *F* (*x*) is obtained by convolution calculation of input value *X*, and the calculation result of the residual module is obtained by adding *F* (*x*) and *X*, and *F* (*x*) is the residual. Error [33] is mathematically defined as the difference between the observed value and the actual value, while residual is the difference between the predicted value and the observed value. In mathematical statistics, residual is the difference between the actual observed value and the estimated value (fitting value). Through residual, important information about the basic assumptions of the model can be obtained, and the correctness of the regression model can be judged by minimizing residual.

As for the reason for naming the residual network, the author explains that a layer of network can usually be regarded as *y* = *H*(*x*), and a residual network can be expressed as *H*(*x*) = *f*(*x*) + *X*, *f*(*x*) = *H*(*x*) + *X*. In the unit mapping, *y* = *x* is the observed value, but the expected value is *H*(*x*), so it corresponds to the residual *f*(*x*), so it is called the residual network. Short connection: network depth has a great impact on the effect of CNN, but simply increasing network depth cannot simply improve the effect of the network and may destroy the effect of the model due to gradient divergence. The introduction of shortcuts is a way to solve this problem. Expressway is one of the early methods to introduce the idea of shortcut into the depth model, which aims to solve the problems of gradient divergence and difficult training in the depth network. In ResNet, set *t* and *C* of the highway network to 1 to reduce the degree of freedom of the model (in the depth model, the greater the degree of freedom is not necessarily the better). The greater the degree of freedom, the more difficult it is to train. Shortcuts are not limited to one layer, but can also span two or three layers.

Deep forest is a comprehensive forest model [34], which is the integration of traditional forest models in breadth and depth. Although the actual operation occupies more memory and the effect is not as good as deep learning, it also provides an integrated idea for traditional machine learning. Deep forest is made up of different kinds of forests stacked in width and depth. The author Hou [35] has always believed that only by fully reflecting the differences of learning samples can we improve the learning effect of comprehensive learners. Therefore, the stack of deep forest has two purposes: one is to reflect the difference of input data, and the other is to improve the classification ability of input data. The former is called multigranularity scanning and the latter is called cascade forest.

Group Convolution: group convolution was first proposed in Alexnet [11]. Since the network has two GPUs in the experiment, the author hopes to propose packet convolution by paralleling the two models. Channel convolution refers to the characteristic channel. The number of group convolutions of channel convolution is determined by the number of channels in each group. Channel convolution [36] is a part of separable convolution. It is an effective method to improve the recognition accuracy [11] by training a group of independently trained networks and training the grouped channels separately by bisecting the channels. In [37], packet convolution is not trained separately, but jointly, but this simple aggregation does not take into account the information interaction between channels in the channel level convolution channel. The information interaction of multichannel convolution [38] is shown in the figure below. Each convolution core is applied to the input channel of the previous layer to generate an output channel, which is a convolution core of the process. We convolute all channels, repeat the process to generate multiple groups of channels, and then add each channel together to form a final single channel.

#### 3. Proposed Method

In this paper, we proposed regression tree age estimation based on local channel attention (CA-XTree), and we will introduce it in detail. ResNet network structure is introduced in feature extraction to help optimize decision function. ResNet improved model (CA-XTree) based on regression tree is composed of two parts: one is ResNet network combined with local channel attention mechanism, and the other is regression tree. The network extracted features through ResNet network added one-dimensional convolution module to extract local channel attention, and input the output results into Softmax function to calculate the final recommendation score *S*.

The block convolution module is shown in Figure 3. With a small number of parameters and computation, block convolution can generate a large number of feature graphs and obtain more coded information.

Figure 4 shows the CA-XTree structure. The network extracts feature through ResNet network, adds one-dimensional convolution module to extract local channel attention, and inputs the output results into Softmax function to calculate the final recommendation score S.

##### 3.1. Group Convolution

Group convolutions are adopted in the network in this paper. Grouping convolution is used to improve CNN architecture, in which high-dimensional (low-dimensional) channels include long (short) convolution with a fixed number of packets. In other words, the channel dimension C is proportional to the convolution kernel size *K*.

Assume that the input , and we present aggregated transformations aswhere is the neuron function, for one neuron, projects into an (optional low dimensional) embed, and then transforms it. In (1), *j* is the size of the set of conversions required for aggregation. In this paper, we design a simple method of transformation function; all have the same topological structure, with (2) in the aggregate transformation into the residual function:

##### 3.2. Local Cross-Channel Interaction

Let the output of a convolution block be where , , and are width, height, and channel dimensions as shown in (4), given the aggregated feature.

As shown in Figure 5, the network diagram consists of four parts. Firstly, the aggregation characteristics obtained by global average pool (GAP) are used. Then, the size *k* = 3 is subjected to fast 1D convolution to generate the channel weight, which is multiplied by the feature map to obtain the final result.

without dimensionality reduction; channel attention can be learned by

*W* is the *C* × *C* parameter matrix, extended into a block diagonal matrix. Divide the channels into *G* groups, each group containing channels. Channel attention in each group is learned independently to capture cross-channel interactions in a local form:

Global average pooling is to output a value by global average of the feature graph, that is, to transform a tensor of *W* × *H*∗ × D into a tensor of 1 × 1 × D. The global average pooling operation is carried out so that it has global receptive field and the global information can be used by the lower layer of the network. Global average pooling (GAP) reduces the number of parameters and can reduce the occurrence of overfitting.

This strategy can be implemented by a one-dimensional fast convolution with a kernel size of *k*:where C1D represents one-dimensional convolution, where the method in (5) is called the CA module, which involves only *k* parameters. Then, given the channel dimension *C*, the kernel size *k* can be adaptively determined:

##### 3.3. Neural Decision Forest

Deep neural decision tree (NDF) is a group of deep neural decision trees. For simplicity, each tree is designated as a complete binary tree. We use the integer *I* as the order node.

As shown in Figure 6, the feature map of each channel goes through the global average pooling operation, and the result of this operation is the summary of the global information of each channel, which is calculated by the sigmoid function. The result calculated by sigmoid function is input into node Ri, and depth features are extracted from the input combined with the associated separation node Si, and a recommendation score (routing probability) is given. According to the score value, if it is equal to zero, it is input into the left subtree; if it is equal to 1 it is input into the right subtree. We calculate a unique path from the root to the leaf. In order to obtain the final prediction result, each leaf node contributes its prediction vector according to the probability weight of its calculated path:

The weight of the leaf node is calculated by iteratively adding the weight score of the previous node multiplied by the recommended score. Assuming that the leaf node is on the left subtree of a path, the weight can be expressed aswhere it is assumed that the input left node on the path ; otherwise . Then the weight can be expressed as follows.

If the current formula is valid and the sum of the weights of all leaves is set as 1, it is possible that the final predicted value is a convex function composed of all leaf nodes whose function value converges to 1. Let the above functions be differentiable. Therefore, the loss function defined on the final prediction is made by gradient descent method, and the network weight is updated and the model is trained by minimizing the loss function. We use ResNet network to extract features from inputs, assign values to channels through local channel attention module, input scoring function, combine deep network and regression tree, assign each split node to a neuron in the final full connection layer, and calculate the final recommendation score using S-shaped function. The details are as follows:where is the kth feature mapping function represented by one or more layers in the deep neural network, and *f*_{i} is a linear mapping function related to the neurons allocated in the last full connection layer, wherein we specify the feature mapping function as

Since this problem is a multitask regression task, it has instances; this paper directly uses the square loss function:

The squared loss function means that the output of the model is a Gaussian distribution with the predicted value as the mean value, the loss function is the likelihood of the real value under this predicted distribution, and the Softmax loss means the likelihood of the real tag.

#### 4. Experiments

The research object of this paper is static face image, and individual aging is a slow and gradual process; because the gender, race, and different people will be different in aging rate, aging characterization is not the same.

##### 4.1. Dataset

For the extraction of age features, due to the size of the finite receptive field and the interaction between cross-channels, the sample feature extraction has a great difference. Therefore, this paper adopts OpenCV and Dlib for face detection and alignment and adopts the method based on channel attention to weaken the cross-channel information interaction in feature extraction. In this paper, face markers are used to locate face regions and eliminate intraface rotation. OpenCV and Dlib are used for face detection and comparison. Finally, resize all images to 256 × 256 pixels, normalize the image according to the calculated mean and standard deviation of the three-color channels, and finally input the model. The training data is expanded by horizontal flipping and random cutting, and the final input space is 224 × 224 pixels.

FGNET: the dataset FGNET [39] is shown in Figure 7. The dataset contains 1002 face images of 82 subjects, and 68 faces in each face image in the dataset have manually labeled key points. FGNET dataset has been used in many studies, such as age estimation, cross-age face recognition, age change inference, and other directions [16, 40].

MORPH: the MORPH of the dataset [41] is shown in Figure 8. The dataset contains 13,000 images of faces of different races, skin colors, and genders, of which 55,134 were manually tagged. On the basis of the experiment [16], we select similar test set and training set selection method, randomly select 80% as training set and 20% as test set, and verify the image.

CACD: the dataset CACD [1] is shown in Figure 9. The dataset is a 16 GB dataset containing 166,417 photos collected from the Internet, mostly celebrity images. In this paper, the dataset is divided into three sets, namely, the training set, the test set, and the verification set: the training set contains 145,275 images of 1800 celebrities, the test set contains 10,517 images of 120 celebrities, and the verification set contains the remaining 80 celebrities. We train our models through training sets and report on their performance through self-testing.

Figure 10 shows the iteration of the training model in the CACD dataset after adding the local channel attention module using the same backbone network. On the premise of introducing block convolution, the experimental results are compared.

**(a)**

**(b)**

The experiment in this paper was conducted on FGNET, MORPH, and CACD datasets. The training set and test set of the dataset are shown in Table 1.

Evaluation criteria: on CACD, FGNET, and MORPH2 datasets, we used the mean absolute error (MAE) and cumulative score (CA) to evaluate the performance. MAE and CS reflect the performance superiority of this model.where *f*(*x*_{i}) is the predicted output of the network model, *y*_{i} is the label corresponding to the input of the face image, and *θ* is the allowable age error corresponding to the cumulative accuracy.

The results show that when the grouping cardinality = 16, the result is the best. This method can effectively improve the age recognition ability of the network model, generate continuous information in the adjacent age range, and improve the generalization ability of the model.

As shown in Table 2, on FGNET and CACD datasets, ResNet-50 was used as the backbone network to test the influence of different grouping numbers on the experimental results.

Gradient regression tree: see Figure 11; the local attention module is added, the cardinality is equal to 16, and MAE value is 4.535 which is obtained by tree classification method.

Combine local concerns and grouping convolution as shown in Figure 12. When group *C* is 8, 16, and 32, MAE is 4.6010, 4.567, and 4.6011, respectively. It can be seen from the figure above that *C* = 8, 16 training converges. The information between age features is simply extracted through local channel attention. Then ResNet is introduced into the regression tree by using a simple sigmoid function. Finally, the final result is trained by the regression tree. MAE reached 4.535, which is worthy of further use in future studies. As shown in Table 3, on FGNET and CACD datasets, ResNet-50 as the backbone network and the addition of local channel attention module significantly improved the CACD dataset, MAE value increased by 0.12, and the CS(3) increased by about 1.01%.

**(a)**

**(b)**

**(c)**

All methods extract depth features from the backbone network and then train the generated network in different modules. In this paper, local attention module and grouping volume are combined, and decision tree is used as classification method for training. The experimental results show that 16 groups of 3-tree depth networks have achieved good training results.

##### 4.2. Implementation Details

Preprocessing: we use OpenCV and Dlib for face detection and alignment. We adjusted the size of all the images to 256 × 256; then, the image is normalized according to the calculated mean and standard deviation of the three-color channel, and the training data is expanded by 0.5 probability level flipping and random clipping. Finally, the input image size is 224 × 224.

Model architecture: we select ResNet-50 as the baseline network, add the aggregation attention module, and then use two full connection layers. The final output is activated by sigmoid function and sent to the decision forest to give the final result. Hyperparameter setting: in order to compare with the previous work, we used a forest of 5 trees, each with a depth of 6. Another forest with a depth of 7 was also used. The batch is 64, which is used for back propagation to train network parameters. We add a new leaf node prediction vector to every 50 batches of network parameters and randomly take 500 samples each time.

Training settings: with the SGD optimizer, the initial learning rate was set to 0.5, and when training with the scheduler model provided by PyTorch stalled, the learning rate was halved.

Table 4 shows the experimental results of the CACD dataset. All methods extract depth features from the backbone network and then train the generated network in different modules. In this paper, the local attention module and grouping convolution are combined, and the depth of the tree is tested. The experimental results show that, in ResNet-50 network as the backbone network, the grouping number is *C* = 16 and MAE value reaches 4.535, which is about 0.135 higher than that of the latest method. CS(3) cumulative score reaches 63.53%, which is about 2.13% higher than that of the latest method.

In the experiment, the tree depth was studied. When the depth was 8, the loss of computing resources was too large; when the depth was 5, the tree leaf node was less than 32, affecting the accuracy of age estimation. Table 5 shows the experimental results on dataset CACD when the tree depth is 7.

Figure 13 shows the variation curve of the error over time. As shown in Figure 13, the final experimental performance reaches 4,553.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

In this paper, the size *k* of convolution kernel is adaptively determined by using the global average pooling aggregation of convolution features without dimensionality reduction, and then convolution operation is carried out and sigmoid function is executed to learn channel attention. This paper applies the attention module to the deep convolution neural network based on ResNet-50. This paper tests FGNET dataset and CACD dataset. The mean absolute error (MAE) of CACD was 4.567 and the cumulative score was 61.11%, and the mean absolute error (MAE) of dataset FGNET is 2.953 and the cumulative score is 83.8%.

Table 6 shows the comparison results of CA-XTree with the current advanced methods, with different results on different datasets. In the FGNET dataset, the dataset is 2004, the data image clarity is insufficient, and the accuracy is slightly reduced; CACD dataset was proposed in 2016, and the size was about 14G. Experimental results were obtained on CACD, which was relatively ideal. Compared with the 2021 CORF method, it achieved an improvement of 0.135 in large dataset CACD.

#### 5. Conclusion

In this paper, we use the residual network as the backbone network to repeatedly establish a building block, the aggregation of building blocks, a set of the same topological transformation, and the attention of a channel, do not reduce the dimension module, and simply set some parameters, and the number and memory loss do not increase too much. Finally, we deeply study the soft decision function of neural decision forest. See Table 6 for model accuracy and compare it with previous work. Our model achieves the highest accuracy in all benchmarks and achieves significantly better performance on the largest dataset CACD. The age estimation dataset of CA-XTree network is mainly composed of two-dimensional face images, while outdoor video surveillance is a video of face. The next work uses a multiview spatial attention mechanism to extract and integrate multiview features, so as to better extract local information of age features. Then two fully connected layers were used for age estimation.

#### Data Availability

Contact the authors for the data used in this study if necessary.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

The paper was partially supported by the Social Science Foundation of Hebei Province (No. HB17JY069).