In recent years, biometric recognition patterns have attracted the attention of many researchers, among which human ears, as a unique and stable biometric feature, have significant advantages in verifying personal identity. In the Internet era, a system with low computing cost and good real-time performance is more popular. Most of the existing ear recognition methods are based on a large parameter network model, which causes a large memory footprint and computational overhead. This paper proposes an efficient and lightweight human ear recognition method (ELERNet) based on MobileNet V2. Based on the MobileNet V2 model, dynamic convolution decomposition is introduced to enhance the representation ability of human ear features. Then, combined with the coordinate attention mechanism, the spatial features of human ear images are aggregated to locate the location information of the human ear features more accurately. We conducted experiments on AWE and EarVN1.0 human ear datasets. Compared with the MobileNet V2 model, the recognition accuracy of our method is significantly improved. Using less computing hardware resources, the ELERNet model achieves 83.52% and 96.10% Rank-1 (R1) recognition accuracy, respectively, which is better than other models. Finally, we provide a visual interpretation using GradCAM technology, and the results show that our method can learn specific and discriminative features in the ear images.

1. Introduction

Biometric recognition has developed rapidly in the last decade, and it uses common characteristics of individuals for recognition. For example, iris [13], face [4, 5], fingerprint [68], gait [9, 10], electrocardiogram (ECG) and electroencephalogram (EEG) [1113], and voice [14, 15] are commonly used techniques for biometric recognition. Human ear recognition has unique advantages over other biometric recognition techniques and has recently received much attention. Unlike face recognition, it is not affected by changes in facial expressions, and it can be done without contact. Different ear structures are unique [16, 17]. The ear characteristics do not change much over time, so the human ear pattern can already be used by police as evidence for identification [18]. In modern forensic identification and criminal investigations, multimodal recognition such as ear, face, and palm print ensures foolproof identification [19].

Many experiments have been conducted on constrained and unconstrained human ear datasets in recent years. Among them, the constrained human ear dataset has a single shooting angle, illumination, background, and resolution and is less difficult to identify. On the contrary, unconstrained human ear datasets are relatively more difficult to recognize due to the large inter- and intraclass variation. Researchers proposed several ear recognition methods based on handcrafted features in the early days of human ear recognition. Most methods do not use a baseline ear database and standard evaluation metrics to assess the performance of the model, and there is a slight variation in the ear images in the database. When these methods experiment on unconstrained ear databases, the recognition performance degrades significantly and is much lower than that of the deep learning-based methods. Currently, the use of deep learning [20, 21] is becoming more and more common. Deep learning-based techniques are used in various fields, such as human ear recognition and human activity recognition (HAR) [2224]. So researchers have proposed many in-depth feature learning-based methods for human ear recognition and achieved good recognition performance in unconstrained ear databases. However, as deep neural networks continue to evolve, their drawbacks continue to be exposed. Most networks have a large number of parameters and high model complexity. Because they require extremely high hardware requirements, they are difficult to be applied to embedded devices and mobile terminals and can only be used in individual scenarios. The rapid development of the mobile Internet has led to the growing demand for lightweight networks and real-time performance. Therefore, Google has successively proposed MobileNet V1 [25], MobileNet V2 [26], and MobileNet V3 [27]. They are easy to deploy on embedded devices and mobile terminals with much-reduced computation and parameters while maintaining the recognition performance. Therefore, we use MobileNet V2 as a baseline network for ear recognition and propose a highly efficient and lightweight human ear recognition method based on MobileNet. In addition, our proposed method can be used for medical image analysis. For example, patients with suspected Hepatitis C Virus (HCV) [28] infection are classified into two categories, healthy and unhealthy, to help clinicians diagnose and treat HCV.

Our contributions can be summarized as follows: (1) we propose an efficient and lightweight human ear recognition method (ELERNet) based on MobileNet. The model consumes fewer computational hardware resources and is easy to apply to mobile and embedded devices. (2) To enhance the ear feature representation capability of the model, dynamic convolutional decomposition [29] is introduced to reduce the difficulty of ear feature extraction. (3) To enhance the feature robustness of the model, a coordinate attention mechanism [30] is introduced. The spatial features of the ear image are aggregated to precisely locate the location information of the spatial features of the ear, which improves the recognition performance of the model. (4) We conducted extensive experiments on two representative unconstrained human ear datasets, AWE [3133] and EARVN1.0 [34], which showed excellent recognition performance. Compared with existing human ear recognition models, ELERNet has significantly higher recognition accuracy with a small memory footprint and computational overhead. (5) We used the Gradient-Weighted Class Activation Mapping (GradCAM) [35] technique to explain how MobileNet V2, as well as our predictions made by the proposed model ELERNet. The visualization highlights that our model can learn specific and discriminative features in the ear image.

In the early days of human ear recognition, researchers based their recognition methods primarily on handcrafted features. In [36], the authors proposed ear recognition based on Scale-Invariant Feature Transform (SIFT) features and homography distance. The recognition performance of this method is better than Principal Component Analysis (PCA). It also shows excellent robustness under slight angle changes, background interference, and occlusion. The disadvantage of their method is that they do not use a standard benchmark database and do not use evaluation criteria to assess the model performance. In another study, the authors extracted ear boundary features using a wavelet approach [37]. The ear features were then saved to a database for matching. The disadvantages of this method are that no precise performance evaluation metrics were used, and the experiments were conducted on a small dataset. A 2D orthogonal filter-based human ear recognition method was proposed in [38]. The method first performs ear feature segmentation, and then ear features are extracted. The experimental results show that the 2D orthogonal filter has excellent recognition performance. The drawback of the method is that the ear images in the database it uses hardly change much. In [39], the authors designed a method for ear feature extraction using local binary patterns (LBP). The results show that LBP outperforms Principal Component Analysis (PCA). The drawback is that the experiments were evaluated on a database of images captured indoors. In [40], the authors performed a comparative analysis of human ear recognition based on the average and uniform variants of LBP. The method achieved a desirable recognition performance on constrained databases. However, when the experiments were performed on the unconstrained ear database, the recognition performance significantly decreased. In [41], the authors proposed a pattern recognition method that uses edge ear features to learn local ear features. The method is robust to small magnitudes of illumination and rotation. The recognition performance of the method is significantly better than other descriptor-based methods. The disadvantage is that the recognition performance on unconstrained databases needs to be improved. In [42], the authors first extracted global features using the Gabor-Zernike operator and then local features using the local phase quantization operator. The method was evaluated on three constrained datasets and achieved perfect recognition results. However, the recognition results of the method on the unconstrained datasets still fall short of the deep learning-based methods.

With the emergence of deep learning in recent years, especially the development of deep convolutional neural networks (CNN), it can solve most computer vision problems. Researchers have proposed many methods for human ear recognition based on deep feature learning and achieved good recognition performance. In [43], the authors modified common CNN architectures such as ResNet, VGG face, and GoogleNet to validate them on unconstrained datasets. To enable the network to learn multiscale information features, the authors use a spatial pyramid-pooling layer to replace the last pooling layer of the CNN model to add central loss during training. In addition, the authors provide a new database of images captured under challenging outdoor conditions USTB-HelloEar. Experimental results show that the VGG face model has the best recognition performance. The disadvantages of this approach are that no performance evaluation metrics are used to evaluate the model, and the model has a large memory footprint and high computational cost. In [44], the authors first used RefiNet for ear detection and then ResNet for ear recognition. The method achieved good recognition performance on an unconstrained database, showing the advantages of deep learning-based methods. The disadvantage of the method is that ear detection is based on existing methods, and ear recognition has limited innovation. Moreover, the system needs to consume more computational hardware resources. In [45], the authors used integrated learning, feature extraction, and fine-tuning learning strategies based on models such as Inception, ResNext, and VGG. Good recognition results were achieved on publicly available unconstrained databases. The drawback of the method is that the performance is evaluated on only one database, which does not highlight the model’s generalization ability. Moreover, the large number of model parameters makes it difficult to embed the model into mobile applications for specific ear recognition scenarios.

This paper proposes an efficient and lightweight human ear recognition method (ELERNet) based on MobileNet V2. The model is evaluated on two publicly available unconstrained datasets. The large intra- and interclass variation of the unconstrained human ear datasets leads to the difficulty of ear feature extraction. We introduced a dynamic channel fusion mechanism to reduce potential spatial features’ dimensionality to implement the dynamic convolutional decomposition [29] and enhanced the ear feature representation. Considering that the unconstrained human ear dataset varies significantly regarding shooting angle, illumination, background, and resolution size, these factors increase recognition difficulty. Therefore, we introduced the coordinate attention mechanism [30]. It aggregates the spatial features of unconstrained human ear images to obtain a coordinate-aware ear feature map. Then, the location information of the spatial features of the ear is precisely located, which dramatically enhances the feature robustness of the model.

3. Method

3.1. MobileNet V2

Since AlexNet [46] won the ImageNet challenge, the deep convolutional neural network craze has been rekindled. Convolutional neural networks are found everywhere in computer vision tasks. In order to achieve higher accuracy, researchers have designed increasingly complex convolutional neural network models with a larger and larger number of parameters, leading to a significant decrease in operational efficiency. In some real-world scenarios, recognition tasks need to be performed promptly on computationally constrained platforms. An example is this paper’s work related to human ear recognition. In order to solve the above problem, MobileNet V1 [25], a model with a small number of parameters and low latency, was proposed by Google. Its network idea is mainly to replace the standard convolutional operation with Deep Separable Convolution (DSC), which dramatically reduces the model parameters. 3 × 3 Depthwise Conv used by DSC generates the output channel after performing the convolution operation, and it has only one layer of thickness, which can be slid layer by layer over the input tensor. Then the thickness is adjusted using 1 × 1 Pointwise Conv. In order to solve the loss of DSC feature information, MobileNet V2 [26] was proposed, which improved the original DSC, which we call Improved DSC (IDSC). Figure 1 compares the ordinary convolution, the depth-separable convolution, and the improved depth-separable convolution.

3.2. Attention Module

Both datasets used in this paper are unconstrained human ear datasets with a significant intraclass variation. Interference features such as background and ear ornaments in the ear images can negatively affect the recognition performance. Reducing the negative impact of these interference features on the recognition performance makes the model focus mainly on the ear contour when feature extraction is performed on the ear images. We insert the coordinate attention module [30] behind the 3 × 3 Depthwise Conv layer in the IDSC module. Its structure diagram is shown in Figure 2. Unlike other attention mechanisms, it can embed location information into channel attention with almost no computational overhead. Coordinate attention can decompose channel attention into two one-dimensional feature encoding processes that aggregate features along two spatial directions. Thus, it enhances the extraction of features of interest in ear images.

The coordinate attention mechanism can be divided into coordinate information embedding and coordinate attention generation. The first part retains location information critical to recognizing performance, and the global pool is decomposed into two 1D feature codes. Given an input , the pool core of the spatial scope (, 1) is used to encode the channel along with horizontal coordinates, and similarly (1, ) is used to encode the channel along with vertical coordinates. The output of the -th channel at height is

The output of the -th channel of width is

The second part is the generation of coordinate attention: we connect the aggregation feature map generated by equations (1) and (2) and then obtain equation (3) through . where is the feature map, is the nonlinear activation function, is the 1 × 1 convolution transformation function, and denotes concatenation of spatial dimensions. To obtain the input , we split into two independent tensors, and . The number of channels of the two independent tensors is equal by the 1 × 1 convolutional transformation functions and . The specific process is where and are the attention weights, is the sigmoid function. The output of the coordinate attention block is

3.3. Dynamic Convolution Decomposition

Since the two human ear datasets used in this paper are both wild datasets, the samples of the same subject are pretty different. Most ear images have significant differences in angle, resolution, etc. It is not easy to use ordinary convolution to extract the features. To adaptively extract the ear features of interest, we replace the 1 × 1 Pointwise Conv layer in the IDSC module with a dynamic convolutional decomposition [29] module. It fuses by applying dynamic channels in the low-dimensional space , as shown in Figure 3(a). Enhancing the learning of the corresponding channels of the high-dimensional potential space and reducing the dimensionality of the potential space make the model parametric number small and low complexity, which improves the feature expression of the model. The dynamic channel fusion mainly uses an L × L matrix to achieve. is the function of the input . Through , the channels dynamic fusion and then uses to increase the number of output channels

The dynamic convolution decomposition layer is shown in Figure 3(b). It uses dynamic branches to generate the coefficients of dynamic channel attention and dynamic channel fusion . Input first passes through the average pool, then through the first fully connected layer (FC), using ReLU6 as the activation layer, and finally through the second fully connected layer (FC).

3.4. ELERNet Introduction

In order to improve the feature representation capability of the model and better cope with the challenges posed by the considerable variation within the same category of the unconstrained human ear dataset. At the same time, the ear features of interest are extracted adaptively, and interference features are filtered to enhance the model’s robustness. We improve the IDSC module and call it the IDSCPlus block, as shown in Figure 4. We replaced the 1 × 1 Pointwise Conv layer with the DCD module and inserted the CA attention module behind the 3 × 3 Depthwise Conv layer.

The structure of ELERNet is shown in Table 1. In this model, the input human ear image is first preliminarily extracted through a 3 × 3 standard convolution layer, a 3 × 3 Depthwise Conv layer, and a 1 × 1 standard convolution layer. Then, 16 IDSCPlus modules and a 1 × 1 standard convolution layer are successively used to extract depth features from ear images. Finally, the distinguishing features are obtained and classified through the AvgPool and DCD-CLS layers. The architecture of ELERNet is shown in Figure 5.

4. Experimental Results and Discussion

4.1. Dataset Introduction

Annotated Web Ears (AWE) [3133] is a human ear dataset produced by the University of Ljubljana, with 1000 images, including 100 subjects. Each subject has 10 images, which belong to the unconstrained dataset. Some ear images are challenged by decoration and hair occlusions. EARVN1.0 [34] is a new unconstrained human ear dataset that contains 164 subjects with a total of 28,412 images that have undergone significant changes in lighting, scale, and pose. These images have significant variations in lighting, resolution, pose, etc. Most of the images also face challenges such as decoration and background occlusion. Figure 6 shows the ear images of three of the subjects. Since there are many images in the EARVN1.0 dataset, we randomly select 10 images of each subject for display.

4.2. Data Augmentation

During the training of the model, too small a sample size can lead to overfitting of the model. To avoid this phenomenon, adopting aggressive data expansion is a good choice. This way, the model gets different images during the training process, which can significantly improve the model’s generalization ability. Figure 7 shows the expanded images.

4.3. Parameter Settings

This paper proposes a human ear recognition method based on the Pytorch open-source framework. The experiment is completed on the NVIDIA Tesla V100 SXM2 16G server. We set up the cosine scheduler and defined the learning rate decay. The specific change curve of the learning rate is shown in Figure 8. We set the number of training iterations to 300 rounds on the AWE dataset in the experiment. We set the number of training iterations to 200 rounds in the experiment on the EARVN1.0 data set. We choose stochastic gradient descent (SGD) as the optimizer of this experiment. The parameters are set to support the learning decay rate, Nesterov momentum, and support momentum parameter, and the batch size is set to 16 for all experiments.

4.4. Evaluation Metrics

The cumulative matching feature (CMC) curve is biometric recognition’s most famous performance evaluation metric. We have plotted cumulative matching feature (CMC) curves for recognition experiments and evaluated the performance of the recognition models using three quantitative metrics. We briefly describe each metric below.

Cumulative Matching Characteristics (CMC) Curve: this is the probability that a recognition model returns the correct identity within the first () ranks, being the number of individuals in the entire gallery.

Rank-1 (R1) Recognition Rate: this is the percentage of the most matched probe images in the gallery that are recognized as correct identities.

Rank-1 (R5) Recognition Rate: this is the percentage of correct identities found as the gallery’s top five matching probe images.

The Area Under the CMC Curve (AUC): Based on the CMC curve, the area under the curve is calculated. A high AUC score indicates a strong model classification performance and a critical evaluation index of the model recognition performance.

4.5. Model Exploration

We selected MobileNet V2, MobileNet V3-Large, MobileNet V3-Small, ShuffleNet V1 [47], and ShuffleNet V2 [48], and five advanced lightweight network models evaluated on AWE and EARVN1.0 ear datasets. Table 2 shows their number of model parameters, model computational complexity, and quantitative performance metrics R1. The experimental results show that MobileNet V3-Small has a small number of model parameters and computational complexity. However, it performs the worst on the AWE and EARVN1.0 ear datasets, with performance metrics R1 of 72.81% and 80.62%, respectively. This is 7.70% and 10.47% lower than that of MobileNet V2. MobileNet V2 has only 0.31% and 0.39% lower performance metrics R1 on the AWE and EARVN1.0 ear datasets compared to MobileNet V3-Large. However, the number of model parameters of MobileNet V2 is 1.9 M smaller than that of MobileNet V3-Large. Models with many parameters are not convenient to deploy to mobile terminals or embedded devices and thus cannot be adapted to specific ear recognition scenarios. ShuffleNet V1 has the smallest number of model parameters and moderate model computational complexity, with performance metrics R1 of 77.49% and 88.17%, which are 3.02% and 2.92% lower than MobileNet V2, respectively. ShuffleNet V2 has 0.6 M more model parameters than MobileNet V2, moderate model computational complexity, and performance metrics R1 of 78.00% and 88.75%, respectively, which are 2.51% and 2.34% lower than those of MobileNet V2.

4.6. The Impact of DCD at Different Layers

Table 3 shows the results of inserting DCD into three different layers, including (1) Depthwise conv (DW), (2) Pointwise conv (PW), and (3) fully connected classifier (CLS). According to the experimental results, the model recognition performance can be improved by using DCD in DW, PW, and CLS layers. The experimental results on AWE dataset are (DW+1.5%, PW+2.2%, and CLS+0.9%), and the experimental results on EARVN1.0 dataset are (DW+2.69%, PW+3.95%, and CLS+0.89%). The results show that the optimal recognition performance can be obtained by combining DCD with PW and CLS simultaneously. To show the differences in recognition performance, Figure 9 plots the CMC curves for DCD at different layers.

4.7. The Impact of Reduction Ratio ϒ

We investigate the effect of the reduction ratio on model performance by reducing the size of the reduction ratio and observing the changes in model performance before and after changing the reduction ratio. As shown in Table 4, when we reduce the reduction ratio by half from the original size of 32 to 16, the number of model parameters increases, but the model performance improves. This shows from the side that the robustness of the model to changes in the reduction ratio and adding more parameters by reducing the reduction ratio is beneficial to improve the model performance. When the coordinate attention mechanism was inserted behind the Depthwise Conv layer of the baseline network and the reduction ratios were set to normal 32, the model recognition performance was improved. Experimental results in the AWE dataset were (baseline+1.01%), experimental results on EARVN1.0 dataset (baseline+1.99%).

Figure 10 plots the CMC curves for different reduction ratios.

4.8. Ablation Experiment

In this part, we mainly conducted ablation experiments to prove the influence of dynamic convolution decomposition (DCD) and coordinate attention (CA) on model recognition performance. The experimental results are presented in Table 5. According to the experimental results in the table, the model recognition performance will improve when the DCD module or CA module is inserted separately. Nevertheless, when the DCD module and CA module are added to the baseline network simultaneously, the model performance will be optimized, and the optimal results have been highlighted. Figure 11 plots the CMC curves for the ablation experiments.

4.9. The Impact of Different Training Ratios

In this section, we discuss the robustness of the model to the training set and test set partitioning. We divide the training images at different scales and then conduct relevant experiments to evaluate the model recognition performance and partitioning robustness of the baseline network and ELERNet when dealing with the training and testing human ear images at different scales. Table 6 shows the recognition performance of two human ear databases (AWE and EARVN1.0) under different training ear image proportions. For AWE and EARVN1.0 human ear datasets, the proportion of ear images in the training set was randomly divided into 50%, 60%, 70%, and 80%. The experimental results show that, with the increase in the proportion of training ear images, the recognition performance of both baseline network and ELERNet on two-ear data sets is significantly improved. However, ELERNet achieved the best performance at the same training ear ratios. It is worth noting that ELERNet was better at % than baseline network at %. We used the histogram to show the comparison results more intuitively, as shown in Figure 12.

4.10. Model Parameters and Complexity Comparison

In the introduction, we cited many pieces of literature and discussed many existing ear recognition methods. It is worth noting that most of the methods that emerged in recent years are based on the models listed in Table 7 to build ear recognition models and propose various transfer learning strategies to solve the problem of ear recognition. Table 7 compares the number of parameters and the complexity of different models. It can be seen that the number of model parameters and complexity of the proposed method are the lowest.

4.11. Compared with Other Methods

As shown in Tables 8 and 9, we compared the proposed method with the methods using the AWE and EARVN1.0 human ear data set for human ear recognition in recent years. According to the comparison results, it can be concluded that the proposed method has the best recognition performance.

4.12. Visual Explanations

In this part of visual interpretation, we use Gradient-weighted Class Activation Mapping (GradCAM) [35]. It allows visual interpretation of the classification recognition (i.e., provides class differentiation interpretation by locating the region of interest in the ear image with class-specific gradient information) and helps us to understand MobileNet V2 and the predictions made by our method ELERNet. We provide some cases where MobileNet V2 makes wrong predictions on subjects, but ELERNet makes correct predictions on subjects. The original image, MobileNet V2 localization results, and ELERNet localization results are shown in Figure 13 (AWE) and Figure 14 (EARVN1.0). From the results, we can conclude that an essential prerequisite for making correct predictions is to take the ear’s geometry as the most discriminative region, ignoring all distracting factors such as background and hair. First, we analyze the visualization results in Figure 13: (a) MobileNet V2 only pays attention to a piece of background hair features, ignoring the ear contour, which leads to wrong predictions. (b) Only pays attention to the upper half of the ear contour features. (c) The scope of attention is too large, and attention is paid to both ear studs and hair features. (d) The features of the middle part of the ear contour are ignored. (e) Excessive attention is paid to the hair-blocking part. (f) Only the earphone pendant is concerned, ignoring the ear’s contours. (g) Focus only on distractor ear studs. (h) Focus too much on hair background features. (i) Focus only on earlobes and earplugs. (j) Focus too much on earplugs and ignore ear contours. (k) Focus only on hair features. (l) Only focus on the earlobe part. Next, we analyze the visualization results in Figure 14: (a) only pay attention to the upper half of the ear contour. (b) Also, pay attention to the features of the hair occluded part. (c) Pay too much attention to the glasses frame. (d) Ignore the ear features in the upper part. (e) Pay attention to the incomplete ear contour under the premise of an auxiliary judgment of the occluder. (f) Pay too much attention to the features of the background part. (g) Pay attention to the facial features in the image. (h) Only focus on hair features, and ignore ear contours. (i) Focus on features such as windows in the background. (j) Focus only on earlobe features. (k) Focus too much on background features and occluder fingers. (l) Focus on hair features and decorations, ignoring ear features.

5. Conclusions

Most existing ear recognition methods are based on network models with high parameters and high model complexity. To address this problem, an efficient and lightweight human ear recognition method (ELERNet) based on MobileNet is proposed in this paper. We consider that the unconstrained human ear dataset has substantial intraclass and interclass differences, making feature extraction difficult. We introduce dynamic convolution decomposition and coordinate attention mechanism to enhance the model’s feature robustness, learn discriminative ear features, and improve the recognition performance. Our method has been tested on both AWE and EARVN1.0, which are public unconstrained human ear datasets, and has achieved better recognition performance than the existing methods. Finally, using the GradCAM technology to explain our model performance visualization highlights that the model predicted results had a decisive impact area. According to the visualization results, we can conclude that the overall ear outline for predicting results is essential. At the same time, our model can be excellent for filtering out the background, earrings, earplugs, and hair, such as interference characteristics. Besides, illumination, angle, contrast, resolution, and other aspects have little influence on model performance, except in extreme cases. We will continue to optimize our approach for subsequent deployment to mobile devices or embedding it into small Linux systems. This will significantly aid identity confirmation in financial security, surveillance security, and other fields.

Data Availability

The data are available from the corresponding author upon reasonable request.

Conflicts of Interest

There is no conflict of interest regarding the publication of this paper.


This work was supported by the National Natural Science Foundation of China (No. 61673316); by the Scientific Research Project of Education Department of Shaanxi Province (21JK0921); by the Key Research and Development Projects of Shaanxi Province, under Grant No. 2017GY-071; by the Technical Innovation Guidance Special Project of Shaanxi Province, under Grant No. 2017XT-005; and by the Research Program of Xianyang City under Grant No. 2017 K01-25-3. Thanks are due to my teachers and classmates for giving me guidance on my studies.