Abstract

Biometrics is the recognition of a human using biometric characteristics for identification, which may be physiological or behavioral. The physiological biometric features are the face, ear, iris, fingerprint, and handprint; behavioral biometrics are signatures, voice, gait pattern, and keystrokes. Numerous systems have been developed to distinguish biometric traits used in multiple applications, such as forensic investigations and security systems. With the current worldwide pandemic, facial identification has failed due to users wearing masks; however, the human ear has proven more suitable as it is visible. Therefore, the main contribution is to present the results of a CNN developed using EfficientNet. This paper presents the performance achieved in this research and shows the efficiency of EfficientNet on ear recognition. The nine variants of EfficientNets were fine-tuned and implemented on multiple publicly available ear datasets. The experiments showed that EfficientNet variant B8 achieved the best accuracy of 98.45%.

1. Introduction

The ear begins to develop in a fetus during the fifth and seventh weeks of pregnancy [1]. At this stage, the face acquires a more distinguishable shape as the mouth, nostrils, and ears begin to form. There is still no exact timeline at which the outer ear is created, but it is accepted that a cluster of embryonic cells connects to establish the ear. These are called auricular hillocks, which begin growing in the lower portion of the neck. The auricular hillocks broaden and intertwine within the seventh week to deliver the ear’s shape. Within the ninth week, the hillocks move to the ear canal and are more noticeable as the ear [1]. The external anatomy of the ear can be seen in Figure 1. The growth of the ear in the first four months after birth is linear, and the ear is then stretched in development between the ages of four months and eight years. After this, the ear size and shape are constant until age seventy, increasing in size again.

Biometrics is the recognition of a human using their biometric characteristics, which may be physiological or behavioral. The physiological biometric features are the DNA, face, ear, facial, iris, fingerprint, hand geometry, hand vein, and palm print, and behavioral biometrics are signatures, gait patterns, and keystrokes. Voice is considered as a combination of biometric and physiological characteristics. Numerous systems have been developed to distinguish biometric traits, which have been used in multiple applications, such as forensic investigations and security systems. With the current worldwide pandemic, facial identification has failed due to users wearing masks. However, the human ear has proven more suitable as it is visible. In Table 1, an investigation was done to ascertain the performance, distinctiveness, permanence, collectability, and acceptability of the biometric.

In different physiological biometric qualities, the ear has received much consideration of late as it tends to be said that it is a solid biometric for human acknowledgment [2]. Ear biometric framework is dependable as it does not change and is of uniform tone, and its position is fixed at the center of the face’s side. The size of an individual’s ear is more critical than a unique finger impression and makes it simpler to capture an image of the subject without necessarily needing to gain information from the subject [2]. There are numerous difficulties in correctly gauging the details of the ear, and these are concealment of the ear by clothing, hair, ear ornaments, and jewelry. Another interference could be the different angles that the image was taken, concealing essential characteristics of the ear’s anatomy. These difficulties have made ear recognition a secondary role in identification systems and techniques commonly used for identification and verification.

Although several computer-aided detection models have been developed to identify ears, low accuracy and sensitivity are still significant concerns that misidentify ears. Existing models are also computationally complex and expensive. The contributions of this work are summarized as follows:(1)Implementation of state-of-the-art EfficientNets to develop an effective and inexpensive ear detection system. It is the first time the EfficientNet model is being applied to classify ears.(2)The proposed model accuracy through EfficientNet.(3)Finally, benchmark datasets were used to evaluate the performance of the model.

The remainder of the work is structured as follows: Section 2 presents related works, and Section 3 presents detailed data and methodology explored in this study. The experimental results and discussion are provided in Section 4, and Section 5 concludes the paper.

This section presents different algorithms using the convolutional neural network (CNN) for ear identifications, and a summary of the related works is shown in Table 2.

Emeršič et al. [3] organized the dataset of the UERC which was used for the benchmark, training, and testing sets. In the completion, it was seen that handcrafted feature extraction methods, such as LBP [13] and patterns of oriented edge magnitudes (POEM) [14], and CNN-based feature extraction methods were used to obtain the ear identification. The challenges were to find methods to remove occlusions such as earrings, hair, other obstacles, and background from the ear image. The occlusion was done by creating a binary ear mask, and then the system recognition was done using the handcrafted features. Another proposed approach was to calculate the score of matrices from the CNN-based features and handcrafted features when they are fused, and a 30% detection rate was achieved.

Tian and Mu [4] applied a CNN to ear recognition in which they designed a CNN—it was made up of three convolutional layers, a fully connected layer, and a softmax classifier. The database used was USTB ear, which consisted of 79 subjects with various pose angles. The images utilized excluded earrings, headsets, or similar occlusions. Chowdhury et al. [15] proposed an ear biometric recognition system that uses local features of the ear and then uses a neural network to identify the ear. The method estimates where the ear could be in the input image and then takes the edge features from the identified ear. After identifying the ear, a neural network matches the extracted feature with a feature database. The databases used in this system were AMI, WPUT, IITD, and UERC, which achieved an accuracy of 70.58%, 67.01%, 81.98%, and 57.75%, respectively.

Raveane et al. [5] presented that it is difficult to precisely detect and locate an ear within an image, this challenge increases when working with the variable condition, and this could also be because of the odd shape of the human ears as well as lighting conditions and the changing profile shape of an ear when photographed [5]. The ear detection system used multiple CNNs, combined with a detection grouping algorithm, to identify an ear’s presence and location. The proposed method matches other methods’ performance when analyzed against clean and purpose-shot photographs, reaching an accuracy of upward of 98%. It outperforms them with a rate of over 86% when the system is subjected to non-cooperative natural images where the subject appears in challenging orientations and photographic conditions.

Multiple scale faster region-based convolutional neural network (Faster R-CNN) to detect ears from 2D profile images was proposed by Zhang and Mu [6]. This method was used by taking three regions of different scales that are detected to defer the information from the ear location within the context of the ear in the image, which was done to extract the ear correctly. The system was tested with 200 web images that achieved a 98% accuracy. Other experiments conducted were on the Collection J2 of the University of Notre Dame Biometrics Database (UND-J2) and University of Beira Interior Ear dataset (UBEAR); these achieved a detection rate of 100% and 98.22%, respectively, but these datasets contained large occlusions, scale, and pose variation.

Kohlakala and Coetzer [7] presented semi-automated and fully automated ear-based biometric verification systems. CNN and morphological postprocessing manually identify the ear region. It is used to classify ears in either the foreground or background of the image. The binary contour image applied the matching for feature extraction, and this was done by implementing a Euclidean distance measure, which had a ranking to verify for authentication. The Mathematical Analysis of Images Ear database and the Indian Institute of Technology Delhi Ear database were two databases, which achieved 99.20% and 96.06%, respectively.

Geometric deep learning (GDL) generalizes CNNs to non-Euclidean domains, presented by [8] Tomczyk and Szczepaniak. It used the convolutional filters with a mixture of Gaussian models. These filters were used so that the images could be easily rotated without interpolation. It shows the published experimental results that the approach did the rotational equivalence property to detect rotated structures. Still, it does not need labor-intensive training on all rotated and nonrotated images.

Alshazly et al. [9] presented and compared ear recognition models built with handcrafted and CNN features. The paper took seven performing handcrafted descriptors to extract the discriminating ear image. They then took the extracted ear and trained it using support vector machine (SVM) to learn a suitable model. They then used CNN-based models, which used a variant of the AlexNet architecture. The results obtained on three ear datasets showed the CNN-based models’ performance increased by 22%. This paper also investigated if the left and right ears have symmetry. The results obtained by the two datasets indicate a high impact of balance between the ears.

Alkababji and Mohammed [10] presented the use of a deep learning item detector called faster region-based convolutional neural network (Faster R-CNN) for ear detection. This CNN is used for feature extraction. It used the principal component analysis (PCA) and a genetic algorithm for feature reduction and selection. It also used a connected artificial neural network as the matcher. The results achieved an accuracy of 97.8% success rate.

Jamil et al. [11] build and train a CNN model for ear biometrics in various uniform illuminations measured using lumens. They considered that their work was the first to test the performance of CNN on underexposed or overexposed images. The results showed that for images with uniform illumination with a luminance of above 25 lux achieved a result of 100%. The CNN model had problems recognizing images when the lux was below ten, but produced an accuracy of 97%. This result shows that CNN architecture performs just as well as the other systems. It was found that the dataset had rotations which affected the results.

Hansley et al. [12] presented an unconstrained ear recognition framework that was better than the current state-of-the-art systems using publicly available databases. They developed CNN-based solutions for ear normalization and description. This was done using handcrafted descriptors, which were fused to improve recognition. This was done in two stages. The first stage was to find the landmark detectors, which were untrained scenarios. The next step was to generate a geometric image normalization to boost the performance. It was seen that the CNN descriptor was better than other CNN-based works in the literature. The obtained results were higher than different reported results for the UERC challenge.

3. Data and Methods

3.1. Dataset

In this study, all the experiments were performed with numerous public ear datasets; an explanation of these datasets is provided below. UBEAR, EarVN1.0, IIT, ITWE, and AWE databases are best suited for ear identification due to their large data size. However, it shows that EarVN1.0 has the foremost prominent usage during age estimation using CNN techniques. It is an appropriate dataset for ear images taken in a controlled environment, while ITWE is compatible for classifying ears in an uncontrolled environment, a summary of the datasets is shown in Table 3.

3.1.1. Mathematical Analysis of Images (AMI) Ear Database

The AMI Ear database [19] was collected at the University of Las Palmas. The database comprises 700 ear images of 100 distinct Caucasian male and female adults between the ages of 19 and 65. All images within the database were taken under an equivalent illumination and a glued camera position. Both the left- and right-hand sides of the ears were captured. The pictures obtained were cropped to form the ear area covering almost half the image. The pose of the images varies in yaw and servery in pitch angles, and this dataset is often found publicly.

3.1.2. The Indian Institute of Technology (IIT) Delhi Ear Database

The IIT database [16] was collected by the Indian Institute of Technology Delhi in New Delhi between October 2006 and June 2007. The database is formed from 421 images of 121 distinct adults of both male and female. All images were taken inside the environment, with no significant occlusions present, and only the right-hand side of the ear was captured. The pictures obtained in the dataset were both raw and normalized. The normalized images were in grayscale and of size 272 × 204 pixels.

3.1.3. The University of Beira Ear (UBEAR) Database

The University of Beira presented the UBEAR database [25]. The database comprises 4429 images of 126 subjects, and these were of both males and females. The images were taken under varying lighting conditions, and angles and partial occlusions were present. These images were of both the left- and right-hand sides of the ear.

3.1.4. The Annotated Web Ear (AWE) Database

The AWE database [18] is a set of public figures from web images. The database was formed from 1000 images of 100 different subjects whose sizes vary and were tightly cropped. Both the left- and right-hand sides of the ears were taken.

3.1.5. EarVN1.0

The EarVN1.0 database [22] comprises 28412 images of 164 Asian male and female subjects, and left- and right-hand sides of the ear were captured. Collection was during 2018 and is formed from unconstrained conditions, including camera systems and lighting conditions. The pictures are cropped from facial images to obtain the ears, and the pictures have significant variations in pose, scale, and illumination.

3.1.6. The Western Pomeranian University of Technology Ear (WPUTE) Database

The Western Pomeranian University of Technology Ear (WPUTE) database [20] was obtained within the year 2010 to gauge the ear recognition performance for images obtained within the wild. The database contains 2071 ear images belonging to 501 subjects. The images were of various sizes and were of both the left- and right-hand sides of the ear, and these were taken under different indoor lighting conditions and rotations. There were some occlusions included in the database, and these were the headset, earrings, and hearing aids.

3.1.7. The Unconstrained Ear Recognition Challenge (UERC)

The Unconstrained Ear Recognition Challenge (UERC) database [21] was obtained in 2017, then extended in 2019, and is a mix of two databases that currently exist and a newly created one. The database contains 3706 subjects with 11804 ear images, and the database ears have both right- and left-hand side images.

3.1.8. In-the-Wild Ear (ITWE) Database

The In-the-Wild Ear (ITWE) database [23] was created for recognition evaluation and has 2058 total images, and 231 male and female subjects. A boundary box obtained these images of the ear, and coordinates of those boundary boxes were released with the gathering. The pictures contained cluttering backgrounds and were of variable size and determination. The database includes both left- and right-hand sides of the ear, but no differentiation was given about the ears.

3.1.9. The University of Science and Technology Beijing (USTB) Ear Database

The University of Science and Technology Beijing (USTB) Ear database [17] contained cropped ear and head profile images of male and female subjects split into four sets. Dataset one includes 60 subjects and has 180 images of right close-up ears during 2002. These images were taken under different lightings and experienced some shearing and rotation. Dataset two contains 77 subjects, has 308 images of the right-hand side ear approximately 2 meters away from the ear, and these images were taken in 2004. These images were taken under different lighting conditions. Dataset three contains 103 subjects and has 1600 images, and these images were taken during the year 2004. The images are on the proper and left rotation, and therefore, the images are of the dimensions 768 × 576 pixels. The dataset contains 25500 images of 500 subjects; these were obtained from 2007 to 2008; the subject was in the center of the camera circle. The images were taken when the subject looked upward, downward, and at eye level. The images during this dataset contained different yaw and pitch poses. The databases are available on request and accessible for research.

3.1.10. The Carreira-Perpinan (CP) Ear Database

The Carreira-Perpinan (CP) [24] Ear database is an early dataset of the ear utilized for ear recognition systems. It was created in 1995 and contained 102 images with 17 subjects. The images were captured in a controlled environment, and therefore, the images include variability in minor pose variation.

3.1.11. The Indian Institute of Technology Kanpur (IITK) Ear Database

The Indian Institute of Technology Kanpur (IITK) is an ear database [26] that the Institute of Technology of Kanpur compiled. The database is split into three sets, and the first set consists of 190 male and female subjects of profile images. The total number of images was 801. The second dataset also contained 801, and with a total of 89 subjects, these images had variations in pitch angle. The third dataset contains 1070 images of an equivalent of 89 subjects, but with a variation in yaw and angle.

3.1.12. The Forensic Ear Identification Database (FEARID)

The Forensic Ear Identification Database (FEARID) [27] is different from other databases as it only includes the ear prints. These contain no occlusions, variable angles, or illumination. Though there is no mention of any variables, other influences like the force the ear was pressed against the scanner and the scanner’s cleanliness need to be considered. This database comprised 7364 images of 1229 subjects. This database was used for forensic application and not for biometric use.

3.1.13. The University of Notre Dame (UND) Database

The University of Notre Dame (UND) database contains [28] many subsets of 2D and 3D ear images. These images were appropriated over a period from 2003 to 2005. The database contains 3480 3D images from 952 male and female subjects and 464 2D images from 114 male and female subjects. These images were taken in different lighting conditions, yaw, pitch poses, and angles. The images are only of the left-hand side of the ear.

3.1.14. The Face Recognition Technology (FERET) Database

The Face Recognition Technology (FERET) database [29] is a sizeable facial image database and was obtained between the years 1995 to 1996. It contains 1564 subjects and has a total of 14126 images. These images were collected for face recognition and were of the left- and right-hand profile images, which made them perfect for 2D ear recognition.

3.1.15. The Pose, Illumination and Expression (PIE)

Carnegie Mellon University obtained the Pose, Illumination and Expression database [30], which contains 40000 images and 68 subjects. The images are of the facial profile and have different poses, illuminations, and expressions.

3.1.16. The XM2VTS Ear Database

The XM2VTS Ear database [31] is frontal and profile facial images from the University of Surrey; the database contains 295 subjects and 2360 images captured during controlled conditions. These images were a set of cropped images of 720 × 576 pixel size and were from video data.

3.1.17. The West Virginia University (WVU) Ear Database

The West Virginia University (WVU) Ear database [32] is a video database and is formed from 137 subjects. The system was an advanced capturing procedure that allowed them to capture the ear at different angles; these images included earrings and eyeglasses.

3.2. Preprocessing

Image preprocessing is a considerable part of the deep learning task. Most CNN models generally require a large dataset to learn to discriminate features suitably for making predictions and obtaining a good performance. As images in the datasets are of different sizes, the inputted images need to be resized to conform to all the other CNN models, but the features need to be preserved when resizing is performed. The examples of the original and the preprocessed images are shown in Figures 2 and 3.

3.3. Transfer Learning

In this study, the concept of transfer learning was adopted and helped with the pretrained CNN model for large datasets to learn features of the target (right and left ears). It will transfer the features learned by the deep CNN models on other CNN models to this dataset. The number of deep CNN model parameters increases as the network gets deeper, which is used to achieve improved efficiency.

Hence, it requires many datasets for training, making it computationally complex and applying these models directly on small and new dataset results in feature extraction bias, overfitting, and poor generalization. The pretrained CNN modified and fine-tuned its structure to suit the dataset given. This concept of transfer learning is computationally expensive, has less training time, overcomes limitations of the dataset, improves performance, and is faster than training a model from the beginning. The pretraining CNN model fine-tuned in this work is the EfficientNets. The proposed structure is represented in Figure 4.

3.4. EfficientNet Architecture

EfficientNet is a lightweight model based on the auto machine learning framework to develop a baseline EfficientNet B0 network and uniformly scaled up the depth, width, and resolution using a simplified and effective compound coefficient to improve EfficientNet models B1–B8. The models performed efficiently and attained superiority over the existing CNN models on the other CNN datasets. EfficientNets are smaller and only require a few parameters, and they are faster and more generalizable to obtain higher accuracy on other datasets’ poplar for the transfer learning task. The proposed study fine-tuned EfficientNet models B0–B8 on the dataset to detect the ears. In transferring the pretrained EfficientNets to the ear dataset, the models were fine-tuned by adding a global average pooling to reduce the number of parameters and fix overfitting. The dense layers follow the global average pooling with a ReLU activation function and a dropout rate of 0.4 before the output last layer [33]. This is done with the softmax activation function to determine the probabilities of the input data to represent the ears, and this can be seen inwhere is the softmax activation function, represents the input vector to the output layer, is depicted from the exponential element , is the number of classes, and represents the output vector of the exponential function.

It is known that many iterations could lead to model overfitting, while too few can cause model underfitting; this study used an early stopping strategy. It configured approximately 90 training iterations before terminating, this was to cater for early stopping to improve performance, and this was applied to control overfitting and used gradient descent. The EfficientNet B0-B8 models were trained with 100 iterations (epochs). The batch size for each iteration was 32, and the momentum equals 0.2 and was regulated. At the same time, categorical cross-entropy is the loss function used to update weights at each iteration. Hyperparameters used were evaluated and found to perform optimally, and this can be defined inwhere is the gradient of the loss with regard to , is the defined learning rate, is the weight vector, while and are the respective training sample and label.

4. Results and Discussion

Various EfficientNet variants were fine-tuned on all the ear datasets to detect the ear. Each dataset is split into 20% training and 80% test sets. The experiments were entirely performed using Keras deep learning framework using the TensorFlow backend. The models were evaluated using the popular evaluation metrics, equation (3)–(7) (accuracy, sensitivity, specificity, and area under the curve). The performances of all experiments are evaluated by using a series of confusion matrix-based performance metrics.

The confusion matrices are used to evaluate the classifiers, with true positives (TPs) representing the ears that are correctly classified as positive, true negatives (TNs) representing the ears that are correctly classified as negative, false positives (FPs) representing the ears that are incorrectly classified as positive, and false negatives (FNs) representing the ears being incorrectly classified as negative.

4.1. Specificity

It is the ratio of correctly classified negative instances by a model to the overall number of true-negative instances being tested, equation (5).

4.2. Accuracy

It is a measure that indicates the ratio of all the correctly recognized cases to the overall number of cases. While this metric generally gives a decent reflection of the classifier, it may not reflect a classifier’s true performance in a scenario where there is an uneven class distribution. Accuracy can be computed using the following formula, equation (3).

4.3. Sensitivity

It is the ratio of all correctly classified positive instances by a model to the overall number of positive classifications by a model. A low precision indicates that a model suffers from high false positives. Precision can be computed using the following formula, equation (4).

The results obtained are presented in Figures 5 and 6 this is the accuracy and loss of these datasets. The various EfficientNet models average at the 100 epochs, and the accuracy is determined using the test set. The models performed at extracting and learning discriminative features from the dataset. EfficientNet B8 attains the best accuracy 98.45%, and the EfficientNet results are noted in Table 4.

An advantage of EfficientNets is that they are smaller with fewer parameters and faster, and obtain transfer learning successfully from the datasets. The worst performing EfficientNet is B2, as shown in Table 4. Even though it has minimal parameters, the reason that this performed poorly could have been because the images were down-sampled. This was done to conform to the model’s image input size. It can be seen that performance improves as the model gets deeper. EfficientNet B0 started poorly, beginning to converge from the 30 iteration, with little noise, until the 30 iteration and then stabilized until 50 iteration, when overfitting started. The best performing EfficientNet is B8, as shown in Table 4, and this is because of the large number of parameters. It began to converge from the 60 iteration and then stabilized until 90 iteration, when overfitting started. It is found that when the dataset is a large and equal number of classes, the results achieved were high. Determining the most suitable hyperparameters was one of the challenges faced and the overfitting, which was limited due to the data samples. The results of the proposed methods compared with related studies are presented in Figure 7.

5. Conclusion

This study investigated and implemented EfficientNet models to automatically identify ears on the most prominent and publicly available datasets. EfficientNets that achieved state-of-the-art performance over other architectures to maximize accuracy and efficiency were explored and fine-tuned on profile images. The fine-tuning technique is valuable to utilize rich generic features learned from significant dataset sources such as ImageNet to compliment the lack of annotated datasets affecting ear domains. The experimental results show the effectiveness of EfficientNets in extracting and learning distinctive features from the ear images and then classifying them into a left or right suitable class. Out of the nine EfficientNet variants explored in this study, the EfficientNet B8 outperformed the others, as evident in Table 5 and depicted in Figure 7. One of the significant downfalls of the proposed approach is training the model on small datasets and training on images with low resolutions. These limitations can easily result in significant overfitting. To overcome this, you need to have compelling image preprocessing techniques. Although the proposed methodology is specified to do ear detection, it could be extended to detect other parts of the face, given the right set of datasets.

Data Availability

Datasets used to support the findings of the study are publicly available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.