Abstract

The basic process for an extensive range of security systems functioning in real-time applications is facial recognition. Considering several factors like lower resolution, occlusion, illumination, noise, along with pose variation, a satisfactory outcome was not achieved by various models developed for face recognition (FR). Therefore, by utilizing reconstruction scheme-centric Viola–Jones algorithm (RVJA) and shallowest sketch-centered convolution neural network (SCNN) methodologies, an effectual face detection and recognition (FDR) system has been proposed here by considering the aforementioned factors. Specifically, first, the algorithm identifies faces in a provided image by determining its global facial model in various positions along with poses; then, it sequentially enhanced the recognition outcome by utilizing SCNN. Initially, by employing the RVJA, face detection (FD) has been performed. The unconstrained face images are handled by the proposed RVJA having efficient properties such as boundedness and invariance, together with the ability to rebuild the actual image. After that, for FR, the SCNN methodology is utilized, thus learning the complicated features of the face-detected images. Next, regarding metrics like area under curve (AUC), recognition accuracy (RA), and average precision (AP), the proposed methodology’s experiential outcome is analogized with other prevailing methodologies. The experimental outcome displayed that the facial images are recognized by the proposed model with higher accuracy than that of the other conventional methodologies.

1. Introduction

In the military as well as commercial fields, FR has been deemed as a popular research area [1]. By evaluating along with relating patterns grounded on a person’s facial features, the person’s identity is verified or identified by utilizing a technique termed FR [2, 3]. In a real-world scenario, various factors like proxy via pictures, lighting conditions, and lower-quality image processing are the limitations faced by FR algorithms [4]. Moreover, FR turns into a complicated task with the factors like the partial or total occlusion with other objects, the view angle owing to the camera position, or lower-resolution sensors of the obtained image [5, 6]. For humans, FR is extremely effortless; however, it is distinctive for a machine [7]. An extremely swift progress was shown by the FR methodologies’ development in the last two decades [8, 9]. In numerous fields like the monitoring system of the bank self-service cash machine, the new face-brushing technology of Alipay, the verification of identity by each application face scanning, along with the face unlocking of the mobile phone, FDR is applied with the continuous enhancement of science and technology [10]. The FDR methodologies, which have been well-studied in the computer vision domain, have been amalgamated with these systems in an attempt to handle certain external issues like computational cost, face capture angle, facial expression, the existence of hair, along with facial alteration relying on the luminosity, time, usage of accessories or ornaments, classifier performance, ethnic variations amongst others, and longer distances as of the camera [11, 12].

Deep learning can achieve a good approximation of a complex function through increments of hidden layers; hence, it is capable of achieving a better result in face recognition [13]. In the process of recognizing faces under well-constrained conditions like standard illumination as well as frontal pose, human performance was outshined by the automatic FR systems grounded on deep convolutional neural networks (DCNNs) with the recent enhancements in deep learning (DL) [1416].

1.1. Problem Definition

Better recognition outcomes were obtained by the prevailing research methodologies; however, these models still have some limitations, which are not completely resolved yet are listed as follows:(i)Mainly concentrating on the extraction of various sorts of features along with developing cascade features, several FD methodologies have been developed. However, these systems consume more time to train and they are ineffective.(ii)Merely frontal face images are recognized by the prevailing methodologies developed utilizing face-like features along with normalized pixel variation; thus, they are sensitive to unconstrained images.(iii)Most systems are available only for face images, thus making the conventional FR models ineffective for unknown facial queries.

Therefore, by utilizing RVJA and SCNN, an effectual FDR model has been proposed to trounce the aforementioned issues. The proposed technique’s major contributions are enlisted further.(i)For detecting the face, RVJA that handled unconstrained face images was proposed(ii)For face recognition, the SCNN approach that learns the complicated features of the face-detected images was proposed

The paper’s remaining parts are structured as follows: the conventional research models pertinent to facial image recognition are surveyed in Section 2; the proposed framework is explicated in Section 3; the proposed model’s performance is assessed in Section 4; lastly, the paper is winded up with the future advancement in Section 5.

2. Literature Survey

Cheng et al. [17] developed a two-layer CNN to learn the higher-level features for FR by means of a sparse representation that meagerly specifies the face image by a subset of training data. The FR system’s performance was enhanced considerably by the description of the provided input face image. The experiential outcomes displayed that on the given dataset, a better performance was achieved by the presented system when analogized with other systems. Nonetheless, a larger dataset was required by the CNN.

Iqbal et al. [18] examined hybrid angularly discriminative features by amalgamating multiplicative angular together with an additive cosine margin for enhancing the efficacy of angular SoftMax loss as well as large margin cosine. By utilizing the CASIA-WebFace dataset, the model was trained; subsequently, on YouTube Faces (YTF), Labeled Face in the Wild (LFW), VGGFace1, and VGGFace2, the testing was conducted. The experimental outcome displayed that the model had accuracy, which was higher than the prevailing methodologies. However, more time was consumed by this model.

Zhao et al. [19] presented data augmentation via image brightness alterations, geometric transformation, along with the application of varied filter operations. Furthermore, regarding orthogonal experimentations, the finest data augmentation methodology was determined. Eventually, the present system’s performance utilizing FR was illustrated in a real class. The developed system attained better accuracy than the PCA as well as LBPH methodologies with data augmentation. Nevertheless, more time was consumed by the VGG-16 network to train its parameters.

Zhao et al. [20] constructed a deep neural network (DNN) to deeply encode the face regions; in this, a face alignment algorithm was deployed for localizing the key points inside the faces. Then, to abate the deep features’ dimensionality, the PCA was employed; similarly, for evaluating the similarity of feature vectors, a joint Bayesian model was utilized; thus, it attained highly competitive face classification accuracy. In addition, several FR attacks were handled by the FR system under different contexts. However, when analogized with conventional machine learning algorithms, the neural network needed more data.

Alghaili et al. [21] suggested a system that could directly detect an individual under all criteria by extracting the most significant features along with utilizing them to recognize a person. To extract the most significant features, a DCNN was trained. Then, the significant features were selected by utilizing a filter. After that, to detect the minimum number that denotes the identity, the selected features of every single identity in the dataset were subtracted as of the actual image’s features. The outcomes displayed that the presented model recognized the face effectively in varied poses. However, owing to an operation termed max pool, the DNN was slow.

Lin et al. [22] presented a feature extraction methodology that utilized the thermal image for transmuting into features. In addition to that, the author utilized DL, Random Forest, along with ensemble learning to construct a FR model. In the feature extraction methodology, the facial image was cut into blocks; subsequently, the feature image and the feature matrix were regenerated. The empirical outcomes demonstrated that a higher prediction performance was achieved by the feature extraction technique. Nevertheless, for prediction, this model required more time.

Lei et al. [23] constructed a hybrid model grounded on DL, visual tracking, and RFR-DLVT to obtain efficient FR. Initially, video sequences were separated into reference frames (RFs) and nonreference frames (NRFs). Next, in RFs, by means of the DL-centric FR methodology, the target face was recognized. Meanwhile, in NRFs, to speed up FR, the Kernelized Correlation Filters-centered visual tracking model was employed. The model was tested on common datasets; it also attained better performance. Nevertheless, to attain better performance, a larger amount of data were required by this model.

Tabassum et al. [24] amalgamated the coherence of discrete wavelet transform (DWT) with “4” varied algorithms: (i) eigen vector of linear discriminant analysis (LDA), (ii) error vector of principal component analysis (PCA), (iii) eigen vector of PCA, and (iv) CNN for enhancing the FR accuracy; subsequently, by utilizing entropy of detection probability along with the fuzzy system, the four outcomes were amalgamated. Depending on the image along with the diversity of the database, the recognition accuracy was established. However, there eventuated overfitting problem owing to this CNN.

Teoh et al. [25] structured an FR as well as an identification system by utilizing a DL methodology. Primarily, it spotted the faces in the images or videos. Next, the recognition was performed after training the classifier. The Haar feature-centric cascade classifier was utilized in FD. In the system classifier section, the tensor flow model was employed. In the recognition process, the classifier is utilized after being trained. The experiential outcomes were given for demonstrating the system’s accuracy. Nevertheless, there occurred false-positive detections whilst utilizing the Haar feature-centric cascade classifier.

3. Proposed Facial Image Recognition System

The input images are first fed to the RVJA for FD. Then, the face-detected images are rescaled along with normalization using the SCNN model for accurate FR. Figure 1 exhibits the block diagram of the proposed framework.

3.1. Face Area Detection

This is the initial step. Here, from the input images , the face part is segmented. The RVJA is utilized for segmenting the face part. Only frontal faces are detected by the traditional VJA algorithm; hence, it is ineffective whilst detecting sideways, upward, or downward. Therefore, for the input images, iterative closest normalized pixel difference (ICNPD) features are computed to enhance the detection efficiency; in addition, it reconstructed the face models in various poses along with varied directions. Consequently, owing to the properties like boundedness, invariance, and enabling actual image reconstruction, the FD is enabled by the reconstruction strategy under unconstrained situations. Four phases are enclosed in the VJA algorithm. They are as follows: selecting features, creating an integral image, AdaBoost training, and cascading.

3.1.1. Selecting Features

This is the major process in the generation of a face model with infinite novel poses. Here, by utilizing the NPD features’ optimal subset, the significant properties like boundedness and scale invariance are gauged. After that, the optimal transformation is detected by the iterative closest point (ICP) methodology; here, the approximate feature location is detected and the face cropping and rough alignment are performed as long as the detected point is close enough to the true location.

At first, the ICNPD feature vector is calculated; then, the algorithm outlines a box with this feature vector. After that, by scanning every single subregions of the image from top to bottom, the outlined box searches for a face in the provided image. The NPD feature betwixt two pixels in the image is measured as follows:

Next, it computed the iterative closest points betwixt the closest point query in the target set; in addition, to reconstruct the face model, the distance betwixt respective points is minimized. By subtracting the sum of pixels in rotation from the sum of pixels in translation, the value of these features is computed as follows:where the features’ value is specified as and the relative rotation and translation pixels computed in the closest form are signified as and . Various parts of the face can be interpreted by utilizing such features.

3.1.2. Integral Image Creation

The computation is performed for all the pixels in a specific feature whilst calculating the feature values. The number of pixels in the large features is high; thus, for larger features, the computation is highly challenging. To make computations effective, the integral image, that is to say, an image’s intermediate representation permits the fast computation of a rectangular region being generated; here, the entire pixel values are provided by the sum of the left and above the specific pixel. The integral image is expressed as follows:where the integral image is notated as and the original image is symbolized as , . The recursion formula utilized in the integral computation is expressed as follows:where the cumulative row sum is indicated as . The value of rectangle-like features is computed utilizing the integral image with “4” values present at every single corner of the provided rectangle rather than computing the sum of all pixels.

3.1.3. AdaBoost Training

Several thousand features may be computed when utilizing a base window for analyzing the features; however, for detecting the face, only fewer features are utilized. Therefore, the AdaBoost algorithm is utilized to select the best feature. By amalgamating the weighted weak classifiers, a strong classifier is formed by this algorithm. It is expressed as follows:where the strong classifier is specified as , the number of features known as weak classifiers is signified as , and the classifiers’ respective weights are notated as . To decide whether the image’s subregion has any face or not, the amalgamation of these features is utilized.

3.1.4. Cascading

Here, to the provided subregion, a series of classifiers, that is to say, a cascaded system, which comprises numerous stages to detect the face, is espoused. After entering the subregion to the cascaded system, the regions with faces are forwarded to all the stages; conversely, the regions devoid of faces are rejected at the particular stage itself. By doing so, the system saves time by avoiding the image’s nonface region; in addition, it detects faces under varied expressions, poses, and illumination along with disguise. Therefore, the face images being segmented are represented as . The VJA algorithm’s pseudocode is illustrated in Algorithm 1.

Input: Input image
Output: segmented face images
Begin
Initialize the number of images
Set
While
  Compute features
  Compute integral image
    //integral image
  Give input to the cascade classifier
  for each shifts do
   for each stages do
    for number of filters at each stage do
     Collects the filter outputs
    End for
    if the stage predicts negative then
     Discard the input
    End if
   End for
   if the input passed all the stages then
    Receive the input as face
   End if
  End for
Set
End while
Return segmented
End

The fundamental steps undergone by the VJA segmentation algorithm are described in Algorithm 1. (i) Feature computation, (ii) feature selection, and (iii) object detection are the steps undergone by the algorithm to detect along with the segment of the face from input frames.

3.2. Rescaling and Normalization

The face-detected images are applied for the preprocessing models like rescaling and normalization to enhance the feature extraction stage’s overall quality. A nonconforming feature representation is generated by the image of a face on a varied scale as of the model’s generalization; thus, rescaling is performed; conversely, normalization is performed to normalize the range of the pixel values of the input images with the intention to abate the higher variation in the values. The normalization is formulated as follows:where the current pixel is specified as , all other pixel values are notated as , and the rescaled images are signified as .

3.3. Face Recognition

Here, to recognize the person, the normalized images are obtained by utilizing the SCNN model. CNN is a DL algorithm; it comprises two sorts of hidden layers; they are the convolution layer (CL) and the pooling layer (PL) that are arranged alternatively in the neural network. To match the face of the user, the system provides output by conducting FR. Here, to perform a novel feature-sharing technique, the shallowest layers are incorporated with the hidden layers, thus achieving higher run-time efficiency. In the shallow layer, initially, all facial landmarks are estimated historically for preserving the facial structure; in addition, to construct the sketch feature vector, localization-sensitive information is utilized. Therefore, the CL along with the PL is fed with the sketch feature built with landmark extracted images as of the shallow layer. Thus, here, the recognition accuracy is enhanced when there is an unknown facial query that has to be identified. In this manner, the FR network, which subsumes “3” phases, namely, (i) shape prediction stage, (ii) feature extraction, and (iii) classification, enhances the recognition accuracy considerably. Figure 2 exhibits the architecture of the proposed SCNN.

3.3.1. Shape Prediction Stage

Here, to estimate the face contour termed sketch, the shallowest layer localizes the set of facial landmarks for the provided input images. The facial region is depicted by the localized landmarks; in addition, they are linked in a fixed order to build the sketch feature vector. Therefore, the extracted landmark features are modeled as follows:where the shape indexed feature map respective to the shallowest layer is represented as , and the set of localized landmarks is indicated as . Along with the extracted landmarks, the sketch feature vector is engendered with the predicted shape. Consequently, to train the network, which utilizes the facial attributes together with geometric relationships betwixt the sketches and the images, the feature vectors are jointly utilized.

3.3.2. Convolution Layer

This is the first layer in the CNN having a set of feature detectors named as kernels; subsequently, every single kernel has its corresponding bias value. To verify whether the feature is present or not, the kernels execute convolution by moving across the image’s receptive fields under the guidance of the shape from the preceding stage. The convolution operation betwixt the input’s connected region and weights is formulated as follows:where the nonlinear activation function is specified as , the input nodes’ weight vector is depicted as , and the deep feature map acquired as of the CL denoting the relationship betwixt the sketches and respective images are symbolized as .

3.3.3. Pooling Layer

In this layer, the down sampling is performed to mitigate the convoluted feature map’s size, thus lessening the number of computations needed. By utilizing the max-pooling function, the dimensionality of input is scaled by the PL. It is formulated as follows:where the max function is represented as and the pooled feature map is notated as . A new vector is formed by the feature maps from the CL along with the PL; subsequently, they are flattened to obtain the column vector .

3.3.4. Fully Connected Layer

The flattened matrix is given to the fully connected layer. This layer provides the input to the SoftMax layer (SL). The probabilities of the input being in a specific class are offered by the SL. The SL’s output is illustrated as follows:where the layer’s bias value is defined as , the classification output is symbolized as . (i) Known person and (ii) unknown person are the “2” classes of outputs encompassed in the SL.

4. Result and Discussion

The proposed facial image recognition system is experimentally analyzed in this section. The model was executed in MATLAB. In the evaluation process, to verify the system’s efficacy, the proposed system’s outcomes are analogized with the prevailing DL methodologies.

4.1. Database Description

To study the problem of unconstrained FR, the LFW database was utilized in the proposed work. More than 13,000 images of faces gathered as of the web were included in the dataset. The dataset comprises a larger range of poses, illumination, together with expression in face images. 5749 identities with 1680 people with 2 or more images were encompassed in this dataset. The verification accuracies were reported on every single face pairs in the standard LFW evaluation protocol. Every single face has been labeled with the pictured person’s name. Of the data available in the dataset, 20% are utilized for testing and 80% are wielded for training in Table 1.

4.2. Performance Analysis

Here, DNN, Elman neural network (ENN), artificial neural network (ANN), and CNN are the prevailing models with which the proposed system is analyzed regarding AUC, AP, and RA, along with training time. In addition, the proposed system is analogized with the detection rate of several FD algorithms like the Viola–Jones algorithm (VJA), Joint Cascade (JCascade), and aggregate channel features (ACF).

The proposed along with prevailing methodologies’ RA is exhibited in Table 2. As per the table, an accuracy of 97.14% was attained by the proposed model, whereas the accuracy values attained by the prevailing systems were lower than that of the proposed methodology. Thus, it is established that when analogized with the prevailing DL models, the proposed work attained better performance.

The AUC graph for the proposed together with prevailing methodologies’ is shown in Figure 3. The proposed model obtained an AUC of 94.65%, whereas the conventional DNN, ENN, ANN, and CNN models attained AUC values of 87.56%, 88.27%, 89.92%, and 92.39%, respectively, which are lower than that of the proposed model. When compared with other classifiers, the proposed one achieved better performance followed by the CNN and lastly the DNN. Thus, it is evident that for facial image recognition, the proposed model is highly effective with more accuracy.

The proposed, as well as prevailing models’ AP, is demonstrated in Figure 4. An AP of 97.29% was attained by the proposed SCNN. Conversely, a lower accuracy of 90.56%, 91.31%, 93.34%, and 95.83% was attained by the prevailing DNN, ANN, ENN, and CNN systems, correspondingly. The evaluation outcomes proved that for FR, an effectual performance was achieved by the proposed model even with an unconstrained image.

Regarding training time, the proposed along with conventional FR methods’ computation complexity is assessed in Figure 5. A model with lower training time is considered to be the better model. The training time taken by the traditional models was the CNN (10.13 ms), ANN (11.24 ms), ENN (12.04 ms), and DNN (14.28 ms), which are larger than the training time of 8.42 ms taken by the proposed system. Therefore, it is clear that when analogized with the prevailing works, the proposed one achieved better performance.

In terms of detection rate, several FD algorithms are evaluated in Table 3. The proposed model attained a higher detection rate than the prevailing methodologies. When analogized with the prevailing detection methodologies, the detection rate attained by the proposed model was enhanced by up to 13.16%. Consequently, the proposed framework’s detection efficacy was enhanced by the effectual features being selected. Thus, from the assessment, it is evident that even for unconstrained images, the FD was performed more effectively by the proposed methodology.

Regarding recognition accuracy, the comparison of the proposed model with prevailing CNN-based frameworks is exhibited in Table 4. The recognition accuracy attained by the proposed S-CNN is 97.14%, while the prevailing CNN, MT-CNN, and DWT-CNN attain 86.3%, 96.4%, and 89.56%, correspondingly. On comparing these outcomes, the proposed S-CNN achieves higher recognition accuracy than the conventional techniques. Thus, it is concluded that the proposed system is more efficient in facial image recognition.

5. Conclusion

By utilizing RVJA as well as SCNN, an effectual FDR system has been proposed here. To detect the face in varied poses, different levels of illumination, along with occlusion, the most significant features are selected as of the input images by utilizing the RVJA technique. Next, for recognition, the SCNN with well-trained mathematical operations was employed. In performance evaluation, regarding performance metrics like RA, AUC, and AP, the proposed SCNN is analogized with the prevailing existing DNN, CNN, ANN, and ENN methodologies. The evaluation outcomes displayed that higher accuracy of 97.14% was achieved by the proposed model than the conventional systems. Therefore, it is concluded that for the facial image recognition system, the proposed framework is better along with highly effective. In the future, to recognize the images with imperfect facial data, the work could be enhanced with some advanced models.

Data Availability

The data used to support the findings of this study are available from the first author upon request at any time.

Conflicts of Interest

The authors declare that they have no conflicts of interest.