Abstract

In this paper, we propose a robust and reliable face recognition model that incorporates depth information such as data from point clouds and depth maps into RGB image data to avoid false facial verification caused by face spoofing attacks while increasing the model’s performance. The proposed model is driven by the spatially adaptive convolution (SAC) block of SqueezeSegv3; this is the attention block that enables the model to weight features according to their importance of spatial location. We also utilize large-margin loss instead of softmax loss as a supervision signal for the proposed method, to enforce high discriminatory power. In the experiment, the proposed model, which incorporates depth information, had 99.88% accuracy and an score of 93.45%, outperforming the baseline models, which used RGB data alone.

1. Introduction

LiDAR, short for light detection and ranging, is a remote sensing technology similar to radar. The difference is that radar uses radio waves to detect its surroundings, whereas LiDAR uses laser energy. When a LiDAR sensor directs a laser beam at an object, it can calculate the distance to the object by measuring the delay before the light is reflected back to it, making it possible to extract depth information for an object and display it in the form of a point cloud or depth map. Not only can LiDAR sensors estimate an object’s range but also they can measure its shape with high accuracy and spatial resolution. Furthermore, LiDAR sensors are robust under various lighting conditions (day or night, with or without glare and shadows), thereby overcoming the disadvantages of other sensor types. Because of its superiority, LiDAR has been widely used in a variety of applications, including autonomous vehicles, river surveys, and pollution modeling. Recently, products launched by technology companies often come equipped with a LiDAR scanner, making it more convenient to obtain depth information for objects in the form of 3D point clouds, as shown in Figure 1.

A face recognition system is a computer-assisted application that automatically determines or verifies an individual’s identity using digital images. In practice, the system verifies the person’s identity by comparing intensity images of the face captured by a camera with prestored images. It can be used for biometric authentication and is emerging as a critical authentication method for information and communications technology (ICT) services. Security-based applications are spreading to various fields; they include employee attendance checks, airport surveillance, and bank transactions. A face recognition system can provide a straightforward yet convenient authentication process, as it can operate using just an RGB image captured from a person’s face. However, this simplicity makes it vulnerable to spoofing attacks [1, 2] because pictures of people’s faces can easily be obtained on social media platforms without their consent, and these can be used by someone with malicious intent to steal a person’s identity. To prevent such face spoofing attacks, we propose a robust face recognition method that uses both RGB images and depth information such as those extracted from point clouds and depth maps produced by a LiDAR scanner.

Face recognition based on RGB images is already widely acknowledged for its promising performance. However, the determination of whether a face is real or fake, known as liveness detection, cannot be performed simultaneously. Distinguishing in terms of liveness between RGB images captured directly from people’s faces using a camera and digital images from other sources used for face spoofing attacks remains challenging because the two images are just one type of input used by the recognition system. A point cloud and depth map, however, can be obtained only by direct capture from people’s faces using sensors such as LiDAR. In addition, depth information is three-dimensional. In other words, spoofing attacks using 2D digital images are immediately identifiable by their lack of 3D information.

The main feature of the proposed method is a face recognition model that incorporates depth information into RGB images. The method uses a device equipped with a LiDAR sensor to collect the supplementary data. Because the method utilizes point cloud and depth data, it solves the liveness detection problem of the existing 2D face recognition method. We also hypothesize that a deep learning framework using depth information can demonstrate higher performance on the classification model for face recognition systems.

According to the developers of the SqueezeSeg3 model [3], point cloud data present strong spatial priors, and their feature distributions vary according to spatial location. Thus, we built an attention-based deep convolutional model based on SqueezeSeg3, called SqueezeFace. Its architecture is shown in Figure 2.

Based on previous studies [49], we additionally adopted large-margin loss as a supervision signal that enables the model to learn highly discriminative deep features for face recognition by maximizing interclass variance and minimizing intraclass variance during the training phase. In the test phase, facial embedding features are extracted 5using our proposed convolution network for face verification. The method can then verify an identity by calculating the cosine similarity between embedding features. The proposed method delivers performance superior to that of existing methods that use only RGB images. The remainder of this paper is organized as follows. In Section 2, related work is reviewed. The structure of the proposed method is described in detail in Section 3. The experimental results are discussed in Section 4. Finally, we conclude the paper in Section 5.

Convolutional neural networks (CNNs) are powerful models that play an essential role in learning feature representations that best describe the given domain while maintaining the spatial information of an image. Because of their excellence in learning important patterns, CNNs have achieved breakthroughs on a variety of computer vision tasks such as those involved in image classification, object detection, and semantic segmentation [1016].

Attention-based CNNs in particular have attracted considerable interest and have been extensively exploited to improve a model’s performance on numerous computer vision tasks by integrating attention modules with the existing CNN architecture [3, 1720]. The attention module allows the model to selectively emphasize important features and discard less informative ones. Hu et al. [17] proposed the Squeeze-and-Excitation (SE) block, which learns the relationship between the channels of its convolutional features and adaptively recalibrates channel weights according to the relationship learned. Specifically, the SE block extracts a representative scalar value for each channel using global average pooling (GAP) and assigns a weight for each channel based on the interdependency between channels through the excitation process. Park et al. [19] introduced the simple yet efficient Bottleneck Attention Module (BAM), which generates attention maps by separating the process of inferring a attention map into a channel attention module and a spatial attention module and configures them in parallel. Woo et al. [20] presented the lightweight Convolutional Block Attention Module (CBAM), which sequentially applies channel and spatial attention modules to emphasize important elements in both the channel and spatial axes.

Exploiting face representation embedding features extracted using a deep CNN is one of several methods used in face recognition tasks [9, 2124]. Face recognition using a deep CNN involves two essential preprocessing steps: face detection and face alignment. These two tasks should be performed jointly because they are inherently correlated [25]. Softmax loss [26] is commonly used as a loss function to supervise the face recognition model and was used in DeepID [21] and DeepFace [22]. However, recent studies have indicated that softmax loss is not suitable for face recognition tasks owing to its inability to optimize the feature embedding to enforce strong similarity within positive class samples and diversity across negative class samples, which can deteriorate model performance on face recognition. Suggested alternatives included functions based on Euclidean distance, such as contrastive loss, triplet loss, and center loss, to alleviate such constraints while strengthening discriminative features.

Contrastive loss was proposed as the loss function in DeepID2 [21] and DeepID3 [27]. Generally, this loss requires pairs of inputs, and it will adjust the distances between embedding features differently depending on whether the pair belongs to the positive class (for an intraclass pair) or the negative class (for an interclass pair). To increase the learning efficiency of contrastive loss, triplet loss was proposed in FaceNet [23]. Unlike contrastive loss, triplet loss requires three inputs, two of which are in the same class and the third belongs to a different class. This loss function reduces the distance between the intraclass pairs and increases the distance between the interclass pairs. Despite being used in many metric learning methods because of its excellent performance, triplet loss requires an expensive preprocessing step in constructing input data for the distance comparison. Thus, center loss was proposed to learn the centroid of the features of each class and penalize the distances between the centroids and their corresponding class features. This loss not only handles the complicated input data preprocessing step but also boosts performance.

In addition to the losses described above, there exists a series of losses that incorporate a large angular margin to strengthen discriminatory power on classification, decrease the distance between features within the same class, and increase the distance between features from different classes [79]. We discuss these losses in detail in Section 3.

Traditional face recognition methods utilize only RGB data as the input. Such methods perform relatively well, but they present a disadvantage with regard to liveness in that the model cannot distinguish whether an image has been captured directly from a person’s face or is a digital image obtained from other sources. This characteristic makes such methods vulnerable to face spoofing attacks. Recent studies have sought to mitigate this problem by adding depth information in the form of point cloud and depth data as inputs. Fuseseg [28], Fusenet [29], and Chinet [30] have been proposed for boosting model performance by effectively fusing such data collected from various sensors. Each model has different methods for data fusion, and each embedding feature created is fused at the layer level.

3. Proposed Method

In this section, we describe the proposed face recognition method, which uses not only RGB images but also depth and point cloud data (3D coordinates) extracted from LiDAR sensors. We constructed the proposed model with a data integration network that processes data serially from different sensors. Because it is imperative to emphasize features that will influence the model’s performance, the attention mechanism was adopted to allow the model to capture and best exploit important features from the point cloud. For the operational technique, we incorporated the spatially adaptive convolution (SAC) block of SqueezeSegv3 into a data integration network to process our data and extract features from them.

In addition, we replaced softmax loss with large-margin loss for supervising the feature embedding process to increase similarity within the same class and discrepancy between different classes. We discuss in detail the construction of the proposed data integration network and the large-margin loss function in Sections 3.1 and 3.2, respectively.

3.1. SqueezeSegv3

Most face recognition models are based on deep convolutional neural networks (DCNNs) to have discriminatory power for classification. Facial feature representations can be extracted with standard convolution aswhere and are the output and input tensors; is the convolutional weight matrix, in which is the convolutional kernel size; and are the output and input channel sizes; represents the image size; and is a nonlinear activation function such as ReLU [31]. In this method, and are defined as and . As mentioned with regard to the SqueezeSegv3 model [3], standard convolution is based on the assumption that the distribution of visual features is invariant to the spatial location of the image. This assumption is largely true in the case of RGB images; thus, a convolution uses the same weight for all input locations. However, this assumption cannot be applied to point cloud data as -coordinate point cloud data present very strong spatial priors, and the feature distribution of the point cloud varies substantially at different locations. In consideration of this fact, the SAC block, which is designed to be spatially adaptive and content aware using 3D coordinates of a point cloud, is proposed to apply different weights for different image locations as follows:

In SqueezeSegv3 [3], is a spatially adaptive function of the raw input , which depends on the location (,). In this method, is only the raw input point cloud. , the spatially adaptive function of SqueezeFace, is shown in detail in the lower part of Figure 2.

To process our data, which are gathered from different sources, an appropriate data fusion model is required. Seven channels are constructed for the input data by stacking RGB, depth, and point cloud data, which are collected from different sensors and possess different characteristics. To obtain attention map , the point cloud data are fed into a convolution followed by a sigmoid function. Next, this attention map is combined with the input tensor . Then, a standard convolution with weight is applied to the adapted input. For the embedding network, we employ the well-known ResNet34 architecture [32]. The ResNet model reduces the image size as it passes through each layer. The downsampling process for the point cloud has difficulty in properly utilizing spatial coordinate information because of the small size of our dataset. Therefore, the SAC block is used at the initial layer, as shown in Figure 2. The network successfully maps the face input to face representation embedding features, combining the three types of data.

3.2. Large-Margin Loss

The face recognition task is a multiclass classification, defined as the problem of classifying images into one of certain classes. The most commonly used loss for multiclass classification is softmax loss, which is a softmax activation function followed by cross-entropy loss [33]. The softmax activation function outputs the probability for each class, whose sum is one, and the cross-entropy loss is the sum of the negative logarithms of these probabilities, defined aswhere is the feature vector of sample data, represents the truth class corresponding to , and and are weight and bias terms, respectively. Despite being widely used, softmax loss has some limitations as it does not strictly enforce higher similarity within the same class and discrepancy between different classes. Thus, traditional softmax loss may create a performance gap for face recognition when intraclass variation is high because of factors such as age gaps, differences in facial expression, and variations in pose (left, right, or frontal). To enable the model to circumvent this problem, -softmax loss was proposed as a reformulation of the traditional softmax loss in SphereFace [5] as follows:where is the angular margin and is the angle between the vectors and . -softmax loss adopts as the linear form, which is expressed as . This loss enables metric learning by constraining the classification weight’s norm to 1 through normalization, setting the bias to 0 and incorporating the angular margin adjusted via parameter to capture discriminative features with clear geometric interpretation.

Then, the CosFace model [8] was proposed, which includes a large-margin cosine loss function that normalizes both weights and features by L2 normalization to eliminate radial variations and adds a quantitative value, a fixed parameter used to control the magnitude of the cosine margin. The overall loss function can be expressed aswhere is a rescale parameter, used by the loss function to rescale the weights and features after normalizing them.

ArcFace [9] adds an additive angular margin penalty between weights and features. This penalty is equal to the geodesic distance margin penalty in the normalized hypersphere and thus is named ArcFace. The loss function is formulated as follows:

Thus, we can supervise our model using additive angular margin loss that combines the margin penalties of SphereFace [5], CosFace [8], and ArcFace [9], which demonstrates the best performance, as follows:where , , and are the angular margin parameters, each represented as in the loss functions described above. Our main task is to identify a class for each input identity. By adopting the proposed additive angular margin loss, the proposed model can increase the similarity of positive classes and enforce a wide diversity of negative classes in metric learning. The proposed large-margin loss can generate high-quality embedding features from our data, enabling high-accuracy classification with both the training dataset and the unseen test dataset.

4. Numerical Experiments

4.1. Datasets

The face dataset consisted of 784 face scans from 83 Korean individuals. The face data were captured using Apple’s latest device equipped with a LiDAR scanner. Specifically, the device was equipped with three cameras (main, wide, and telephoto) and a LiDAR scanner for capturing both RGB image and depth information. ARKit can be used to connect with the scanner on the Apple device and process the depth and point cloud (3D coordinate) data. ARKit recently introduced a new depth API available only for devices equipped with a LiDAR scanner and provides several methods to access depth information collected from LiDAR scanners. The LiDAR scanner allows this API to obtain per-pixel depth information of a person’s face and generate 3D coordinates of the point cloud by setting the parameters for the device. We modified ARKit’s sample code and set up the application to simultaneously store RGB and point cloud data within one scene. We installed this modified app on the device and collected data through the app.

4.2. Experiment Setup

We trained three different models to compare their performance. The first model used only RGB data. The second model used three types of sensor data (RGB, depth, and point cloud) with three different characteristics, and the third model was the SqueezeFace model that uses the SAC block on the three types of sensor data. All three models used the ResNet34 architecture [32] and large-margin loss [6]. The ResNet34 model is pretrained using a facial image dataset of 400 Korean individuals, provided by AI Hub (https://aihub.or.kr/). For the three sensor data models, pretrained weights from ResNet34 were used as the weights of the RGB data, and the weights for the point cloud and depth data were initialized using the Xavier initializer.

4.3. Experiment Results

We split our face dataset into a training set and a test set, and the sensor data were configured as three types (RGB, depth, and point cloud). In addition, to evaluate the face verification performance, we constructed a face verification dataset with pairs of face images from the test set. Accuracy, precision, and recall were used as metrics to measure the model’s performance for face verification. Accuracy is the ratio of the number of correct predictions to the total number of inputs. Precision is the ratio of the number of true positive predictions to the total number of the model’s predicted positive values, and recall is the ratio of the number of true positive predictions to the number of all positive samples. These three definitions are represented aswhere TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively. For the face verification dataset, the number of interclass combinations was much greater than the number of intraclass combinations. Because the intraclass and interclass counts were considerably imbalanced, the score—the harmonic mean of precision and recall—was used as the evaluation metric for face verification:

4.3.1. Analysis of Face Verification Results of the Proposed Method

According to the experimental results, shown in Table 1, the model using the three types of sensor data outperformed the model using only RGB data, demonstrating that employing depth information can enhance rich facial representation. More importantly, the proposed SqueezeFace model, with the added SAC attention block, achieved the best accuracy and score. This result shows that the proposed model learned well the face points with high importance by actively utilizing the point cloud data with different distributions according to the spatial location. The intraclass variance due to pose variations and age gaps significantly increases the angle between positive pairs and therefore can increase the best threshold for face verification on test data. However, if the train data for each identity are limited, making the intraclass variance small, it is difficult to increase the best threshold for face verification on test data. A low threshold used in the evaluation of face verification indicates a low reliability of the model. The proposed model addresses this problem by adding point cloud and depth data to the RGB data.

The results for face verification performance on three-shot learning are compared in Table 2. Three-shot learning is learning that takes place using only three training samples. The best threshold is the threshold with the maximum score. The model using the three types of sensor data shows higher accuracy, a higher score, and an increase in the threshold than the RGB-images-only model. This demonstrates that by making use of supplementary information such as point cloud and depth data, the proposed model can increase intraclass variance and, as a result, increase the best threshold for face verification.

4.3.2. Analysis of Cosine Similarity on Three-Shot Learning of the Proposed Method

We examined the cosine similarity for various facial expressions on three-shot learning, with results as shown in Table 3. The proposed model produced better similarity values between positive pairs than the RGB-images-only model, even with a variety of facial expressions. Because the proposed method uses more information of face by adding depth and point cloud, the intraclass variance of the model can increase the angle between positive pairs. Therefore, the model can increase the cosine similarity, and the higher cosine similarity can increase the best threshold on face verification. This result demonstrates that adding depth and point cloud data enables the model to learn important facial features for face verification more effectively than the model with only RGB data. In addition, despite the difference between the same identities according to pose variations, the proposed method can distinguish the identity well in the test data by adding depth and point cloud data.

5. Conclusion

This paper has proposed a face recognition approach that considers depth information using point cloud data. By using depth information, false facial verification using a face photo or video of an authorized person can be avoided, thereby increasing the reliability of the face recognition system. The method incorporates the SAC block based on the attention mechanism to capture important features and weight them to enhance model performance. In addition, we used a modified loss function constructed by adding a large margin to reinforce high discriminatory power for face recognition applications [34]. The proposed method delivers a considerable performance improvement over the baseline models and uses a higher threshold for face verification when subjected to an increase in intraclass variance.

Data Availability

All source codes are available online at https://github.com/kyoungmingo/Fusion_face (author’s webpage)

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2020R1C1C1A01005229 and NRF-2021R1A4A5032622).