Abstract

User authentication for accurate biometric systems is becoming necessary in modern real-world applications. Authentication systems based on biometric identifiers such as faces and fingerprints are being applied in a variety of fields in preference over existing password input methods. Face imaging is the most widely used biometric identifier because the registration and authentication process is noncontact and concise. However, it is comparatively easy to acquire face images using SNS, etc., and there is a problem of forgery via photos and videos. To solve this problem, much research on face spoofing detection has been conducted. In this paper, we propose a method for face spoofing detection based on convolution neural networks using the color and texture information of face images. The color-texture information combined with luminance and color difference channels is analyzed using a local binary pattern descriptor. Color-texture information is analyzed using the Cb, S, and V bands in the color spaces. The CASIA-FASD dataset was used to verify the proposed scheme. The proposed scheme showed better performance than state-of-the-art methods developed in previous studies. Considering the AI FPGA board, the performance of existing methods was evaluated and compared with the method proposed herein. Based on these results, it was confirmed that the proposed method can be effectively implemented in edge environments.

1. Introduction

Recently, authentication systems based on biometric information have been applied to various mobile devices such as smartphones, and many users perform identity authentication using facial or fingerprint information instead of the existing password input methods. In addition, biometric authentication is being applied to bank transactions and mobile payment applications. As a result, researchers are greatly interested in developing high-performance authentication systems.

Among user biometric information, face images are the most widely used biometric identifier because the associated registration and authentication processes are noncontact and concise. However, face images are very easy to acquire using social networks, etc., and are vulnerable against various spoofing techniques, including printed photos and video replay. To solve this problem, research utilizing software solutions have become popular, rather than antispoofing hardware solutions using additional sensors. These software approaches can be classified into motion-based methods and texture-based methods [1].

The motion-based counterfeit face detection method measures eye/head movement, eye blinking, and changes in facial expression [2, 3]. In the case of counterfeit face detection methods utilizing eyes, note that a still face such as in a photograph does not exhibit eye blinking or pupil movement, as opposed to real human faces which exhibit relatively large amounts of movement over time. This method is very simple and fast. However, this method classifies a spoofing face using only eye movement and thus cannot defend against simple attack variations that focus on and accurately emulate the eye area based on a photo.

The texture-based spoofing face detection method mainly uses lighting characteristics that appear differently between 2D plane and 3D stereoscopic objects or uses a fine texture difference between the spoofing face data and live face data through an external medium such as printing [48]. This method mainly uses a local image descriptor such as an LBP (local binary pattern) [9] to express differences in the texture characteristics between live and spoofing face images. Such texture-based methods have been actively researched due to the advantages of easy implementation and short detection times; however, these methods have difficulty classifying liveness faces in nonuniform images or images with large amounts of noise. Recently, researchers have been working on the detection of spoofing faces using convolutional neural networks (CNNs) [10, 11]. Since this method can effectively derive features through learning, its performance is improved over existing texture-based detection methods.

Although the field of spoofing face detection has developed tremendously, the existing methods mainly focus on the brightness information of face images. More specifically, other color information, which is similar to brightness information, is often overlooked in spoofing face detection. Therefore, by considering both color and brightness information of face images, a method was proposed that independently extracts texture features from the brightness space and color space of the face image using an LBP [12].

The difference between a real face and spoofing face is discriminated using a descriptor (such as an LBP) that encodes comparison results with respect to surrounding pixel values in a binary pattern at all pixel locations. However, since it is possible to produce high-resolution images, it is very difficult to distinguish detailed surface differences between real faces and spoofing faces using only pixel brightness.

In this paper, we propose a liveness face detection method based on a convolutional neural network utilizing the color and texture information of a face image. The proposed method analyzes the combined color-texture information in terms of its luminance and color difference channels using an LBP descriptor. For color-texture information analysis, the Cb, S, and H bands are used from the color spaces.

The rest of the paper is organized as follows. In Section 2, the related key technologies are illustrated. The proposed scheme for our color-texture-based antispoofing is presented in Section 3. Section 4 thoroughly presents the results and discussion. Finally, conclusions are presented in Section 5.

2.1. Face Antispoofing

Conventional face antispoofing methods generally create spoofing patterns by extracting features from face images. Classic local descriptors such as LBP [13], SIFT [14], SURF [15], HOG [16], and DoG [17] are used to extract frame level functions, while methods such as dynamic texture [18], micromotion [19], and eye blinking [20] extract video features.

Recently, several deep learning-based methods have been studied to prevent face spoofing at the frame and video levels. In frame level methods [2124], the pretrained CNN model is fine-tuned to extract features from the binary classification setup [2527].

2.2. Color Spaces

RGB is a color space commonly used for sensing and displaying color images. However, its use in image analysis is typically limited because the three colors (red, green, and blue) are not separated according to luminance and color difference information. Thus, it is common to additionally convert the RGB information into YCbCr and HSV information before use. These two latter color spaces are based on luminance and chrominance information [2831]. In particular, the YCbCr Color space separates RGB into luminance (Y), chrominance blue, and chrominance red. Similarly, the HSV color space uses the hue and saturation dimensions to define the color differences of the image, and the value dimension corresponds to the luminance.

2.3. LBP (Local Binary Pattern)

LBPs [32, 33] are a feature developed for classifying image textures. Since then, LBPs have been used for face recognition. LBPs are a simple operation used for image analysis and recognition and are robust to changes in discrimination and lighting. Equation (1) is an LBP equation:

Here, ranges over the pixel values excluding the center pixel and is the center pixel in equation (1). In Figure 1, P is the number of adjacent pixels and R is the radius of the circle. Figure 2 shows an example result of LBP operation applied to a real photo [34].

3. Proposed Scheme for Color-Texture-Based Antispoofing

The RGB color space contains three color components, red, green, and blue; the YCbCr color space contains brightness and saturation information, and the HSV color space contains three components: hue, saturation, and brightness. Each color space contains different information and has its own characteristics. RGB contains rich spatial information that most closely resembles the colors seen by humans, while the YCbCr and HSV color spaces contain information that is more sensitive to brightness. The RGB color space can be converted into HSV and YCbCr, and the specific calculation is as follows:

The YCbCr calculation formula is shown as

In existing methods, RGB face images are converted into the YCbCr and HSV color spaces, and the spoofing images are classified by applying an LBP to each color space. However, this method increases the amount of computation because it uses a 6-channel color space. Figure 3 shows a conceptual diagram of the existing methods.

In this paper, we use a 3-channel color space consisting of Cb, S, and V, from which many facial features can be derived. The proposed method aims toward high-speed processing and robustness against lighting changes in face antispoofing. Figure 4 shows a conceptual diagram of the proposed scheme.

The advantages of this approach are summarized as follows:(1)This proposed scheme reduces false detection by using a 3-channel color space in which sufficient facial feature information is expressed(2)This proposed scheme uses less memory with fewer feature dimensions, thus enabling high-speed processing

4. Performance Evaluation

4.1. Train/Test Dataset

In this paper, we performed a spoofing face detection test using the CASIA Face Antispoofing Database (CASIA-FASD) [35] for performance evaluation. CASIA-FASD consists of real face videos and fake face videos acquired from 50 different users. The real face videos consist of three types of videos: low quality, medium quality, and high quality. Similarly, the fake face videos consist of three types of fake attack videos: printed photo attacks, cut photo attacks, and video relay attacks. Videos for 20 people are used for learning, while the remaining videos for 30 people are used for performance evaluation.

We extracted each frame from the CASIA-FASD dataset videos images for performance evaluation. In total, 4,577 live face images, 5,054 printed photo attack images, 2,368 cut photo attack images, and 4,429 video replay attack images were used for learning. In addition, 5,912 live face images, 7,450 printed photo attack images, 4,437 cut photo attack images, and 5,652 video replay attack images were used for evaluation. Table 1 shows detailed information on data partitioning of CASIA-FASD.

4.2. Experimental Setup

In this paper, we used FPGA for performance evaluation. We evaluated the performance of the proposed scheme by using the AI Accelerator of FPGA. The specifications of FPGA and the implemented board are shown in Figure 5.

Zynq® UltraScale+™ MPSoC devices provide 64-bit processor scalability while combining real-time control with soft and hard engines for graphics, video, waveform, and packet processing. Built on a common real-time processor and programmable logic-equipped platform, three distinct variants (dual application processor (CG) devices, quad application processor and GPU (EG) devices, and video codec (EV) devices) are included, creating numerous possibilities for various applications such as 5G wireless, next-generation ADAS, and industrial internet-of-things technologies [36].

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards. It consists of an optimized IP, tools, libraries, models, and example designs. Vitis AI is designed with high efficiency and ease of use in mind, leading to great potential for AI acceleration on Xilinx FPGA and ACAP [37].

Face antispoofing detection uses AlexNet based on CNN. AlexNet is a basic model utilizing a convolutional layer, a pooling layer, and a fully connected layer [38].

AlexNet consists of five convolution layers and three full-connected (FC) layers, where the last FC layer uses softmax as an active function for category classification. Figure 6 shows Alexnet’s CNN architecture.

4.3. Experimental Analysis Method

To evaluate the proposed scheme, we measured the HTER (Half Total Error Rate) in the CASIA-FASD dataset. The HTER is calculated using the false acceptance rate (FAR) and false rejection rate (FRR) in the attack dataset, both of which are defined below. The HTER calculation is given as follows [39]:

The FAR [40] is a measure of how likely the biometric security system will incorrectly accept an access attempt by an unauthorized user. A system’s FAR typically is defined as the ratio of the number of false acceptances divided by the number of identification attempts.

The FRR [41] is a measure of how likely the biometric security system will incorrectly reject an access attempt by an authorized user. A system’s FRR typically is defined as the ratio of the number of false rejections divided by the number of identification attempts.

Smaller HTER values indicate good performance, where HTER is defined using only misclassification ratios. Additionally, the EER (equal error rate) refers to the rate at which the FRR and FAR values converge to one another, where a small value also indicates good performance.

The EER [42] is a biometric security system algorithm used to predetermine the threshold values for the FAR and FRR. When the rates are equal, the common value is referred to as the equal error rate. The lower the ERR, the better the accuracy of the biometric system.

ROC (receiver operating characteristic) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

AUC (area under the curve) is the area under the ROC Curve. If the AUC value is high, it means that the model for classifying objects has excellent performance.

4.4. Experimental Results and Discussion

To verify the performance of the proposed scheme, eight scenarios were compared and tested using the CASIA-FASD attack dataset.

Table 2 shows HTERs according to eight different scenarios in the CASIA-FASD dataset. The proposed method showed improved performance for printed photo attacks, cut photo attacks, and video replay attacks. Figure 7 shows the performance comparison for the CASIA-FASD dataset.

Table 3 shows the EER values according to eight different scenarios for the CASIA-FASD dataset. Compared with the proposed scheme, only the “YCbCr_lbp + HSV_lbp” scheme has good EER performance.

The receiver operating characteristic (ROC) curves are presented. These curves show the error of the false positive rates against the true positive rates. ROC curves are best used for comparing the performance of various systems. Figures 8 and 9 show the ROC curves generated for each scenario in the CASIA-FASD dataset.

Table 4 shows the FAR, FRR, and area under the curve (AUC) results according to eight different scenarios in the CASIA-FASD dataset. A high AUC indicates good performance.

Table 5 shows the accuracy for different facial spoofing attacks. The accuracy for YCbCr_lbp + HSV_lbp is the highest, but the proposed method shows similar performance.

The overall test results of this paper are shown in Table 6. Compared to the already existing YCbCr_lbp + HSV_lbp method, the method proposed in this paper has improved performance with respect to printed photo attacks (0.18%), cut photo attacks (0.69%), and video replay attacks (1.52%), with an overall performance improvement of 0.73%. Additionally, the ERR was low, while the accuracy values were similar. Overall, the YCbCr_lbp + HSV_lbp method showed similar performance but uses six color space channels, while the proposed method uses only three-color space channels, leading to a faster calculation speed.

5. Conclusions

In this paper, we proposed a face antispoofing method utilizing CNN learning and inference and constructed important parameters by extracting texture information via an LBP from the face image color space. CASIA-FASD was used as the dataset for performance verification. Images were extracted from videos and divided into printed photo attacks, cut photo attacks, and video replay attacks. These images extracted from the CASIA-FASD dataset were used for both training and evaluation. It was confirmed that the detection performance was improved by separating the color space from the face image in addition to the Cb, S, and V color space, which is useful for antispoofing. In previous studies, a 6-channel (YCbCr + HSV) color space was typically used, leading to large computational costs. On the contrary, the proposed approach reduces the computational load by instead considering only three (Cb, S, V) color space channels. Considering the AI FPGA board, the performances of the existing methods were evaluated with that of the proposed scheme. It was confirmed that the proposed method can be effectively used in edge environments.

As future work, we want to verify the performance against another well-known face spoof dataset. In addition, we plan to conduct performance tests between databases.

Data Availability

The data used to support the finding were included in this paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by BK21 FOUR (Fostering Outstanding Universities for Research) (no. 5199990914048), and this research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1I1A3066543). In addition, this work was supported by the Soonchunhyang University Research Fund.