#### Abstract

Face representation and matching are two essential issues in face verification task. Various approaches have been proposed focusing on these two issues. However, few of them addressed the joint optimal solutions of these two issues in a unified framework. In this paper, we present a second-order face representation method for face pair and a unified face verification framework, in which the feature extractors and the subsequent binary classification model design can be selected flexibly. Our contributions can be summarized in the following aspects. First, a novel face-pair representation method that employs the second-order statistical property of the face pairs is proposed, which retains more information compared to the existing methods. Second, a flexible binary classification model, which differs from the conventionally used metric learning, is constructed based on the new face-pair representation. Finally, we verify that our proposed face-pair representation can benefit from large training datasets. All the experiments are carried out on Labeled Face in the Wild (LFW) to verify the algorithm’s effectiveness against challenging uncontrolled conditions.

#### 1. Introduction

Face recognition has been extensively studied in the field of computer vision and pattern recognition, which in general can be categorized into two tasks, that is, face verification and face identification. In this paper, we focus on the former one. Face verification normally works on the data of face pairs; that is, given a face pair, we need to decide whether they are from the same person or not. However, this is not an easy task due to various challenges, such as position, background, pose, lighting, and occlusion (Figure 1). Like most pattern recognition systems, face verification task has two key components, that is, face representation and face matching. These two components are the central parts that most of the researchers are concerned about. Although miscellaneous algorithms have been proposed focusing on these two components, few of them considered both in a unified framework. Intuitively, jointly formulating these two components may result in a promising performance.

**(a)**

**(b)**

Over the years, many face representation approaches have been proposed. According to the intrinsic characteristics that are depicted, these can be classified into three simple categories: local features descriptors, holistic features descriptors, and features descriptors based on deep learning.

Local features descriptors have been proven to be very effective for texture description. For instance, the local binary pattern (LBP) [1, 2] encodes the structure distribution into a histogram by computing the relative intensity magnitude difference between each pixel and its neighbors based on the predefined rules. Harr-like features [3] are rule-based local feature descriptors. However, such handcrafted encoding methods are suspected to get optimal encoding for a specific task. Alternatively, more flexible and uniformly distributed local descriptors can be learned for face recognition task [4]. HoG [5] and SIFT features [6] can also be categorized into local features descriptors. And SIFT also considers robustness to scale variations by building the scale spaces. Local features descriptors are more stable to local changes such as illumination, expression, and inaccurate alignment. Gabor wavelets [7–9] captured the local structure corresponding to specific spatial frequency (scale), spatial locality, and selective orientation. They have been demonstrated to be discriminative and robust to illumination and expression changes.

The holistic features descriptors mainly focus on feature transformation, which consists of various algorithms that create new but fewer features while having higher discriminatory power in a different space than the original feature space. This can also be used for feature reduction. In such transformation process, the advantages of statistical techniques are employed. The typical techniques are PCA [10, 11], LDA [12], ICA [13], LPP [14], and so forth. These techniques are categorized as subspace learning methods and are distinguished by their ability to preserve some desired properties of the data. The holistic feature represents the visual features of the whole face. In general, the efficiency of holistic features relies heavily on well aligned face, that is, uniform in scale, pose, illumination, and so forth. Consequently, holistic feature is more sensitive to misalignment [15] and other variations from intrinsic changes (e.g., expression, pose) and ambient environment changes (e.g., side lighting).

Deep learning [16–18] has received increasing interest in computer vision and face verification recently, and a number of deep learning methods have been proposed in the literature. One merit of deep learning methods is that they preserve the face image’s neighborhood relations and spatial locality in their latent higher-level feature representations. However, the deep learning methods need a huge number of labeled samples, which brings about the computation assumption issues.

Face matching is the other important component of face verification. Usually, a metric is necessary for measuring the similarity of an input face pair, and then a learned threshold is used to perform the matching task. Considering the optimal similarity (or distance) metric with regard to the specific feature space, lots of automatic metric learning algorithms have been proposed under different objectives. For instance, Guillaumin et al. [19] learned a Mahalanobis-like metric using the logistic discriminant method with the objective that positive pairs have smaller distances than negative pairs; Davis et al. formulated the Mahalanobis learning problem as that of minimizing the differential relative entropy between two multivariate Gaussians, while adding an additional regularization term that emphasizes a prior of the covariance matrix in their objective [20]. There are some other metric learning methods which are different in their objective for the specific tasks in the literature [21–23] as well, while they share the same framework of face verification that is depicted in Figure 2.

Mahalanobis-like distance learning can also be viewed as an Euclidean distance in the linearly transformed feature space. This refers to a more extensively researched topic, that is, kinds of projection techniques, which we have described in the above section as a feature representation approach. However, all these projections are independent of the subsequent classifiers. Reconsidering the face verification task, if we can represent a face pair using a unified vector, then the face verification task can subsequently be treated as a simple binary classification problem. That is, we determine whether face pairs are mismatched (i.e., the different persons) or matched (i.e., the same person). The representation not only is able to describe the dissimilarities between face pairs, but also naturally implies a simile classifier [24]. Therefore, this data representation strategy endows great flexibility for the subsequent classification model design.

Motivated by this idea, classification-related data organization for face pairs is proposed, which is more flexible and valid to the following classification model design. And consequently, a joint optimal approach to put face representation and face matching into a unified framework is provided. This paper is organized as follows. Section 2 briefly reviews the fundamental knowledge, which is used to introduce the motivation of the proposed approach. In Section 3, a novel representation of a face pair, which employs the second-order statistical property, is proposed. And a framework of a face verification system based on the proposed face-pair representation is proposed in Section 3 too. Three basic experiments under the framework are implemented to illustrate the efficiency of the proposed method in Section 4. All the experiments are carried out on Labeled Face in the Wild. Finally, a conclusion is presented at the end of the paper in Section 5.

#### 2. Preliminaries

In this section, we will briefly review the conventional Mahalanobis distance metric learning and some classical binary classifiers.

##### 2.1. Mahalanobis Distance Metric Learning

Metric learning algorithms have been proposed and some of them have already been applied to tackle the problem of face verification over the decades. The common objective of these methods is to learn a suitable metric, which can be used to reduce the distance between positive face pairs and enlarge that of negative pairs. Since the metric can be parameterized using a positive semidefinite matrix , the metric learning can be transformed into learning the positive semidefinite matrix .

Let be a training set of samples, where is the th sample and is the total number of training samples. The task of the traditional Mahalanobis distance metric learning is to seek a square matrix . Based on the matrix , we can define the squared distance between two samples using the following equation:where is the th sample in the training set and is a symmetric and positive semidefinite matrix. It can be decomposed as follows:where performs a linear transformation and normally .

Therefore, the squared distance defined in (1) can be reformulated by

Equation (3) shows that learning a Mahalanobis metric is equivalent to learning a linear transformation , which actually projects an original sample into a low dimensional subspace due to the fact that . The Mahalanobis metric in the original space is degenerated into an Euclidean distance between two samples in the transformed space.

##### 2.2. Logistic Regression

Consider a set of training samples, denoted by , where is the th sample and denotes its corresponding class label. The likelihood function of these samples is defined by , and the average logistic loss is defined as [26]

Therefore, we can determine the parameters and by minimizing the average logistic loss; that is,

The minimization problem will lead to a smooth convex optimization problem because the average logistic loss function is a smooth and convex function.

##### 2.3. Support Vector Machine

Given training samples in two classes and a label vector denoting the corresponding label of samples , the Support Vector Machine can determine the weight of the samples by solving the following optimization problem [27]:where maps the sample into a higher-dimensional space and is the regularization parameter. Considering the dimensionality of the vector variable , we usually solve its dual problem:where is the vector of all ones and is a positive semidefinite matrix, with the element , where is called the kernel function.

#### 3. Face Verification Based on Second-Order Face-Pair Representation

In this section, a novel face-pair representation approach is proposed. We firstly present a second-order face-pair representation, and therefore the task of face verification can be viewed as a binary classification problem in a unified framework. On the one hand, the second-order representation should keep more discriminant information; therefore, we expect that such novel viewpoint of face verification problem can benefit from large training samples. On the other hand, since the proposed unified framework integrates feature extraction and classifier selection for the face verification problem, it should enhance the flexibility of the system.

##### 3.1. Reformulate Face Verification

Face verification is a problem of determining whether two face images depict the same person or not. Its formal description is as follows.

Given two faces, denoted by and , the task of face verification is to decide whether these two faces (a face pair) are from different persons or the same one.

The conventional solution to the face verification is formulated as a metric learning problem, for example, learning a Mahalanobis distance between and , which is employed to present the similarity of a face pair. The final results are obtained by checking the predicted distance based on a properly chosen threshold, which actually implements a matching procedure.

However, there are four shortcomings in the metric-learning-based face verification process: the positive definiteness of is required; it is difficult to obtain a closed solution to ; some prior domain knowledge is difficult to be embedded flexibly; it is hard to choose the appropriate threshold to determine the final result.

To overcome these drawbacks, we propose a unified face verification framework, in which the face-pair representation is employed instead of a single-face representation. Based on the face-pair representation, the face verification problem can be transformed into a binary classification problem defined on a face-pair measurement space. In order to distinguish it from single-face representation, we refer to this face-pair representation.

For a pair of faces, denoted by and , the measurement between a face pair is defined aswhere . Then, the face verification problem can be transformed into binary classification problem on a face-pair measurement space.

As a result, the target of the face verification is equivalent to training a function , which is defined on the transformed face-pair measurement space :where , functions as a binary classifier, 0 denotes that the face pair is from different persons, and 1 denotes that the face pair is from the same person.

Specifically, the classifier is degenerated into a linear classifier when we obtain a linear function.where corresponds to the weight to be learned in a binary classification problem.

##### 3.2. A Second-Order Face-Pair Representation

In the reformulated binary classification task of face verification, it is essential to construct a representation space for face pairs. First, we define positive samples and negative samples, respectively, as face pairs from the same person and those from different persons. We intend to find a unified representation of a face pair , which is different from the conventional single-face representation. Intuitively, the representation should keep more discriminant information so as to facilitate the following verification task.

Since face verification aims to find the similarity between two faces in the given face pairs, we can simply represent the face pair as the difference between the measures of two faces. Kumar et al. [24] utilized the absolute value of the difference between two trait vectors and the weighted product was defined to measure the similarity of two faces. However, the definition is given subjectively, and less similarity information is retained.

It is well known that the Mahalanobis distance computes the distance between two samples, while taking into account the covariance structure across the -dimensional features. Motivated by this characteristic, we rewrite the initial Mahalanobis distance defined in Section 2 as follows:where denotes the operation that rearranges a matrix into a column vector.

If we denote in (11) by , that is,then (11) can be rewritten aswhere the weight corresponds to ; it can be learned by training a binary classifier. 0 denotes that the face pair is from different persons and 1 denotes that the face pair is from the same person.

*Remark 1. *The definition of in (12) contains covariance-like feature representation; such second-order representation of reserves more information for the face verification problem.

*Remark 2. *Due to the symmetry of the feature matrix for face-pair representation, we can use the upper triangle data only to reduce the storage and computation burden without any information loss.

The second-order feature representation has close relationship with Mahalanobis distance metric. When we choose the linear model as the binary classifier in the above architecture (Figure 3), then the face verification process is equivalent to computing a Mahalanobis-like distance between two faces, while face verification process can be implemented nonlinearly as well when the binary classifier is chosen as a nonlinear classifier. Throughout this paper, we will apply this mechanism for face-pair representation, while the basic representation of a single face can be various.

##### 3.3. A Face Verification Framework Based on the Second-Order Face-Pair Representation

The proposed second-order face-pair representation transfers the face verification problem as a more general two-category classification task. The overview of face verification based on our proposed face-pair representation method is illustrated in Figure 4. The architecture can be decomposed into three main processes: basic feature extraction for a single face, which can be implemented using an appropriate feature extractor; second-order face-pair representation based on the difference of two faces; a more flexible two-category classifier.

It is worth noting that this is a general framework based on second-order face-pair representation. In practice, the basic feature extraction can be implemented with great flexibility while incorporating prior knowledge from the specific problem domain. For instance, a face can be encoded by global feature descriptors or local feature descriptors. The classification module can also be implemented with great flexibility. For instance, logistic regression can be employed to tackle the linear classification problem, while the kernel support vector machine is more suitable to nonlinear classification problem.

Another advantage of our proposed framework is that the system can benefit from learning with a large amount of data. Intuitively, most statistical-based learning methods will benefit from the large amount of training data. Because there is some second-order statistical information embedded in the proposed face-pair representation, the proposed method will make use of the large amount of training data more effectively. The experiment results in Section 4 do verify that. The dimension of the second-order face-pair representation grows in quadratic speed with the dimension of the single-face representation, which may limit its applications in high-dimensional data settings. That is, if the dimension of the feature vector of single face is , then that of face-pair measure space will be . Therefore, a feature dimension reduction technique is necessary. Usually, dimension reduction techniques are exploited to tackle this problem. However, there is a limited compression ratio while attempting to preserve as much information as possible to facilitate the analysis of the latter classification process.

#### 4. Implementation

In this section, experiments are conducted to verify the performance and effectiveness of our proposed method.

##### 4.1. Dataset

We choose the Labeled Faces in the Wild [25] dataset as our testing bed, which is the de facto standard dataset for face verification. The dataset is highlighted by three characteristics: enriched in both people number and image number per people; challenges of variations in scale, pose, lighting, background, hairstyle, clothing, expression, color saturation, image resolution, focus, and so forth; flexibility in data organization. Data can be used in both “restricted” and “unrestricted” settings. In “restricted” setting, data are provided in the form of paired faces. In “unrestricted” setting, a large amount of paired faces can be generated as needed. Following the standard performance reporting protocol, we report our performance on the 10-folder sets of view 2 [25]. These 10-folder sets are independently divided, that is, one person only in one set, 300 positive pairs, and 300 negative pairs for each set. While reporting performance, each one set is held out for testing and another 9 sets are used for training. This will generate 10 sets of testing scores. The estimated averaged accuracy and standard deviation are computed following the standard definition in [25].

We use the aligned version of this dataset referred to as “funneled,” in which all the faces are globally aligned. Semantic facial parts are localized using an off-the-shelf facial feature detector [34], which outputs 9 points about corners of eyes, mouth, and nose. Based on these points, we induce another 3 points as the center of each eye and mouth. Then, we extract a 128-dimensional feature vector using SIFT descriptor at these 12 localized points with three patch sizes (16, 32, and 46) and three blur scales, leading to a 9 × 128 = 1152-dimensional descriptor for each facial part.

##### 4.2. Experimental Setup

The performance of a face verification system relies on many factors, such as basic feature representation, feature transformation, metric learning, and classification design. We try to fully consider the effects of these factors while evaluating the proposed method. The main consideration and the corresponding methods are summarized in several aspects below.

First, the issue of misalignment of face image will affect the performance of the proposed system. However, what we focus on in this paper is the face-pair representation and its influence on the performance of face verification system. It is not necessary to make accurate global face alignment, since geometric alignment based on some specified facial parts location will not perform well with larger changes of pose and expression. What is more, facial parts detection is a hard job, and incorrect localization of one part will lead to great misalignment of a whole face.

Second, we consider the basic feature representation of a single face. Kinds of feature extraction methods can be utilized to represent a single face and each feature extraction method has its own advantages and deficiencies. Here, we employ SIFT feature [6] as the basic feature for single-face representation, considering its robustness against the common challenges, such as the variations of scale, lighting, and rotation, especially in real-life scenarios.

Third, local feature representation of a single face is useful in practical face verification systems, especially for enhancing the system’s robustness against great changes of pose, expression, and lighting, since incorrect localization of one local part will not affect other components. In this implementation, we use the facial feature detector proposed in [34] to extract several semantic local facial components, which are assembled together to form the face, such as eyes, nose, mouth corners, and eyebrows.

The final process of the pipeline is classifier design. Most popular classifiers can be exploited to perform such two-category classification task. Besides this, the face identifier can also be constructed in two ways: one global classifier for a whole face and one classifier for each facial part. The popular techniques as logistic regression and support vector machine are chosen as our classification model. In this paper, we utilize the off-the-shelf implementation of these two techniques for our experiments, that is, SLEP [35] for logistic regression and libSVM [36] for SVM.

Based on these above settings, we design three experiments. The first one is used to illustrate the performance of our method with different feature descriptors (i.e., global features and local features) and with different classifiers (i.e., logic regression and kernel-SVM). The second experiment is implemented to illustrate that the verification accuracy of our approach can benefit from the growing number of training face pairs. Finally, a comparison between several face verification algorithms and our method is carried out in terms of the verification accuracy and ROC curves.

##### 4.3. Experimental Results

We first evaluate the performance of the system using the proposed face-pair representation in a global way, where all the SIFT features extracted on these points are concatenated into a single vector as the single-face descriptor. Since the dimension of a single face is high, we apply PCA to obtain a more compact single-face representation. The dimension of features reduced by PCA is also an important issue; we experimentally chose 100 as the best trade-off between accuracy and complexity. However, in the global way, we find that if we choose 100 principle components, the preserved energy is only approximately 67%. Such low energy ratio implies that much information was lost in the dimension reduction process. In order to preserve more useful information within an endurable source cost, we train a local classifier for each facial point. The process is similar to the global way, and we also select the appropriate dimension for each local descriptor, and then we simply average all the local classifiers results to make the final decision. The experimental results are listed in Table 1.

Table 1 shows that when we choose logistic regression as the binary classifier, the performance of the system with local feature descriptors is enhanced compared to the one with global feature descriptors. Since different binary classifiers can be adopted as the classifier model in our proposed framework, we can also choose SVM with RBF kernel (KSVM) as the classification model; the performance of the system with local feature descriptors is enhanced from 80.35% to 83.07% compared to the one with global feature descriptors. We also noted that integrating the nonlinear classifier into the system will improve the performance compared to using a linear classifier, that is, logistic model. The corresponding ROC curve is shown in Figure 5.

Due to the second-order statistical property of the proposed face-pair representation method, the corresponding face verification model will benefit from a large amount of training data. A single-face encoding method [24] has shown its efficiency in face verification problem, where only 73 attributes scores are used to encode each face. Based on this single-face encoding method, we get a 1701-dimensional vector for a face pair. In order to enlarge the size of the training dataset, we randomly add 1000 positive pairs and the same number of negative samples, to verify the trend of performance improvement with the number of training data. The final results show that improved performance is obtained with increased number of training samples. From Table 2, we can see that more than one percent point improvement is obtained compared to the trained model under the restricted setting, and nearly four percent point improvement is achieved compared to the initial abs. prod. [24].

Finally, a comparison between several face verification algorithms and our method is carried out in terms of the verification accuracy and ROC curves. The experiments are carefully performed with the protocol of Image-Restricted, No Outside Data. All experiments use centered 150 × 150 crops of “LFW-funneled” images. The recognition rate and AUC of different methods are shown in Table 3, and the corresponding ROC curve is shown in Figure 6. It shows that our method outperforms most of the existing methods except for Spartans, which preprocessed the data more delicately.

#### 5. Conclusions

In this paper, we derived a second-order face-pair representation based on the formula of the conventional Mahalanobis distance metric learning. This face-pair representation provides more discriminate information. Based on the face-pair representation, a unified face verification framework is put forward, which makes the proposed approach more flexible. The flexibility can be summarized in two aspects: flexibility of choosing feature extractor for a single face and flexibility of selecting the classifiers. What is more, this novel face-pair representation brings the second-order statistical property of the difference between two faces. Three basic experiments on LFW have been implemented to illustrate the efficiency and validation of the proposed method.

Face verification is a complex problem that has substantial connections with deep convolutional neural networks, and their integration could improve the performance, both in theory and in application aspects. In our future work, we will try to integrate the deep CNN into the unified framework, to obtain more satisfied feature representation and improve the performance of face verification task.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This research is supported by the Natural Science Foundation of Hebei Province (F2018201115, F2018201096) and the Youth Scientific Research Foundation of Education Department of Hebei Province (QN2015026, QN2017019).