Abstract

With the rise of deep neural networks, the performance of biometric systems has increased tremendously. Biometric systems for face recognition are now used in everyday life, e.g., border control, crime prevention, or personal device access control. Although the accuracy of face recognition systems is generally high, they are not without flaws. Many biometric systems have been found to exhibit demographic bias, resulting in different demographic groups being not recognized with the same accuracy. This is especially true for facial recognition due to demographic factors, e.g., gender and skin color. While many previous works already reported demographic bias, this work aims to reduce demographic bias for biometric face recognition applications. In this regard, 12 face recognition systems are benchmarked regarding biometric recognition performance as well as demographic differentials, i.e., fairness. Subsequently, multiple fusion techniques are applied with the goal to improve the fairness in contrast to single systems. The experimental results show that it is possible to improve the fairness regarding single demographics, e.g., skin color or gender, while improving fairness for demographic subgroups turns out to be more challenging.

1. Introduction

Biometrics are already employed in many areas of life as automated algorithms. According to recent market value analyses, the biometrics market is expected to grow even more in the next years [1]. Automated algorithms, such as face recognition, have already outperformed human capabilities [2]. Therefore, these algorithms are also used in areas that can immediately and strongly impact an individual’s life. For example, automated algorithms are used in the judiciary [3], healthcare [4], credit scoring [5], and other fields [6]. However, face recognition technologies are also error prone. For example, in the U.S., there are known cases where misidentifying a person as a wanted criminal has led to a wrongful arrest, accompanied by at least temporary imprisonment and inappropriate treatment from the police [79]. In this context, Garvie and Bedoya [10] documented a disproportional higher arrest and search rate of African-Americans based on face recognition software decisions. In addition to these individual cases, researchers have reported a difference in the performance of face recognition algorithms based on the demographic characteristics (skin color/ethnicity, gender, age) of the individual being identified or verified. Demographic bias in face recognition is already known in the field of human expert analysis: The so-called other-race effect describes the fact that people can recognize faces within their own demographic group better than faces of another demographic group [11]. Many researchers even refer to algorithmic bias as one of the most critical challenges in the field of biometrics [1214].

In response to said issues, organizations, such as the Association of Computing Machinery, call for an immediate suspension of face recognition software [15]. Both in the U.S. [16] and in the EU [17], standards have been created to regulate automated algorithms with respect to demographic bias. There are now several proposed measurements to evaluate the fairness and demographic differentials of biometric algorithms [1821]. Also, a vast number of techniques and algorithms have been put forward to mitigate demographic bias, mainly focused on face recognition [22]. Many different approaches are published trying to mitigate demographic bias [23], which include methods during the training process [24], the removal of sensitive attributes [25], and domain adaptation [26]. Most approaches focus on the verification scenario, while only a few approaches consider the identification scenario [22].

In contrast to previously published works, this work investigates whether fusing multiple face recognition systems can mitigate demographic bias and make biometric systems fairer, as shown in Figure 1. Algorithm fusion has been successfully applied in the field of biometrics in order to achieve more robust recognition system. However, to the best of the authors’ knowledge algorithm fusion has not yet been applied for the purpose of improving the fairness of face recognition. To do so, 12 different face recognition models are evaluated in verification mode with respect to accuracy and demographic fairness. The metrics used are general and demographic-specific false non-match rates (FNMRs) and false match rates (FMRs), as well as the resulting fairness metrics inequity rate (IR), fairness discrepancy rate (FDR), and Gini Aggregation Rate for Biometric Equitability (GARBE). In a case study, 33 different fusions are evaluated: These are composed of three selection criteria, three demographic attributes, and three to four types of fusions. The fusions applied are decision-level and score-level fusions. The decision-level fusions use the AND-, OR-, and Majority-Vote-operators. Score-level fusion is an equally weighted min–max normalized average fusion. The fusions are evaluated based on the selection criteria and the covariate under consideration. Fairness is evaluated within the three covariates of gender, skin color, and subgroups of gender and skin color.

In summary, this study presents a way to improve the fairness of biometric systems through carefully selected fusions. This gives providers of face recognition systems new opportunities to improve the fairness of their systems and helps to establish the equal treatment of individuals from different demographic groups. The key contribution of this paper can be summarized as follows:(1)Twelve face recognition systems are benchmarked on the composite University of North Carolina at Wilmington (UNCW) dataset [27] to report their demographic bias toward the gender, skin color, and combined subgroups. The results are presented in terms of biometric performance in a verification scenario as well as a fairness score. Generally, it is observed that error rates are lower for males compared to females. Further, lower error rates are obtained for dark-skinned subjects compared to light-skinned subjects, while this different is less pronounced than the aforementioned gender accuracy gap.(2)Multiple fusion schemes are implemented to combine the strengths of different face recognition systems. In this context, different fusion techniques are applied as well as different selection methods for possible fusion candidates.(3)The fusion results are evaluated to understand whether the fairness score could be improved and how this fusion affects the biometric recognition performance. It is observed that biometric performance as well es fairness scores can be improved for distinct fusion approaches.

The rest of this paper is structured as follows: related work is reviewed in Section 2 and relevant metrics are defined in Section 3. Section 4 introduces terminology and concepts of our approach. The experimental evaluation is discussed in Section 5, and Section 6 concludes our findings.

There are multiple works reporting the existence of demographic bias in face recognition. The following works estimate the demographic bias concerning different biometric applications, e.g., verification, identification, soft-biometric classification, and sample quality assessment. Most studies look at the verification scenario. Here, gender is the most commonly studied demographic attribute, followed by skin color, which is frequently referred to as ethnicity. A trend can be identified from the results of the different studies: The biometric performance is mostly better for male individuals [2844], while only Lui et al. [45] found no difference in the performance of algorithms with respect to gender. Lower performance for females was also observed in classification tasks [4648].

The analysis of bias in face recognition performance of different ethnicities is more challenging to assess due to a broader definition of ethnicity. The vast majority of studies only focus on the ethnicities East Asian, Caucasian, and Black. A common observation is that East Asians are the best-performing ethnicity, followed by Caucasians and dark-skinned people, who perform the worst in various studies [26, 28, 3032, 38, 40, 42, 44, 46, 47, 49, 50]. However, other studies [11, 34, 41, 45, 51, 52] indicate that the performance differentials are not inherent to the different ethnicities and are a result of the own-race-effect and/or algorithm-specific training or implementation. The own-race effect causes algorithms to work best with ethnicities that originate from the same region as the algorithm training data.

For the identification scenario, the so-called watchlist imbalance effect has been examined [53, 54]. The effect describes the influence of the gallery composition on the performance of face recognition. Looking at the distribution of the gallery in terms of gender and skin color, the FMR is increasing for demographic groups with the proportion of the same demographic group in the watchlist.

Fairness measurement metrics have been introduced by different researchers. de Freitas Pereira and Marcel [55] proposed the FDR. The FDR is a fairness measurement that determines the fairness by the maximum absolute distance of the FMR and/or FNMR between two demographic groups at a certain decision threshold. Grother et al. [41] proposed the IR as a fairness measurement. In contrast to the FDR, the IR calculates the ratio between the worst and best FMR and/or FMR observed across demographic groups. Howard et al. [21] introduced a set of interpretable criteria referred to as the functional fairness measure criteria (FFMC). This measure was applied to identify shortcomings of the aforementioned fairness measurements based on which the same authors propose the Gini Aggregation Rate for Biometric Equitability (GARBE). When Grother [56] later published the “Face Recognition Vendor Test Part 3: Summarizing Demographic Differentials,” he added the FFMCs defined in Howard et al.’s [21] study and added two additional FFMCs. The mentioned FFMCs and fairness measurements will be detailed in the subsequent section.

In addition to the estimation of demographic bias, there are also numerous approaches that attempt to mitigate the bias. The approaches can be roughly divided into three categories. In the first category, there are approaches that focus on training [24, 30, 39, 5763]. Some approaches focus on a training dataset that is as balanced as possible for the demographic covariates to be mitigated. Other approaches use specialized loss functions. For example, some algorithms are trained with more or fewer data from a particular covariate, depending on what results in the fairest outcome.

Another category of approaches dynamically selects the most appropriate recognition algorithm, decision threshold, or score normalization depending on the individual under consideration [42, 6467].

Furthermore, some approaches try to obfuscate or remove an individual’s demographic information. Thus, the demographic covariate should not have any influence on the performance of the face recognition algorithms [6871].

3. Fairness Metrics

Despite the biometric standardization community being working on standardizing fairness metrics [72], no final definitions are available for now. However, as mentioned before several metrics have been proposed by different researchers that will be described in detail as follows. Said metrics should fulfill five FFMCs [21, 56]:(1)FFMC.1. The net contributions of FMR and FNMR differentials to the overall fairness measure should be intuitive when using a normal range of risk parameter weights and operationally relevant error rates.(2)FFMC.2. There should be recognizable points of reference in the domain of the fairness measure, e.g., one bounded by known minimum and maximum possible values.(3)FFMC.3. The fairness measure should be calculable when no recognition errors are observed for a demographic group. Given a finite image dataset partitioned into intersectional demographic groups, the likelihood that one group has zero FNMR rises with the number of groups.(4)FFMC.4. The measure should reward more accurate algorithms if they distribute errors uniformly or in the same way as less accurate ones.(5)FFMC.5. The measure should rank algorithms intuitively, correctly penalizing algorithms with the most nonuniform error rates.

Published fairness metrics, i.e., FDR, IR, and GARBE, have in common that they are composed of FNMR and FMR, as suggested in FFMC.1. In this regard, the formula is split into two terms each. An term calculates the fairness concerning FMR, and a term calculates the fairness with respect to FNMR. To flexibly and intuitively weigh the composition of the terms, a weighting parameter (in the range ) is used. A high value means that FMR is strongly considered, and a low value means that FNMR is more strongly considered. More specifically, for only the fairness concerning FNMR is computed, for only the FMR is considered, and for both rates are equally weighted.

3.1. Fairness Discrepancy Rate

The calculation of the FDR is shown in Equation (1) for two demographic groups and and a given decision threshold . The two fairness terms are determined by the largest difference in the FMRs and FNMRs of each demographic group. This means that fairness is generally lower when the system is more accurate, which partially contradicts FFMC.4. On the other hand, FDR can be computed, in case one error rate is , which fulfills FFMC.3. The main drawback of FDR is that while its theoretical range of values is between and , as in FFMC.2 required, it uses only a small portion of that range in practice. In fact, the range is mostly narrowed between and , as shown in Howard et al.’s [73] study. Since means fair and means unfair, this fact could lead to the impression that all systems are fair, even if it is not the case.

3.1.1. Inequity Rate

The IR is calculated based on ratio differences of max and min for FMR and FNMR separately. This is done for all demographic groups and as can be seen in Equation (2). A system with an IR close to is considered fair and the higher the IR, the more unfair the system. The IR is not upper bounded, so it does not satisfy FFMC.2 and is difficult to classify alone without a reference system. In addition, the IR does not satisfy FFMC.3; if the error rate of a demographic group is , the metric is not defined since this leads to a division by :

3.1.2. Gini Aggregation Rate for Biometric Equitability

The GARBE is inspired by the Gini coefficient and satisfies FFMC.1, FFMC.2, and FFMC.3. The GARBE can be calculated using Equation (3). The variable represents the number of observations of the variable , i.e., the number of demographic groups. represents one observation from , i.e., the FMR/FNMR of a demographic group, and represents the mean of all observations . The GARBE has a range of values of , where is the fairest, and is the most unfair system. Unlike the previous two fairness metrics, the GARBE considers the difference or ratio of the highest and lowest error rates of the demographic groups and includes all values in between. This matters when the fairness between more than two demographic groups is calculated, which is the case when combining, e.g., skin color and gender information:

4. Proposed System

While further discussion is required to standardize final definitions of fairness metrics, for this study, we follow the argumentation of Howard et al. [73] and use GARBE to compare the fairness of different systems since it satisfies the most FFMCs. Additionally, GARBE can differentiate very well between fair and unfair compared to FDR and its fixed range is easier to interpret than the unbound IR.

First, the facial images are processed by multiple face recognition systems and the biometric performance is reported. In addition, a fairness score is computed for each system. Finally, different fusion schemes are evaluated in terms of biometric performance as well as demographic fairness. The whole procedure is executed on the full database as well as on subsets for different demographic groups. More details for each step are provided in the following.

4.1. Face Recognition and Demographics

For our work, we want to use multiple face recognition systems. Each system is then evaluated in terms of biometric performance as defined in ISO/IEC 19795-1 [19] regarding FMR and FNMR. In this context, the FMR is fixed to 0.1%, as recommended, e.g., for border control [74], to benchmark the different systems regarding their FNMRs. Additionally, the biometric performance is monitored for separate demographic groups. With this, we can see how biased the different face recognition systems are toward specific demographics.

4.2. Pareto Efficiency

Pareto efficiency is an optimization method mainly used in economics. The idea of using Pareto efficiency for biometric systems originated from [75]. In this work, we can make use of Pareto efficiency to preselect biometric systems that lie on the Pareto curve. The Pareto-efficient systems are identified using FNMR and the GARBE with respect to FMR (). With this setup, we combine all three inputs into a 2D Pareto curve. A system is Pareto efficient if no parameter (FNMR or GARBE) can be improved without worsening the other parameter.

4.3. Fusion Techniques

In general, there are many ways to fuse information in biometric systems [76]. In our approach, the two relevant fusion techniques are on decision level and on comparison score level. For the decision level fusion, the face recognition systems each compare their computed comparison score to the decision threshold. Subsequently, the decisions are fused using either AND/OR combinations or a majority voting. Especially the latter one requires an odd number of fused systems or a fallback strategy. For the score level fusion, each face recognition system computes one comparison score. Now, these scores need to be normalized to the same value range before fusing them. In our case, we use the min–max normalization to map all scores in the range of . The single comparison scores can then be weighted equally or a specific system can influence the final score to a larger amount. In any case, a new decision threshold is required for the fused system. This also implies a new calibration when different systems are fused or the weights are adjusted.

This leaves us with the question how to select the corresponding systems for the fusion. In this regard, we test three different approaches and evaluate how those improve the fairness metric as well as how they affect the biometric recognition performance:(1)We select fusion candidates based on complementary FMRs. More specifically, focusing on one demographic characteristic (e.g., gender), we select the face recognition model with the lowest FMR for one group (e.g., female) and another model with the lowest FMR for the other group (e.g., male). By fusing both models, we hope that the strengths of both models combined can improve the fairness regarding this demographic characteristic (e.g., gender). The same selection process is applied for all demographic groups.(2)The GARBE values are used to choose fusion candidates. Here, the idea is that the face recognition models with the best fairness scores are fused to hopefully complement each other, resulting in an even better fairness score.(3)The Pareto efficiency for all models is computed. By visually inspecting the graph, the Pareto-efficient systems are identified and selected for the fusion. The computation of the Pareto efficiency relies on biometric performance as well as the fairness score, thus somehow combining both previous approaches.

5. Experimental Evaluation

This section provides information about the experimental setup including database preparation and selected face recognition models. Subsequently, the results for all selected demographic groups are presented and discussed.

5.1. Experimental Setup

The face image database used in this study is the UNCW-MORPH dataset [27], see Figure 2. More specifically, UNCW offers a free academic dataset (https://uncw.edu/oic/tech/morph_academic.html) and a commercial dataset (https://uncw.edu/oic/tech/morph.html), which comes in two parts and licensing options (we did only license the first part). We combined the first half of the commercial dataset with the free academic dataset to obtain a larger face database for our study. Both subsets do not contain identical images or subjects, which was checked using cryptographic hash functions on file side and face recognition systems for biometric comparisons.

For the experiments, we split the UNCW database into smaller sets according to demographic attributes such as gender and skin color. It should be noted that we did not assign gender and ethnicity to the data subjects but used the available labels, coming with the database, as ground truth. In terms of gender, the database labels were binary, thus only distinguishing between female and male. In order to evaluate the influence of the skin color, we focused on the ethnicity labels African and European to have a clear separation in skin tones in these experiments. We are aware that these limitations do not represent all people, but we selected this setting to analyze bias reduction capabilities on clearly separable demographic subgroups. The idea is that those subgroups consist of a combination of two demographic attributes namely gender and skin color. In the following, the ethnic labels are discarded and the terms dark and light are used to separate the skin tones. The resulting demographic subgroups are therefore: dark female, dark male, light female, and light male. The resulting database comprises more than 246,000 images from 35,633 subjects, as can be seen in Table 1. This makes it one of the largest annotated databases providing demographic labels, which is captured in a controlled environment. When focusing on evaluating demographic fairness and bias, we do not want additional factors from unconstrained capture processes to influence the experimental results.

For the analysis of the demographic bias reduction capabilities, we need multiple face recognition systems in order to have a pool of possible fusion candidates to choose from. The selected face recognition systems should have state-of-the-art performance in terms of biometric recognition rates, thus only the leading open source models are used in this study.

The original ArcFace [77] is constantly updated [78] and retrained on new datasets. When looking at the different models (https://github.com/deepinsight/insightface/tree/master/model_zoo), that are made available by the authors, the reported performance increases for larger backbones (e.g., R100) compared to smaller ones (e.g., R50, R34, R18). Hence, we selected all available R100-models to be included in this study. However, for some of the pretrained models only R50 versions are available. Here, the WebFace600K model stands out since its reported performance is better than some of the previously selected R100 ones. Thus, this model is also included. The naming here mirrors the dataset, where the corresponding model was trained on, except for mxnet, which was trained on MS1MV2 but builds upon a different backbone structure compared to the remaining models. From now on all ArcFace models are marked with the prefix af_ followed by their original model name.

For the following open source face recognition systems, the selection process is more simple. The authors of CurricularFace [79] provide only one model (https://github.com/HuangYG123/CurricularFace) and MagFace [80] comes in multiple versions (https://github.com/IrvingMeng/MagFace), where again the R100-model is selected. Finally, ElasticFace [81] offers four pretrained models (https://github.com/fdbtrs/ElasticFace), which are all included. In the style of ArcFace, the ElasticFace models also get a prefix ef_ to be discernible in the following. In addition to the 12 open source systems summarized in Table 2, one commercial off the shelf (COTS) face recognition system is also included in the benchmark. However, this system is not considered for the fusion approaches in order to grant full reproducibility of our results.

For more details on the specific pretrained face recognition models, the reader is referred to the descriptions of the original authors. For this study, we now focus on the demographic bias of each model and how to fuse them to improve the fairness.

To retrieve comparable results of the different face recognition systems, RetinaFace [82] was used for face detection and alignment of the cropped face regions. Hence, all different face recognition models receive the same preprocessed face images as inputs.

5.2. Benchmark Results

Figure 3 shows biometric performance on the full database for all selected models.

5.2.1. False Non-Match Rate

Table 3 summarizes the FNMRs for all models and each demographic group. Since FMRs and FNMRs have a trade-off character, i.e., the higher the FMR is, the lower the FNMR is, we need to look at FNMRs in addition to FMRs. Magface is the best open-source model, and COTS is the best of all evaluated systems in terms of FNMR at a fixed FMR of . The worst model with the highest FNMR is Casia. The statement that Magface is the best model and Casia the worst is feasible in this case, the comparison of all subjects with all subjects since the FMR is uniformly . In the following observations regarding the FMRs of individual demographic groups, it must be noted that the FMR is not for each individual group but varies. Thus, a statement regarding the improved accuracy or performance of the models cannot be made directly. However, it is still possible to say that a FNMR for certain groups is better or worse than others for a fixed FMR of across all groups.

COTS performs best for each demographic group, but due to the smaller amount of comparison made with COTS and since it is not considered for fusion, only the FNMR values of the open-source models are described. First, we look at the gender-related columns. The highest FNMR among females has Casia with ; followed by the second worst model Curricularface with an FNMR of . The best FNMR among females has Magface, with . In the comparison among males, Casia is again the worst-performing model . Again, Magface is the model with the lowest FNMR of . These observations are the same as when looking at the comparisons of all subjects. In the direct comparison between the FNMR among males and females, it is clear that the FNMR among males is better than the FNMR among females for each model.

The lowest and thus best FNMR within dark-skinned individuals could be achieved with the Mxnet model and an FNMR of ; many of the other models have a only slightly higher FNMR. Casia has the highest and, thus, worst FNMR. Casia also has the highest FNMR for light-skinned individuals. This time, however, Magface has the lowest FNMR with an FNMR of . If we again compare the values of all models concerning dark-skinned and light-skinned people, the FNMR for dark people always turns out to be significantly better than for light people. In all models, the FNMR is lower for dark-skinned people than for light-skinned people. This is the opposite pattern of the FMRs. The obtained results could be caused by a difference in quality of facial images of certain subgroups. It has been observed that especially for light-skinned female subjects parts of the facial region may be occluded by hair. On the contrary, male subjects may have more distinct facial hair covering some of their facial region. This could be one hypothesis for the observed results. However, a more detailed investigation of causes of bias is out of scope for this work.

For the dark female subgroup, Webface600k provides the best FNMR with only , and the FNMRs are quite close for all models except Casia, with an FNMR of . Within the dark males subgroup, Mxnet has the lowest FNMR at . Again, the values of all models are close to each other, except Casia. Magface has the best FNMR in the light females category, with an FNMR of . In the light male subgroup, Magface also has . Casia has the highest and, therefore, worst FNMR in each group. Comparing all subgroups with each other, for each model the FNMR is lowest or best for dark females, followed by dark males and light males. Light females have the worst FNMR in each model evaluated with an FMR across all groups of . This observation is opposite to the behavior of FMRs. However, it is conclusive with the trade-off effect between FNMR and FMR: if FMR is higher, FNMR is lower, and conversely, if FMR is lower, FNMR is higher.

Figure 4 shows the biometric performance of the best open source model for each analyzed demographic group.

5.2.2. FMR Within Demographic Groups

Table 4 shows that the FMRs for all demographic groups are split into comparisons within each group and across different groups. Looking first at the FMR values within demographic groups in Table 4a, it is shown that with a global FMR of across all demographic groups, the female group performs worse than in most models. Only in Cosplus and Arcplus, the FMRs are below . Also, in the male group, the FMR is higher than in most models; only in Magface is the FMR significantly lower than with .

If we compare the male and female FMR values within their own demographic group, it is clear that in most models, the FMR of females is lower than the FMR of males. In 10 out of 13 models, females have a lower FMR than males, which means that a false match between females is less likely with these models than a false match between males. The two models that have a lower FMR among males are Webface600k and Magface.

Looking at the demographic groups dark and light, we notice that dark-skinned individuals have a higher FMR in most cases, as expected when comparing these individuals only among themselves. The FMR when comparing dark-skinned individuals is above in all cases except in the case of Magface: In the case of Magface, the FMR is . A different picture emerges here if we look at the comparison between light-skinned persons. For seven of 13 models, the FMR between light-skinned individuals is below . Since one tends to expect higher values for comparisons within a demographic group than for comparisons between all groups, this is very striking. Unsurprisingly, 12 of 13 models have a better FMR of light-skinned versus dark-skinned individuals. The only exception is, again, MagFace.

For the subgroup dark females, the FMRs are all higher than 0.1%. Similarly, in the dark-males subgroup, the FMRs of all models are above . The only exception is Magface, with an FMR of in the comparison between dark males. In the light-females subgroup, most FMRs are also above , Arcplus and Cosplus being the two exceptions. When comparing within the light-males subgroup, nine of 13 models still have an FMR above . If we compare the FMR values of the subgroups, the following picture emerges: In most models , dark females have the highest FMR, followed by dark males, and light males. Light females have the lowest FMR. Looking at the Magface model, the order of descending FMR is dark females, light females, dark males, and light males. It is noticeable that dark females have the worst FMR in every model. In most cases , dark males have the second-worst FMR, while light females and light males have the best or second-best FMR in almost equal proportions . Exceptions are Mxnet and Magface, although dark females still have the highest FMR, this time light females and light males have the second-highest FMR, respectively. In summary, FMRs are generally higher within dark-skinned subgroups. And the FMR within dark-skinned females is higher than within dark-skinned males.

5.2.3. FMR across Different Demographic Groups

Table 4b shows the FMR across demographic groups. That is, subjects from one demographic group are only compared to subjects from other demographic groups and not their own group. Since the two subjects being compared do not belong to the same demographic group and thus do not share certain characteristics (gender, skin color), it is expected that the FMR should be lower compared to the average value of [18, 41]. This is also true in most cases. Comparing subjects of different genders, any model has no FMR above . The best model is Magface with , and the worst model is CurricularFace with an FMR of . When comparing different skin colors, the same pattern emerges. No model has an FMR of more than . Magface is the best model with an FMR of , and Curricularface is the worst model with an FMR of . Comparing the FMRs across gender with those across skin color, the values are very similar in magnitude. Only Casia can distinguish skin color with an FMR of , significantly better than Gender with an FMR of .

Looking at the FMRs between the subgroups, it is noticeable that especially the FMRs between dark females and dark males are relatively high. Only three of the 13 models have an FMR of less than when comparing dark females and dark males. Besides COTS, one is Magface with and only very close Webface600k with . This observation aligns with the findings of Kolberg et al. [54], where the error rates across dark-skinned subgroups were also significantly higher than for other demographic subgroups.

In the comparison between dark females and light females, the FMR of 12 of 13 models is below , only Curriuclarface performs worse in this category with . Magface again performs best with . The FMRs between dark females and light males are relatively low. Magface distinguishes best with an FMR of , and Curricularface distinguishes worst with an FMR of . When comparing dark males with light females, the same picture emerges: COTS distinguishes best with an FMR of , followed by Magface with an FMR of , and Curricularface distinguishes worst with an FMR of .

When comparing dark males and light males, the FMR values are slightly higher for most models compared to dm–lf. For Mxnet and Cosplus, they are above . The best differentiator is again Magface with an FMR of .

The last comparison is between the groups light females and light males. No model has an FMR of more than . The best model is Casia with an FMR of . The worst model is Mxnet, with an FMR of .

Table 5 lists the GARBE fairness scores for all different models and demographics.

For each model and metric, was varied to include only FNMR fairness (), only FMR fairness (), and both equally combined (). The GARBE is for a very fair system and for a very unfair system.

5.3. Subgroups

Finally, we consider the fairness values with respect to the subgroups. The values can also be found in Table 5. As with the categories gender and skin color, the Casia model has the best fairness values with regard to FNMR, with a GARBE of . According to the GARBE, the Arc model is the least fair, with . The fact that Casia is the fairest model with respect to GARBE in all three considered categories regarding FNMR could again be related to the fact that Casia has generally higher FNMR values than all other models, which lowers the chance of a high ratio between the considered values (FNMR-male and FNMR-female, etc.) and thus makes a fairer impression than if the considered FNMR values are low. According to GARBE, Mxnet provides the best fairness between subgroups in terms of FMR. The Mxnet model has a GARBE of . Further, Curricularface is the least fair model with . Looking at both error rates combined (), the fairest model with respect to GARBE is Casia with . The least fair model is Curricularface with .

5.4. Summary—Individual Algorithms

We observe that GARBE, as mentioned earlier in the discussion of metrics, is a fairness metric with good predictive power with respect to equal treatment of different demographic groups. In contrast to other metrics, e.g., IR or FDR, the GARBE metric considers all error values, which becomes clear when assessing fairness between subgroups, since there are four demographic groups to consider there, instead of two. However, the results also show the weakness of GARBE, since it does not fulfill FFMC.4. When a system performs equally bad for all demographic groups, the GARBE fairness score is better than for other systems (c.f., af_casia GARBE scores).

5.5. Fusion Results

In the following evaluations of the fusions, we again only consider the GARBE fairness metric. Additionally, only FMR fairness () is evaluated, as the scope of this paper does not allow for further analysis of FNMR fairness values. The combined fairness of FMR and FNMR is also left out, for these values another circumstance is that the influence of the initial values is not directly comprehensible, since one does not know to what extent FMR or FNMR are causal for the result. Table 6 shows the fusion results when selecting the candidates based on their FMR performance. Table 7 shows the fusion results when selecting the candidates based on their GARBE scores. Table 8 shows the fusion results when selecting the candidates based on their Pareto efficiency. Figure 5 plots the Pareto efficiency for all tested systems based on the particular demographics. The Pareto curve combining all systems in the most lower left corner is called Pareto efficient and marked in green. These systems are then used for fusion and further evaluation.

5.5.1. Skin Color

Arcplus and Magface were chosen with the intention of aligning the two FMR values within light and dark subjects. Arcplus has the best FMR value for light–light comparisons, while Magface has the best FMR value for dark–dark comparisons. Subsequently, we expect a better fairness score over the demographic characteristic skin color from a fusion of these two models. In Table 6a, we first compare the new FMR values for dark–dark and light–light for the different fusions AND-, OR-, and Score-fusion with the values of the baseline models.

For the AND-Decision fusion, the FMR value of the fusion improves to compared to the baseline models for dark–dark comparisons, and the FMR value of light–light also improves to compared to both baseline models. Since these two values are relatively closer to each other than the values of the initial models, the GARBE measure also improves: the fairness score with respect to the fusion is , while the fairness score for the initial models is for Magface and for Arcplus. Thus, this fusion was able to improve fairness with respect to the considered fairness score. With the OR-Decision fusion, the fairness value changes to and in the case of the Score-fusion to . Based on these values, the selection criterion of the initial models seems to make sense. Furthermore, the Score-fusion seems to have the best effect on fairness. The other values (FMR and FNMR within genders and subgroups) are not compared, since they were irrelevant for the selection of the initial models and the inclusion and discussion of these parameters exceed the scope of the thesis.

For GARBE, the candidates are the Webface600k, MS1MV3 and Mxnet models. The results of this fusion are shown in Table 7a. As previously observed, the AND-Decision fusion lowers the FMR across all demographic groups, in this case, reducing the FMR to . The FNMR increases to with the AND-Decision, as expected. The FMR between dark-skinned subjects decreases to by AND-fusion, and for light-skinned subjects, the FMR decreases to . As a result, the GARBE between the two FMRs is . Thus, the GARBE of the AND-fusion is not better than the GARBE of Webface600k, but it is better than the GARBE of MS1MV3 and Mxnet. An improvement over all the initial models could not be achieved with the AND-fusion. In contrast to AND-fusion, OR-fusion increases FMR across all groups to , while FNMR improves to , as expected. The FMR between dark-skinned subjects increases to , and the FMR between light-skinned subjects increases to compared to all baseline models. But GARBE deteriorates to compared to all baseline models. The Majority-Vote-fusion improves FMR to , while FNMR settles between baseline models at . The FMR between dark-skinned subjects improves to , while the FMR between light-skinned subjects drops to . GARBE deteriorates to compared to the baseline models. Score-fusion can again use the FMR of . The FNMR across all groups improves to . The FMR among dark-skinned subjects deteriorates to , and among light-skinned subjects, it is in between the scores of the initial models with . Thus, the Score-fusion achieves a GARBE value of , which is only better than the fairness value of Mxnet, but worse than that of MS1MV3 and Webface600k.

None of the tested fusions could improve the fairness of Webface600k, and all fusions perform worse than the fairest initial model in terms of GARBE.

For the skin color characteristic, only two models form the Pareto curve, i.e., Webface600k and Magface, as can be seen in Figure 5(a). Since only two models are on the Pareto curve, only these two are fused, and there is no opportunity for a Majority-Vote-fusion. The results are shown in Table 8a, and the AND-fusion again improves all FMR values and worsens the FNMR value across all groups to . The GARBE improves to . With OR-fusion, the FNMR value improves compared to the baseline models, and the FMRs worsen. GARBE also improves with OR-fusion, this time to . With Score-fusion, we normalize the FMR across all groups to . The FNMR improves slightly to . The FMR between dark-skinned subjects is , lower than that of Webface600k but higher than that of Magface. The FMR between light-skinned subjects is , lower than that of Magface but higher than the FMR of Webface600k. The GARBE here is lower than that of the baseline model but higher than the GARBE of the other fusions with . The OR-fusion and AND-fusion are the only systems forming a new Pareto-curve and offer a Pareto-efficient trade-off between FNMR and GARBE.

5.5.2. Gender

To improve fairness between the female and male demographic groups, Cosplus and Magface were chosen. Cosplus has the lowest FMR for females, and Magface has the lowest FMR for males. Table 6b shows the fusion results. The AND-fusion results in the FMR of females and males being almost equal, both around . This results in a fairness score of for the AND-fusion, compared to the baseline models’ fairness scores of for Cosplus and for Magface. The OR-fusion yields similar fairness scores: the GARBE value is decreased to . And the Score-fusion approach can also improve the fairness score to . The selection criterion appears to be appropriate for improving fairness, at least for these two models, for each fusion tested.

The fairest three models in the gender category are MS1MV3, Webface600k, and Curricularface. Table 7 summarizes the results. With the AND-fusion, the FMR across all groups improves to , while the FNMR across all groups deteriorates to . The FMR among females improves to and that among males to . The GARBE of these two values is , which is worse than all GARBE values of the baseline models. In OR-fusion, the FMR increases to , and the FNMR across all groups improves to . The FNMR within female subjects increases to and that between male subjects increases to . The GARBE improves to , an improvement compared to all baseline models. The Majority-Vote again lowers the FMR among all demographic groups, and the FNMR is intermediate to the FNMRs of the baseline models. The FMR among female subjects drops to , while it drops to among male subjects. The GARBE fails to improve over all the baseline models and is . In Score-fusion, the FNMR improves to . The FMR among female subjects is between all baseline models with , and that of male subjects deteriorates to . Thus, the GARBE worsens compared to the baseline models and amounts to .

For the gender characteristic, MS1MV3, Webface600k, and Magface form the Pareto curve, as can be seen in Figure 5(b) and are selected for fusion. The results of the fusion are summarized in Table 8b. The AND-fusion again improves the FMRs, both within subjects of all groups and within the specific groups, female and male. In return, the FNMR worsens to across all groups. The GARBE for the FMR among females and the FMR among males is for the AND-fusion, which is fairer than Magface, but still more unfair than MS1MV3 and Webface600k. With OR-fusion, again, the opposite effect can be seen. The FNMR across all groups decreases, and the FMRs increase compared to all initial models. The GARBE is and is slightly worse than the GARBE of the AND-fusion. The Majority-Vote-fusion also behaves like the Majority-Vote-fusions before: FMRs decrease across all groups and in the male and female demographic groups. The FNMR is within the range of the baseline models. GARBE improves to compared to the AND- and OR-fusion, but the fusion is still more unfair than the MS1MV3 and Webface600k baseline models. Score-fusion improves the FNMR across all groups slightly to . The FMR between females is . The FMR within males is lower than that of MS1MV3 and Webface600k with but higher than that of Magface. The GARBE is the highest compared to the other fusions at ; compared to the baseline models, only Magface is more unfair. The OR-fusion and the Majority-Vote-fusion both form the new Pareto curve and are therefore pareto efficient.

5.5.3. Demographic Subgroups

In the comparison of fairness between subgroups, the Casia, Arcplus, and Magface models were used as initial models for the fusion. The models were chosen so that the selected models each included the lowest FMR for all subgroups. The results of the fusion are documented in Table 6b. The baseline models have a GARBE fairness score of for Casia, for Arcplus, and for Magface. For the AND-fusion, all FMR scores of the subgroups improve compared to all baseline models. The GARBE fairness score is for the AND-Fusion. For the OR-fusion, the FMR values certainly deteriorate as expected. The fairness score also declines to . A deterioration in fairness also occurs with the Majority-Vote-fusion, with the GARBE score deteriorating to . With the Score-fusion, a mean GARBE value of is reached. Fusions of these three models chosen by the selection criterion do not improve fairness between demographic subgroups compared to the initial models.

The three fairest models in terms of subgroups are Mxnet, Webface600k, and Glint360k. The results of the fusion are summarized in Table 7b. In the AND-fusion, the FMR decreases to , while the FNMR between all subjects increases to . The FMR of each subgroup decreases to within dark females, to within dark males, to within light females, and to within light males. The GARBE for the AND-fusion is , which is worse than that of the baseline models. For the OR-fusion, the FNMR worsens across all groups, as do the FMRs for each demographic group. The FNMR improves to . GARBE also improves compared to the baseline models to . In the Majority-Vote-fusion, as in the AND-fusion, all FMR values improve. The FNMR is between the values of the initial models at . GARBE is worse than the baseline models at . In Score-fusion, the FNMR again improves to . The FMR of dark females worsens, and the FMR of dark males, light females, and light males is between the initial FMRs. The GARBE of the Score-fusion with is only better than that of Glint360k but still worse than that of Mxnet and Webface600k.

The models Mxnet, Webface600k, and Magface form the Pareto curve concerning subgroups, as can be seen in Figure 5(c), are merged. The results are shown in Table 8. The AND-fusion again reduces all FMRs, while the FNMR increases to within all subjects. The GARBE of is higher than the GARBE of the initial models, making the AND-decision more unfair. For the OR-decision, the FMRs increase within all subjects and the subjects of the specific demographic groups. In return, the FNMRs are at . The GARBE also drops to in this case, making the OR-fusion fairer than all the baseline models. The Majority-Vote-fusion lowers the FMR across all subjects to . The FNMR is between the values of the baseline models at . The FMR among dark females decreases to , as does the FMR among light females to . The FMR within dark males and light males is between the values of the initial models. The GARBE of is lower than that of Magface but still higher than that of Mxnet and Webface600k. The Score-fusion is again standardized to an FMR of . The FNMR is slightly worse than the FNMR of Magface with . FMRs within demographic subgroups lie between the baseline models, except dark females perform worse. The GARBE is , which is fairer than Magface but unfairer than Mxnet and Webface600k.

The OR-fusion is the only pareto-efficient system, when comparing the baseline systems and the other fusions.

5.6. Summary—Algorithm Fusion
5.6.1. Effect on Fairness

We conclude that 12 out of 33 mergers improved fairness under GARBE. Six of these are accounted for by the characteristics of gender and skin color and the selection criterion of the lowest FMR of each covariate considered. In those cases, every fusion improved the fairness of the baseline models. The OR-fusion accounts for two improvements in fairness regarding the selection criterion of the best fairness values. Three are accounted for by the fusions with the selection criterion of the Pareto curve, with only two models on the Pareto curve. And the last one is also accounted for by the OR-fusion with the selection criterion of the Pareto curve. The summary is visualized in Table 9.

5.6.2. Effect of the AND-Fusion

In every fusion performed, the AND-fusion led to a reduction in FMR between all subjects. This means that the overall probability of a false match can be significantly reduced with the AND-fusion. However, the AND-fusion significantly increases the FNMR between all subjects. This means the probability of a false nonmatch occurring is higher in systems with an AND-fusion.

It could also be shown that the FMR of the individual covariates decreases in each case due to the AND-fusion. More interesting is the effect of the AND-fusion on the fairness of the GARBE for FMR. In three out of nine cases, the AND-fusion improved fairness relative to all baseline models. This was the case for all the fusions using only two models. The AND-fusion improved fairness twice for the covariate skin color and once for gender. No improvement in subgroup fairness was possible with the AND-fusion. The AND-fusion combined with the selection criterion of the lowest FMR of a covariate appears promising when only two covariates (male and female or dark- and light-skinned) are considered. In both cases, fairness could be improved.

5.6.3. Effect of the OR-Fusion

The OR-fusion behaves opposite to the AND-fusion concerning FMR and FNMR. The FMR within all subjects and each covariate increases significantly compared to the baseline value of the merged models . In turn, the FNMR decreases significantly. This secures the OR-fusion a place on the Pareto curve for the selection criterion every time. The OR-fusion can improve the fairness of the initial models in six out of nine cases. The OR-fusion could improve the fairness of all combinations created by the selection criterion of the lowest FMR per variate.

For the selection criterion based on the lowest fairness scores, OR-fusion improved fairness in two out of three cases. For the Pareto curve, fairness was only improved in one case.

5.6.4. Effect of the Majority-Vote-Fusion

The Majority-Vote-fusion could only be applied in six of the nine fusions. The FMR within all subjects could be reduced in every case. The FNMR, on the other hand, is always a value between the initial values of the baseline models. The FMRs of the individual covariates are mostly between the initial values and, in some cases, below.

The fairness could not be improved compared to all baseline models. In some cases, the fairness even worsened compared to all initial models.

5.6.5. Effect of the Score-Fusion

Score-fusion normalized the FMR to 0.1%, as in the baseline models. In five out of nine cases, the FNMR was reduced by Score-fusion. In the other four cases, it is comparable to the FNMR of the best baseline model. The FMR of the individual covariates is mostly between those of the baseline models, but in some cases, it is higher.

In three out of nine cases, fairness could be improved. Fairness was specifically improved when only two models were merged with the AND-fusion. In the other cases, the GARBE is between or even above that of the baseline models.

6. Conclusions

This work presented a benchmark of 12 open source face recognition systems on a common database. The overall biometric performance as well as the performance for specific demographic groups is evaluated as a baseline to inspect the raw system bias.

The main contribution of the work was to analyze whether fairness can be improved by fusing face recognition models. Since all possible combinations of models would have been a too large number to evaluate effectively, three selection criteria for models to be fused were formulated. The first selection criterion chooses the models with the lowest FMR for their demographic group. The second selection criterion selects the three models with the best fairness in terms of FMR. The last criterion selects the models based on the Pareto curve. For the selection criterion based on FMR, improvements were achieved for fairness concerning gender and skin color for all types of fusions (AND-, OR-, Majority-Vote-, and Score-fusion). However, fairness between subgroups could not be improved. For the selection criterion GARBE, the fairness could only be improved in two cases: in each case for the OR-fusion with respect to gender and demographic subgroups. For the Pareto efficiency, fairness could be improved with AND-, OR-, and Score-fusion for skin color demographics, while gender fairness could not be improved. Fairness for demographic subgroups could only be improved with the OR-fusion. In addition, we tested whether the fusions were Pareto efficient relative to the baseline models, thus adding better points in the Pareto curve. The OR-fusion is always Pareto efficient, while the AND- and Majority-Vote-fusion were Pareto efficient only in individual cases.

Based on these results, the following trends could be identified. The OR-fusion was most successful in improving fairness, while the Majority-Vote-fusion failed to improve fairness in any case. Fairness was best improved for skin color and gender, while the fairness of demographic subgroups could only be improved in two of 12 cases. The combination of two models seems to give better results regarding fairness than the combination of three models. The selection criterion of the lowest FMR seems to be the most effective to improve fairness.

The question of how the fusions influence the general performance, i.e., the FNMR, must be answered for each individual fusion. The OR-fusion always improves the FNMR for the cost of a worst FMR. The AND-fusion worsens the FNMR in every case, but on the other hand improves the FMR. The Majority-Vote-fusion mostly achieves an intermediate FNMR of the initial models, while the FMR can be significantly improved. The Score-fusion is the only one where we can maintain a fixed FMR of 0.1% with the effect that the FNMR also depends on the new threshold, thus varying increase and decrease. Accordingly, the choice of fusion is closely related to the application scenario and whether security or user convenience is preferred.

A general recommendation on how systems should be fused cannot be made from the above trends. This would require a statistical study for each criterion, fusion type, and demographics. However, the trends can be used to examine the different types of fusions, selection criteria, and demographics more closely and individually to avoid a flood of combinations.

Data Availability

The image data used to support the findings of this study were supplied by the University of North Carolina Wilmington under license and cannot be made freely available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research work has been funded by the German Federal Ministry of Education and Research and the Hessian Ministry of Higher Education, Research, Science and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. Open Access funding enabled and organized by Projekt DEAL.