Abstract

Due to usability features, practical applications, and its lack of intrusiveness, face recognition technology, based on information, derived from individuals' facial features, has been attracting considerable attention recently. Reported recognition rates of commercialized face recognition systems cannot be admitted as official recognition rates, as they are based on assumptions that are beneficial to the specific system and face database. Therefore, performance evaluation methods and tools are necessary to objectively measure the accuracy and performance of any face recognition system. In this paper, we propose and formalize a performance evaluation model for the biometric recognition system, implementing an evaluation tool for face recognition systems based on the proposed model. Furthermore, we performed evaluations objectively by providing guidelines for the design and implementation of a performance evaluation system, formalizing the performance test process.

1. Introduction

Face recognition systems provide the benefit of collecting a large amount of biometric information in a relatively easy and cost-effective manner, because they do not require subjects to bring any part of their body in contact with the recognition device intentionally, which results in fewer repercussions and less inconvenience when collecting the biometric information. An additional advantage exists, that is, the widely deployed image acquisition equipment can be used without modification. In particular, various face recognition algorithms and commercial systems have been developed and proposed, and the marketability of face recognition systems has increased, with many immigration-related facilities such as air and sea ports in many countries anxious to introduce face recognition systems after the 9.11 terror attacks in the US. These benefits, and the perceived necessity of increased security, have led to a rising social demand for face recognition systems; and certified performance evaluation has become important as a means of evaluating these face recognition systems.

This paper proposes a performance evaluation model (PEM) to evaluate the performance of biometric recognition systems; and designs and implements a performance evaluation tool that enables comparison and evaluation of face recognition systems, based on the proposed PEM. The PEM is designed to be compatible with related international standards, and contributes to the consistency and enhanced reliability of the performance evaluation tool that is developed with reference to the model.

Section 2 outlines existing studies related to performance evaluation of face recognition systems. Section 3 proposes a PEM to evaluate the performance of face recognition systems. Section 4 describes the design and implementation of the performance evaluation tool, based on the proposed PEM. Section 5 compares the performance evaluation method, utilizing the performance evaluation tool proposed in this paper, with existing performance evaluation programs. The last section provides a conclusion, and suggests some remarks for future research.

The representative certified performance evaluation programs for facial recognition systems are FacE Recognition Technology (FERET) and Face Recognition Vendor Testing (FRVT). As has been used since 1993, FERET includes not only performance evaluation of facial recognition systems but also the development of algorithms and the collection of a face recognition database. Headed by the U.S. Department of Defense (USDoD), FERET is an evaluation tool that has been systematically executed from 1993 through 1997 by testing changes in the environment (e.g., size, posture, background, etc.), differences in the time when pictures are taken, and the performance of algorithms in processing a mass-volume database. In particular, the FERET performance evaluations have been general evaluations designed to measure algorithms at the level of research centers. The major purpose of the FERET performance evaluation tool has been to implement adaptations to the latest facial recognition technologies and their flexibility. Therefore, the FERET test is neither used to clearly measure the influence of algorithms on the performance of individual components nor to assess performance in fully organized scientific manners under all operating conditions of a system [13].

Face Recognition Vendor Testing (FRVT) was a performance test for the face recognition system that was implemented using three Face Recognition Technology (FERET) performance evaluations (1994, 1995, and 1996). The FERET program introduced the evaluation technique in the face recognition area, and developed the face recognition area at the earliest level (system prototype development). However, as face recognition technology matured from the prototype level to the commercial system level, FRVT 2000 measured the performance of these commercial systems, and evaluated how far the technology had evolved through comparison with the last FERET evaluation. The public began to pay more attention to face recognition technology in 2002. As a consequence, FRVT 2002 measured the degree of technical development since 2000, evaluated the large-size databases that were in use, and introduced new experiments to better understand the performance of face recognition. Size, difficulty, and complexity of this performance evaluation were on the rise as the evaluation theory as well as the face recognition technology grew. For example, FERET SEP96 performed just 14.5 million comparisons over a period of 72 hours, while FRVT 2000 carried out 192 million comparisons in 72 hours. In contrast, FRVT 2002 introduced an evaluation that made 15 billion comparisons in 264 hours [46].

Certified performance evaluation programs like FERET and FRVT were designed to measure the algorithm accuracy of face recognition systems. For these projects, a common face image database was provided for the test, face recognition was performed for a certain period of time according to the respective method, and the results were evaluated. However, this method provides evaluation only for face recognition technology vendors that participated in the program during the evaluation period. In particular, database items were limited to image size, target posture, image acquisition environment, and time, which left the problem that various conditions of algorithm evaluation were not satisfied dynamically. Moreover, the algorithm evaluation environment was commissioned to each of the face recognition system developers, creating the problem of inconsistency in establishing a performance evaluation system environment. Furthermore, additional tasks were required in order to determine the accuracy of each algorithm, and to analyze the algorithm implementation result again. Therefore, it is necessary to design an algorithm evaluation method that can resolve these problems, to build a standardized evaluation environment, and to automatically figure out the evaluation result of the algorithm whose performance is measured in this environment.

2.1. Factors Affecting Performance Evaluation

Results from the performance evaluation of facial recognition systems change in accordance with varying factors, such as lighting, posture, facial expression, and elapsed time. The JTC 1/SC 37/WG 5 International Standard [7] classifies those factors affecting the performance of biometric systems.

As outlined in Table 1, there are a number of factors affecting the performance of facial recognition systems, and such factors must be prudently taken into consideration when the probe and gallery are selected during the performance evaluation of these systems. Algorithm performance evaluation of biometric recognition technology is conducted in such a way that the standard gallery is trained or registered, and is then compared with the test biometric information (probe) to be recognized, after which the similarities between the two sets of information are measured.

Generally, basic factors such as posture, angle, facial expression, lighting brightness, and gender are considered in the construction of a face image database. However, records of the face image database need to be further subdivided in order to process the test under conditions similar to those prevalent in the real world.

The research facial database that was developed by Korea Information Security Agency (KISA) from 2002 to 2004 was used for performance evaluation [8]. Table 2 shows a classification of KISA’s database.

3. Performance Evaluation Model (PEM)

Many factors must be considered when building a fair performance evaluation system for biometric recognition systems. For example, to evaluate the performance of face recognition systems, a database of facial information for use in face recognition should be collected, and performance evaluation items (changes in facial expression, lighting, etc.) as well as performance evaluation measurement criteria such as false acceptance rate (FAR), false rejection rate (FRR), and equal error rate (EER) should be selected. The face recognition system to be evaluated and a standardized interface for the performance evaluation system should be designed and international standards need to be applied at each stage of performance evaluation, in order to enhance fairness and reliability.

Thus, the performance evaluation model (PEM) is created to analyze and arrange the criteria to be considered in building up the performance evaluation system, and to support the development of the performance evaluation system. The PEM presents the basic system structure, guidelines, and development process used to build a system for performance evaluation.

The PEM proposed in this paper is designed to (1) evaluate the performance of the biometric recognition algorithm, and (2) to build a system that automatically evaluates performance and outputs the results in tandem with the biometric recognition system.

3.1. Structure of the PEM

The PEM structure for the system that evaluates the biometric recognition algorithm is composed of a data preparation module, an execution model, and a result analysis module, as shown in Figure 1.

3.1.1. Data Preparation Module

The data preparation module prepares the biometric information used for performance evaluation, for which the development of a biometric information database and the design of the test criteria are the major elements to consider. As the biometric information database used affects the evaluation reliability of the performance evaluation system to a large extent, it should be considered a priority at the initial stage of system development. In addition, the biometric information used for performance evaluation should never be exposed, so that evaluation reliability can be improved [9].

Generally, algorithm performance evaluation of biometric recognition technology is conducted in such a way that the standard gallery is trained or registered, and is then compared with the test biometric information (probe) to be recognized, after which the similarities between the two sets of information are measured. At this time, algorithm performance varies according to the information generation environment or conditions. For example, if an expressionless front-view facial image is registered in the gallery, and a smiling facial image photographed at a 15-degree angle from the left is used as the probe, we can compare the strength of the different algorithm technologies in terms of facial expression and angle. In this paper, the “test criteria” refers to an item that could affect the performance evaluation result, and the performance evaluation system developer should design test criteria that are suitable for the objective of the evaluation. The test criteria selected by the PEM are limited to the classification criteria of the biometric information database.

3.1.2. Execution Module

The execution module activates the biometric recognition system to be evaluated, and executes a performance evaluation. There are two methods for establishing an interface between the performance evaluation and the biometric recognition system. The first one consists in developing two systems as independently applied programs, while the second one consists in creating the biometric recognition algorithm as a component or library and then inserting it into the performance evaluation system. The former requires advance agreement between two systems with regards to the input/output file format, since the input data used for performance evaluation and the performance evaluation execution result data are generally transferred in a predefined form (generally, XML). Even though FRVT 2006 did not use the performance evaluation tool, participating companies submitted the biometric recognition system as an execution file, and the name of the input file used for evaluation and the output file that records the evaluation result were transferred as the program argument. For the component (or library) method, the standardized interface should be agreed upon in advance. The agreed interface should be as simple as possible, and compatibility with international standards is desirable. The related international standards include biometric application programming interface (BioAPI) 1.1 [10] and BioAPI 2.0.

3.1.3. Result Analysis Module

The result analysis module performs a final analysis of recognition algorithm performance, using the result value obtained from the execution module. The performance of the specific algorithm can be expressed using several measurement criteria, and the appropriate measurement factor is decided upon depending on the objectives of the performance evaluation. Measurement factors can be broadly grouped into error rates and throughput rates. Error rates are basically derived from matching errors and sample acquisition errors, and the focus is on whether the algorithm is working properly and accurately. The throughput rate shows the number of users that the face recognition system can process in a given unit time. This throughput rate has significant meaning when performing the verification in a large image database [7].

3.2. Formalization of PEM

As biometric products are being used in establishing national infrastructure, a need for more effective and objective biometric performance test is on the increase. As examples are being announced such as FRVT, however, no objective and proved methodology is reported yet. In this paper, by presenting and formalizing biometric system performance test models, firstly, securing efficiency by eliminating unnecessary processes or factors that can occur when evaluating the performance of a biometric recognition system, and secondly, guaranteeing objectivity may be accomplished by generating credible test factors and processes for the performance test, which is completely dependent on heuristic methodology.

The performance test was executed by elevating the usability of presented models in this paper, and by formulating model-based tools for validation.

PEM is defined as a structure with the following meanings. (a)DB is a set of all images of the test database.(b)ATTR is a set of pair representing classification factors for probe and gallery set. (i) where fact is a set of factors influencing performance.(ii)age, sex, background, expression, pose, illumination,.(c)INT is an interface of biometric recognition system being tested.(d)MET is a set of performance metrics. DB represents all facial images in the image database to be used for the purpose of face recognition performance evaluation. ATTR represents the factors that affect the performance test. The factors include age, sex, photographing time, expression, background, posture, lighting, and costume; and these factors are used when selecting probe and gallery image set. Probe image set is , and gallery image set is . is a function selecting elements that satisfy the condition. For example, in case of to perform the performance test for laughing men’s faces, the probe image set used in the performance test is . Gallery image set is .

When executing the performance test, all images of GSET will be matched to each image of PSET. That is, all image matching set MSET is a Cartesian product of PSET and GSET. (a)
INT is the interface of face recognition module, an object of the performance test. Through this interface, performance-testing tools call the function of face recognition module. The interface should be as simple as possible, and it is desirable to be compatible with the international standards. INT should include the matching function , and this function outputs the matching results (accept or reject) by accepting one factor of image matching set.

(b) , such that returns “accept” or “reject,” where
The execution result of face recognition function for the performance evaluation is expressed as the results from calling function using all element of MSET as factors, and this can be expressed as a two dimensional matrix, . That is, element value is . When the result value is accept, the outcome is 1, when the resulting value is reject, it is 0. For example, , and , MSET becomes . If the resulting values of applying each element of MSET to is 1, 0, 0, 0, 1, 1, 0, 0, 1, consecutively, this can be expressed as the following 2-dimensional matrix: MET is a set of performance test measures such as fail-to-enroll rate (FTER), fail-to-acquire rate (FTAR), and false nonmatch rate (FNMR), false match rate (FMT). Such matching error-related metrics as FNMR, FMT can be calculated using RMATRIX: where the following hold: (i) is the size of PSET, (ii) is the size of GSET,(iii)Same() is 1, if and are images of same person, otherwise Same() is 0,(iv)Diff() is 0 if and are images of same person, otherwise, Diff() is 1,(v) is 1, if and are 1, otherwise it is 0,(vi) is 1, if values of and are same, otherwise it is 0.

3.3. Evaluation System Development Process

The following section describes how to build a performance evaluation system according to the PEM. (1)Describe the objectives of developing and evaluating a performance evaluation system.(2)Develop or select the biometric information database that will be used for performance evaluation.(3)Design test criteria that fit into the evaluation objectives.(4)Determine the type of interface to be used between the performance evaluation system and the biometric recognition system. If it runs as a standalone program, select the input/output file format; or, if it is linked by the component method, design the component interface.(5)Select the measurement criteria that fit into the evaluation objectives.(6)Implement the “data preparation module” which reads the biometric information according to the test criteria through the interface with the biometric information database.(7)Implement the “execution module” which executes the recognition algorithm with the gallery/probe biometric information provided by the data preparation module. The execution result (degree of similarity) should be saved in the database containing the results of performance evaluation.(8)Calculate the value of the measurement criteria by analyzing the similarity saved in the performance evaluation result database, and implement the “result analysis module” to generate a report on the results of the performance evaluation.

The test items and measurement criteria can be decided when performing the actual performance evaluation instead of building the performance evaluation system. In this case, select test items and measurement criteria that can be selected at steps 3 and 5 as described above, the test should be able to select the necessary items from among these when running the performance evaluation tool.

4. Designing and Implementing the Performance Evaluation Tool

The performance evaluation tool was designed and implemented using the PEM proposed by this paper. The following section describes the contents and results by step, according to the evaluation system development process. The purpose of performance evaluation is to identify the technology level of the face recognition system through objective performance evaluation and certification, so as to encourage public trust in face recognition products and enhance their competitiveness.

4.1. Test Criteria

The test criteria are designed in such way that those do not have to be selected when developing the evaluation system, enabling the tester to select it in the course of performing the actual performance evaluation. Basically, the test item lists all the classification criteria so that the tester can select from them separately, based on the condition of the gallery image set and the probe image set. The gallery image set and the probe image set, each of which is composed of several items, are referred to as the “test set.” One performance evaluation project can generate several test sets, and each test set can generate a different result report. For instance, even though the collection of facial images to be registered in the face recognition system is white-front (normal) or purple-front (illum. yellow), the actual image acquired by the image acquisition device to identify the user can be normal, eye-closed, or tilted slightly to the left or right. The test set can be configured as in Figure 2. by assuming this kind of face recognition system.

If we assume that 10 face images exist per category in the above example, there are 20 facial images in the gallery set, and 40 facial images in the probe set. Therefore, the template generation function will be invoked 20 times for the gallery image, and image processing will be performed 40 times for the probe image, which means that 800 () comparison operations will be executed if performance evaluation is performed on the above test set. Among these comparison operations, image comparison will be performed 80 times for all the persons involved, because the image for a specific person appears twice in the gallery set and 4 times in the probe set, which results in an image comparison being performed 8 times for the same person, with 10 persons in total being compared. Therefore, 720 different image comparisons are made for the different persons, based on this system. Using this method, image comparison times can be estimated in advance, and the tester can estimate the time required for performance evaluation in advance, because the comparison times are calculated beforehand.

4.2. Selected BioAPI

The performance evaluation tool provides a standard interface for the face recognition system, and the examinee provides the face recognition module that satisfies this interface as the dynamic library. The performance evaluation tool is designed to enable the tester to change the face recognition module during the run time so that the tester can perform evaluation by changing the face recognition module without modifying the performance evaluation tool.

BioAPI 1.1 was applied for the compatibility with international standards, and only the minimum number of functions required for algorithm performance evaluation was selected, in order to reduce the examinee’s development burden. Table 3 shows the selected BioAPI for the face recognition module.

4.3. Measurement Criteria

The performance evaluation metrics used by FERET and FRVT, as well as the metrics proposed by JTC 1/SC 37/WG 5 Standard [7], were analyzed. Among these metrics, the criteria related with the technology evaluation of the face recognition system were chosen, as shown in Table 4.

4.4. Class Diagram of Metadata

Within the performance evaluation tool implemented in this paper, individual projects created for performance evaluation internally generate metadata in a specific structure in order to save the settings related to performance evaluation and performance evaluation results. The structure of these metadata is as illustrated in Figure 3.

4.4.1. CMetaProject Class

Whenever a new project is created, this class creates an instance in connection with the project. It maintains the path used to save the project name and the project itself, and the path to access a database in character strings. This information is related to the project configurations, and contains values established by the tester upon the project’s creation. In addition, the list of face image data designed by the tester for performance evaluation receives the lists of CMetaTestSet class. A single project may have at least one group of face image data for its performance evaluation, and independently create a report based on these individual evaluation results. Therefore, the information exists in the extendible list format. Moreover, this class allows users to ascertain the number of total images to test (compare), and that of probes and gallery images with regards to such information.

4.4.2. CMetaCategory Class

This class contains data related to the subcategories of face images. Face images are influenced by the direction of the picture taken, the location of lighting, and posture, and so forth. According to these conditions, they are classified, and the tester may choose some of the classified images by using the performance test tool as the test subject group. The chosen information is individually saved in the Cmetacategory class. These metadata contain variables, such as the category name of each criterion, the total number of images in the category, and the number of images failed to enroll (for gallery items) or acquire (for probe items).

4.4.3. CMetaTestItem Class

This class contains data pertaining to one face image. Each face image has a unique ID for the photograph target, its location, and the face image items. Additionally, it contains Boolean variables in order to save whether each item failed to enroll (for gallery items) and acquire (for probe items) or not.

4.4.4. CMetatTestResult Class

This class stores the verification results of the comparison of one probe item with one gallery item in order to determine whether or not they come from an identical person. It contains each item’s criteria, location, and ID information, along with variables, such as the similarity value and the comparison time created as a result of a comparison between two images.

4.4.5. CMetaEnrolled Class

This class saves the template data created so as to recognize a face from the image data for the system to execute the enrollment of a gallery item. It holds not only the item’s criteria and location information but also a binary space to store template data, as well as data used to save the results of the template creation and the time required to create a template.

4.4.6. CMetaTestset Class

This class includes information about the face image probe group and gallery group in order to conduct the test. One project may have several test groups, and each test group individually saves the performance evaluation results. Metadata of the test groups bear the following information. Firstly, the class contains CMetaCategory which is the information holding the face image criteria as the list information for each of the probe and gallery groups. Namely, each of the probe and gallery groups may include a face image group with several different criteria. In addition, the class maintains the list of CMetaTestItem for each of the probe and gallery groups. This is not the criteria information but the metadata with individual image item information used for the test. It also contains the list of CMetaTestResult, containing the test results between a single probe item and a single gallery item.

Furthermore, where a system enrolls a gallery item, this class will contain the list of CMetaEnrolled classes in order to store template data that are created when each face recognition module generates its own template data. This is accomplished by using the face image data. As general data of such list data, the class contains variables to contain the test start time, end time, number of total probe items, number of total gallery items, number of total gallery items that failed to enroll, and number of total probe items that the system failed to acquire. We developed the six major modules and metadata classes that we examined above with the program for Windows, under Microsoft’s Visual Studio development configuration. The face recognition modules, which are the subject of the performance evaluation, work in connection with the performance evaluation tool in the format of a dynamic link library.

4.5. Implementing Data Preparation/Execution/Result Analysis Module

The performance evaluation tool was developed as an application program running on Windows OS, and the face recognition module to be evaluated was implemented as a dynamic link library (DLL). The data preparation module that has the function of connecting with the biometric database and of setting the gallery and probe image set was implemented for use in the performance evaluation, as shown in Figure 4.

The face recognition module provided by the vendor was checked to verify that it provides the functions presented in Table 3, and the execution module that performs evaluation was implemented, using the functions of the selected face recognition module. A function that visually displays whether performance evaluation is progressing properly or not was included in the execution module, as shown in Figure 5.

Finally, the value of evaluation criteria is calculated by analyzing the similarity saved in the performance evaluation result database, and the result analysis module that generates the performance evaluation result report is implemented. This performance evaluation tool is equipped with a function that generates the evaluation result, as well as a function that issues the certificate for the face recognition module, depending on the evaluation result.

5. Comparison of Performance Evaluation Methods

Table 5 shows a comparison made between FERET and FRVT, which are the representative face recognition evaluation cases, and the evaluation method that uses the performance evaluation tool proposed by this paper.

Compared with performance evaluation programs such as FERET and FRVT, the performance evaluation tool proposed by this paper provides the following benefits. (i)Disclosure of the face image database can be fundamentally prevented.(ii)Development of a face recognition module that complies with international standards will be encouraged.(iii)The performance evaluation target can be separated from the performance tester.(iv)The evaluation cost can be reduced significantly, and individual evaluations can be performed for each vendor.

6. Conclusion

This paper proposed a PEM to evaluate the performance of biometric recognition systems. The proposed PEM is designed for compatibility with the related international standards, thereby contributing to the enhanced consistency and reliability of the performance evaluation tool that is developed according to this design. The proposed PEM is essential for the following reasons. (i) It represents a model and development method for the performance evaluation system.(ii)It applies the related international standards to the performance evaluation system.(iii)It enhances the consistency and reliability of the performance evaluation system.(iv)It provides guidelines for the design and implementation of the performance evaluation system by formalizing the performance test process.

In addition, a performance evaluation tool capable of comparing and evaluating the performance of the commercialized facial recognition systems was designed and implemented, and an evaluation that executed 800 billion comparisons in 596 hours using the KFDB [8] was conducted. The certificate issuance criteria regarding the performance of the face recognition systems should be presented systematically, and a method should be prepared that can promote certification.