Abstract

In recent years, various studies have been conducted to provide a real-time service based on face recognition in Internet of things environments such as in a smart home environment. In particular, face recognition in a network-based surveillance camera environment can significantly change the performance or utilization of face recognition technology because the size of image information to be transmitted varies depending on the communication capabilities. In this paper, we propose a multiresolution face recognition method that uses virtual facial images by distance as learning to solve the problem of low recognition rate caused by communication, camera, and distance change. Face images for each virtual distance are generated through clarity and image degradation for each resolution, using a single high-resolution face image. The proposed method achieved a performance that was 5.9% more accurate than methods using MPCA and SVM, when LDA and the Euclidean distance were employed for a DB that was configured using faces that were acquired from the real environments of five different streets.

1. Introduction

IoT is an intelligent system that helps communicate with people and things or between things and things using Internet networks. With the recent advancement in various technologies such as high-speed networks, large-capacity data transmission, wired-wireless sensor networks, among others, not only IoT but also related application technologies have been actively developed. With the recent developments in hardware technology, a video surveillance system using a high-resolution camera is commonly used. However, in the case of an intelligent surveillance system, there is a limit to data transmission due to high computational real-time analysis. Therefore, thanks to the spread of high-speed communication technology, such as 5G, the high-resolution-based surveillance camera has been applied to remote face recognition technology. Along with the recent increase in the availability of high-resolution cameras, there has also been an increase in the use of image monitoring systems based on such cameras to improve face recognition performance. In face recognition systems, faces are recognized using high-resolution face images that are acquired at close range, and these serve as an important factor that can guarantee an effective recognition performance. However, aside from access control systems, it is difficult for face recognition systems to acquire high-resolution face images at close range because the distance between the camera and the person varies. Images that are captured through the camera inevitably suffer in terms of the face resolution if the distance between the camera and the person becomes too wide, regardless of the cameras performance. In other words, even if high-resolution cameras are employed in face recognition systems, the face recognition performance cannot be guaranteed for long range low-resolution images [1].

Many studies exist regarding various face recognition methods, but issues still remain, such as the recovery processing speed for low-resolution face images, reliance on learning data, and reduced recognition rates according to changes in distance that occur during actual face recognition situations. Although some studies are limited to a single environment, such as high or low-resolution face environments, no existing face recognition studies are based on environments that include changes in resolution. In other words, there is a need for face recognition technology based on multiresolution that is robust to changes in resolution resulting from camera performance or changes in distance [2]. This paper will propose a face recognition system that employs face recognition learning according to virtual resolutions in order to resolve the issue of reduced recognition rates resulting from changes in face image resolution that occur through changes in distance. Face images for each virtual distance are generated through clarity and image degradation for each resolution, using a single high-resolution face image. Clarity is measured using blur metrics, and Gaussian blurring is applied for image degradation. The performance of the proposed method is analyzed through a face DB that was configured using five different real resolutions. According to the test results, the proposed method achieved a performance that was 5.9% better than methods using MPCA and SVM, when LDA and the Euclidean distance were employed. When the standard image size was set to the average size of all face images at 3030, under the conditions using LDA and the Euclidean distance, the average performance was improved by 40.7% for 1616, 1212, and 1010, which are low-resolutions.

This paper is organized as follows. Section 2 explains related works with structure of network-based surveillance system. Section 3 explains face recognition using the proposed virtual multiresolution face images, Section 4 analyzes the experimental results, and Section 5 presents conclusions.

In an IoT environment, the network camera-based facial recognition depends heavily on the camera because the recognition performance varies depending on the obtained image quality. Recently, UHD-class high-resolution cameras have become commonplace thanks to the rapid advancements in camera performance. However, in the case of network camera-based surveillance systems, there is a limit to its practical use due to the limit of transmission speed. Therefore, in this paper, we propose a facial recognition technology applicable to various resolutions. Figure 1 shows the structure of a network-based surveillance system using intelligent surveillance system technology combined with IoT in the network environment. First, the images obtained from the camera are transmitted to the network-based surveillance system for event analysis. The surveillance system decides if an event occurs by comparing the image received from the camera with the image stored in the server. Finally, when an event occurs, an alarm is sent to the user, which enables them to recognize the situation, and the event image is stored on the network server.

2.1. Video Surveillance System

The video surveillance system refers to a method of transmitting the image information captured by the camera to a receiver and observing a specific area that the user wants to be based on the transmitted image information. The video surveillance system is widely used for various applications in the fields of security, industry, vehicle surveillance, and traffic management. Recently, the availability of the system has been growing sharply due to the development of communication technologies such as IoT technology, 4G, and 5G. In addition, the advancement of IT technology has transformed an analog environment based on VCR storage devices into a digital age that uses DVR-based image compression and digital transmission technology. It has evolved into an intelligent video security technology that combines IP-based networks using broadband networks and open protocols with automatic image analysis and recognition technology. CCTV cameras can be divided into dome, box, IR, PTZ, and IP cameras depending on their use and shape. Considering camera performance is as important as selecting the right camera in a video surveillance system, that is, the use of the surveillance system is determined depending on the quality of the image obtained through the camera, high-resolution and high-quality images contain considerable information, which is a very important factor in intelligent surveillance systems based on image processing. Although an image is captured from the same location, in the case of the HD-class image, the information of the object can be intuitively confirmed without additional image processing, yet in the case of normal image quality, it is often difficult for us to directly understand the information of the object. That is, the higher the image resolution is, the clearer and more detailed image we can see. However, the price is more expensive depending on the camera performance, and the size of the image data becomes larger, resulting in more processing time.

2.2. Low-Resolution Based on Face Recognition Technology

Recently, studies have been conducted regarding face recognition technology using low-resolution images [3]. This face recognition method recognizes faces by using super resolution recovery, which converts low-resolution images into high-resolution images [46]. The super resolution recovery method achieves a superior high-resolution recovery performance from the visual end, but it is unable to achieve a satisfactory performance when it comes to face recognition. Because this method relies on learning data, it has the disadvantage of requiring high quantities of learning data, and processing times increase as the amount of data increases. There also exist studies on a method that exploits structural characteristics through the mapping and learning of a pair of low- and high-resolution images generated from the same face [3, 7, 8]. This method using structural characteristics is advantageous to the super resolution recovery method, owing to its lower calculational complexity. However, the recognition rate varies significantly according to undefined values such as changes in lighting, distances, face expressions, other external variables, and mapping related weighted values. Face recognition using a Pan-Tilt-Zoom(PTZ) camera boasts an extremely high face recognition rate because it can solve the fundamental issue of image deterioration from long range imaging [9, 10]. However, because the camera used in this method is extremely expensive, it cannot be employed for general purposes.

3. Proposed Algorithms for Multiresolution Face Recognition

Among face recognition methods that depend on learning, methods of improving the recognition rate include configuring data for testing and identifying the most similar faces through training data and increasing the amount of training data. The method of configuring testing data and identifying the most similar faces through learning uses face images that are extracted from various real resolutions. However, because this method acquires face images when the user is moving, it requires serious cooperation from users. This paper proposes a method that generates face images at various resolutions without requiring such cooperation from users.

Figure 2 illustrates the proposed face recognition method, where faces are recognized in multiresolution environments that were acquired by generating multiple low-resolution face images using a single high-resolution face image. Virtual low-resolution face images are scaled to a set reference face size. Bilinear interpolation is used for scaling, and histogram smoothing is applied to adjust the lightning. Faces for testing were not generated randomly, but rather five real resolution images were employed. At this point, image degradation was performed on the testing data by using face size normalization and imposing an image blur value. Real multiresolution faces for testing were acquired while varying the distance between the camera and the person between 1 m and 5 m. In order to recognize faces, LDA was used to extract features from the training and testing data, and the Euclidean distance was employed to measure the similarity [11, 12].

3.1. Generating Training Faces Using a Single Image per Person

In order to minimize the required calculations and use the least amount of samples, this paper proposes a method that employs a single high-resolution face image to automatically generate multiple face images for learning. Figure 3 illustrates the process of generating face images according to virtual resolutions, using one high-resolution face image. One high-resolution face image is input and is then reduced to fit the standard size according to the resolution. The size of the original image that is reduced for each resolution is based on a real face image, which was detected as the distance between the person and the camera varied in 1 m intervals. The size of the original image in which faces can be detected is 320240, and the distance between the person and the camera varies from 1 m to 5 m. Images that are reduced to the appropriate size according to the resolution are then expanded to be the same size as the reference image for face recognition. The definition of an expanded face image is reduced through low pass filtering until it reaches the target clarity. Equation (1) below defines the method for generating virtual face images using high-resolution face images.where is the face image at the ith virtual size that is used for learning, and is the expansion ratio for scaling to the standard image size for acquiring features.

Furthermore, represents the high-resolution image that is reduced to fit the face size according to each resolution. At this point, indicates the blur kernel, represents the convolution operation, and represents the scaling. The high-resolution is 5050, while the low-resolution is 1010. Finally, represents the scaling ratio for generating face images according to each resolution. The reduction ratio for face images at each resolution is acquired from the average face image size at each real resolution. Assuming that there are five face images for the training data for each candidate, is the group that is configured through the learning data for each candidate.

Figure 4 presents the results of comparing face images from each distance with virtual multiresolution face images. Figure 4(a) shows the high-resolution face image that was extracted from a distance of 1m, and Figure 4(b) shows the virtual face image that was generated using the proposed method. Figure 4(d) shows the face image for each resolution that was extracted from a real distance, and Figure 4(c) shows the image from Figure 4(d) after being normalized to fit the standard image size 3030. The test results indicate that the virtual face image that was automatically generated was more similar to the real face image than the image produced from a set distance of 1m.

3.2. Image Degradation and Reference Face Size

Existing low-resolution face recognition technology employs a method that recovers face images to high-resolution to improve the face recognition rate. In this paper, the definition of a high-resolution face image is reduced through low pass filtering, which is the opposite of the existing method. The process of increasing the similarity through lowering the definition not only requires less calculation than existing high-resolution recovery technology but also does not require a reference image. In order to minimize the differences between the multiresolution training face image that is automatically generated and the real low-resolution resting face image, the clarity of the two images was considered.

The blur value of the image that is derived through the input, such as in (2), is measured and compared to the standard blur value for each resolution. If the blur value of the current input image is greater than the standard value , then the definition is reduced through low pass filtering. This process is performed until the value becomes lower than the standard blur value.where represents the clarity measurement method, and the blur metric method [13], which does not employ a reference image, is used. represents the blur kernel. The weighted value is generated through (3) and used to reduce the definition of the generated image.The complexity of the algorithm is generally important when considering the processing speed of an image processing technique, but this speed is also influenced by the size of the processing image resolution. Face recognition techniques can be performed smoothly at 400,000 pixels, but once an image exceeds 1 million pixels real-time processing cannot be guaranteed. The proposed multiresolution face recognition method adopts the average sizes of the high-resolution and low-resolution images as the standard for the images that are input for face recognition, instead of using the size of a high-resolution face image as the standard, as is the case in existing techniques. In other words, the sizes of all face images used for training and testing are scaled to the same standard size. The average size of the faces used for face recognition is calculated through the following.where is the standard size that all face images are normalized to, is the width of the highest face image resolution, and is the width of the lowest resolution.

3.3. Illumination Normalization

In this paper, histogram interpolation was used to normalize the brightness of faces. A histogram constitutes a method of assessing the distribution of the illumination, which is an important element of data for an image. Histogram configuration is a method that sets the illumination value from 0 to 255 as the index and accumulates the frequency in the corresponding index table according to the illumination value of each pixel that configures the image. Histogram interpolation is a method that alters an images histogram such that it appears evenly across the entire gray scale area. This method does not merely expand the images histogram, but rather changes the histogram distribution based on the characteristics of the original images histogram. The function that changes the images pixel values is defined in (5), where r is the inputted gray scale and s is the gray scale value that is outputted through the conversion function T. In general, the conversion function T is assumed to be a monotone increasing function, and histogram interpolation can also be described in the form of a monotone increasing function.Based on the above, the histograms of the inputted and outputted images can be represented through and , respectively, as probability density functions [14, 15]. The conversion function of histogram interpolation is defined in a form that responds to the accumulated value of , and is expressed through (6), where is a dummy variable for the integration. The function that is defined by integrating over the probability density function is called the cumulative distribution function.

3.4. Feature Extraction

PCA has several limitations that are inherent from its nature. Among others, the biggest is that it is ineffective at separating objects, although it is good at summing up data. Discrimination between objects is important, because face recognition aims to distinguish objects. Therefore, we need to know whether changes in a face image occur because the object changes or because of changes in the illumination or face expression. LDA separates different components into groups, and discriminates between changes in face components and changes resulting from other factors [10]. LDA consists of a linear transformation, which maximizes the ratio of the between-class scatter matrix to the within-class scatter matrix and reduces the dimension of a specific vector for the data. The between-class scatter matrix SB and within-class scatter matrix SW can be written as follows:where C is the number of objects, is the mean image for each object, and is the mean image of all images. is the number of the image of object in sequence. Therefore, where becomes the maximum and becomes the minimum can be described as below.

If eigenvector and eigenvalue calculated from (7) and (8) are applied to (9), optimal eigenvector can be calculated that shows the level of group’s discrimination. This process is repeated when a new image is entered to calculate weight and face recognition is performed by comparing the weight of images in the database and that of a new image. LDA is widely used for research related to face recognition as it clearly discriminates the features of each group.

3.5. Similarity Measurement

In order to verify the recognition rates for long range face recognition systems, the similarities between feature matrices were compared in order to calculate the recognition rates. For two feature matrices and , their similarity can be calculated using the following.After converting a random verification image into a feature matrix , the similarity with the feature matrix of all the training data is measured. The feature matrix with the least similarity is determined as shown in (11), and the class inside this feature matrix is determined as the final class of the verification image. The final face recognition is determined by the proportion of verification images that accurately include the class compared to the total number of verification images T, as is shown in (12), where N is the total number of learning images.

4. Experimental Results

In traditional face recognition experiments, a face DB such as the Yale DB [16], MIT Face DB [17], or FERET DB [18] is normally used. It is found that existing face DBs include factors such as lighting, face twisting, and changes in face expression as external changes, but they do not consider changes in face image resolution according to changes in distance. However, there is a difference between real low-resolution face images that are extracted from long distances and low-resolution face images that have been reduced to the same size using a high-resolution image. In other words, the high-resolution face images from existing DBs are used after being temporarily reduced, but such images are not suitable for analyzing the performance of face recognition from real multiresolutions. The ETRI face DB that was used in this paper is composed of face images from 10 candidates from a u-Robot test bed environment [19]. Table 1 illustrates the composition of the ETRI face DB used in this experiment. Images for each candidate include changes in lighting and distance.

Changes in lighting are achieved by using indoor lighting, and the distance varies from 1 m to 5 m. The size of the original face image was configured to 5050, 2525, 1616, 1212, and 1010 using a high-resolution image. In this paper, face recognition was performed using a 1:N search method, instead of a 1:1 authentication method, and the face image that was most similar out of all of those stored in the DB was categorized through the image verification results.

In this experiment, the DB was configured by directly extracting face areas, assuming that all faces would be detected from the input images regardless of the distance. If faces are manually extracted, then face areas can be extracted with higher precision than in an automatic face detection method. In addition, the original image was employed as it was, without considering any twisting or turning in the extracted face images.

The performance of the proposed method was analyzed by comparing it with PCA, LDA, and MPCA [20], as well as CMs [5] and CLPMs [6], which recognize low-resolution faces by considering structural features. The experiments were all conducted under the same conditions, and only one original face data image was used in training for each candidate. There were 30 verification images for each distance, giving a total of 150 images, and for all candidates there were a total of 1,500 verification images, where learning images were not included in all verification images. Figure 5 presents the face recognition rate change when virtual face images for training. In the experiment results, if a single sample was used then all high-resolution face recognition methods achieved an excellent performance. However, as the resolution decreased there was a significant difference in the performances. In terms of the average recognition rate for all resolutions, the proposed method, which used LDA, achieved the highest performance, at 86.8%. When the size of the standard image was based on the average resolution (3030) instead of high-resolution (5050), the average face recognition performance was at an excellent level of 40.7%, even at low-resolutions of 1616, 1212, and 1010. When the face recognition was performed using a single sample in multiresolutions with face images generated for each virtual resolution, LDA performed better than PCA. During the process of extracting features, using the average size of all resolutions performed better for improving the general recognition rate than using the face size based on a high-resolution as the standard. Moreover, on account of issues occurring when the number of pixels in an image is higher than the number of images, LDA achieved a better performance than existing techniques such as CLPMs and MPCA.

5. Conclusions

In camera-based face recognition, various resolutions can exist, in terms of face images that are acquired in conditions from close range high-resolution images to long range low-resolution images. If high-resolution images are converted into low-resolution images for face recognition, then the face recognition performance may suffer on account of a loss of data and a difference between the resolution of the face image that was used for training and that of the verification image. This paper has proposed a method that recognizes faces by generating face images for each virtual resolution in order to resolve the issue of reduced recognition rates resulting from changes in resolution caused by changes in the distance between the camera and the subject. The proposed method uses one high-resolution image to automatically generate face images for each resolution to use for training. The face image size is normalized to the average size of all faces and used for face recognition. The proposed method achieved the highest performance of the tested methods, with an average resolution of 86.8%, when LDA and the Euclidean distance were used. Moreover, when the standard image size was set to 3030, which is the average image size, the performance improved by an average resolution of 37.1% compared with when the standard image size was set to 5050, under the same experimental conditions.

Data Availability

The ETRI DB data used to support the findings of this study have not been made available due to Electronics and Telecommunications Research Institute under license and so cannot be open to the public.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (no. 2018R1A2B6001984) and supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2017R1A6A1A03015496).