Abstract

This paper aims to develop a machine learning and deep learning-based real-time framework for detecting and recognizing human faces in closed-circuit television (CCTV) images. The traditional CCTV system needs a human for 24/7 monitoring, which is costly and insufficient. The automatic recognition system of faces in CCTV images with minimum human intervention and reduced cost can help many organizations, such as law enforcement, identifying the suspects, missing people, and people entering a restricted territory. However, image-based recognition has many issues, such as scaling, rotation, cluttered backgrounds, and variation in light intensity. This paper aims to develop a CCTV image-based human face recognition system using different techniques for feature extraction and face recognition. The proposed system includes image acquisition from CCTV, image preprocessing, face detection, localization, extraction from the acquired images, and recognition. We use two feature extraction algorithms, principal component analysis (PCA) and convolutional neural network (CNN). We use and compare the performance of the algorithms K-nearest neighbor (KNN), decision tree, random forest, and CNN. The recognition is done by applying these techniques to the dataset with more than 40K acquired real-time images at different settings such as light level, rotation, and scaling for simulation and performance evaluation. Finally, we recognized faces with a minimum computing time and an accuracy of more than 90%.

1. Introduction

Today’s organizations face significant security challenges; they need several specially trained personnel to achieve the required security. However, humans make mistakes that affect safety. Closed-circuit television (CCTV) is currently used for various purposes in everyday life. The development of video surveillance has transformed simple passive monitoring into an integrated intelligent control system. Face detection and its new applications for secure access control, financial transactions, etc. Biometric systems (faces, palms, and fingerprints) have recently gained new importance. With advances in microelectronics and vision systems, biometrics has become economically viable. Facial recognition is an essential part of biometrics. In biometrics, human fundamentals are mapped to current data. The facial features are hauled out and implemented using an efficient algorithm, and some variations are made to improve the existing algorithm model. Face recognition from the computer can be applied to a variety of applied applications, including crime ID, security systems, and authentication. A facial recognition system typically involves steps of face detection where the face of the input image is detected, and then the image process cleans the face image for easy recognition.

In this modern age, face recognition has become a necessity as the individual’s identification increases daily with globalization. Since the last two decades, face recognition has received much attention because of its various applications, invaluable image analysis, and understanding domains. Face recognition is also becoming important in other fields like image processing, animation [1], security [2], human-computer interface [3], and medicine [4]. Face recognition is natural, noninvasive, and easy to use. The face recognition system has a wide choice of applications in public safety, entertainment, attendance management, and financial payment. While today’s facial recognition systems work well in relatively controlled environments, they suffer from significant problems when used in existing surveillance systems due to image resolution, background clutter, lighting variations, and face and expression posture.

Face recognition systems consist of three steps, such as preprocessing of the image, feature extraction, and classification technique for recognition [5]. Features extracted from the face, such as the mouth, nose, eyebrows, etc., are geometric features. The detected and processed face is compared to a database of known faces to determine who the person is. The surveillance system needs people to monitor it. Human monitoring involves reliability issues, scalability issues, and the inability to identify everyone.

Facial occlusions, such as beards and accessories (glasses, hats, and masks), involve evaluating facial recognition systems, making the subject diverse and challenging to function in a nonsimulated environment. Another essential factor to consider is the different terminologies of the same distinct: macro and microterminologies find their place on someone’s face because of changes in an emotional state, and because of the many expressions of this type, effective recognition becomes difficult. A perfect face recognition system should be able to tolerate changes in lighting, expressions, poses, and occlusions and can scale for many users who need to capture the fewest images simultaneously.

The overall contributions of the research paper can be summarized as follows:(i)A machine learning-based framework for detecting and recognizing faces in CCTV images with various clutter backgrounds and occlusion(ii)A dataset of 40K images with different environmental conditions, clutter backgrounds, and occlusion(iii)Performance comparison of classical machine learning and deep learning algorithms for faces recognition in CCTV images

The rest of the paper is organized as follows: Section 2 briefly introduces the related works. Section 3 explains the methodology, and the results are discussed in Section 4. Finally, we conclude the paper in Section 5.

In this section, we briefly introduce the related works about face detection and recognition using classical approaches and deep learning.

2.1. Face Detection Algorithms
2.1.1. Geometric Methods for Face Detection

In the early stages of computer vision, researchers explored many algorithms that extracted the image characteristics and utilized geometric requirements to comprehend the provisions of all features. This was partly due to very limited computational resources. The reduction of information from the extraction of functionality has made computer vision possible in the first computers [6, 7].

2.1.2. Template-Based Face Detection [8]

Most of the face detection algorithms are model-based, they encode facial images directly on the basis of pixel intensity. Probabilistic models are mostly used for the characterization of these images of facial images also by neural networks or by some other mechanisms. The parameters of these models are automatically adjusted by sample images or manually.

2.1.3. Simple Templates

If you are using a skin-based method and another skin color is found in the image (like arms and hands), these algorithms show false results. Many researchers tried to overcome this by using simple models to integrate results from the color matching of skin. These models have varied from some ovals related to the image of the edge of the entrance to the correlation models for the regions of skin color and skin color (like lips, hands, or eyes). However, these techniques can enhance the robustness of detectors by color, but with also the enhancement of speed.

2.2. Face Recognition algorithms

Face recognition is a technique that has now attained consideration in machine learning and artificial intelligence. It plays an essential role in many social security applications. There are many studies and practices now under research that can solve the problem of face recognition. Vivek and Guddeti [9] proposed combining cat swarm optimization (CSO), particle swarm optimization (PSO), and genetic algorithm (GA). This hybrid technique has inspired many others to work similarly. Ali et al. combined SVM, higher-order spectral (HOS), and random transformation (RT) [10].

2.2.1. Iterative Closest Point-Based Alignment

The objective of the alignment approach [11, 12] is based on the closest iterative point to determine the translation and the rotation parameters in an iterative way to convert the point cloud. Clouds’ mean square error becomes minimal while both point clouds are aligned. So, distance among point clouds is reduced to a minimum by translating and rotating one of the point clouds with respect to others, also determine by identifying the distance with every point in the initial point clouds every second, also calculating the average of all distances. An important disadvantage of the alignment approach based on the closest iterative point is that it needs an initial alignment of the convergence course. This approach is computationally very expensive, so that’s another disadvantage.

2.2.2. Simulated Annealing-Based Alignment

It is an algorithm based on a stochastic process used for local research [13]. The difference between hill-climbing and simulated annealing is that it can compute an even worse solution than the current one in the iteration process. As simulated annealing is not bounded by local minima, it is more likely that you will find a solution. Six parameters are required for simulated annealing (in which three for every translation also the rotation referencing to a 3D coordinate system) which is used to define transformation matrix which is used for an alignment between two 3D faces. This approach aligns images of the face in three phases: (1) alignment in initial level, (2) alignment in an approximate level, and (3) alignment in the last level [14]. Initially, the center of the two-sided mass is being aligned. By using this approach, it serves to minimize an approximation measure which uses the consensus of multiple estimators M (MSAC) together with the mean square error corresponding point of two faces that will compare. Then, an accurate alignment is obtained with the mean of a search algorithm that is based upon simulated annealing, which uses the measurement of the interpenetration of surfaces (SIM) as an estimation criterion. The disadvantage of alignment based on simulated annealing is its more calculation time which is comparable to the alignment based on the nearest iterative point.

2.2.3. Average-Based Face Model

This alignment is based on the medium-based face model [15]. First of all, the reference points are on the face automatically or manually. Subsequently, the average of pivotal coordinates calculated, followed by procrustes examination and transformed milestone [16], are again mediated to obtain a face model. While in this method, the image of the probe face aligns with the average model using an alignment on the nearest iterative point. A notable weakness of the alignment based on the medium face model is the low precision index [17] and part of the spatial material lost during the creation of the medium face model.

The first step in face recognition is preprocessing. Images taken from a camera or in real-time video surveillance setups may suffer from various degradations during the process of capture, transformation, conversion, or compression [18]. For instance, blurry, noisy, and low-resolution images affect the face recognition process. Such issues may lead to significant challenges in the face recognition scheme and decrease its performance. Therefore, pre-processing is an essential step in any face recognition system. Many color normalization, statistical, and convolutional methods are used as preprocessing tools [19]. Another big problem in face recognition through surveillance cameras is that too many images of a person are collected and applying a face recognition algorithm to each of them proves costly in terms of processing and energy consumption. Vignesh et al. [20] presented a technique for image quality assessment (IQA) using CNN to take the person’s best image. Tudavekar et al. [21] proposed video inpainting to fill the missing regions in a video by dual-tree complex wavelet transformation.

PCA is the most widely used technique in signal and image processing. They are also known as eigenfaces, the orthogonal vectors that help in face recognition. Drume and Jalal proposed a two-level classification technique that uses principal component analysis (PCA) in level one and boosts its results by support vector machine (SVM) at level two [22]. Kanade employed image processing techniques to extract 16 facial parameters with the ratio of distance, angle, and area and used the method of Euclidean distance to achieve a performance of 75% [23]. On this basis, a method called eigenface for face recognition was proposed for the first time [24]. This method leads to the formation of an algorithm called principal component analysis (PCA). From then on, PCA gathered a lot of attention and became the most effective approach for face recognition. Many improvements have been made in the PCA algorithm to get its best results [2530].

Rala used PCA and Kernel-PCA for feature extraction and face recognition, respectively. They explore the nonlinear kernel function for the improvement of PCA [31]. Abdullah et al. optimized the PCA time complexity without affecting the performance of the algorithm [32]. Another approach includes hexagonal feature detection, which works on the principle of edge detection [33]. A part-based method in [34] utilizes PCA, NMF, ICA, LDA, etc., under partial occlusion. Another effective algorithm called AFMC shows the results to be more accurate with the reduced computational cost and proposes eliminating the SSS problem [35]. Viola–Jones algorithm was also presented with the smoothed invalid regions and excluded near-ear regions [36].

Deep hidden ID entity feature (DeepID), a face representation based on CNN, is suggested in [37]. Unlike DeepFace, which learns features from a single large CNN, DeepID learns features from an ensemble of tiny CNNs that are utilized for network fusion. Similarly, a face recognition pipeline, WebFace, is proposed in [38], which uses CNN to learn the face representation. The convolutional neural network (CNN) [39] has been one of the most prominent approaches in computer vision over the last decade, with applications including image classification [40], object identification [41], and face recognition [38]. Different methods, such as PCA-based eigenfaces [42] and LDA-based Fisherfaces [43] employ the nearest neighbor (NN) classifier and its variants [44]. In a face recognition system, supervised classifiers such as support vector machines (SVM) [45] and neural networks [46] are also proposed. Huang et al. [47, 48] developed a novel learning technique for single hidden layer feedforward networks (SLFNs) called the extreme learning machine (ELM), that can be utilized in regression and classification applications [42, 4951]. Yang et al. [52] proposed a re-enforcement-based deep learning algorithm for multirobot path planning. Table 1 depicts the summary of the literature review.

3. Proposed Framework for Face Detection and Recognition in CCTV Images

The proposed method consists of four significant steps: (i) image acquisition, (ii) image enhancement, (iii) face detection, and (iv) face recognition, as shown in Figure 1. We performed different machine learning techniques for recognition purposes that include random forest, decision tree, K-nearest neighbor (KNN), and convolutional neural network (CNN).

3.1. Image Acquisition

In this phase, we acquire an image. Images need to be restored from the source (usually a hardware source) camera, making it the first step in the workflow sequence because processing is not possible. Our CCTV constantly reads images, which is our preprocessed input.

3.1.1. Camera Interfacing

An Internet protocol (IP) camera, Hikvision DS-2CD2T85FWD-15/18, is used for image acquisition. It is an 8-megapixel camera and captures 15 frames per second video with a resolution of 1248  720. Firstly, the camera will capture the image, which will be saved and accessed using some software tool, such as MATLAB. Table 2 shows the CCTV camera specification used for image acquisition.

The face database includes the faces of those whom it will recognize. Because facial recognition involves classification algorithms, each image in the dataset is labeled. Images of each person’s faces have their own unique labels. We have more than 41,320 images of 90 people. Thus, the label of these classes (persons) is from 1 to 90. It means that each label has multiple images. Given below is the dataset description.

So, label 1 has 775 images approximately and same as others displayed in the figure (classes on the x-axis and number of images on the y-axis). Figure 2 shows the sample images in the dataset.

3.2. Preprocessing

After the image acquisition, preprocessing of the image prepares it for further handling. Preprocessing includes two main steps: gray scale conversion and edge detection techniques.

3.2.1. Grayscale Conversion

From the camera, we acquire the RGB image (R for red, G for green, and B for blue). An RGB pixel has 1 pixel of red combined with pixels of blue and green. The RGB image made computation expansive as 1 pixel is of 8 bits, so in RGB, it would become 24 bits. In a grayscale image, each pixel is a scalar, so it will be an 8-bit image. So, the equation that converts RGB to grayscale is

Here R, G, and B represent red, green, and blue pixels, respectively.

3.2.2. Canny Edge Detection

The Canny filter detects edges in pictures by detecting abrupt changes in color in photos. We are using this to enhance the edges of the images. The more the advantages are improved, the more accuracy we can achieve in recognizing facial expressions. The filter consists of Gaussian and Sobel filters. Firstly, a Gaussian filter with a predefined value of is applied to grayscale images to smooth edge finding.

In the second step, the Sobel filter is applied for finding the edges in the images. The filter used for finding the horizontal edges is

For horizontal edges, the filter is

The horizontal and vertical edges are calculated in order to find all the edges in the filter.

The third and last step of the canny edge detector, the hysteresis threshold, is applied to images containing the edges. The threshold is expressed as

The maximum and minimum thresholds are selected initially. If the pixel’s value is greater than the specified threshold, then one is assigned to the pixel, and if the value of the pixel is less than the threshold, then it is set to 0. Another case is when the value is the same as the threshold; it remains the same. Lastly, the edges are added to the original image to get the final enhanced image. Thus, detection and extraction of facial features become easy and increase the efficiency of the overall system.

3.3. Face Detection

The next step after getting the image from the camera is to detect the face from the images by the Viola–Jones algorithm that distinguishes the face and nonface regions. Then, for further processing, the face region is extracted.

3.3.1. Face Detection Using Viola–Jones Algorithm

Viola–Jones algorithm is the first algorithm that provides competitive object detection rates in real-time. It provides robustness with high detection rates, easy for real-time applications as it can process two frames per second. After applying this, different classification techniques are used to recognize the image. The main steps include the following:(1)Haar feature(2)Integral image(3)Ada boost training(4)Cascading classifiers

3.3.2. ROI Extraction and Resizing

The face detected by the Viola–Jones technique is extracted and resized as a 40 × 40 image, then used by various feature extraction techniques to find the features.

3.4. Features Extraction from Detected Face Images

We have used the principal component analysis (PCA) technique to extract features of the face in order to detect the face in later steps.

3.4.1. PCA-Based Facial Feature Extraction

PCA is a technique used to reduce the dimensions of the images in our dataset. It finds the characteristics of images, the difference and variance in pixels in one column from the other [58]. PCA has the following steps as shown in Figure 3:(1)Mean of each columnIn this step, we have calculated the mean value of each column. The sum of the means of the columns are expressed asHere, i is the mean of i-th column.(2)Covariance matrixThe second step is calculating the covariance of the matrix. The variance of the pixels is calculated asIn the above equation, i is the number of columns in the original image matrix, j is the second column in the image, and k is the number of rows. The following equation shows the result.(3)EigenvaluesAfter the covariance matrix is calculated, the eigenvalues of the covariance matrix can be calculated by.(4)EigenvectorsUsing the eigenvalues calculated in the previous step, we can find the eigenvectors from the following equation:Eigenvalues are the features of an extracted face. These values will be used for recognition.

3.5. Face Recognition Using Machine Learning Algorithms
3.5.1. Random Forest

This is a machine learning approach for solving classification and regression problems. It makes use of ensemble learning, a technique used for solving difficult problems by combining many classifiers. Many decision trees make up a random forest algorithm. The random forest algorithm’s produced “forest” is trained via bagging or bootstrap aggregation. Bagging is a meta-algorithm that enhances accuracy by grouping them together.

3.5.2. Decision Tree

For classification and regression, the decision tree is a nonparametric supervised learning approach. The objective is to learn basic decision rules from data characteristics to construct a model that predicts the value of a target variable. It is a flowchartlike tree structure in which each internal node represents an attribute test, each branch indicates the outcome, and each leaf node (terminal node) carries a class label.

3.5.3. K-Nearest Neighbor

We have used 5, 10, and 15 eigenvectors as our features. The dataset is created with these vectors, and the new face image will pass through all the steps of PCA. Then, we will calculate its distance with the features of other images in the dataset, and the nearest one will be our prediction. We have used the Manhattan distance formula to calculate distance as it is more accurate. The Manhattan distance formula is

Here, z is for the dataset, and b is for the test image. Then we will check which instance in the dataset has the minimum distance with the test image, which will be our prediction.

3.6. Face Recognition Using Convolutional Neural Network

Convolutional neural networks consist of convolutional layers, pooling layers, and, at the end, a fully connected layer. A CNN has a much different architecture than a simple neural network. It has an input layer, a convolutional layer, a max-pooling layer, and at the end, a fully connected neural network as shown in Figure 4.

We have used Adam optimizer for training in optimizing weights.

3.6.1. Adam Optimizer

where : learning rate (0.001), : gradient at time t, t: exponential average of the gradient, st: exponential average of the square of Gradient, and 1,2: hyperparameters.

4. Results and Discussion

When we apply PCA, we get eigenvectors; these eigenvectors are our features. We have used different features, such as we have used 5, 10, and 15 eigenvectors.

4.1. K-Nearest Neighbour (KNN) Algorithm Results

Results obtained by simulating different values of k are as shown in Table 3.

Figure 5 with 5 eigenvectors shows the results obtained having a maximum accuracy of 94.7%. When we increase the value of K the accuracy decreased. For K = 1, with Manhattan distance, we get approximately 95% accuracy, and with Euclidean distance, we get 89% accuracy.

In Figure 6, PCA features with 10 coefficients are shown. With 10 eigenvectors, we obtained a maximum of 93.7% accuracy with Manhattan distance and with Euclidean distance, we obtained 87.6%. Then the accuracies decreased as the value of K increased. Here we have also noted that Manhattan distance performs better than Euclidean distance. And if the eigenvectors increase, the accuracy also decreases because the starting eigenvectors show maximum feature importance.

In Figure 7, PCA features with 15 coefficients are shown. Same case here, as the features increase, accuracy decreases. And the same with the value of k.

4.2. Decision Tree Result

For the decision tree, the results obtained for different features are given below, both in tabular form Table 4 and graphical form Figure 8.

4.3. Random Forest Results

The random forest shows the highest accuracy of 93.20% with 5 eigenvectors in Table 5 and Figure 9.

4.4. CNN Results

As in CNN, we must train our dataset. We have trained our data in 5000 steps and obtained 95.7% accuracy with only 30 images for testing and 30 for training.

4.4.1. With 50% Training and Testing Data

We have obtained a maximum of 95.67% accuracy with 50% data of training and testing. We trained it in 4000 steps. In some steps, the training steps, the accuracy increased, and at some points, it decreased, but at the end, we have obtained a maximum accuracy of 95.67%, accuracy as shown in Figure 10.

4.4.2. With 90% Training and 10% Testing Data

Now we have obtained 95% accuracy in this section, maybe because testing data is much less than training. And we have obtained this accuracy in 300 steps, as shown in graph Figure 11.

4.4.3. With 80% Training and 20% Testing Data

Now we have obtained 97.5% accuracy in this section, which may be because testing data is much less than training data. And we have trained data in 5000 steps, as shown in the graph Figure 12.

5. Conclusion

We have developed a framework for automatic face recognition based on CCTV images using different machine learning algorithms in this work. One of the objectives of this work is to collect more than 40,000 face images and compare the performance of algorithms to obtain the highest recognition accuracy. We have implemented different algorithms and have obtained high accuracy for CNN. CNN is much more reliable than PCA with DT, RF, and KNN. KNN is a lazy algorithm, and it checks all the instances in the dataset for prediction while CNN recognizes in very little time from its model. The other reason is that we have used 41,320 images for 90 classes for PCA, and for CNN, we have used ten classes and 30 images per class, and we obtained good accuracy compared to PCA. We collected more than 41,320 images. We will enhance this system by making it a complete security system. We recognize a single face from the image; our next step is to recognize multiple faces in a live-streaming video.

Data Availability

The data are available with the first author and will be provided on request for research purposes.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.