#### Abstract

The localization of the region of interest (ROI), which contains the face, is the first step in any automatic recognition system, which is a special case of the face detection. However, face localization from input image is a challenging task due to possible variations in location, scale, pose, occlusion, illumination, facial expressions, and clutter background. In this paper we introduce a new optimized k-means algorithm that finds the optimal centers for each cluster which corresponds to the global minimum of the k-means cluster. This method was tested to locate the faces in the input image based on image segmentation. It separates the input image into two classes: faces and nonfaces. To evaluate the proposed algorithm, MIT-CBCL, BioID, and Caltech datasets are used. The results show significant localization accuracy.

#### 1. Introduction

The face detection problem is to determine the face existence in the input image and then return its location if it exists. But in the face localization problem it is given that the input image contains a face and the goal is to determine the location of this face. The automatic recognition system, which is a special case of the face detection, locates, in its earliest phase, the region of interest (ROI) that contains the face. Face localization in an input image is a challenging task due to possible variations in location, scale, pose, occlusion, illumination, facial expressions, and clutter background. Various methods were proposed to detect and/or localize faces in an input image; however, there are still needs to improve the performance of localization and detection methods. A survey on some face recognition and detection techniques can be found in [1]. A more recent survey, mainly on face detection, was written by Yang et al. [2].

One of the ROI detection trends is the idea of segmenting the image pixels into a number of groups or regions based on similarity properties. It gained more attention in the recognition technologies, which rely on grouping the features vector to distinguish between the image regions, and then concentrate on a particular region, which contains the face. One of the earliest surveys on image segmentation techniques was done by Fu and Mui [3]. They classify the techniques into three classes: features shareholding (clustering), edge detection, and region extraction. A later survey by N. R. Pal and S. K. Pal [4] did only concentrate on fuzzy, nonfuzzy, and colour images techniques. Many efforts had been done in ROI detection, which may be divided into two types: region growing and region clustering. The difference between the two types is that the region clustering searches for the clusters without prior information while the region growing needs initial points called seeds to detect the clusters. The main problem in region growing approach is to find the suitable seeds since the clusters will grow from the neighbouring pixels of these seeds based on a specified deviation. For seeds selection, Wan and Higgins [5] defined a number of criteria to select insensitive initial points for the subclass of region growing. To reduce the region growing time, Chang and Li [6] proposed a fast region growing method by using parallel segmentation.

As mentioned before, the region clustering approach searches for clusters without prior information. Pappas and Jayant [7] generalized the K-means algorithm to group the image pixels based on spatial constraints and their intensity variations. These constraints include the following: first, the region intensity to be close to the data and, second, to have a spatial continuity. Their generalization algorithm is adaptive and includes the previous spatial constraints. Like the K-means clustering algorithm, their algorithm is iterative. These spatial constraints are included using Gibbs random field model and they concluded that each region is characterized by a slowly varying intensity function. Because of unimportant features, the algorithm is not that much accurate. Later, to improve their method, they proposed using a caricature of the original image to keep only the most significant features and remove the unimportant features [8]. Besides the problem of time costly segmentation, there are three basic problems that occur during the clustering process which are center redundancy, dead centers, and trapped centers at local minima. Moving K-mean (MKM) proposed in [9] can overcome the three basic problems by minimizing the center redundancy and the dead center as well as reducing indirectly the effect of trapped centers at local minima. In spite of that, MKM is sensitive to noise, centers are not located in the middle or centroid of a group of data, and some members of the centers with the largest fitness are assigned as members of a center with the smallest fitness. To reduce the effects of these problems. Isa et al. [10] proposed three modified versions of the MKM clustering algorithm for image segmentation: fuzzy moving k-means, adaptive moving k-means, and adaptive fuzzy moving k-means. After that Sulaiman and Isa [11] introduced a new segmentation method based on a new clustering algorithm which is Adaptive Fuzzy k-means Clustering Algorithm (AFKM). On the other hand, the Fuzzy C-Means (FCM) algorithm was used in image segmentation but it still has some drawbacks that are the lack of enough robustness to noise and outliers, the crucial parameter being selected generally based on experience, and the time of segmenting of an image being dependent on its size. To overcome the drawbacks of the FCM algorithm, Cai et al. [12] integrated both local spatial and gray information to propose fast and robust Fuzzy C-means algorithm called Fast Generalized Fuzzy C-Means Algorithm (FGFCM) for image segmentation. Moreover, many researchers have provided a definition for some data characteristics where it has a significant impact on the K-means clustering analysis including the scattering of the data, noise, high-dimensionality, the size of the data and outliers in the data, datasets, types of attributes, and scales of attributes [13]. However, there is a need for more investigation to understand how data distributions affect K-means clustering performance. In [14] Xiong et al. provided a formal and organized study of the effect of skewed data distributions on K-means clustering. Previous research found the impact of high-dimensionality on K-means clustering performance. The traditional Euclidean notion of proximity is not very useful on real-world high-dimensional datasets, such as document datasets and gene-expression datasets. To overcome this challenge, researchers turn to make use of dimensionality reduction techniques, such as multidimensional scaling [15]. Second, K-means has difficulty in detecting the “natural” clusters with nonspherical shapes [13]. Modified K-means clustering algorithm is a direction to solve this problem. Guha et al. [16] proposed the CURE method, which makes use of multiple representative points to obtain the “natural” clusters shape information. The problem of outliers and noise in the data can also reduce clustering algorithms performance [17], especially for prototype-based algorithms such as K-means. The direction to solve this kind of problem is by combining some outlier removal techniques before conducting K-means clustering. For example, a simple method [9] of detecting outliers is based on the distance measure. On the other hand, many modified K-means clustering algorithms that work well for smaller medium-size datasets are unable to deal with large datasets. Bradley et al. in [18] introduced a discussion of scaling K-means clustering to large datasets.

Arthur and Vassilvitskii implemented a preliminary version, k-means++, in C++ to evaluate the performance of k-means. K-means++ [19] uses a careful seeding method to optimize the speed and the accuracy of k-means. Experiments were performed on four datasets and results showed that that this algorithm is -competitive with the optimal clustering. Experiments also proved its better speed (70% faster) and accuracy (potential value obtained is better by factors of 10 to 1000) than k-means. Kanungo et al. [20] also proposed an algorithm for the k-means problem. It is competitive but quiet slow in practice. Xiong et al. investigates the measures that best reflects the performance of K-means clustering [14]. An organized study was performed to understand the impact of data distributions on the performance of K-means clustering. Research work also has improved the traditional K-means clustering so that it can handle datasets with large variation of cluster sizes. This formal study illustrates that the entropy sometimes provided misleading information on the clustering performance. Based on this, a coefficient of variation (CV) is proposed to validate the clustering outcome. Experimental results proved that, for datasets with great variation in “true” cluster sizes, K-means lessens (less than 1.0) the variation in resultant cluster sizes. For datasets with small variation in “true” cluster sizes K-means increases (greater than 0.3) the variation in resultant cluster sizes.

Scalability of the k-means for large datasets is one of the major limitations. Chiang et al. [21] proposed a simple and easy to implement algorithm for reducing the computation time of k-means. This pattern reduction algorithm compresses and/or removes iteration patterns that are not likely to change their membership. The pattern is compressed by selecting one of the patterns to be removed and setting its value to the average of all patterns removed. If is the pattern to be removed then we can write it as follows:

This algorithm can be also applied to other clustering algorithms like population-based clustering algorithms. The computation time of many clustering algorithms can be reduced using this technique, as it is especially effective for large and high-dimensional datasets. In [22] Bradley et al. proposed Scalable k-means that uses buffering and a two-stage compression scheme to compress or remove patterns in order to improve the performance of k-means. The algorithm is slightly faster than k-means but does not show the same result always. The factors that degrade the performance of scalable k-means are the compression processes and compression ratio. Farnstrom et al. also proposed a simple single pass k-means algorithm [23], which reduces the computation time of k-means. Ordonez and Omiecinski proposed relational k-means [24] that uses the block and incremental concept to improve the stability of scalable k-means. The computation time of k-means can also be reduced using Parallel bisecting k-means [25] and triangle inequality [26] methods.

Although K-means is the most popular and simple clustering algorithm, the major difficulty with this process is that it cannot ensure the global optimum results because the initial cluster centers are selected randomly. Reference [27] presents a novel technique that selects the initial cluster centers by using Voronoi diagram. This method automates the selection of the initial cluster centers. To validate the performance experiments were performed on a range of artificial and real world datasets. The quality of the algorithm is assessed using the following error rate (ER) equation:

The lower the error rate the better the clustering. The proposed algorithm is able to produce better clustering results than the traditional K-means. Experimental results proved reduction in iterations for defining the centroids.

All previous solutions and efforts to increase the performance of K-means algorithm still need more investigation since they are all looking for a local minimum. Searching for the global minimum will certainly improve the performance of the algorithm. The rest of the paper is organized as follows. In Section 2, the proposed method is introduced that finds the optimal centers for each cluster which corresponds to the global minimum of the k-means cluster. In Section 3 the results are discussed and analyzed using two sets from MIT, BioID, and Caltech datasets. Finally in Section 4, conclusions are drawn.

#### 2. K-Means Clustering

K-means, an unsupervised learning algorithm first proposed by MacQueen, 1967 [28], is a popular method of cluster analysis. It is a process of organizing the specified objects into uniform classes called clusters based on similarities among objects based on certain criteria. It solves the well-known clustering problem by considering certain attributes and performing an iterative alternating fitting process. This algorithm partitions a dataset into disjoint clusters such that each observation belongs to the cluster with the nearest mean. Let be the centroid of cluster and let be the dissimilarity between and object . Then the function minimized by the k-means is given by the following equation:

K-means clustering is utilized in a vast number of applications including machine learning, fault detection, pattern recognition, image processing, statistics, and artificial intelligent [11, 29, 30]. k-means algorithm is considered as one of the fastest clustering algorithms with a number of variants that are sensitive to the selection of initial points and are intended for solving many issues of k-means like the evaluation of the clusters number [31], the method of initialization of the clusters centroids [32], and the speed of the algorithm [33].

##### 2.1. Modifications in K-Means Clustering

Global k-means algorithm [34] is a proposal to improve global search properties of k-means algorithm and its performance on very large datasets by computing clusters successively. The main weakness of k-means clustering, that is, its sensitivity to initial positions of the cluster centers, is overcome by global k-means clustering algorithm which consists of a deterministic global optimization technique. It does not select initial values randomly; instead an incremental approach is applied to optimally add one new cluster center at each stage. Global k-means also proposes a method to reduce the computational load while maintain the solution quality. Experiments were performed to compare the performance of k-means, global k-means, min k-means, and fast global k-means as shown in Figure 1. Numerical results show that the global k-means algorithm considerably outperforms other k-means algorithms.

Bagirov [35] proposed a new version of the global k-means algorithm for minimum sum-of-squares clustering problems. He also compared three different versions of the k-means algorithm to propose the modified version of the global k-means algorithm. The proposed algorithm computes clusters incrementally and cluster centers from the previous iteration are used to compute k-partition of a dataset. The starting point of the th cluster center is computed by minimizing an auxiliary cluster function. Given a finite set of points in the -dimensional space the global k-means compute the centroid of the set as the following equation:

Numerical experiments performed on 14 datasets demonstrate the efficiency of the modified global k-means algorithm in comparison to the multistart k-means (MS k-means) and the GKM algorithms when the number of clusters . Modified global k-means however requires more computational time than the global k-means algorithm. Xie and Jiang [36] also proposed a novel version of the global k-means algorithm. The method of creating the next cluster center in the global k-means algorithm is modified and that showed a better execution time without affecting the performance of the global k-means algorithm. Another extension of the standard k-means clustering is the Global Kernel k-means [37] which optimizes the clustering error in feature space by employing kernel-means as a local search procedure. A kernel function is used in order to solve the M-clustering problem and near-optimal solutions are used to optimize the clustering error in the feature space. This incremental algorithm has high ability to identify nonlinearly separable clusters in input space and no dependence on cluster initialization. It can handle weighted data points making it suitable for graph partitioning applications. Two major modifications were performed in order to reduce the computational cost with no major effect on the solution quality.

Video imaging and Image segmentation are important applications for k-means clustering. Hung et al. [38] modified the k-means algorithm for color image segmentation where a weight selection procedure in the W-k-means algorithm is proposed. The evaluation function is used for comparison: where = segmented image, = the number of regions in the segmented image, = the area, or the number of pixels of the th region, and = the color error of region .

is the sum of the Euclidean distance of the color vectors between the original image and the segmented image. Smaller values of give better segmentation results. Results from color image segmentation illustrate that the proposed procedure produces better segmentation than the random initialization. Maliatski and Yadid-Pecht [39] also propose an adaptive k-means clustering, hardware-driven approach. The algorithm uses both pixel intensity and pixel distance in the clustering process. In [40] also a FPGA implementation of real-time k-means clustering is done for colored images. A filtering algorithm is used for hardware implementation. Suliman and Isa [11] presented a unique clustering algorithm called adaptive fuzzy-K-means clustering for image segmentation. It can be used for different types of images including medical images, microscopic images, and CCD camera images. The concepts of fuzziness and belongingness are applied to produce more adaptive clustering as compared to other conventional clustering algorithms. The proposed adaptive fuzzy-K-means clustering algorithm provides better segmentation and visual quality for a range of images.

##### 2.2. The Proposed Methodology

In this method, the face location is determined automatically by using an optimized K-means algorithm. First, the input image is reshaped into vectors of pixels values, followed by clustering the pixels, based on applying a certain threshold determined by the algorithm, into two classes. Pixels with values below the threshold are put in the nonface class. The rest of the pixels are put in the face class. However, some unwanted parts may get clustered to this face class. Therefore, the algorithm is applied again to the face class to obtain only the face part.

###### 2.2.1. Optimized K-Means Algorithm

The proposed K-means modified algorithm is intended to solve the limitations of the standard version by using differential equations to determine the optimum separation point. This algorithm finds the optimal centers for each cluster which corresponds to the global minimum of the k-means cluster. As an example, say we would like to cluster an image into two subclasses: face and nonface. So we look for a separator which will separate the image into two different clusters in two dimensions. Then the means of the clusters can be found easily. Figure 2 shows the separation points and the mean of each class. To find these points we proposed the following modification in continuous and discrete cases.

Let be a dataset and let be an observation. Let be the probability mass function (PMF) of where is the Dirac function.

When the size of the data is very large, the question of the cost of the minimization of arises, where is the number of clusters and is the mean of cluster .

Let

Then it is enough to minimize without losing the generality since where is the dimension of and .

###### 2.2.2. Continuous Case

For the continuous case is like the probability density function (PDF). Then for 2 classes, we need to minimize the following equation:

If we can use (8)

Let be any random variable with probability density function ; we want to find and such that it minimizes (7) as follows:

We know that

So we can find and in terms of . Let us define and as follows:

After simplification we get

To find the minimum, and are both partially differentiated with respect to and , respectively. After simplification, we conclude that the minimization occurs when where and .

To find the minimum of , we need to find

After simplification we get

In order to find the minimum of , we will find and separately and add them as follows:

After simplification we get

Similarly

After simplification we get

Adding both these differentials and after simplifying, it follows that

Equating this with 0 gives

But , as in that case , which contradicts the initial assumption. Therefore

Therefore, we have to find all values of that satisfy (26), which is easier than minimizing directly.

To find the minimum in terms of , it is enough to derive

Then we find

###### 2.2.3. Finding the Centers of Some Common Probability Density Functions

* Uniform Distribution*. The probability density function of the continuous uniform distribution is
where and are the two parameters of the distribution. The expected values for and are

Putting all these in (28) and solving, we get

* Log-Normal Distribution*. The probability density function of a log-normal distribution is

The probabilities in left and right of are, respectively,

The expected values for and are

Putting all these in (28) and assuming and and solving, we get

* Normal Distribution*. The probability density function of a normal distribution is

The probabilities in left and right of are, respectively,

The expected values for and are

Putting all these in (28) and solving we get . Assuming and and solving, we get

###### 2.2.4. Discrete Case

Let be a discrete random variable and assume we have a set of observations from . Let be the mass density function for an observation .

Let and . We define as the mean for all , and as the mean for all . Define as the variance of two centers forced by as a separator

The first part of simplifies as follows:

Similarly, Define

Now, in order to simplify the calculation, we rearrange as follows:

The first part is simplified as follows:

Similarly, simplifying the second part yields

In general, these types of optimization arise when the dataset is very big.

If we consider to be very large then , , and : and, as , we can further simplify as follows:

To find the minimum of , let us locate when :

But , and so is , since is an increasing sequence, which implies that where

Therefore,

Let us define where and are the means of the first and second parts, respectively.

To find the minimum for the total variation, it is enough to find the good separator between and such that and (see Figure 3).

After finding few local minima, we can get the global minimum by finding the smallest among them.

#### 3. Face Localization Using Proposed Method

Now, it becomes clear the superiority of the K-means modified algorithm over the standard algorithm in terms of finding a global minimum and we can explain our solution based on the modified algorithm. Suppose we have an image with pixels values from 0 to 255. First, we reshaped the input image into a vector of pixels; Figure 4 shows an input image before the operation of the K-means modified algorithm. The image contains a face part, a normal background, and a background with illumination. Then we split the image pixels into two classes: one from 0 to and another from to 255 by applying a certain threshold determined by the algorithm.

Let be the class 1-which represents the nonface part and let be the class 2-which represents the face with the shade.

All pixels with values less than the threshold belong to the first class and will be assigned to 0 values. This will keep the pixels with values above the threshold and these pixels belong to both the face part and the illuminated part of the background, as shown in Figure 5.

In order to remove the illuminated background part, another threshold will be applied to the pixels using the algorithm meant to separate between the face part and the illuminated background part. New classes will be obtained, from to and from to 255. Then, the pixels with values less than the threshold will be assigned a value of 0 and the nonzero pixels left represent the face part. Let be the class 1 which represents the illuminated background part and let be the class 2 which represents the face part.

The result of the removal of the illuminated back-ground part is shown in Figure 6.

One can see that some face pixels were assigned to 0 due to the effects of the illumination. Therefore, there is a need to return to the original image in Figure 4 to fully extract the image window corresponding to the face. A filtering process is then performed to reduce the noise and the result is shown in Figure 7.

Figure 8 shows the final result which is a window containing the face extracted from the original image.

#### 4. Results

In this Section, the experimental results of the proposed algorithm will be analyzed and discussed. First, it starts with presenting a description of the datasets used in this work, followed by the results of the previous part of the work. Three popular databases were used to test the proposed face localization method. The MIT-CBCL database has image size , and images were selected from this dataset based on different conditions such as light variation, pose, and background variations. The second dataset is the BioID dataset with image size and images from the database were selected based on different conditions such as indoor/outdoor. The last dataset is the Caltech dataset with image size and the images from the database were selected based on different conditions such as indoor/outdoor.

##### 4.1. MIT-CBCL Dataset

This dataset was established by the Center for Biological and Computational Learning (CBCL) at Massachusetts Institute of Technology (MIT) [41]. It has persons with images per person and an image size of pixels. The images of each person are with different light conditions, clutter background, and scale and different poses. Few examples of these images are shown in Figure 9.

##### 4.2. BioID Dataset

This dataset [42] consists of gray level images with a resolution of pixels. Each one shows the frontal view of a face of one out of different test persons. The images were taken under different lighting conditions, with a complex background, and contain tilted and rotated faces. This dataset is considered as the most difficult one for eye detection (see Figure 10).

##### 4.3. Caltech Dataset

This database [43] has been collected by Markus Weber at California Institute of Technology. The database contains face images of subjects under different lighting conditions, different face expressions, and complex backgrounds (see Figure 11).

##### 4.4. Proposed Method Result

The proposed method is evaluated on these datasets where the result was with a localization time of s. Table 1 shows the proposed method results with the MIT-CBCL, Caltech, and BioID databases. In this test, we have focused on the use of images of faces from different angles from 30 degrees left to 30 degrees right and the variation in the background as well as the cases of indoor and outdoor images. Two sets of images are selected from the MIT-CBCL dataset and two other sets from the other databases to evaluate the proposed method. The first set from MIT-CBCL contains the images with different poses and the second set contains images with different backgrounds. Each set contains 150 images. The BioID and Caltech sets contain images with indoor and outdoor settings. The results showed the efficiency of the proposed K-mean modified algorithm to locate the faces from the image with high background and poses changes. In addition, another parameter was considered in this test which is the localization time. From the results, the proposed method can locate the face in a very short time because it has less complexity due to the use of differential equations.

Figure 12 shows some examples of face localization on MIT-CBCL.

In Figure 13(a), an original image from BioID dataset is shown. The localization position is shown in Figure 13(b). But there is need to do a filtering operation in order to remove the unwanted parts and this is shown in Figure 13(c) and then the area of ROI will be determined.

**(a)**

**(b)**

**(c)**

Figure 14 shows the face position on an image from the Caltech dataset.

**(a)**

**(b)**

**(c)**

#### 5. Conclusion

In this paper, we focus on developing the RIO detection by suggesting a new method for face localization. In this method of localization, the input image is reshaped into a vector and then the optimized k-means algorithm is applied twice to cluster the image pixels into classes. At the end, the block window corresponding to the class of pixels containing the face only is extracted from the input image. Three sets of images taken from MIT, BioID, and Caltech datasets and differing in illumination, and backgrounds, as well as in-door/outdoor settings were used to evaluate the proposed method. The results show that the proposed method achieved significant localization accuracy. Our future research direction includes finding the optimal centers for higher dimension data, which we believe is just generalization of our approach to higher dimension (8) and to higher subclusters.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This study is a part of the Internal Grant Project no. 419011811122. The authors gratefully acknowledge the continued support from Alfaisal University and its Office of Research.