Low-cost and highly efficient 3D face reconstruction is a significant technique for developing virtual interaction applications. While the Kinect, a typical low-cost 3D data collection device, has been widely used in many applications, in this work, we propose a highly efficient 3D face point cloud optimization method for face reconstruction based on Kinect. Based on the different characteristics of the Kinect point cloud, diverse optimization strategies were delicately applied to improve the reconstruction quality. Our extensive experiments clearly show that our approach outperforms traditional approaches in terms of accuracy.

1. Introduction

The human face displays unique human biological characteristics and conveys a wide range of emotions. With the development of science and technology, the demand for digital human faces is no longer limited to two-dimensional images. Digital 3D face techniques are receiving increasing attention because they can convey more facial information, such as spatial shape, facial contour, and wrinkles. Many applications use 3D face technology to create vivid visual effects. For instance, virtual teleconference software uses reconstructed heads to compress the transmission bandwidth [1]. Security and electronic payment systems utilize facial depth characteristics to validate user identities [2]. In the medical field, surgical planning and analysis, as well as communication between doctors and patients, have become more efficient with the interactive assistance of virtual faces [3, 4]. 3D face techniques have incredible application prospects in our society.

The physiological structure and geometry of the face are very complex, so describing the face is a key problem. The classical model-based face construction method 3D Morphable Model (3DMM) [5], which uses a single RGB image, reconstructs a face model by solving the model parameters and building a linear combination of 3D faces. It can reconstruct a complete face as well as its topological structure, but it lacks deformation expression and is not flexible enough to represent high-frequency deformations, resulting in poor distinction between different human faces.

The depth information contains more details, which can enhance the realism of the face model. High-precision depth information is usually obtained by high-resolution lasers or structural light scanning equipment. For the high-precision instrument Go!SCAN, each scanned frontal face model has approximately 8,500 to 9,500 vertices. However, this equipment is often cumbersome and expensive, has strict usage conditions, and is unfriendly to measured objects. To address these problems, several low-cost RGB-D devices have been developed. For example, Microsoft Kinect can obtain colour and depth images in real time [6]. Due to its convenience and low cost, it has become the most popular 3D data collection device. Kinect HD Face [7] is a standard built-in API of Kinect SDK for face reconstruction. It can detect and track human faces at a rate of 30 frames per second, but it only generates 1,347 facial vertices. Too few point cloud vertices constrain the ability to describe the rich information of the human face. Its face tracking is based on body skeleton detection. If only the head, but no body, exists, there will be undetectable problems or data loss. In [8], Liang et al. used high-precision database matching to improve the quality of face models. However, due to the long time spent on nonrigid alignment and retrieval, the whole algorithm is very slow, requiring approximately 92.16 seconds per frame. In addition, texture information cannot be obtained.

In summary, model-based methods that use RGB information tend to produce results that are too smooth. The depth frame that is efficiently captured by the low-cost device Kinect contains more information. However, the low precision and weak anti-interference ability of Kinect does not meet the requirements of detailed facial expressions. The purpose of this paper is to generate high-quality face models efficiently and at a low cost while restoring as many different shapes and details of human faces as possible.

Our scheme is shown in Figure 1. With the acquired data from Kinect, we used face recognition and landmark detection to extract the face region and obtained 68 key facial feature landmarks. Then, we utilized the landmarks to distinguish a number of facial feature regions, which were then mapped to the 3D face point cloud. Based on the different characteristics of the region’s point cloud, various optimization strategies were applied. The proposed algorithm allowed us to improve the quality of face reconstruction with a low-cost and highly efficient method.

The main contributions of our scheme can be summarized as follows: (i)We propose a face detail optimization method that is guided by key regions from a single RGB-D image to smooth faces while retaining more face details. In addition, the final face model includes texture information(ii)We develop high-precision face models by using optimized algorithms, overcoming the inevitable low-quality problems caused by the low-cost Kinect equipment. Compared with existing methods, our method performs better in terms of both qualitative and quantitative results for seven expressions

The remainder of the paper is organized as follows. First, we summarize the related works in Section 2. Then, we present our 3D face reconstruction scheme in Section 3. After showing the experimental results in Section 4, the conclusions are presented in Section 5.

In this section, we briefly summarize state-of-the-art 3D face reconstruction techniques and Microsoft (MS) Kinect-based research.

2.1. 3D Face Reconstruction

Currently, 3D face reconstruction methods can be classified as active or passive based on the information acquisition mode. For the active mode, depth data are extracted by dedicated scanning instruments, whereas for the passive mode, depth data are learned on 2D information such as videos and pictures.

Among the active methods, contact style and noncontact style are popular 3D scanning techniques. Contact instruments usually have high precision and do not have strict scanning condition requirements, such as illumination and object colour, but the scanning speed is slow and the usage is inconvenient. Noncontact instruments can actively emit infrared light to scan the environment. Noncontact instruments are popular due to their lightweight structure and fast scanning speed, but their accuracy is low. The scanning results are often affected by ambient conditions and frequently contain considerable noise. To improve accuracy, Hernandez et al. proposed a real-time 3D face reconstruction algorithm that used the first depth image as a reference frame and then used a GPU-based ICP algorithm to fuse the subsequent consecutive frames to the reference frame [9]. The algorithm can restore more facial details, but the texture mapping is unstable. Meyer and Do proposed a fast face segmentation method for separating the face from the depth image, in which the segmented depth image was registered to a 3D model [10]. However, incomplete depth data may result in a partial 3D face, and the texture cannot be mapped. Point cloud data scanned or calculated by instruments often have problems such as noise and holes due to missing data [11, 12], which are not conducive to processing [13, 14]. Therefore, many scholars have studied the problem of hole repair in point clouds. Kumar et al. used the point cloud data around the hole to fit a nonuniform B-spline surface [15]. Leong et al. directly connected the boundary of the hole [16]. Since the boundary was randomly distributed, the result was not satisfactory. He et al. proposed a point cloud hole repair method based on a cubic B-spline surface [17] that performed better for large holes. However, the repair effect on the local area is greatly affected by noise.

For passive methods, face reconstructions are generated directly from 2D images [1821] according to the shape-from-shading principle, which is an inverse photography process that uses boundary conditions, reflecting properties, and lighting and depth values to determine the 3D shape of the object. Boundary localization requires accurate two-dimensional feature point detection [22, 23]. To achieve this, Jiang et al. proposed a method for inferring geometric details from a single image [20]. Gecer et al. proposed a generative adversarial network model to fit feature points [24]. Hernandez et al. proposed a motion-based prior constrained structure [25]. Roth et al. proposed a reconstruction method using albedo information and personalized templates [26]. Another 3D face reconstruction trend is based on dense alignment. Fan et al. proposed an approach for recovering realistic facial textures that used facial feature points to adjust the overall and partial general frame models [21]. In [27], Liu et al. utilized multiangle 2D pictures with marked feature points to regress the 3D model. In [28], Feng et al. developed a UV position map to record the 3D shape of a complete face and trained a convolutional neural network to iteratively learn this UV map from a single 2D image. Liu et al. proposed learning 3D-to-2D mapping by computing the adjustment between the 3D face shape and the 2D landmarks [29, 30]. Tu et al. proposed a self-supervised learning method for overcoming noisy values that used sparse two-dimensional face landmarks as supplementary information [31].

Some methods combine the two types. For example, in [32], Zollhöfer et al. used colour images captured by Kinect to detect facial features and then used the features to adjust the general model and register it with the depth image. Macedo et al. improved the KinectFusion algorithm [33]. The algorithm uses head pose estimation to address occlusion problems, which provides the ICP algorithm with a better initial guess and allows it to reregister to obtain a complete face model.

2.2. MS Kinect-Based Applications

Microsoft Kinect can collect depth information from the environment and objects in a convenient and low-cost manner [6], supporting the expansion of applications such as postural control assessment [34], medical assistance [3537], indoor mapping [38], and 3D scene reconstruction.

Newcombe et al. used a handheld Kinect to reconstruct room-sized scenes [39]. This was pioneering work for real-time dense 3D reconstruction using an RGB-D camera. Based on this work, many optimized reconstruction algorithms have been developed [4042]. Zhang et al. proposed an integral imaging display based on KinectFusion that fuses multiframe depth information to eliminate the inherent noise in a single frame [43]. To restore and denoise depth maps, Esfahani and Pourreza proposed a method based on Gaussian filtering to repair small-scale noise and holes in depth images [44]. Lee et al. used asynchronous cellular automata and combined the depth and colour information to enhance the reconstruction accuracy [45].

Kinect has been used extensively for face recognition [4650], emotion recognition [51, 52], wearable face recognition [53], facial recognition databases [54], and pose estimation [5557]. Ren et al. applied the Kinect sensor for part-based hand gesture recognition, using a novel distance metric known as the Finger-Earth Mover’s Distance to assess the dissimilarity of handshapes [58]. Saeed et al. proposed a head pose estimation method based on Kinect and a new depth-based feature descriptor that provided competitive estimation accuracy while requiring less computation time [59]. In [60], Southwell and Fang proposed a human object recognition approach using Kinect and a depth information mask matching model. Kinect has also been used to detect humans with a 3D human segmentation scheme [61, 62]. Siv et al. used Kinect to create 3D facial avatars [63], and Nose and Igarashi created talking avatars for contact on the Internet [64]. Bondi et al. used Kinect as a low-resolution scanner to reconstruct high-resolution face models by computing the 3D face registration among depth face sequences [65].

3. Proposed Method

Our algorithm consists of two modules: coarse face point cloud generation and optimization. In the first module, we use Kinect to capture depth and colour images of humans in real time and align them to the same size. Then, the face region and landmarks are extracted from the colour image with face detection and mapped to the depth image regions. Finally, we obtain a coarse 3D face model by transforming the depth map to a point cloud.

In the second part, we perform a key region-guided detail optimization algorithm. Specifically, for different face regions, we use various strategies based on the characteristics of the face subregion to overcome the problems of reserving face details and noise smoothing.

3.1. Aligning Depth and Colour Images

We used the Kinect for Windows V2 Sensor to capture the colour and depth images. Kinect V2 is a 3D somatosensory camera manufactured by Microsoft that includes a built-in colour camera, an infrared transmitter, and a depth (infrared) camera. The colour camera can capture colour image frames with a resolution of . The infrared transmitter emits a near-infrared spectrum, which is reflected off the surface of the camera. The depth information can be calculated based on the flight time, and a depth image with a resolution of is generated. The depth and colour images are captured simultaneously. Since the depth and colour information should be aligned in the reconstruction, we use an efficient built-in function of Kinect SDK to align the two kinds of images, obtaining images with the same resolution of . Due to the parallax of the colour and depth cameras, some pixels captured by one camera may not exist in the other, resulting in repeated pixels and shadows (see Figure 2).

3.2. Face Region Extraction and Landmark Detection

To calculate the face point cloud, we need to extract the face region from the image. Following previous works, we use the Dlib [66] face detection module to obtain the face region and the corresponding 68 face landmarks (see Figure 3(a)), which are used as indicators of key face regions. The face region is represented by a classical bounding box. Since we aligned the depth image and the colour image, the coordinates of the face region in the two images are the same. We can directly extract the face region from the colour image and the depth image.

3.3. Face Point Cloud Generation

Face point cloud generation is the process of converting the pixel coordinates of the face region in the depth image into world coordinates. This process includes converting among four coordinate systems: the pixel, image, camera, and world coordinate systems [34]. The units of the pixel coordinate axis and the image coordinate axis are pixels and millimetres (mm), respectively. The pixel coordinate system and the image coordinate system are on the same imaging plane. These two coordinate systems are translationally related (Figure 4(a)). This transformation can be expressed as follows: where and are the lengths that pixels occupy in the and directions, respectively, and is the origin of the image.

The conversion of the image coordinate system to the camera coordinate system involves a perspective projection (Figure 4(b)), and it can be expressed as Equation (2). where is the focal length (the distance from the optical centre to the image plane), is the homogeneous coordinate of a point in the camera coordinate system, and is the homogeneous coordinate of a point in the image coordinate system.

The transformation of the camera coordinate system to the world coordinate system can be expressed as follows: where and are the camera extrinsics that represent the rotation matrix and the translation matrix, respectively. In the experiment, the camera coordinate system coincides with the origin of the world coordinate system, so and are set to the following:

The camera coordinate system has the same depth as the world coordinate system. In summary, the conversion from the pixel coordinates to the world coordinates can be expressed as follows:


After we calibrate the intrinsics of the depth camera, including the focal length of the camera , the focal length of the camera , the principal point of the camera in the dimension , and the principal point of the camera in the dimension , we can perform the coordinate conversion.

The final point cloud is an organized point cloud. An organized point cloud is a 2D array that stores vertices based on their spatial relationships. Adjacent points in the 2D array are also adjacent in three-dimensional space, allowing the index values to quickly locate adjacent points. Moreover, organized point clouds enable us to quickly map 2D points in the colour image to 3D points in the point cloud.

3.4. Landmark-Guided Face Point Cloud Optimization

Due to the limitations of environmental factors and Kinect sensor precision, the depth data may have holes formed by invalid points, which are represented as NaN values. We need to fill the holes to recover the face regions. In addition, we observe that point clouds often have rough burrs. For these two reasons, preliminary smoothing is performed. Because the organized point cloud is sequential, the smoothing traversal method is similar to traversing a two-dimensional array. We traverse the point cloud sequentially from the top left corner to the bottom right corner, smoothing each point with a mean filter. Mean smoothing calculates the average depth values of the eight neighbouring points around the current point (see Figure 5) and uses the mean result as the smoothed value for this point. When an invalid point is encountered within these eight points, we ignore it and use the remaining points to compute the average. Hole filling is similar to mean smoothing. We replace the invalid point with the average of the surrounding eight points. For boundary points, we use a zero-padding strategy to fill the template. This operation yields a point cloud with no holes and few burrs.

Experiments show that the point cloud is still not smooth enough, even after mean smoothing. Thus, we use bilateral filter to smooth the point cloud. Bilateral filter is used frequently in depth image denoising. In [67], Li et al. used bilateral filter to denoise depth images and achieved good results. Bilateral filter is a nonlinear and noniterative extension of Gaussian filter [68, 69]. In Gaussian filter, weight is directly related to spatial distance. The closer the spatial distance is, the greater the weight is. Bilateral filter uses both spatial domain and range domain (greyscale) information, the points with a large greyscale difference are assigned with small weight, and the points with a small greyscale difference are assigned with large weight, allowing edge information to be preserved while smoothing. For both the spatial domain and the range domain , the kernel functions of bilateral filtering are as follows: where is the centre point of the template, with a size of and represents the other points in the template. The gray value is converted from RGB values according to . and are the standard deviation of the spatial domain and the range domain, respectively. The bilateral filter can be denoted as : where is the weight sum of each point in the template used for the standard normalization operation. Figure 6 shows that the smoothing has noticeably improved. The face becomes smoother after bilateral smoothing, and the edge information is preserved as much as possible.

The point cloud after bilateral filtering is smooth in a small range, but it is still uneven as a whole. The least squares method can be used to fit the scattered points in a larger area than in the previous steps. We traversed the point cloud sequentially. Taking the current point as the centre, the surface was fitted with the least squares method in a range of . Then, we calculated the value of the points on the smooth surface to replace the original value. The quadric surface equation to be fitted can be expressed as follows: where represents the surface coefficient to be fitted. The total error between the surface and the original point can be expressed as follows: where is the depth value of the -th point. Fitting the surface minimizes the total error , which can be written in a matrix form as follows: where where is an matrix of depth values and indicates the number of points in the template. is an matrix, where indicates the 6 bases constituted by and 1. The -th row of can be calculated by the and values of the -th point.

Since and can be transposed to each other and both results are scalars, noting that the transpose of a scalar is equal to itself, then

The above equation can be solved when the partial derivative of is equal to 0:

The above equation can be simplified as follows:

The quadratic expression of the surface can be obtained by . Finally, the new depth value can be obtained by combining the and coordinates of the point. The result is shown in Figure 7.

The face is overly smooth, and some important detail features are lost. Most of the human face is smooth, except for key facial features such as the eyes and nose. Our goal was to use least squares smoothing to smooth only the areas without facial features while maintaining the key facial regions established in the previous step. Thus, we utilize the 68 previously obtained face landmarks to guide the smoothing procedure. Figure 8(a) shows the 68 landmarks marked with red dots on the colour image. Then, we divide the areas of the facial features into six key regions based on the landmarks. For example, for the mouth, the -coordinates of the left and right corners of the mouth serve as the left and right boundaries of the region, while the -coordinates of the upper lip and lower lip points serve as the upper and lower boundaries. The six obtained key regions are represented by bounding boxes in Figure 8(b). Since the points in the colour image and the depth image are aligned, we can use 2D landmark information to extract the depth information directly.

For the key facial regions, we perform least squares smoothing on the area outside the boxes while maintaining the values inside the boxes. This produces a smooth face model that does not bur the detailed facial features (Figure 9(a)). However, due to the different smoothing operations performed on the inside and outside of an area, an obvious abruptness appears on the edges of the area. Therefore, we continue to perform mean smoothing on the edge regions. Finally, we use two efficient methods (Delaunay [70] and Trimesh [71]) to calculate the triangular mesh and display the final face model (Figure 9(b)).

4. Experiment

In the experiments, the face scanned by Go!SCAN was regarded as the ground truth. The Hausdorff distance was used to measure the difference between the two meshes. MeshLab is a 3D geometric processing system that integrates the Hausdorff distance function [72]. Our implementation was constructed in MATLAB and run on a PC with a 2.2 GHz Intel Core I7-8750h CPU and 16.0 GB RAM. In Section 3.4, both and were set to 1. We reconstructed 3D face models of four people with seven different expressions (neutral, happy, sad, disgust, surprise, angry, and fear). We compared our method with two popular approaches: Kinect HD Face and GRIDFIT.

Kinect HD Face has two functions [7]. First, it can capture faces and display them in three dimensions. Second, it can track faces in real time and capture facial movements. In the first stage, Kinect uses 16 frames to fully capture our faces: four on the left, four on the right, four in the front, and four on the top of the face. In the second stage, the face model is created by deforming the mean face based on the captured face. Finally, the generated face mesh can be used in other animation applications. Moreover, no colour information is used in the final face model. We show the results in Figure 10(a).

In GRIDFIT, the surface is represented as a ductile film. The film is stretched only in the direction of the real data point and is fixed in all other directions. In contrast to interpolation methods, it does not predict the data of the point but instead pushes the obtained surface as close as possible to the data point to obtain an approximate solution [73]. There are some lattices that cover all the points, forming a matrix unit. All points fall into one of the matrix units. The fitting problem can be written as . We define the number of points in the direction as and the number of points in the direction as , so the length of is . Thus, has rows and columns. Because this is often an ill-posed problem in practice, we added a regularization term. Then, the goal of the objective function is to minimize . GRIDFIT allows the user to control the stiffness through [73]. In comparative experiments, is selected as 1. We show the results in Figure 10(b). There were still many bulges on the face, and the facial features were not clear.

Figure 11 shows the 3D face reconstruction results of four people using our methods: the GRIDFIT method and the Kinect HD Face method. We also compare the results with the ground truth faces (Go!SCAN results). The mean and standard variation of the RMS are for our method, for the GRIDFIT method, and for the HD Face method. The figure shows that our results always had lower errors than the GRIDFIT and HD Face methods. The results show that our method can express human faces more accurately. We preserved as many face details as possible while ensuring smoothness. This is verified in Figure 12. In terms of speed, our method requires approximately 1.31 seconds to load the face landmark detection model in the first frame. After the first frame, our method takes approximately 0.23 seconds per frame. The GRIDFIT method takes 0.63 seconds per frame on average. The Kinect HD Face method is divided into two stages. The first stage is human detection, which detects the body skeleton and the dynamic face. This stage takes approximately 3.67 seconds. The second stage is the face tracking stage, which takes approximately 0.23 seconds per frame. We calculated the time consumption of the three models for ten frames. Our method, the GRIDFIT method, and the Kinect HD Face method take approximately 3.61 seconds, 6.30 seconds, and 5.97 seconds for ten frames, respectively. Our method not only produces higher-quality models but also takes less time.

5. Conclusion

In this paper, we propose a key region-guided detail optimization approach for 3D face reconstruction from a single RGB-D image. We extract the face region from an image frame and use face landmarks to guide the optimization of the face point cloud. For different face regions, we design various strategies for smoothing noise and holes while retaining necessary face details, resulting in a more realistic face model. When comparing our approach with existing popular methods on a real face reconstruction dataset, our algorithm performs better than the other methods, with a clear advantage in both quantitative and qualitative terms. The Kinect HD Face method performed poorly on disgust, surprise, anger, and fear expressions, while our method was consistent across the seven expressions. Especially in the eyebrow and mouth regions, the Kinect HD method had difficulty expressing the real emotion. The speed comparisons also show that our method is effective for high-quality 3D face generation.

One limitation of our method is that the effective depth range of Kinect is approximately 0.5~4.5 m, and the measurement data beyond this range have a large error. This leads to a high modelling error when there is a large distance between the subjects and the Kinect. In addition, in this study, we used Kinect to obtain RGB-D images quickly and easily. Many mobile phones, such as the iPhone X, now include RGB-D cameras, which makes it possible to extend our method to mobile terminals. In future work, we plan to adapt the method for mobile phones to apply the method to a broader range of scenarios.

Data Availability

The data used to support the findings of this study are currently under embargo. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work has been in part supported by the National Natural Science Foundation of China (Nos. 61962021, 61602222, 61972344, and 61732015), the Natural Science Foundation of Jiangxi, China (20202ACBL202008), the Postgraduate Course (Virtual Reality Technology) Construction Project of Jiangxi, China, the Higher Education Reform Research Provincial Project of Jiangxi, China (JXJG-20-5-20), and the Technology Project of Jiangxi Education Department, China (GJJ210623).