Abstract

Calibration of extrinsic parameters of the RGB-D camera can be applied in many fields, such as 3D scene reconstruction, robotics, and target detection. Many calibration methods employ a specific calibration object (i.e., a chessboard, cuboid, etc.) to calibrate the extrinsic parameters of the RGB-D color camera without using the depth map. As a result, it is difficult to simplify the calibration process, and the color sensor gets calibrated instead of the depth sensor. To this end, we propose a method that employs the depth map to perform extrinsic calibration automatically. In detail, the depth map is first transformed to a 3D point cloud in the camera coordinate system, and then the planes in the 3D point cloud are automatically detected using the Maximum Likelihood Estimation Sample Consensus (MLESAC) method. After that, according to the constraint relationship between the ground plane and the world coordinate system, all planes are traversed and screened until the ground plane is obtained. Finally, the extrinsic parameters are calculated using the spatial relationship between the ground plane and the camera coordinate system. The results show that the mean roll angle error of extrinsic parameter calibration was −1.14°. The mean pitch angle error was 4.57°, and the mean camera height error was 3.96 cm. The proposed method can accurately and automatically estimate the extrinsic parameters of a camera. Furthermore, after parallel optimization, it can achieve real-time performance for automatically estimating a robot’s attitude.

1. Introduction

RGB-D cameras, such as Kinect [16], PrimeSense, and Asus Xtion Pro, are traditional RGB cameras with added infrared cameras. Figure 1 shows the structure of a Kinect camera. It can be seen that it includes a color camera, infrared camera, and an infrared illuminator. The color camera outputs a color image, the infrared camera outputs a depth image, and the infrared illuminator emits infrared light for the calculation of the depth image. With the emergence of low-cost RGB-D cameras such as Kinect, the camera and illuminator are increasingly used for tasks such as 3D scene reconstruction and navigation [8, 9], target recognition and tracking [10, 11], 3D measurement [1215], and even social networks [16], for which the extrinsic parameters of the camera often must be calibrated.

For example, in target detection and recognition [11], the extrinsic parameters of the camera should be calibrated first, and then pedestrians in the scene are tracked. If the extrinsic parameters of the camera can be obtained, the negative effects of perspective can be eliminated. In addition, the detection algorithm can be unified, and the recognition process can be simplified, thus accelerating recognition speed.

However, studies of the calibration of RGB-D cameras mainly focus on the extrinsic parameters of the color camera relative to the infrared camera [17, 18]. Calibrating the extrinsic parameters of an RGB-D camera is often similar to the calibration of the color camera, and the method of chessboard calibration is generally used to calibrate the color camera [19]. Munaro et al. [10] used a chessboard to calibrate the extrinsic parameters of multiple cameras and carried out pedestrian detection based on this. This calibration method did not make full use of the depth information provided by the RGB-D camera, and the calibration results were essentially the extrinsic parameters of the color camera. If it were directly used for 3D reconstruction of the depth map, then there would be a large error. Shibo and Qing [20] designed a calibration board for RGB-D infrared camera recognition, which had holes with regular intervals and calibrated the infrared camera by automatically identifying holes. The need to design special calibration objects increased the difficulty of calibration. Liao et al. [17] divided the calibration into three categories: calibration that required a calibrator, calibration that required human intervention, and fully automatic calibration. The proposed method belonged to the third category.

Liao et al. [17] divided the extrinsic parameter calibration methods for RGB-D cameras into three categories: (1) the first was to calibrate the extrinsic parameters of the color camera and then to use the extrinsic parameter of the color camera and images collected by the infrared camera to obtain the extrinsic parameter of the infrared camera through transformation . This method can directly use the color image calibration method, but it needs the extrinsic parameters of the color camera relative to the infrared camera and does not use the information provided by the depth map, so the process is complicated. (2) The second was to detect the features on the depth map provided by the infrared camera by designing a specific calibration object to obtain the extrinsic parameters of the camera based on the features (such as chessboard corners). (3) The third method used the depth map to calibrate the extrinsic parameters of the camera. The calibration was carried out by detecting the target on the depth map and using the relationship between the target and the world coordinate system. This method directly used the depth information, which greatly simplified the process and improved the efficiency of calibration. Our proposed method, which we will call the ground plane calibration method, is based on the automatic calibration of the extrinsic parameters of a ground plane detection camera and belongs to the third category. This method can directly calibrate the infrared camera of an RGB-D camera using its depth information.

2. Extrinsic Parameter Calibration

RGB-D cameras can be divided into color cameras and infrared cameras. For this study, the extrinsic parameters of an infrared camera were calibrated. Thus, extrinsic parameter calibration refers to that of infrared cameras. The extrinsic parameter of the camera iswhere and are the rotation matrices of the camera, is the translation matrix of the camera, and represents the transformation from the camera coordinate system to the world coordinate system.

Figure 2 shows the flowchart of extrinsic parameter calibration. After the establishment of a world coordinate system, the depth image is first obtained from the RGB-D camera. Then, the depth image is transformed to a 3D point cloud under the camera coordinate system using the internal parameters of the infrared camera. After that, the ground plane in the point cloud is solved by iteration. Subsequently, the extrinsic parameters of the camera are obtained by calculation.

2.1. Establishment of the Camera Coordinate System and World Coordinate System

The camera is in a 3D coordinate system, where the infrared camera is the origin, as shown in Figure 3. The infrared camera is the origin of the camera coordinate system. The axis is along the transverse direction of Kinect. The is perpendicular to Kinect and points in the shooting direction.

The world coordinates can generally be established at will. For facilitating the calculation of extrinsic parameters of the camera, as shown in Figure 4, the world coordinates should meet the following requirements.(1)The origin of the world coordinate system is the projection point of the origin of the camera coordinate system on the ground plane(2)The axis is the projection of the axis of the camera coordinate system on the ground(3)The axis is downwardly perpendicular to the ground plane(4)The coordinate system is a right-hand system

In this way, it is convenient to find the point group corresponding to camera coordinates and world coordinates and simplify the calibration process.

2.2. Calculation of Extrinsic Parameters of the Camera

The matrix of transformation from the camera coordinate system to the world coordinate system, that is, the extrinsic parameter of the camera, is a matrix with 12 unknown parameters, as shown in (1). First, the transformation matrix from the world coordinate system to the camera coordinate system is calculated to obtain the extrinsic parameters of the camera:

A point in the camera coordinate system is , and the corresponding point in the world coordinate system is , as shown in Figure 5. Four special points are selected in the world coordinate system:

By calculating the corresponding points in the camera coordinate system, that is, , the value of can be obtained as

It is known that the plane in the camera coordinate system is . Its normal vector perpendicular to the ground is and . As the origin of the world coordinate system is the projection of the origin of the camera coordinate system on the plane, the coordinates corresponding to the origin of the world coordinate in the camera coordinate system are

If point is on the axis of the world coordinate system and are collinear, then

Because the axis is the projection of the axis on the plane,where is the unit vector of ,

is the projection of point in the camera coordinate system to the plane. By solving (9), the projection point can be obtained aswhere are the parameters of the plane equation and is a random point on the plane.

After are obtained, can be obtained by vector cross-multiplication as

By solving (5)–(7) and (10), the coordinates of four points selected from the world coordinate system in the camera coordinate system are obtained. Then, the transformation matrix from the world coordinate point to the camera coordinate point is obtained by solving (4). Finally, the extrinsic parameter matrix of the infrared camera is obtained by solving (2).

3. Ground Plane Estimation

Toward facilitating plane detection, the depth map is first transformed to a 3D point cloud in the camera coordinate system using internal parameters. Then, the maximum likelihood estimation sample consensus (MLESAC) method [21] is used to extract the plane. The ergodic plane is iterated to obtain the extrinsic parameters. Next, the extrinsic parameters and point cloud are used to determine whether the plane is a ground plane, and we finally obtain the extrinsic parameters of the camera, as shown in Figure 6.

3.1. Transformation of Depth Map to 3D Point Cloud in the Camera Coordinate System

Consider the following:

Figure 7 shows the imaging model of the camera. The coordinate axes of the image plane are and , respectively. The imaging point of point in the camera coordinate system on the image plane is . The focal length of the camera is . The coordinates of the intersection of the optical axis and the image plane are and . The length and width of the pixels are and , respectively. The value of the pixel is . The proportion of Z coordinate values corresponding to the value of pixels in the camera coordinate system is .

Then, can be transformed to , as shown in (11).

3.2. Ground Plane Estimation

Ground plane estimation is the basis of subsequent extrinsic parameter calculation of the camera. Its accuracy determines the accuracy of extrinsic camera parameters. In a scene, multiple planes will be fitted out. Then, whether the plane is a ground plane is determined according to the following conditions:where is the angle between the axis of the camera and the plane; is the operation of taking the median value; is the set of interior point -values of the fitted plane model; is the set of exterior point -values of the fitted plane model; and is the set tolerance value. Equation (12) represents two conditions: (1) the inclination angle of the camera relative to the plane does not exceed and (2) after the exterior parameter is calculated according to the plane, the point cloud is transformed to the world coordinate system. The point set on the plane has the largest -value.

According to this definition, the planes were screened to calculate the qualified ground plane. The flowchart is shown in Figure 6.

First, the 3D point cloud in the camera coordinate system was input, and the plane was calculated using the MLESAC method. The set of interior points satisfying the plane was recorded. By using this plane and the method in Section 2.2, the extrinsic parameters of the camera were calculated. Combined with the internal parameters of the camera, the 3D point cloud in the camera coordinate system was transformed to the 3D point cloud in the world coordinate system, and it was judged whether the conditional equation (12) was met. If not, then the recorded set of internal points was removed from the point cloud, and fitting of the plane continued. Otherwise, the plane was the ground plane. No further operation was conducted, and the extrinsic parameters of the camera were output.

4. Experiment

4.1. Experiment Process

In this experiment, a PrimeSense camera was used to collect video data, and MATLAB was used for simulation to validate the proposed algorithm. To facilitate the accuracy comparison, the chessboard calibration results of the color camera were used as the reference data. Because the matrix was not suitable for comparison, the extrinsic parameters were transformed to camera height (), roll angle (), and pitch angle (). The calibration accuracy was measured by comparing the camera height, roll angle, and pitch angle. Following the world coordinates established in Section 2, the values can be obtained aswhere is the camera Z-axis in the world coordinate, is the camera X-axis in the world coordinate, and is the world Z-axis.

The experiment was carried out according to the following steps.(1)A PrimeSense camera was used to collect video data (each frame of the video had a clear chessboard), including color video and depth video.(2) video frames were selected randomly from the color video as the input to the Zhang camera calibration method [19], and the internal camera parameters were estimated.(3)Each frame of the color video was traversed. The camera’s extrinsic parameters were estimated using the internal parameters and the chessboard corner detected by the current frame, and the camera attitudes (, , and ) of each frame of color video were obtained.(4)Each frame of depth video was traversed. The ground plane was detected using the proposed method, and the extrinsic parameters of the camera were estimated to obtain the camera attitudes (, , and ) of each frame of depth video.

The camera height difference (), roll angle difference (), and pitch angle difference () of each frame was calculated, using the following equations:

The video formats collected in Step 1 are shown in Table 1.

Each frame of the video contained a chessboard, as shown in Figure 8. In the experiment, the chessboard was used to estimate the camera’s extrinsic parameters, which were taken as the reference data. Because different chessboards corresponded to different internal parameters when the camera’s internal parameters were calculated using the Zhang calibration method [19], Steps 1 to 3 were executed 86 times. Each video frame could estimate 86 extrinsic parameters, with their medians as the reference data.

The size of a chessboard was , and both the length and width of each chessboard were 40 mm. The camera’s extrinsic parameters, as obtained through chessboard calibration, are shown in Figure 8.

In Figure 9, , , and are the medians of the camera pitch angle, roll angle, and height, respectively, as measured using the chessboard method and , , and are the medians of the camera pitch angle, roll angle, and height, respectively, as measured using the proposed method.

Figure 10 shows the experimental errors, and Table 2 shows the final measured results.

4.2. Error Analysis

Because there was no high-precision instrument to measure the extrinsic parameters of the camera in the experiment, these were measured using the chessboard calibration method and taken as the reference data to measure the accuracy of the proposed method. The factors affecting the accuracy of this experiment are as follows.(1)First is quantization error of corner detection. Because the video frame used was pixels, the quantization error of corner detection was quite large when calculating the camera's internal and extrinsic parameters.(2)The chessboard method and the proposed method were, respectively, used to measure the extrinsic parameters of the RGB-D camera's color sensor and infrared camera.(3)The noise of the RGB-D camera scanning scene data would affect the detection of the ground plane, thus affecting the accuracy of the camera's extrinsic parameters.(4)The parameters to stop MLESAC iteration in ground plane detection would also affect the results of this experiment.

4.3. Influence of Scene Noise on Ground Plane Detection

RGB-D cameras have many noise sources, such as temperature, incident angle and intensity of ambient light, and texture [22]. The MLESAC used in this study could deal with small amounts of noise, but not with scenes with too much noise or data loss.(1)Strong sunlight would cause too many ground plane noise points, resulting in inaccurate estimation of plane parameters and a slight influence on the camera height and roll angle.(2)If the reflectivity of the ground was too high (for example, a mirror was placed on the ground), the data of this area would be lost. If too much data were missing, then the ground plane could not be detected, and the extrinsic parameters of the camera could not be estimated.

To verify the influence of noise on camera attitude, Figure 11 shows the error change of camera attitude with the increase of Gaussian noise variance in the 929th frame. The mean value of Gaussian noise was 0, and the variance was . It can be seen that with the increase of noise variance, the height error increases, and the stability of pitch angle and roll angle decreases. In addition, when the variance is greater than 0.25, the plane cannot be correctly estimated.

5. Conclusion

During pedestrian detection based on RGB-D cameras, due to the impact of environmental vibration, an RGB-D camera will shift its original position, and the extrinsic parameters will change greatly, which will directly affect the detection accuracy. By using the automatic extrinsic parameter calibration method proposed in this study, the extrinsic parameters can be automatically corrected when there is no pedestrian, so as to solve the above problem.

The proposed method can be applied to the automatic adjustment of extrinsic parameters of 3D cameras (speckle, TOF, and binocular camera) and has high parallelism. After parallel implementation, it is instantaneous and can be used for the automatic calibration of extrinsic parameters of 3D cameras on mobile robots.

In this method, we extracted the plane from the 3D point cloud and used the position relationship between the plane and world coordinates to obtain a camera’s extrinsic parameters, which were used to determine whether the current plane was a ground plane. If not, then the next plane would be used to calculate the extrinsic parameters until the ground plane was found, and the extrinsic parameters of the infrared camera were obtained. In this study, the conditions of the ground plane were given, which could guarantee the correctness of the established world coordinate system and creatively combine the plane detection with the extrinsic parameter estimation, so as to achieve the objective of automatic extrinsic parameter calibration. This method does not require an additional calibration object, and it is aimed at the calibration of infrared cameras. The results are reliable.

The currently proposed method has the limitation that there needs to be a ground plane in the scene. If the ground plane cannot get detected, then the calibration cannot be carried out regularly. Two problems still must be solved: (1) to improve the accuracy of calibration and use targets such as pedestrians in scenes to carry out fine calibration and (2) to calibrate the extrinsic parameters of multiple cameras automatically according to the common area taken by multiple cameras. If no common area is taken by the two cameras, then a simple calibration object should be designed for automatic calibration to be carried out according to the geometric size of the calibration object.

The MATLAB resource code and the video data used to support the findings of this study are available from the corresponding author upon request.

Data Availability

Data are available on request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Natural Science Foundation of China under Grant 61572083, in part by the Ministry of Education Joint Fund Project of China under Grant 6141A02022610, in part by the Fundamental Research Fund for the Central Universities of China under Grants 310824173601, 300102249304, and 300102248303, in part by the Fundamental Research Funds for the Central Universities Team Cultivation Project under Grant 300102248402, and in part by the Funds for Shaanxi Key R&D Program under Grant 2018ZDXMGY-047.