Abstract

The real-time calculation of positioning error, error correction, and state analysis has always been a difficult challenge in the process of manipulator autonomous positioning. In order to solve this problem, a simple depth imaging equipment (Kinect) is used and Kalman filtering method based on three-frame subtraction to capture the end-effector motion is proposed in this paper. Moreover, backpropagation (BP) neural network is adopted to recognize the target. At the same time, batch point cloud model is proposed in accordance with depth video stream to calculate the space coordinates of the end-effector and the target. Then, a 3D surface is fitted by using the radial basis function (RBF) and the morphology. The experiments have demonstrated that the end-effector positioning error can be corrected in a short time. The prediction accuracies of both position and velocity have reached 99% and recognition rate of 99.8% has been achieved for cylindrical object. Furthermore, the gradual convergence of the end-effector center (EEC) to the target center (TC) shows that the autonomous positioning is successful. Simultaneously, 3D reconstruction is also completed to analyze the positioning state. Hence, the proposed algorithm in this paper is competent for autonomous positioning of manipulator. The algorithm effectiveness is also validated by 3D reconstruction. The computational ability is increased and system efficiency is greatly improved.

1. Introduction

In computer vision, 3D reconstruction refers to the process of 3D scene restoration based on a single view or multiple-view images. There are four kinds of reconstruction methods: binocular stereovision, sequence image, photometric stereo, and motion view analysis. Binocular stereo method is suitable for larger objects. Sequence image method is suitable for small objects. Photometric stereo and motion view analysis methods are suitable for large and complex scene reconstruction. Because single video information is incomplete, its reconstruction needs to use the empirical knowledge. The multiview reconstruction is relatively easy. That is to calculate the pose relation between image coordinate frame and the world frame. Then, 3D information is reconstructed by using plurality of 2D images. But computational complexity is high, and the cost is more expensive. For 3D reconstruction using the simple depth image apparatus, a few scholars begin to study and have achieved certain results. Izadi et al. [1] propose a new 3D reconstruction technology based on Microsoft Kinect. However, due to the limitation of TOF technology, accuracy of the surface texture information is not high. The significance of 3D reconstruction is to make real-time monitoring for end-effector positioning process and error. At the same time, it can lay the foundation for debugging and analysis of manipulator autonomy job task. Ma et al. [2] propose a gradual human reconstruction method based on individual Kinect. Body feature points are positioned in the depth video frames combining with feature point detection and error correction processing algorithm. And human body model is obtained by estimating the body size. Guo and Gao [3] propose a robust automatic UAV image reconstruction method under batch framework. Li et al. [4] seek multiview reconstruction method from the perspective of motion visual analysis. The sparse point cloud and initial mesh are built by each view bias model. Lü et al. [5] propose a Bayesian network model that describes body joints spatial relationship and dynamic characteristics. Golf swing process 3D reconstruction system is built by the similarities of swing movements. The problem of limb occlusion is effectively solved using easy depth imaging device to capture the motions. Lin et al. [6] utilize adaptive window stereomatching reconstruction method based on integral gray variance and integral gradient variance. Image texture quality is determined according to integral variance size. Related calculation will be done if it is more than a preset variance threshold. And it needs to traverse the whole image to obtain dense disparity map. Izadi et al. [7] get point cloud data using a single mobile Kinect and four fixed ones. The point cloud alignment and fitting problems are also solved by iterative closest points. Kahl and Hartley [8] convert 3D reconstruction into norm minimization problem. A closure approximate solution is derived by second-order cone programming (SOCP). In the case of the known camera rotating, shifting and spacing position of the camera can be solved simultaneously. In this paper, the reconstruction process by capturing end-effector motion and recognizing the target object needs to be understood. The model is 3D point cloud fitting.

(1) Paper’s whole engineering problem: the visual system is used to guide the positioning control of manipulator. Each joint trajectory of manipulator is corrected continuously according to the positioning error information. ① Internal and external parameters of Kinect are obtained by camera calibration. They provide the known conditions for solving the 3D model. ② Establishment of the kinematics model can guide the motion control of manipulator. ③ The movement of end-effector is detected and tracked using Kalman filtering by RGB images. And the motion state is estimated (including the end-effector’s position and velocity). ④ The object to be positioned and TC’s position are determined by target object recognition. ⑤ The manipulator’s motion is corrected until the positioning is successful by the error information between EEC and TC. ⑥ Effectiveness of the algorithm can be verified by 3D reconstruction in the positioning process. Simultaneously, it also provides the convenience for visualization monitoring.

(2) Paper’s research intention: firstly, we hope to improve autonomous operability of the manipulator and the adaptability to environment. Secondly, we hope that the system has the capability of visualization monitoring in the positioning process. And the positioning error can be calculated and analyzed in real time.

2. Camera Calibration

2.1. Kinect Hardware

A simple depth imaging device Kinect is used as a sensor which consists of RGB camera, IR camera, rotating motor, and speech array. IR camera consists of infrared transmitter and infrared receiver as shown in Figures 1(a) and 1(b). RGB camera outputs color images. IR camera outputs depth images directly as shown in Figures 1(c) and 1(d). Figure 1(c) denotes the depth point cloud information. Figure 1(d) denotes the mapping between depth point cloud and the distance. Figure 1(e) denotes the real-time depth value corresponding to point (320, 240). RGB camera includes three-frame rate modes: @ 12 fps, @ 30 fps, @ 15 fps. Frame rate mode of IR camera is @ 30 fps. So far, most studies are about human detection and pose estimation based on Kinect images. New research is about human behavior recognition. Hassine et al. [9] use Kinect to detect and identify the target object and calculate the target position in real time. Akshara et al. [10] propose a new autonomous positioning method of manipulator based on machine learning and Kinect camera. Autonomous learning algorithm (target feature set is trained) based on Kinect is studied by Cornell University [11]. In this paper, RGB images are used for object recognition, segmentation, motion capture, and state estimation. Depth images are used to calculate the target’s 3D information (including EEC, TC, and 3D surface fitting).

2.2. Kinect Calibration

Internal parameters describe the transform relation between camera coordinate frame and image coordinate frame. External parameters describe the transform relation between camera coordinate frame and world frame. Thence, camera calibration is the premise of 3D reconstruction. It is convenient to use OpenCV, but the calibration results are often inaccurate and unstable. Paper uses Matlab toolbox for calibration. Then, the calibration results are applied to stereomatching and parallax calculation. The radial distortion coefficients are estimated by least squares method in Zhang’s calibration [12, 13]. And internal and external parameters are estimated by a closed-form solution. External parameters of Kinect are formed by and . is composed of rotation and translation matrix from calibration plate to the camera coordinate frame in IR camera. is composed of rotation and translation matrix in RGB camera. Herrera method [14] is used for Kinect calibration. Firstly, Zhang’s method is used for parameters initialization. Nonlinear minimization of cost function is used to estimate internal and external parameters by using Levenberg-Marquardt method. Perspective projection transformation describes the mapping from space point to 2D point . It is denoted by 3 × 4 matrix : where denotes nonzero scale factor. denotes homogeneous pixel coordinate. denotes rotation matrix. denotes translation vector. and describe the camera direction and position in world frame, respectively. denotes internal parameter matrix . and denote conversion factors between imaging plane pixel and space physics length. In RGB image, one pixel represents 5.48 μm. So, focal length is . In depth image, one pixel represents 10.34 μm. So, focal length is . denotes a skew factor; here, . (,) is origin coordinate between axis and the image plane. The distortion coefficients are not included in (1). But they can be obtained by Herrera method [14], where and denote the radial distortion coefficients. and denote the tangential distortion coefficients.

The length and width of calibration plate are 575 mm and 350 mm, respectively. Simulink is used to read Kinect depth video [1517]. Twenty videos are obtained from different perspectives. And each video records the calibration plate within 2~3 seconds. Each video is transformed into images frame by frame, in which there are 20 images clustered from different perspectives as shown in Figure 2. The calibration can be divided into two steps: initialization and nonlinear optimization. Optimization is an accurate calculation process by iterative gradient dropping of Jacobian matrix. Color camera calibration results are shown in Table 1, its rotation matrix and translation vector . Depth camera calibration results are shown in Table 2, its rotation matrix and translation vector .

3. Kinematics Modeling

Manipulator motion control is guided by forward and inverse kinematics model. The system obtains the position of the target by Kinect. Then, the inverse kinematics model is used to get the rotation angles of each joint. Secondly, the corresponding command is sent to control the movement of each joint. Thus, the end-effector can reach the target position.

This paper is illustrated by the example of 5-degree-of-freedom (DOF) manipulator. There are five rotation axes, six joints, and three links as shown in Figure 3. The rotation axes are waist rotation axis, arm pitching axis, forearm pitching axis, wrist pitching axis, and wrist rotation axis, respectively. Forward kinematics model is established. The end-effector motion trajectory is derived by each joint angle. Define whereIt is a homogeneous transformation matrix from to coordinate frame. denotes the number of joints. denotes the link parameter vector. Sine and cosine functions are abbreviated as s and c. Inverse kinematics model is established as shown in [18].

4. Manipulator End Motion Capture

The proper detection of the end-effector motion is important for postprocessing. In this paper, the changing pixel regions are detected from image sequences. And the moving object is extracted from a static background using three-frame differencing method. Consecutive three-frame differencing method [19] can better deal with environmental noise, such as weather, light, shadow, and messy background interference. It is better than two-adjacent differencing method [20] in double-shadow treatment. Then, morphology erosion and dilation are operated for the binary image to remove holes.

The motion state is estimated after moving end-effector was extracted. So, it is convenient to calculate the position error between end-effector and target object. The end-effector movement is a discrete-time dynamic system in video. State vector is , where and denote the end-effector center coordinates on x- and y-axes, respectively. and denote the velocities on x- and y-axes, respectively. Assume that observation vector is , where and denote the observations on x- and y-axes, respectively. The system status is predicted and tracked by using Kalman filtering [21]. State equations can be expressed as follows:where denotes the iteration time. denotes system status from to . denotes system control variable. In paper, assume that the system is free from outside influence, so here . denotes measurement value from to . and denote process noise and measurement noise, respectively. They are assumed to be white Gaussian noise (WGN). denotes the state transition matrix of system from to . Since the end-effector’s movement is approximated as uniform motion, the state transition matrix is . denotes transformation matrix of control coefficients. denotes measurement transition matrix, also called observation matrix; here . In other words, the state vector can be observed directly. Prediction is a process of estimating the state of next moment according to the current state and the error covariance. Thereby, a priori estimate is obtained. Correction is a process of feedback. The new actual observation value and a priori estimate are considered together. Thereby, a posteriori estimate is obtained. When the system is represented by (4), it is possible to estimate the posterior probability density function of the mean and covariance. State prediction equation is Covariance prediction equation is Kalman gain matrix is Covariance estimation isState estimation:where denotes the state prediction at . denotes the state measurement at . denotes the state measurement at . denotes prediction covariance matrix at . denotes measurement covariance matrix at . denotes measurement covariance matrix at . denotes covariance matrix of process noise. denotes covariance matrix of measurement noise. denotes Kalman gain matrix. After the completion of each prediction and correction, a priori estimate will be predicted by posterior estimate at the next time. And repeat above steps. This algorithm does not need to save the last measurement data. After the data is updated, the new parameters will be estimated according to recurrence formulas. Thus, storage and computation of filtering device are greatly reduced. And system operational efficiency is improved.

5. Object Recognition

5.1. Image Preprocessing

This section describes the preprocessing based on Kinect RGB images. Target recognition is illustrated by the example of cylindrical target object (CTO). End-effector and CTO will appear in the same video. Firstly, image gray processing is carried out. It is the process which the color images are converted into gray images. And this process can reduce the computation greatly. Gray image gradation is 0~255. Grayscale method, [22], was used in paper.

Secondly, image median filtering is carried out. It cannot blur the edges while suppressing random noise [23]. It is a kind of nonlinear smoothing method. Gray values of the pixels are sorted in the sliding window. The original gray value of the pixel in the window center is substituted by the median.

Thirdly, mathematical morphology operation is carried out. Dilation and erosion operations are used [24]. It is widely used in edge detection, image segmentation, and image thinning as well as noise filtering and so forth. Assume that denotes binary image. denotes structural elements. The following operations are used:(1)Morphology dilation: (2)Morphology erosion:

Then, make a weighted fusion between an input image and its “canny” operator detection. Threshold segmentation of fusion image is carried out. Image segmentation is the basis for determining the feature parameters. The whole contour of the object is obtained after image segmentation. An example is demonstrated in Figure 4. Shaded area represents the boundary.

At last, autonomous positioning algorithm should make the system possess the capability of automatically extracting geometric features. These features should keep invariant when the image is transformed, such as translation, rotation, twisting, and scaling.

There are two kinds of CTO feature parameters, edge contour feature and shape parameters. The parameter of contour points belongs to edge contour feature. Shape parameters include perimeter, area, longest-axis, azimuth, boundary matrix, and shape coefficient.

① Contour points represent the required number of pixels which can outline the contour. The number of contour points is 22 in Figure 4.

② Perimeter represents the contour length of outer boundary. And it can be calculated by the sum of the distance between two adjacent pixels on outer boundary. In the boundary, assume that the distance between two adjacent edge pixels is 1 in horizontal or vertical direction. The distance between two adjacent edge pixels is in oblique direction. So, perimeter is in Figure 4.

③ Area can be represented with the number of pixels in target region. So, area is 41 in Figure 4.

④ The longest-axis denotes the maximum extension length of target region, that is, the connection line of the maximum distance between two pixel points on outer boundary. So, the longest-axis is 8 in Figure 4.

⑤ Azimuth represents the angle between the longest-axis and -axis in target region. So, azimuth is 0 in Figure 4.

⑥ Boundary matrix denotes the minimum matrix encompassing the target region. And it is also the intuitive expression of flat level of target region. It is composed of four outer boundary tangents. Two of them are parallel to the longest-axis, and the other two are perpendicular to the longest-axis. So, boundary matrix is in Figure 4.

⑦ Shape coefficients denote the ratio of the area to square of the perimeter. So, shape coefficient is 0.0639 in Figure 4.

5.2. Neural Network Recognition

This section describes the recognition based on Kinect RGB images. BP neural network learning algorithm is used in this paper. It is a learning process of error back propagation algorithm. This network can learn and store a large amount of input-output mappings. And it is also the mathematical equation without describing these mappings in advance. Its learning rule is the gradient descent method. Mean square error is minimized by adjusting the network weights and thresholds continuously. Network topology includes input layer, hidden layer, and output layer.

Feature vectors are extracted as training sample. Neural network is used as a classifier instead of Euclidean distance method to implement target recognition.

The design of input and output layers in neural network is as follows: the number of nodes is 7 in input layer. The elements of the input vector are contour points, perimeter, area, longest-axis, azimuth, boundary matrix, and shape coefficients, respectively. The number of nodes is 3 in output layer. The elements of output vector are cylinder, square, and spherical, respectively. Normalized values of output are 0.1, 0.2, and 1, respectively. The design of hidden layer is as follows: there are two hidden layers which include a logarithmic characteristic function and a “purelin” function. The number of nodes is 20 in the first hidden layer. The number of nodes is 3 in the second hidden layer. Linear excitation function is used in output layer. The number of hidden layers is related to the number of neurons and specific issues. At present, it is difficult to give an accurate function to describe this relation. Experiments show that it is not sure to improve the accuracy of the network when increasing the number of hidden layers and neurons. The initial number of the hidden layers can be selected via . Wherein, “” denotes the number of neurons in input layer. “” denotes the number of neurons in output layer. “” denotes an integer from 1 to 10. Here, “” is set to 15.

Sample Set. Sample set is collected from the shooting scene of Kinect. In Figure 5, there are cylindrical objects, square objects, and spherical objects. There are 30 kinds of cylindrical objects (Figure 5(a)), 10 kinds of square objects (Figure 5(b)), and 10 kinds of spherical objects (Figure 5(c)). For each target object, there are 20 different viewing angles (schematic diagram in Figures 5(d), 5(e), and 5(f)). So, the numbers of cylindrical objects, square objects, and spherical objects are 600, 200, and 200, respectively, in sample set. In Figure 5(g), edge contour is extracted.

Network Training. The weights of neurons are adjusted in the process of training network. The training stops until the mean square error (mse) reaches . The maximum number of iterations is set to 10000. The momentum constant is set to 0.8. The initial learning rate is 0.01. Increasing ratio of learning rate is 1.05. Reduction ratio of learning rate is 0.7. The dimension of training set is (cylindrical 600, square 200, and spherical 200). The dimensions of validation set and testing set are all (cylindrical 60, square 20, and spherical 20). Sample set (including training set, validation set, and testing set) is the normalized data. And the range is . Normalization function is .

After the identification, we need to calculate the target centroid (TC). The shape of target object is regular, so the spatial position of TC is the destination of end-effector positioning. TC is calculated as follows:where and denote minimum pixel and maximum pixel of target object, respectively, along the row direction. and denote the minimum pixel and maximum pixel of target object, respectively, along the column direction. “” denotes an adaptive threshold. denotes the grayscale value.

6. 3D Reconstruction

This section describes 3D surface fitting based on Kinect depth images. Vision system captures the end-effector motion. In other words, moving objects can be recognized and tracked. At the same time, the motion state is estimated. Thence, 3D reconstruction is carried out in the positioning process. There is not a full synchronization and alignment between color-frame-stream and depth-frame-stream, so it is necessary to correct these two data streams. For 3D reconstruction of the positioning process, the system requires faster processing speed and the capacity of a large amount of data processing. So, we use the batch mode of sequence images. According to depth imaging principle [2527], the extraction model of 3D point cloud is established as shown in Figure 6. denotes the world frame. denotes IR camera coordinate frame. denotes image physical coordinate frame. denotes image pixel coordinate frame. In 3D space, there is point whose coordinate is . Camera coordinate frame is established on the optical center of IR camera. denotes the coordinate in camera coordinate frame. and denote ideal coordinate and actual coordinate, respectively, in image physical coordinate frame. and denote ideal coordinate and actual coordinate, respectively, in image pixel coordinate frame. The formulas are as follows:

Ideal coordinate denotes the coordinate without distortion. Actual coordinate denotes the coordinate with radial distortion or tangential distortion. , , and represent the optical-axis centers of IR camera, RGB camera, and infrared transmitter, respectively.

For Brown distortion model [28], make the world coordinate frame and camera coordinate frame coincide. So, rotation matrix is equal to . And translation matrix is equal to . Spatial coordinate of the target point can be expressed as follows:From pinhole model of the camera, we can getFrom formula (16), the following can be obtained:It can be deduced thatwhere and denote radial distortion coefficients. and denote tangential distortion coefficients. denotes the depth value of IR camera. can be extracted from depth image directly. So, can be solved by formulas (13), (19), and (20). Then, we can get when is substituted into formula (18). But formula (19) is a bivariate quartic equation set, so is difficult to solve. In the case of only radial distortion, it can be deduced from formula (19) that Let ; it can be deduced from formula (21) thatIn formula (22), two equations addFormula (20) is substituted into formula (23): can be solved by formula (13). So, formula (24) becomes a quadratic equation and “” is obtained. Then, “” is substituted into formula (21). Then, is solved. At last, 3D coordinate of target point is obtained by substituting into formula (18). Since the distortion of Kinect itself cannot be ignored, we must consider the radial distortion and tangential distortion simultaneously. Therefore, formula (15) becomes [29]where , , , and denote implicit factors. , , , and are obtained by calibration. , , , and can be solved by converting into least squares problem. Assume that there are a total of calibration points. and denote actual value and ideal value, respectively, in image physical coordinate frame. Three vectors , , and are given:Let vectors and beIn formula (28),According to formula (25), vectors , , and meet . And we can get by the least squares method. At last, 3D coordinate of target point is obtained by substituting into formula (25).

After point cloud of the target is obtained, surface fitting will be made by using triangular facets [3032]. This process is also called grid generation. In real space , scattered points are given. Each point corresponds to constraint . Function is constructed. For each scattered point, it satisfiesThe number of solutions to (30) is infinite. Ideal solution is the minimum of energy function (31) when satisfying the constraint of formula (30):Variational technology is used to solve the minimum problem of energy function. General solution forms where denotes radial basis function (RBF). denotes any one point on surface. denotes a scattered point and is also called sample point. denotes the weight of RBF. denotes the Euclidean distance. is usually selected as RBF during the interpolation of scattered points. In order to ensure the continuity and linear of surface, is defined as follows:In order to minimize the energy function and make (32) have a unique solution, the orthogonalization condition is defined: Additional constraint points are calculated on the basis of normal direction of scattered points. Interpolation constraint points and additional constraint points are substituted into formulas (32), (33), and (34). Thus, unique solution is obtained. Then, and are substituted into Thus, surface equation is obtained. At last, Bloomenthal algorithm [33] is used to achieve triangular facets of surface. In this situation, 3D surface obtained is often rougher. And the quality of interpolation surface is also affected by many narrow triangular facets. Therefore, triangle mesh needs some local optimization-processing.

7. Experiment and Analysis

Experimental platform is composed of computer (Acer TMP455, 16 G memory, 500 G SSD), manipulator system, and Kinect vision as shown in Figure 7. The effective measurement range of Kinect is 0.8~2.3 meters. And measurement accuracy decreases with increasing distance. The software includes VC++ 2010, Matlab 2012a, and Kinect for Windows SDK v1.7.

In accordance with 3D reconstruction steps, the experiments can be divided into end-effector motion capture, target recognition, and 3D surface fitting. The experiment of end-effector motion capture includes RGB video converting to frame stream, image fusion, image binarization, and the state estimation. The experiment of target recognition includes feature extraction, gray processing, threshold segmentation, feature vectorization, data normalization, and BP identification. The experiment of 3D reconstruction includes information extraction of target point cloud and surface fitting.

7.1. End-Effector Motion Capture

The video is converted into image frame format (). There are a total of 73 frames. The target region will be captured when the end-effector is moving in each frame image. Regional boundaries are described by wireframe and its geometric center. Motion capture result of the end-effector is in square region at as shown in Figure 8. Blue “∘” denotes the center of target region obtained by three-frame differencing method. Blue frame denotes the moving region boundary obtained by three-frame differencing method. Motion state of the end-effector is estimated by Kalman filter. Thereby, the position and velocity of the end-effector are obtained. At time , the measurement covariance matrix is . The measurement noise covariance matrix is . The process noise covariance matrix is . Red “□” denotes the center of target region obtained by Kalman filter. Red frame denotes the moving region boundary obtained by Kalman filter. During the experiment, blue center “∘” is tracked by red center “□.” And blue frame is tracked by red frame.

Figure 9 shows the relation between actual position and predicted position. Here, the actual values represent the measurements, also called the observations. The curve is fitted by image pixel coordinates of 73 centers. And it reflects that center position changes in image sequences. The green line indicates the actual position of end-effector. The red line represents the Kalman estimate position of end-effector. From Figure 9, the position tracking is very stable in the initial period of time. But, in the intermediate period of time, there is a certain amount of pixel error. And the maximum error is 8 pixels. The reason is the variable motion of end-effector. In the end period of time, the position of end-effector is corrected, and the tracking becomes stable again. Thus, the position of end-effector can be obtained in real time.

Figure 10 shows the relation between actual velocity and estimated velocity. Here, the actual values represent the measurements, also called the observations. And it reflects that center velocity changes in image sequences. The green line indicates the actual velocity of end-effector. The red line represents the Kalman estimate velocity of end-effector. From Figure 10, the velocity tracking is basically stable the whole time. Thus, 3D reconstruction can be carried out in real time in the positioning process.

7.2. Target Recognition Experiment

Edge contour of target object is extracted according to Section 4. Then, calculate the shape parameters (7 kinds) to obtain the sample set. Normalization processing of sample set is carried out. The situation of BP network training is shown in Figure 11. This training takes 5 seconds with 47 iterations. Sample set is divided into automatically training set, validation set, and test set. Blue line, green line, and red line represent their convergence situations, respectively. The network stops training when “mse” reaches . Validation set has a convergence when “mse” reaches . Test set has a convergence when “mse” reaches . Function gradient decreases from 1.49 to . Thin dotted line and “◯” denote the best status of validation set. Thick dotted line denotes preset “mse” of stopping training.

The result of the identification is shown in Figure 12. The horizontal axis denotes the number of verification and test samples/group. Vertical axis denotes the classification (identification result). Blue “” denotes network predictive result. Red “∘” denotes the actual classification. If the classification is equal to 0.1, the target is cylindrical object. If the classification is equal to 0.2, the target is square object. If the classification is equal to 1, the target is spherical object. From Figure 12, we can see that there is a one-to-one correspondence between network classification and actual result. For validation sample, the recognition rate is 0.998. For test sample, the recognition rate is 0.997. High recognition rate shows that the extracted features are comprehensive and critical. And it also shows that the design of BP network is rational. At last, a cylindrical object is selected randomly from test sample. Its coordinate of TC is (507 pixels, 306 pixels) according to formula (15).

7.3. Reconstruction Experiment

First of all, calibration results are imported. There are 20 calibration images. There are 4 calibration points in each image. Focal length is , , and . Origin is and . Radial distortion coefficients are and . Tangential distortion coefficients are and . The coordinates of 80 calibration points are shown in Table 3. On the left is the image pixel coordinate. On the right is the image physical coordinate. Clamping mechanism (also called end-effector) has the maximum opening range 287 mm. According to the actual positioning requirement, the maximum permissible errors are 20 mm, 25 mm, and 20 mm, respectively, along “x-,” “y-,” and “z-”axis.

3D coordinates (including end-effector and cylindrical TC) are calculated according to formulas (13)–(28). Cylindrical TC is 214.2 mm, −3.9 m, and 825 mm. Under ideal conditions, it indicates the positioning successful if TC coordinates coincide with the center coordinate of end-effector. But, in the experiment, it is normal if there is a certain deviation. And the center coordinate of end-effector should gradually converge to TC coordinate.

For the video (including 73 frames), image coordinates and 3D coordinates of end-effector centers are shown in Table 4. () denotes image coordinate. () denotes spatial coordinate of end-effector center with respect to base coordinate frame. From the table, we can see that the center of end-effector gradually approaches TC in the time domain. Absolute errors are  mm, −3.3 + 3.9 = 0.6 mm, and  mm along “x-,” “y-,” and “z-”axis, respectively.

Theoretical values are constant along “x-”axis. But experimental data fluctuate within a certain range. And the maximum random fluctuation is  mm. Theoretical values are decreasing along “y-”axis. Experimental data are also decreasing. Theoretical values are increasing along “z-”axis. Overall trend of experimental data is also increasing. But, sometimes, these data are unchanged (such as 34th~39th frame and 40th~41st frame) or fluctuant (such as 42nd~46th frame and 50th~52nd frame) in some consecutive frames. The reason of deviation is low pixel accuracy of Kinect. 3D point cloud extraction is shown in Figure 13(a). Surface fitting is implemented by using triangular facets and morphological processing as shown in Figure 13(b). Paper only displays the reconstruction results of 5th, 25th, 45th, 65th, and 73rd frame.

7.4. Comparative Experiment

In [1], accumulated error cannot be corrected. In [4], 3D reconstruction based on monocular camera only rely on mathematical model. And the requirement for light source is harsh. In [6], 3D reconstruction based on binocular vision requires pattern matching and a lot of computing. And the effect of reconstruction is significantly reduced in the case of large baseline distance. From Table 5, we can obtain the following conclusions. Paper’s method (1) can meet large computational applications, (2) has low light source demanding, and is mainly subject to ultraviolet rays. (3) Neural network recognition algorithm reduces the reliance on mathematical model and (4) improves the reconstruction efficiency and accuracy.

8. Conclusion

First of all, paper proposes the Kalman filtering method based on three-frame differencing to capture the end-effector motion. Secondly, target objects are identified by using BP classification idea. And calculate the coordinates of TC. Then, the center of end-effector gradually converges to TC. This convergence indicates that the positioning is successful. Finally, surface fitting of 3D point cloud is achieved by triangular facets and morphology processing. Comparative experiment shows that paper’s method is more practicable and efficient.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This paper is supported by Beijing project “Equipment Development and Demonstration Application in Rail Accident Scene Emergency.” And Project no. is Z131100004513006.