Abstract

This paper proposes a novel vision and inertial fusion algorithm S2fM (Simplified Structure from Motion) for camera relative pose estimation. Different from current existing algorithms, our algorithm estimates rotation parameter and translation parameter separately. S2fM employs gyroscopes to estimate camera rotation parameter, which is later fused with the image data to estimate camera translation parameter. Our contributions are in two aspects. (1) Under the circumstance that no inertial sensor can estimate accurately enough translation parameter, we propose a translation estimation algorithm by fusing gyroscope sensor and image data. (2) Our S2fM algorithm is efficient and suitable for smart devices. Experimental results validate efficiency of the proposed S2fM algorithm.

1. Introduction

Camera relative pose estimation (CPE) is the estimation of camera extrinsic parameters, that is, camera 3D rotation parameter and 3D translation parameter. It is one of the key issues in computer vision and is widely applied in 3D scene reconstruction, augment reality, panorama, and digital video stabilization solutions.

The traditional solutions to CPE problem are based on image processing technique, that is, the visual methods. These solutions usually first extract feature correspondences between frame pairs and then model CPE problem as linear equations under the epipolar geometry constraint. In such a way CPE problem is transformed into an optimal solution problem. Hartley [1] proved the feasibility of using 8 pairs of feature correspondences to handle CPE problem and proposed 8-point (8 pt) algorithm to solve CPE problem for uncalibrated cameras. After that, in order to find simpler solutions, researchers proposed 7 pt algorithm [2], 6 pt algorithm [3, 4], and 5 pt algorithm [46]. These are mature traditional solutions based on image processing technique with accurate estimation results but complex calculation and slow computing speed. With the fast growing employment of MEMS sensors in smart devices, inertial-based solutions for CPE problem have been tried recently. These solutions [7, 8] usually first perform CPE by visual and inertial methods individually and then adopt data filter to fuse the two results in order to obtain a more reliable estimation result. These two individual algorithms complement each other and improve the robustness of CPE. The disadvantage of this solution lies in the fact that it needs additional fusing time of the two results and thus reduces the efficiency.

It can be seen from the above analysis that visual solutions for CPE are mature and accurate but with complex computation and inertial solutions have been tried but without satisfactory results. Considering that rotation can be estimated fast and accurately by gyroscopes but there are no proper sensors to estimate translation accurately enough for CPE application; this paper proposes a visual and inertial fusion solution S2fM (Simplified Structure from Motion). S2fM divides the CPE problem into two parts: the rotation estimation part and the translation estimation part. It first employs gyroscope sensor to estimate the rotational information and then fuses the estimated rotational information with image data to estimate camera translation. Our solution relies on both gyroscope sensor data and image data but there might be time delay between them, so a calibration algorithm is necessary to align the gyroscope data and image data. The camera focal length is also estimated in the calibration algorithm, which further simplifies the visual algorithm for translation estimation. Since the calibration needs to be done only once for each device, the main cost of our solution lies in the visual algorithm stage, which has been simplified to deal with only 3 feature pairs. Our main contributions are in two aspects. (1) Under the circumstance that no inertial sensor can estimate accurately translation parameter, we propose a translation estimation algorithm fusing gyroscope sensor and image data. (2) Our S2fM algorithm is efficient and suitable for smart devices.

The rest of the paper is organized as follows. Section 2 reviews the related works. Section 3 describes the proposed solution. Section 4 presents the experimental results, and Section 5 draws the conclusion.

Generally, CPE solutions can be classified into two major groups.

The first group of solutions are the traditional solutions. These solutions model CPE problem as linear estimation problem based on image feature correspondences under two-view geometry (mostly adopted) or multiview geometry [2]. A fundamental matrix will be determined by the feature correspondences, which can then be decomposed to give relative camera orientation and translation. Thus, CPE problem is transformed into the fundamental matrix estimation and decomposition problem. The fundamental matrix decomposition problem is called the minimal problem in computer vision, whose solutions are divided into two categories: one for calibrated camera and the other for uncalibrated camera. The essential issue in minimal problem is that how many correspondence points the solution needs at least. For uncalibrated camera, the solutions include 8 pt (point) algorithm, 7 pt algorithm, and 6 pt algorithm. For calibrated camera, the solution is 5 pt algorithm, since the relative pose parameter number is 5, that is, 3 for rotation and 2 for translation (up to an unknown scale factor). Hartley proved the validity of 8 pt algorithm [1] in 1997, in which the correspondence problem is supposed to have been solved. After 8 pt algorithm, in order to find simpler algorithm for uncalibrated camera, researchers tried to add constraints to the formulated equations and proposed 7 pt and 6 pt algorithms. In 2003, Hartley and Zisserman proposed 7 pt algorithm [2], which added the constraint that fundamental matrix and essential matrix are singular matrices. In 2005, Stewénius et al. proposed 6 pt algorithm [3] and in 2012 Kukelova et al. proposed a 6 pt algorithm based on polynomial eigenvalue [4]. 5 pt algorithms for calibrated camera include Nistér’s 5 pt algorithm [5] in 2004, Li and Hartley’s 5 pt algorithm [6] in 2006, and Kukelova’s polynomial-eigenvalue-based 5 pt algorithm [4] in 2012. These widely used traditional visual solutions rely on image correspondence points, which may contain error and noise. RANSAC [9] method is usually introduced to reduce those error and noise. Brückner et al. [10] compared these traditional solutions above. The advantage of these traditional solutions is that they can generate accurate results but the disadvantage is that they are complex in computing: the more correspondent points the algorithm needs, the slower its computing speed is.

Another group of solutions for CPE are the inertial-based solutions, which were not proposed until the MEMS sensors were accurate enough. In 2008, Gaida et al. [7] introduced a multisensor framework that combines gyroscopes, accelerometers, and magnetometers as a unit to estimate camera pose. Then a visual method is adopted to estimate camera pose too. Finally extended Kalman filter is adopted to fuse their results to obtain the final pose. One disadvantage of using accelerometers for translation estimation is that translation measurements from accelerometers are significantly less accurate than orientation measurements [1113]. This is because gyroscope data need to be integrated only once to obtain the camera’s orientation but accelerometer data need to be integrated twice to obtain the camera’s translation, which will introduce too much noise that will affect significantly the accuracy. Miyano et al. [8] proposed an inertial and visual combination solution. It uses acceleration and a magnetic sensor to roughly estimate a camera pose and then searches the accurate pose by matching a captured image with a set of reference images. Corke et al. [14] made a survey on inertial and vision fusion solutions. These fusion solutions usually first perform CPE with separate inertial-based and visual-based solutions, generating respective results, and then fuse them by data filters. This is cooperation between inertial and visual methods. Its advantage lies in the robustness because the two methods can complement each other. Its disadvantage is the slow computing speed because of the fusion process.

This paper proposes an inertial and visual fusion solution called S2fM for CPE. Different from existing fusion solutions which fuse inertial data and visual data in a cooperation manner, our solution fuses them in a division manner: it divides CPE problem into a rotation part and a translation part. Our solution first estimates camera rotation by gyroscopes and then uses it as known parameter in the visual method to estimate camera translation. Since the reliability and efficiency of gyroscopes for rotation estimation have been proven [7, 8, 1114], they can significantly simplify the visual solution for camera translation estimation. As we will derive in the next section, only 3 pairs of correspondence points are needed for translation estimation. Different from Hartley and Nistér, who made great efforts to find the solution of the established equation sets, our focus is on proposing an inertial and visual fusion solution to solve CPE problem efficiently under the circumstance that no inertial sensor can estimate accurately enough translation parameter.

3. Proposed Solution

This section describes our proposed solution which is under the pinhole camera model and consists of three steps: camera and gyroscope calibration, estimation of camera rotation, and estimation of camera translation.

3.1. Camera and Gyroscope Calibration

Our solution first calibrates the camera and gyroscope and the calibrating contents are as follows:(1)Gyroscope noise processing(2)Camera focal length calibration (in pixel unit)(3)The delay between the gyroscope and frame sample timestamps

3.1.1. Gyroscope Noise Processing

Raw MEMS gyroscope data need to be processed to remove zero-drift and random noise. We take the general statistical method to remove zero-drift. Put the device in static position for a period of time to get statistics of zero-drift and subtract it from the source data to obtain a series of stable, zero-expectation, and normally distributed random noise. Those random noises are then modeled through time sequence method and depressed by Kalman filter to give usable gyroscope data.

3.1.2. Calibrating Algorithm

After noise processing, the gyroscope data can be used for the calibrating operation. The purpose of our calibrating algorithm is to calibrate the parameters (delay between the gyroscope and frame sample timestamps) and (camera focal length). We take a similar calibrating algorithm as in Miyano et al. [8] under camera rotation model (as Figure 1 shows) but with an optimized objective function.

As Figure 1 shows, a camera moves under the rotation model with its optical center unchanged [15]. A point in world coordinate and its projected image coordinate have the following mapping relation:And we have the following:

(1) is the camera intrinsic:where is camera focal length to be recovered and is the projected point of the optical center, which is (0, 0) in the proposed solution, since we set the image center as the optical center.

(2) is an unknown scaling factor, since the translation can only be determined up to scale in CPE.

Under this rotation model, the relationship between image points in a pair of frames for two different camera orientations (as Figure 1 shows) can be derived. For a scene point the projected points and in the image plane of two different frames would bewhere and are 3 × 3 rotation matrices representing the rotational parameters at two frame timestamps, which will be detailed in the next section. Rearrange (3) and substitute for ; we get a mapping of all points in one frame to another as (4) shows:

With these parameters above, we formulate calibration as an optimization problem, as shown in the following equation:And we have the following:(1) is the frame amount, is frame number, and is the feature number in the current frame.(2) is the feature amount of the th frame pair.(3) is the Frobenius norm.(4) is the th SIFT pair [16] in the current frame pair.

3.2. Estimation of Rotation

After calibration, gyroscope data can be used to estimate the rotation of the device accurately. Gyroscope outputs angular velocities of every axis, so the angular rotation of each axis can be calculated by integrating the angular velocities. For any axis of the gyroscope, let be its angular velocity and let its corresponding sampling time be ; then the angular value from moment to moment can be obtained as shown in the following equation:

Let be the rotation values of the three gyroscope axes, where can be calculated by (6). Let be the total rotation value; then the rotation matrix can be given by the Rodrigues formula, as shown in the following equation:where is the identity matrix and is the cross product matrix of :

Finally, we obtain (3), which is a unitary matrix representing the rotation of the device.

3.3. Estimation of Translation

The obtained rotation matrix is used as known parameter in the estimation of translation based on visual method to generate the final camera pose. Similar to the traditional method, we formulate linear equations using correspondence points and solve them to get camera translation.

3.3.1. Formulation of Linear Equations

As shown in the two-view geometry of the proposed S2fM algorithm (Figure 2), given a scene point in Euclidean space, its projections and in camera coordination and have the mapping relation ofwhere is the rotation matrix (3 × 3) ant is translation 3-vector. Define as the cross product matrix of .

We have

Premultiplying both sides of (9) by gives

Let be the essential matrix (3 × 3); then

Equation (12) also holds for image points and , which gives the epipolar constraint [2]:

Writing (13) in the form of pixel coordinate gives

Rearrange (14):

Let and let , which are now knowns; then

Equation (16) can be rearranged as linear equations:where , , and in (17) are unknowns to be estimated.

3.3.2. Solution of Linear Equations

The inputs of (17) are camera intrinsic , rotation matrix , and feature correspondences and . This is typical ternary homogeneous linear equation set and we need only 3 pairs of feature correspondences to solve it (T can also be estimated with 2 pairs of feature correspondences by enforcing certain constraint to reduce the number of freedoms to 2). We extract SIFT correspondence features [16] between frames and introduce RANSAC algorithm [9] to remove mismatches and noise. Then singular value decomposition (SVD) method is employed to solve (17). Since the essential matrix is defined up to scale, the solved translation vector will also be defined up to scale; that is, .

Two solutions are still possible due to the arbitrary choices of signs for translation . The correct one can be easily determined by ensuring that the reconstructed points lie in front of the camera [5], as shown in Figure 3. One pair of feature correspondence is enough to determine the sign of . First, select a random pair of feature correspondence and random sign of ; then apply the 3D reconstruction solution by Hartley and Sturm [17] to reconstruct its 3D space coordinate . If lies in front of the cameras, as Figure 3(a) shows, the selected sign of is already correct; otherwise, inverse the sign of , as shown in Figure 3(b).

4. Experimental Results

Our experimental device is Lenovo Vibe Z, with built-in three-axis gyroscopes ST L3GD20. The gyroscopes run at a frequency of 100 Hz.

We show the experimental results of S2fM algorithm for CPE. To show the results explicitly, we draw the camera 3D motion computed by the proposed solution to check its validity. Then we compare our solution with traditional solutions (including 8 pt, 6 pt, and 5 pt algorithms. 7 pt algorithm performs similar to 8 pt algorithm so we omit it) and the inertial-based solution proposed by Gaida et al. [7]. But, first of all, we show the calibrating result for the device and gyroscopes used in our experiments.

4.1. Calibrating Result

We record a video of about 10 seconds with a rotation motion around -axis of the camera and record the gyroscope data of all the three axes and then run the calibrating algorithm. Under a pure rotation model, the frame translation motion can be estimated both by SIFT features and by gyroscope data. After calibration, the frame motions estimated by the gyroscopes would align with that estimated by SIFT features. As the results show in Figures 4 and 5, the best parameters align the two motion results well.

In order to show the results more explicitly, the calibration error for every frame is computed as shown in Figure 6. The average calibration error is 0.0024 pixels and the absolute average calibration error is 0.75 pixels per frame. There are some error peaks in the calibration because the motion is not pure rotation (translation exists), but this will not affect the calibrating results.

4.2. CPE Results

We write letters L, E, N, O, V, and O and the word Lenovo in the air with our experimental device and run S2fM algorithm to estimate the camera pose. Figure 7 shows its results, where every point represents one estimated camera pose and the red one is the start position.

4.3. Evaluation

We compare our solution with 8 pt algorithm [1], 6 pt algorithm [3], 5 pt algorithm, [6] and the fusion method by Gaida et al. [7]. In order to make the experiments as complete as possible, we design six basic camera motion scenes:(1)Left-right translation(2)Up-down translation(3)Forward-backward translation(4)Rotation around -axis(5)Rotation around -axis(6)Rotation around -axis

All possible camera motions can be a combination of the six basic motions above. To suit for different application precisions, different error thresholds (2.0, 1.0, and 0.5 pixels) are set in the linear square solution.

4.3.1. Accuracy

The symmetric squared geometric error is introduced to measure the reprojection error [10], as shown in the following equation:where and are feature correspondences and is the fundamental matrix. The reprojection errors of all the six motion scenes with different error thresholds are shown in Figure 8.

We compute the average reprojection errors of the 6 basic scenes under different threshold in RANSAC algorithms, as shown in Figure 9 and Table 1. The results show that, in our experiments, Nistér’s 5 pt algorithm generates minimal reprojection error and our solution performs similar to 6 pt algorithm.

4.3.2. Efficiency

We analyze the efficiency of S2fM both in theory and in practice.

(1) Theoretical Analysis. Since the camera rotation can be read out from gyroscopes in real time, S2fM costs most in translation estimation by the visual method. The time complexity of traditional algorithms can be analyzed by the number of feature pairs needed in solving the linear equations. In our experiments, we adopt SVD algorithm to solve the linear equations, whose time complex is . As a result, the time complex ratio of solving one linear equation set for CPE would be as shown in Table 2 (with S2fM set to be 1).

(2) Practical Analysis. In practice, all of the algorithms (S2fM, 8 pt, 6 pt, and 5 pt) need to adopt RANSAC algorithm to estimate the optimal solution of the linear equations and the number of iterations will not be the same. However, for each algorithm, we set its upper limit of RANSAC iteration times to be 128 in our experiments and then compute the average frame-computing time of all the 11715 experiments above with the same SIFT features and the same running environment. With a good number of experiments, the average value should be enough to compare the performances of all the algorithms, as shown in Table 3.

As seen in Table 3, the practical efficiency result generally tallies with the theoretical analysis.

4.4. Limitations

One main limitation of the proposed solution lies in the drift of gyroscopes. After running a long period of time, the gyroscope data may drift and result in the decrease of estimation accuracy, which can be compensated by timely calibrations, for example, assisted by other inertial sensors to remain drift-free.

Another limitation is that our solution is ideally supposed to perform similar to 5 pt algorithm because they are the most similar in estimating translation, but the experimental results show that it performs worse than 5 pt algorithm in accuracy. The reason that S2fM does not achieve ideal results lies in the focal length calibrating error and the drift of gyroscopes. More accurate calibrating algorithm will be of great help.

What is more, in our experiments, we choose to use 3 pairs of feature correspondences to estimate camera translation but can also be estimated with 2 pairs of feature correspondences by enforcing certain constraint to reduce the number of freedoms to 2. However, the purpose of this paper is to provide a fusion solution for CPE, not the mathematical technique for solving equation set.

5. Conclusion

Traditionally, CPE has been formulated as the problem of estimating the optimal camera pose given a set of point correspondences. This is the vision-based method. The first 8-point solver was proposed in the 1990s and later 7-point, 6-point, and 5-point solvers were proposed in the 2000s, all of which are based on the feature correspondence problem as reviewed in the literature. These solvers all focus on the mathematical technique of solving the formulated optimal estimation problem. In recent years, when MEMS sensors become accurate enough and become popular in many hand-held devices, MEMS-based methods turn up in CPE-related problems. Generally, sensor data will be coupled with image processing data with data filter, for example, Kalman filter, for robust camera pose estimation. The drawback of this method is also the computing complexity due to the additional data filtering process. This paper proposes a camera pose estimation algorithm with gyroscope sensor, which estimates camera rotation with the built-in gyroscopes of the device and then fuses it with the image data to estimate camera translation. This proposed fusion solution is quite different from existing fusion solutions in the fusion manner of inertial and visual data. We compare our solution with both the traditional solution and existing fusion solution and the experimental results validate the efficiency of our solution. As for the accuracy, our solution performs similar to 6 pt algorithm, which can be further improved with better focal length calibration and drift compensation techniques.

Under the circumstance that no proper MEMS sensor can estimate accurately enough translation, S2fM provides a way to fuse the inertial data with visual data to solve the problem.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.