Research Article  Open Access
An Indoor Navigation System Based on Stereo Camera and Inertial Sensors with Points and Lines
Abstract
An indoor navigation system based on stereo camera and inertial sensors with points and lines is proposed to further improve the accuracy and robustness of the navigation system in complex indoor environments. The point and line features, which are fast extracted by ORB method and line segment detector (LSD) method, are both employed in this system to improve its ability to adapt to complex environments. In addition, two different representations of lines are adopted to improve the efficiency of the system. Besides stereo camera, an inertial measurement unit (IMU) is also used in the system to further improve its accuracy and robustness. An estimator is designed to integrate the camera and IMU measurements in a tightly coupled approach. The experimental results show that the performance of the proposed navigation system is better than the pointonly VINS and the visiononly navigation system with points and lines.
1. Introduction
Indoor navigation technique, which has been widely applied in the field of mobile robot (e.g., home service robot) or unmanned aerial vehicle (UAV) system [1, 2], has received considerable attention in the past few years. One major challenge of the indoor navigation system is the unavailability of the global position system (GPS) signal in indoor environment. Therefore, many other sensors have been applied in the system such as sonar [3], odometry [4], light detection and ranging (LiDAR) [2], camera [5], and inertial measurement unit (IMU) [6]. With the recent development in visionbased techniques, cameras used as sensors make the visionbased navigation system more and more attractive [7]. Direct method and featurebased method are two main methods for indoor navigation system. While the direct method estimates the motion by minimizing the photometric error [8], we deal with featurebased method in this paper. For conventional featurebased visual navigation system, feature extraction methods are applied to extract the point features from images collected by the camera. After matching these point features between two frames, the pose of the system can be estimated through egomotion estimation methods (e.g., perspectivenpoint (PnP)) or by minimizing the reprojection error [9, 10]. Note that only point features are used in the conventional featurebased visual navigation system. However, in some lowtextured scenarios exemplified by manmade or indoor environments, it is hard to extract enough key points by suitable point feature extraction method. As a result, the performance of conventional featurebased method will decrease, which thus implies that other complementary features need to be explored for further performance improvement. In recent years, line feature has received more and more attention since lines or line segments provide significantly more information than points in encoding the structure of the surrounding environment [11]. They are usually abundant in humanmade scenarios, which are characterized by regular structures rich in edges and linear shapes [12]. In addition, recent advances in the line segment detection method have made it possible to extract line features fast and accurately. Therefore, in some lowtextured environments, the line features can be used in the featurebased visual navigation system to improve the navigation accuracy [11, 13]. In this paper, we choose the stereo camera to acquire images of indoor environments; thereafter, points and lines features are all extracted from these images. In addition, the scale information of the features can be estimated by stereo camera because of the small scale of the indoor environments.
Although the camera can be used efficiently in navigation system, it is not robust to motion blur induced by rapid motion [14] and the visionbased navigation system often fails in rapid motion scenarios. In order to make the system more robust, a common solution is fusing camera with inertial sensors, which leads to the visualinertial navigation system (VINS) [6, 15]. Inertial sensors usually contain three orthogonal gyroscopes and accelerometers to measure the angular velocity and acceleration of the carrier and estimate the carrierâ€™s motion in high frequency. However, it suffers from the accumulated error and relatively large measurement uncertainty at slow motion [16]. It can be found that visual and inertial measurements offer complementary properties which make them particularly suitable for fusion [17], since inertial sensors have a relatively low uncertainty at high velocity, whereas cameras can track features very accurately at low velocity and less accurately with increasing velocity [16]. Consequently, VINS works better than visiononly or pure inertial navigation system. Without loss of generality, the VINS can be classified into two methods, filteringbased method and optimizationbased method. Although the accuracy of optimization approach is better than the filtering approach, it is computationally expensive. Considering the extra computational cost in line feature extraction, we adopt the filtering approach which usually uses the extended Kalman filter (EKF) to estimate the pose of the system.
In this paper, we present a visualinertial navigation system with point and line features for indoor environments. The stereo camera is combined with an inertial sensor in a tightly coupled filtering approach to overcome the defect in single sensor. The point and line features are both used in the proposed system, such that this system can perform well in lowtextured environments such as indoor environments. In addition, the robustness of this navigation system can be improved by these two schemes. The performance of this system is also tested in an experiment which uses visualinertial benchmark dataset.
The rest of the paper is organized as follows. The related work is described simply in Section 2. In Section 3, we present the IMU model based on the inertial measurements and the representations of point and line features. The estimator and the implementation of the algorithm are described in Section 4. Experiments are conducted in Section 5 to demonstrate the performance of the proposed system. Finally, conclusions are drawn in Section 6.
2. Related Work
Compared to the simultaneous localization and mapping (SLAM) system, the visual navigation system does not have loop closures, but in a large part, it follows the SLAM paradigm (especially the frontend of the SLAM). For featurebased method, pioneering work has been carried out in [18]. It presents a monocular camera SLAM system which uses sparse key points and EKF to estimate the camera motion. In [19], it points out that the monocular camera suffers from the â€śscale driftâ€ť problem; however, the stereo camera can efficiently overcome this issue. In [20], it presents a stereo visual SLAM system which uses Harris corner detector to extract point features and EKF to estimate the pose. More recently, the ORBSLAM has been proposed in [21, 22], which supports all monocular, stereo, and RGBD cameras. In ORBSLAM, the ORB point features [23] have been used for the first time and the accuracy of the system is very high. However, the systems mentioned above would reduce their performance in lowtextured environment with insufficient point features. Therefore, massive efforts are devoted to the line segment detection method and its application in visual navigation and SLAM system.
A lineartime line segment detector has been proposed in [24, 25] with a high detection accuracy. In [13], a linebased monocular 2D SLAM system has been also presented, which incorporates vertical and horizontal lines to estimate the motion with EKF. In addition, another linebased system has been proposed in [11] to improve the efficiency of the system by adopting two different representations of a line segment. However, the line feature only lends itself to the structured environment, which implies that the system performance is likely to be degraded in complex environment by employing the single line feature. Taking into account the complementarity between point and line features, a combination of both features has beneficial effects on the robustness and accuracy of the navigation system. Recently, visual navigation systems with points and line segments have been proposed in many works [26â€“28], which reveals superior performance in a wide variety of realtime scenarios. However, the visualonly system has its own limitation and the VINS is developed to overcome it.
The filteringbased VINS is particularly relevant to our work. A visionaided inertial navigation system based on a multistate constraint Kalman filter has been proposed in [6, 15], presenting a novel system model which does not contain the feature position in state vector. Therefore, the computational overhead of the method is low. However, this system relies on large storage capacity, which makes it unavailable in some applications. A consistent VINS based on EKF has been proposed in [29]. In this paper, the author proposes a new method which improves the consistency of the system based on analyzing the observability of the VINS. An autonomous navigation system combines the stereo camera, and IMU has been proposed in [30], which improves the positioning accuracy by considering both the near and the far point features. However, all these methods mentioned above use point features only. Linebased VINS has been proposed in [31, 32], which use straight lines as features. These systems exhibit superior reconstruction performance against pointbased navigation system in linerich environment. However, in our work, we mainly focus on the VINS combining both point and line features.
3. IMU Model and Feature Representation
In this section, we will describe the IMU model and the camera model which contain the point and line features. To begin with, we define the reference coordinate frames which would be applied in the rest of the paper.
The navigation coordinate frame {} (also can be referred as the world coordinate frame) is chosen as the EastNorthUp (ENU) coordinate frame in this paper (the origin of this coordinate frame is fixed in the first frame of the system, and the , , and axes are aligned with the east, north, and up of the systemâ€™s first frame, resp.). The pose information of the system is all characterized with respect to in this coordinate frame. In addition, the navigation coordinate frame is fixed with the first frame of the system such that the effect of the earth rotation can be ignored (cf. (1) and (2)).
The IMU coordinate frame {} and the camera coordinate frame {} are fixed with the IMU and the stereo camera, respectively. Their origins are located at the center of the IMU and the optical center of the camera. The , , and axes of the IMU coordinate frame are aligned with the right, front, and up of the IMU, respectively. The axis of camera coordinate frame points aligns with the optical axis. There are two camera coordinate frames, which are denoted in terms of left camera coordinate frame and the right camera coordinate frame . The transition matrix between them is obtained though camera calibration, and the is chosen as the reference camera coordinate frame {}.
In addition, for the sake of simplicity, in Sections 3 and 4, we assume that the IMU coordinate frame {} and the camera coordinate frame {} (left camera coordinate frame) are coincident (in practice, the transition matrix between these two coordinate frames can be extrinsically calibrated following the method proposed in [33]). To avoid confusion, in Sections 3 and 4, we use the other notation {} (body coordinate frame) to represent both coordinate frames (in practice, one of the {} and {} can be selected as the body coordinate frame).
3.1. IMU Model
The angular velocity and the specific force of the system in the body coordinate frame can be measured by the IMU. These measurements include the angular velocity and the acceleration information with noise [34]: where denotes the true angular velocity in body coordinate frame, denotes the true acceleration in navigation coordinate frame, and are the constant drifts of the gyroscopes and the accelerometers in body coordinate frame, respectively, and are the random noises of the gyroscopes and the accelerometers in body coordinate frame, which can be modeled as uncorrelated zeromean white Gaussian noise, denotes the rotation matrix from the navigation coordinate frame to the body coordinate frame, and is the gravity vector.
The estimates of the attitude information (described in the form of unit quaternion), the velocity , and the position can be updated by the IMU measurements as follows [34]: where denotes the estimated rotation matrix from the body coordinate frame to the navigation coordinate frame, and are the estimates of the constant drifts of the gyroscopes and the accelerometers, which can be obtained by the calibration method [35], and
In addition, the unit quaternion , the attitudes , and the rotation matrixes and can be interchangeable [34].
In addition, and can be modeled as follows:
According to the analysis above, the IMU state vector can be described as follows:
In practice, the angular velocity and the specific force are sampled by IMU at discrete times, so (3) should be calculated in a discrete method. Therefore, we assume that the and are constant between the two sampling times; thereafter, (3) can be discretized in an easy way.
3.2. Feature Representation
The point and line features are extracted from the images collected by the stereo camera. In this subsection, we will describe the representations of the point and line features.
3.2.1. Point Features
Owing to small scale of the indoor environments, the depth of the most points in images can be estimated by stereo camera through the baseline between the left and right cameras, so point features can be coded with a Cartesian coordinate frame as follows: where denotes the position of point features in body coordinate frame and it can be estimated from the pixel coordinate value of the point features as follows: where is the baseline between the left camera and the right camera, is the disparity, and are the pixel coordinate value of the point features in the left camera image, and are the pixel coordinate value of the optical center in the left camera image, and is the focal length. In addition, , , and are the intrinsic parameters of the camera which can be obtained by the camera calibration method [36].
The position of point features in navigation coordinate frame is also important in the proposed system, and the estimated position can be computed as follows:
This process is only executed when the point feature is detected in the first time. In this section, for the sake of simplicity, the rotation matrix between IMU coordinate frame and camera coordinate frame and the relative position between these two coordinate frames are ignored, but in practice, they should be considered.
According to the analysis above, the point feature state vector can be described as follows: where denotes the position of the th point features in navigation coordinate frame and is the number of the point features and this number is a variable because of the different images collected by the camera.
3.2.2. Line Features
Different from the points represented by in threedimensional spaces, lines are parameterized in 4 DOFs. In this paper, we introduce two different representations of lines, the PlĂĽcker coordinate frame representation and the orthonormal representation [11, 37]. The PlĂĽcker coordinate frame representation is initialized when the line features are detected in the first time, and the orthonormal representation is used in the estimator.
The PlĂĽcker coordinate frame representation is determined by two points on the line. Assuming the homogeneous coordinate frames of these two points are and , the PlĂĽcker matrix can be determined as follows:
The PlĂĽcker matrix is a 4â€‰Ă—â€‰4 antisymmetric matrix in which the diagonal elements are zero and there are six nonzero elements in this matrix. The PlĂĽcker coordinate frame is a 6â€‰Ă—â€‰1 vector composed by these six nonzero elements as below: where and are the 3â€‰Ă—â€‰1 inhomogeneous coordinate frames of the two points. is orthogonal to the interpretation plane containing the line and the origin, and is the direction of the line. The relationship between the PlĂĽcker matrix and the PlĂĽcker coordinate frame is expressed as where denotes the skewsymmetric crossproduct matrix with a vector and the is defined as
The transformation of the PlĂĽcker coordinate frame in different reference coordinate frames (e.g., navigation coordinate frame and body coordinate frame) can be described as follows [38]: where denotes the translation vector from navigation coordinate frame to the body coordinate frame.
The 3D lines represented by PlĂĽcker coordinate frames in body coordinate frame can also be projected onto the image plane as follows: where is the 2D line in the image plane represented by pixel coordinate frame. It can be found that only is used in this projection process.
According to the analysis above, six elements are used in the PlĂĽcker coordinate frame representation while the lineâ€™s DOFs are four. If using this representation in estimator, superfluous elements will cost more computation. In addition, the orthogonal constraint between the and , that is, , will decrease the numerical stability, and thus, we need to use another representation method in the estimator.
The orthonormal representation uses minimum 4 parameters to represent a line, and this representation can be derived from the PlĂĽcker coordinate frame representation [37].
We define a matrix and decompose this matrix by QR decomposition as and set
The 3D line can be represented by where . This means the and are three and twodimensional rotation matrices, respectively. Thus, the matrix can be updated accordingly by a vector containing 3 parameters such as Euler angles and can be updated by a scalar parameter as follows: where and denote the threedimensional rotation matrices.
We can define a 4â€‰Ă—â€‰1 increment vector which can be used in the estimator presented in the next section to represent the error of line features. The detailed discussions will be provided in Section 4.
In addition, the PlĂĽcker coordinate frame can be transferred from the orthonormal representation as follows: where denotes the th column of U.
4. Estimator Establishment and Algorithm Implementation
In this section, an EKF estimator is presented. The IMU measurements along with the point and line features mentioned in Section 3 are combined based on this estimator in a tightly coupled approach. In addition, we will elaborate the feature detection methods, the feature initialization, and some other processing steps.
4.1. ErrorState Equation
In our work, the EKF estimator uses the error model, which is defined as follows: where , , and are the errorstate vectors of IMU, point features, and line features, respectively.
The IMU errorstate vector can be defined as where , , , and are the velocity error, the position error, the error of gyroscope constant drift, and the error of accelerometer constant drift, respectively. The aforementioned error is formulated as where is the true value and is the estimated value. is the attitude error computed from the error quaternion which is defined as
The relationship between the true quaternion and the estimated quaternion is defined as , where is the quaternion multiplication [34].
Similarly, the errorstate vector of point features can be defined as where is the position error of the th point features in navigation coordinate frame which is defined as . In addition, .
The errorstate vector of the line features can be defined as follows: where denotes the error vector of the th line features in navigation coordinate frame and is the number of the line features. It is worth to mention that is a variable because of the different images collected by the camera. In addition,
According to the analysis above, the errorstate equation can be established as follows: where is the process noise assumed to be uncorrelated white zeromean Gaussian noise the covariance matrix of which is and
Owing to the estimator is executed in the discrete time, and the differential equation needs to be transformed into the discrete equation as follows: where is the sampling period and denotes the th frame.
In addition, in any environment, the stereo camera is used to capture a series of fixed images. According to the point and line feature extraction methods, for a series of fixed images, the extracted points and lines are also fixed and we assume that the matched features are all true positive, and thus, there is no random noise for these features. Actually, they only contain the measurement noise the differential of which equal to zero. Therefore, in (27), there is no noise of point and line features.
4.2. Measurement Equation
The featuresâ€™ reprojection errors in the image plane are selected as the measurement. For the sake of illustration, we present the reprojection errors by considering the case where a single point feature and a single line feature are present.
We first define as the th point feature which can be observed by the camera in the th frame. The reprojection error of the point feature can be defined as follows: where is the systemâ€™s position in the th frame, is the point featureâ€™s estimated position in the body coordinate frame, is the point featureâ€™s reprojection position in the image plane, and is its pixel value. is the point featureâ€™s observation value in the image plane, and is its pixel value.
We define as the th line feature which can be observed by the camera in the th frame. The reprojection error of the line feature can be defined as follows: where is the line featureâ€™s estimated position in the body coordinate frame, is the point featureâ€™s reprojection position in the image plane. In addition, when the line feature is observed by the camera, we can obtain the pixel value of the startpoint and endpoint of the line feature. In (33), and denote the startpoint and endpoint of the line featureâ€™s observation value in the image plane. Note that and are represented by the homogeneous coordinate frames which are vectors (the first two elements are the pixel value, and the third element is 1).
According to the analysis above, the measurement vector can be selected as follows:
The errorstate equation can be established as follows: where is the measurement noise assumed to be uncorrelated white zeromean Gaussian process the covariance matrix of which is and H is the measurement Jacobian matrix formulated as: where the individual matrixes and can be computed as where , , , , and .
The measurement equation can also be written as the discrete form:
4.3. Estimator and Algorithm Implementation
According to the analysis in the previous section, it can be found that the errorstate equation and the measurement equation are nonlinear. In order to deal with the linearization error, we use the EKF estimator to fuse the inertial measurement and the images observed by the stereo camera. As presented in the previous section, these nonlinear functions have been linearized.
Based on the linearized errorstate and measurement equations, the EKF estimator can be executed with the following two steps: (i)Predict where is the covariance matrix of the state vector.(ii)Update where is the Kalman gain matrix.
The covariance matrices , , and in the above equations are empirically initialized based on the characteristics of the sensors. Specifically, and are initialized, respectively, associated with the random noise of the inertial sensors and the scales of the feature space. is initialized concerning the constant drifts of the inertial sensors and the scales of the feature space.
Finally, we summarize the whole algorithm. In this system, static base alignment is achieved for obtaining the initial attitude angle by following the method in [30]. Subsequently, analogous to the inertial navigation system [34], the IMU measurement is used to update the pose and the velocity of the system by solving (3).
On the other hand, the images are collected by the stereo camera; after performing the feature detection and initialization which will be discussed in the next subsection, the errorstate and measurement equations can be established by (22), (23), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), and (38) with the IMU measurement and the update result of the inertial navigation. In the last step, the EKF estimator is executed by (39) and (40). In addition, the covariance matrix of the EKF should be augmented when new point or line features are observed.
4.4. Feature Detection and Initialization
In terms of the feature detection, the ORB method [23] is used to extract point features from the images. This method deals with the orientation problem of the FAST method and generates a very fast binary descriptor based on BRIEF to describe the detected point features. Therefore, this method can detect and describe the point features in a fast, efficient, and accurate manner. In [21, 22], the ORB method was used in the visual SLAM system, which resulted in an excellent performance.
On the other hand, the LSD method [24, 25] is used to extract the line features from the images. Its mechanism is to merge pixels with the similar gradient direction. This method can extract the lines with subpixel accuracy in linear time. In addition, the detected lines are described by a binary descriptor, line band descriptor (LBD) [39], which allows accurate matching between two frames or two cameras (left and right).
The nearestneighbor method is used to match the features between the two frames based on the descriptor of features. In order to reduce the false matches, the random sample consensus (RANSAC) method is used to filter out the outliers for both point features and line features.
Once a new feature is detected and described, we next perform binocular matching. For a point feature in the left image, its counterpart in the right image can be acquired along the parallel epipolar line in the right image. The Hamming distance is used to measure the similarity of point features with a predefined threshold. For the line feature, we exploit the matching method proposed in [40].
The unsuccessfully matched features will be discarded. After matching the point and line features between the two images, the positions of the point features and the two endpoints of the line features can be obtained by (8). The representations of line features in the 3D space can be computed by (12). The point features and lines in the navigation coordinate frame can be computed by (9) and (16). With the abovementioned feature initialization, the initial positions of the features in the navigation system are obtained and are thus used to compute the reprojection error in the measurement equation. In addition, after executing one filtering process, the error of the featuresâ€™ positions in the navigation coordinate frame can be estimated after feature filtering. Therefore, these positions can be updated in the filtering processes and become progressively accurate.
5. Experiments and Results
In this section, experiments are conducted to test the effectiveness of the proposed indoor navigation system. We assume our system is tailored for any structured indoor environment where the geometric scale information can be calculated by the stereo camera and the object textures can be characterized by either point or line features. We use EuRoC dataset [41] for evaluating our approach and use rootmeansquare error (RMSE) for performance measure. In the comparative study, we compare the proposed navigation system (PLVINS), with the StVOPL [26] and the system based on multistate constraint Kalman filter (MSCKF) [15]. The StVOPL is a stateoftheart visual navigation system using stereo camera with points and lines, while the MSCKF is an EKFbased visualinertial navigation system with high accuracy. In implementation, we directly use the publicly available source code for both StVOPL and MSCKF methods. For the sake of consistency, we repeat all the competing approaches ten times on all the sequences and the final experimental results are obtained by averaging the accuracy scores.
5.1. Experiment Description
The EuRoC dataset contains 11 sequences which are categorized into two types both recorded in three indoor environments. The first type of the dataset contains the first five sequences which are recorded in a machine hall (MH). The second type of the dataset contains the remaining six sequences recorded in Vicon room 1 (V1) and Vicon room 2 (V2) with an approximate size of 8â€‰mâ€‰Ă—â€‰8.4â€‰mâ€‰Ă—â€‰4â€‰m. All eleven sequences include the stereo images and the IMU measurements containing the gyroscopes, accelerometers data along with the ground truth. More specifically, the stereo images are recorded by a stereo camera, MT9V034 with 2â€‰Ă—â€‰20â€‰Hz while the IMU measurements are recorded by a MEMS inertial sensor, ADIS16448 with 200â€‰Hz. Besides, the ground truths are provided by Leica MS50 and VICON. All these equipment are mounted on a micro air vehicle (MAV) which flies in the indoor environment to record the data. More details about the EuRoC dataset can be found in [41]. In addition, these 11 sequences provide varying challenges relevant to the speed of the MAV, illumination, texture, and so forth. In some sequences, especially for the sequence Vicon room 2 03, the environment is lowtextured with few visual contents.
According to the description of the EuRoC dataset, it can be found that these sequences are suitable for testing the proposed indoor navigation system because of the visualinertial data and the lowtextured indoor environments. In addition, to the best of our knowledge, this dataset is the only dataset which provides visualinertial data and ground truth in indoor environments.
In order to test the algorithms quantitatively, the transformation error matrix between two coordinate frames can be computed as follows: where is the transformation matrix between two coordinate frames provided by the ground truth, is the estimation transformation matrix from the navigation system, and is the error matrix. The translation errors and the rotation errors can be obtained from this error matrix.
The RMSE of translation and rotation can also be computed for comparison of different systems: where is the number of the frames, is the 3â€‰Ă—â€‰1 translation error vector, and is the 3â€‰Ă—â€‰1 rotation error vector which are represented by the Euler angles.
5.2. Results and Discussion
5.2.1. Translation Errors
Extensive evaluations on 11 sequences are carried out by making use of three different navigation systems mentioned above. First, we analyze the translation errors of the dataset. Figures 1 and 2 show the translation errors of different methods on sequences machine hall 04 and Vicon room 2 02. Table 1 gives the relative , , , and whole translation RMSE errors of all 11 sequences. In Table 1, the â€śâ€ť column means the whole translation RMSE errors. â€śMH,â€ť â€śV1,â€ť and â€śV2,â€ť respectively, indicate the â€śmachine hall,â€ť the â€śVicon room 1,â€ť and the â€śVicon room 2.â€ť â€śMH01â€ť implies the sequence 01 in machine hall, and another four sequences named from â€śMH02â€ť to â€śMH05â€ť are also used in our evaluations. Besides, we also use the respective three sequences in V1 and V2 in our experiments, namely, â€śV1 01,â€ť â€śV1 02,â€ť â€śV1 03,â€ť â€śV2 01,â€ť â€śV2 02,â€ť and â€śV2 03.â€ť In addition, in order to demonstrate the robot movement in some environment, we give the ground truth trajectories and the trajectories estimated by the proposed method as shown in Figure 3.

(a)
(b)
It can be seen from Figures 1 and 2 and Table 1 that the proposed navigation system reports higher accuracy than MSCKF and StVOPL in terms of translation errors. The StVOPL achieves the lowest positioning accuracy with the largest , , , and whole translation RMSE errors. For most of the sequences, the translation RMSE errors of the proposed method are better than those of the MSCKF. In addition, the proposed navigation system is more robust than the other two systems. The error curves of the proposed method are smoother than those of the StVOPL. On sequence Vicon room 2 03, both the StVOPL and MSCKF underperform while the proposed system exhibits superior performance.
It is observed that in terms of the sequences with substantial point features generated, pointbased VINS performs well such that the line features provide limited performance improvements. In some cases, particularly, the measurement noise of the line features and the incorrect matches could lead to a slight worse performance such as the results in sequence machine hall 02. It is also shown that our approach outperforms MSCKF in the case when the features are insufficient. Thus, in specific cases, the performance of the proposed system is likely to be compromised by the measurement noise and the false matches; however, most of the results significantly demonstrate the advantage of our system against other competing system.
It is also observed from figures that the error of the StVOPL increases over time. Since StVOPL is a visionbased navigation system, serious accumulated error is produced, which significantly impairs the performance of the system. By contrast, the proposed method and the MSCKF are visualinertial navigation system, which enables reducing the error accumulation. Besides, both of them achieve very high positioning accuracy. Furthermore, compared to the MSCKF, the proposed method utilizes both the line features and point features. The line features abundant in indoor or manmade environments can use edges and linear shapes in environments. Thus, in some complex lowtextured environments such as sequence Vicon room 2 03, the proposed system can still extract enough features to estimate the motion, while other systems such as MSCKF suffer from the lack of point features, which contributes to better positioning accuracy and the robustness of the proposed system.
5.2.2. Rotation Errors
In this subsection, we analyze the rotation errors of the EuRoC dataset. Similarly, as for sequences machine hall 05 and Vicon room 1 03, the rotation errors of three systems along with time are shown in Figures 4 and 5. The relative , , , and whole rotation RMSE errors of all 11 sequences are shown in Table 2. In Table 2, the â€śâ€ť column denotes the whole rotation RMSE errors.

The results of the rotation is similar to the results of the translation. It can be seen from the two figures and the table that the rotation accuracy of the StVOPL is much worse than the other two systems with large errors of roll, pitch, and yaw angles. Despite the smooth motion in some sequences, the StVOPL system does not perform well, due to the accumulated error of the visual navigation system. As for the other two navigation systems, the proposed navigation system has the relatively small rotation errors. However, in some sequences, the rotation errors of these two systems are very close, probably due to the smooth movement. For the challenging sequences with sudden angle changes, the proposed navigation system performs better than the MSCKF. In addition, the robustness of the proposed navigation is also better than other two systems. Both StVOPL and MSCKF fail for the sequence Vicon room 2 03 which has many lowtextured scenes, but the proposed navigation system executes successfully. As analogous to the translation errors, our method reports higher rotation errors than the MSCKF in some scenario, which can be caused by the measurement noise and false matches of line features.
Finally, it can be observed from Tables 1 and 2 that our approach has been evaluated on all the eleven video clips which provides different challenges, and the experimental results demonstrate that our method beats the other competing techniques on ten video clips, which considerably suggests the consistent superiority of our method.
To summarize, the proposed navigation system performs much better than the visualonly navigation system with points and lines and better than the pointonly VINS. The reason is that the inertial sensors can aid the visual sensors to improve the performance of the navigation system. What is more, the datasets used in our work are all recorded in manmade environments which contain abundant line features and in some sequences; these environments contain some lowtextured senses which lack enough point features. Therefore, the application of line feature can improve the performance of the system to a certain degree.
5.2.3. Qualitative Evaluations of Both Features in Different Environment
Figures 6 and 7 qualitatively illustrate the matched point and line features between two frames in the dark environment and in the environment with some curves.
(a)
(b)
(a)
(b)
It is shown in Figure 6 that it is difficult to extract substantial point features and line features due to insufficient visual contents in the dark environment. To be specific, the point features are distributed intensively while the line patterns are scattered sparsely. Despite the significant variance in visual distribution, combining point and line features sufficiently encode the most image textures, which indicates the strong correlation between the two features.
It is observed in Figure 7 that line features can be extracted on the smooth surfaces or barshaped objects except some curves with high curvature. By contrast, the point features can be extracted on these curves with high curvature yet fail to be mined on the smooth surfaces or barshaped objects. Based on this observation, we argue that there exists substantial complementarity between these two features. Therefore, fusing the point and line features significantly contributes to the performance improvements.
In addition, sufficient correct matches are observed from these two figures, which thus guarantees the estimation accuracy. Subsequently, the estimated position errors of both features can be derived and revised accordingly. Therefore, the accuracy of these positions will be progressively accurate.
6. Conclusions
In this paper, we present an indoor navigation system combining stereo camera and inertial sensors with point and line features. The stereo images and the IMU measurements are fused by the EKF estimator for more accurate and robust pose estimation. Furthermore, the point and line features are all extracted from the stereo images. This scheme can further improve the robustness and accuracy of the system and makes it perform well in complex indoor environment such as lowtexture or humanmade scenes. Experiments based on eleven sequences provided by the EuRoC dataset are conducted to test the proposed system. The results show that the proposed navigation system performs better than the pointonly VINS or the visiononly navigation system with points and lines in indoor environment, especially in more complex and challenging scenarios. So, the proposed navigation system can be applied to the complex indoor environment with high accuracy and robustness. In addition, the MEMS inertial sensors and the camera are all the lowcost and passive sensors, so, this system can be used in most environments and suitable for the lowcost autonomous navigation system. On the other hand, this system does not include the loop closure detection, so, its accuracy may be no higher than the visualinertial SLAM system. In addition, we do not consider the features which cannot obtain the scale information by stereo camera because of the far distance, so, the performance in outdoor environment will be worse. In future work, we will further improve this system in these two fields.
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
This study is supported in part by the National Natural Science Foundation of China (Grant nos. 61473085, 51775110, and 51375088), Fundamental Research Funds for the Central Universities (2242015R30031), and key laboratory fund of the Ministry of Public Security based on large data structure (2015DSJSYS002).
References
 M. Jung and J.B. Song, â€śRobust mapping and localization in indoor environments,â€ť Intelligent Service Robotics, vol. 10, no. 1, pp. 55â€“66, 2017. View at: Publisher Site  Google Scholar
 G. A. Kumar, A. K. Patil, R. Patil, S. S. Park, and Y. H. Chai, â€śA LiDAR and IMU integrated indoor navigation system for UAVs and its application in realtime pipeline classification,â€ť Sensors, vol. 17, no. 6, p. 1268, 2017. View at: Publisher Site  Google Scholar
 M. W. M. G. Dissanayake, P. Newman, S. Clark, H. F. DurrantWhyte, and M. Csorba, â€śA solution to the simultaneous localization and map building (SLAM) problem,â€ť IEEE Transactions on Robotics, vol. 17, no. 3, pp. 229â€“241, 2001. View at: Publisher Site  Google Scholar
 W. Chen and T. Zhang, â€śAn indoor mobile robot navigation technique using odometry and electronic compass,â€ť International Journal of Advanced Robotic Systems, vol. 14, no. 3, article 1729881417711643, 2017. View at: Publisher Site  Google Scholar
 A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, â€śMonoSLAM: realtime single camera SLAM,â€ť IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052â€“1067, 2007. View at: Publisher Site  Google Scholar
 A. I. Mourikis and S. Roumeliotis, â€śA multistate constraint Kalman filter for visionaided inertial navigation,â€ť in 2007 IEEE International Conference on Robotics and Automation, pp. 3565â€“3572, Roma, Italy, April 2007. View at: Publisher Site  Google Scholar
 A. BenAfia, V. GayBellile, A.C. Escher et al., â€śReview and classification of visionbased localisation techniques in unknown environments,â€ť IET Radar, Sonar & Navigation, vol. 8, no. 9, pp. 1059â€“1072, 2014. View at: Publisher Site  Google Scholar
 J. Engel, V. Koltun, and D. Cremers, â€śDirect sparse odometry,â€ť IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611â€“625, 2018. View at: Publisher Site  Google Scholar
 F. Fraundorfer and D. Scaramuzza, â€śVisual odometry: part I: the first 30 years and fundamentals,â€ť IEEE Robotics and Automation Magazine, vol. 18, no. 4, pp. 80â€“92, 2011. View at: Google Scholar
 F. Fraundorfer and D. Scaramuzza, â€śVisual odometry : part II: matching, robustness, optimization, and applications,â€ť IEEE Robotics and Automation Magazine, vol. 19, no. 2, pp. 78â€“90, 2012. View at: Publisher Site  Google Scholar
 G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, â€śBuilding a 3D linebased map using stereo SLAM,â€ť IEEE Transactions on Robotics, vol. 31, no. 6, pp. 1364â€“1377, 2015. View at: Publisher Site  Google Scholar
 R. GomezOjeda, J. Briales, and J. GonzalezJimenez, â€śPLSVO: semidirect monocular visual odometry by combining points and line segments,â€ť in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4211â€“4216, Daejeon, South Korea, October 2016. View at: Publisher Site  Google Scholar
 G. Zhang and I. Suh, â€śBuilding a partial 3D linebased map using a monocular SLAM,â€ť in 2011 IEEE International Conference on Robotics and Automation, pp. 1497â€“1502, Shanghai, China, May 2011. View at: Publisher Site  Google Scholar
 L. Clement, V. Peretroukhin, J. Lambert, and J. Kelly, â€śThe battle for filter supremacy: a comparative study of the multistate constraint Kalman filter and the sliding window filter,â€ť in 2015 12th Conference on Computer and Robot Vision (CRV), pp. 23â€“30, Halifax, NS, Canada, June 2015. View at: Publisher Site  Google Scholar
 M. Li and A. I. Mourikis, â€śHighprecision, consistent EKFbased visualinertial odometry,â€ť International Journal of Robotics Research, vol. 32, no. 6, pp. 690â€“711, 2013. View at: Publisher Site  Google Scholar
 P. Corke, J. Lobo, and J. Dias, â€śAn introduction to inertial and visual sensing,â€ť The International Journal of Robotics Research, vol. 26, no. 6, pp. 519â€“535, 2007. View at: Publisher Site  Google Scholar
 S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, â€śKeyframebased visualâ€“inertial odometry using nonlinear optimization,â€ť The International Journal of Robotics Research, vol. 34, no. 3, pp. 314â€“334, 2015. View at: Publisher Site  Google Scholar
 A. J. Davison, â€śRealtime simultaneous localisation and mapping with a single camera,â€ť in 2003. Proceedings. Ninth IEEE International Conference on Computer Vision, pp. 1403â€“1410, Nice, France, October 2003. View at: Publisher Site  Google Scholar
 L. M. Paz, P. Pinies, J. D. Tardos, and J. Neira, â€śLargescale 6DOF SLAM with stereoinhand,â€ť IEEE Transactions on Robotics, vol. 24, no. 5, pp. 946â€“957, 2008. View at: Publisher Site  Google Scholar
 A. J. Davison and D. W. Murray, â€śSimultaneous localization and mapbuilding using active vision,â€ť IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 865â€“880, 2002. View at: Publisher Site  Google Scholar
 R. MurArtal, J. M. M. Montiel, and J. D. Tardos, â€śORBSLAM: a versatile and accurate monocular SLAM system,â€ť IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147â€“1163, 2015. View at: Publisher Site  Google Scholar
 R. MurArtal and J. D. TardĂłs, â€śORBSLAM2: an opensource SLAM system for monocular, stereo, and RGBD cameras,â€ť IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255â€“1262, 2017. View at: Publisher Site  Google Scholar
 E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, â€śORB: an efficient alternative to SIFT or SURF,â€ť in 2011 International Conference on Computer Vision, pp. 2564â€“2571, Barcelona, Spain, November 2011. View at: Publisher Site  Google Scholar
 R. G. von Gioi, J. Jakubowicz, J.M. Morel, and G. Randall, â€śLSD: a fast line segment detector with a false detection control,â€ť IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 4, pp. 722â€“732, 2010. View at: Publisher Site  Google Scholar
 R. G. von Gioi, J. Jakubowicz, J.M. Morel, and G. Randall, â€śLSD: a line segment detector,â€ť Image Processing on Line, vol. 2, pp. 35â€“55, 2012. View at: Publisher Site  Google Scholar
 R. GomezOjeda and J. GonzalezJimenez, â€śRobust stereo visual odometry through a probabilistic combination of points and line segments,â€ť in 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2521â€“2526, Stockholm, Sweden, May 2016. View at: Publisher Site  Google Scholar
 A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. MorenoNoguer, â€śPLSLAM: realtime monocular visual SLAM with points and lines,â€ť in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4503â€“4508, Singapore, Singapore, May 2017. View at: Publisher Site  Google Scholar
 R. GomezOjeda, D. ZuĂ±igaNoĂ«l, F.A. Moreno, D. Scaramuzza, and J. GonzalezJimenez, â€śPLSLAM: A stereo SLAM system through the combination of points and line segments,â€ť 2017, http://arxiv.org/abs/1705.09479. View at: Google Scholar
 G. Huang, M. Kaess, and J. Leonard, â€śTowards consistent visualinertial navigation,â€ť in 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4926â€“4933, Hong Kong, China, May 2014. View at: Publisher Site  Google Scholar
 Z. Xian, X. Hu, and J. Lian, â€śFusing stereo camera and lowcost inertial measurement unit for autonomous navigation in a tightlycoupled approach,â€ť Journal of Navigation, vol. 68, no. 3, pp. 434â€“452, 2015. View at: Publisher Site  Google Scholar
 D. Kottas and S. Roumeliotis, â€śEfficient and consistent visionaided inertial navigation using line observations,â€ť in 2013 IEEE International Conference on Robotics and Automation, pp. 1540â€“1547, Karlsruhe, Germany, May 2013. View at: Publisher Site  Google Scholar
 H. Yu and A. Mourikis, â€śVisionaided inertial navigation with line features and a rollingshutter camera,â€ť in 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 892â€“899, Karlsruhe, Germany, September 2015. View at: Publisher Site  Google Scholar
 C. Guo and I. Roumeliotis, â€śIMURGBD camera 3D pose estimation and extrinsic calibration: observability analysis and consistency improvement,â€ť in 2013 IEEE International Conference on Robotics and Automation, pp. 2935â€“2942, Karlsruhe, Germany, May 2013. View at: Publisher Site  Google Scholar
 D. H. Titterton and J. L. Weston, Strapdown Inertial Navigation Technology, Lavenham Press Ltd, London, UK, 2nd edition, 2004.
 J. Rohac, M. Sipos, and J. Simanek, â€śCalibration of lowcost triaxial inertial sensors,â€ť IEEE Instrumentation & Measurement Magazine, vol. 18, no. 6, pp. 32â€“38, 2015. View at: Publisher Site  Google Scholar
 Z. Zhang, â€śA flexible new technique for camera calibration,â€ť IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330â€“1334, 2000. View at: Publisher Site  Google Scholar
 A. Bartoli and P. Sturm, â€śStructurefrommotion using lines: representation, triangulation and bundle adjustment,â€ť Computer Vision and Image Understanding, vol. 100, no. 3, pp. 416â€“441, 2005. View at: Publisher Site  Google Scholar
 A. Bartoli and P. Sturm, â€śThe 3D line motion matrix and alignment of line reconstructions,â€ť International Journal of Computer Vision, vol. 57, no. 3, pp. 159â€“178, 2004. View at: Publisher Site  Google Scholar
 L. Zhang and R. Koch, â€śAn efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency,â€ť Journal of Visual Communication and Image Representation, vol. 24, no. 7, pp. 794â€“805, 2013. View at: Publisher Site  Google Scholar
 D.M. Woo, D.C. Park, S.S. Han, and S. Beack, â€ś2D line matching using geometric and intensity data,â€ť in 2009. AICI '09. International Conference on Artificial Intelligence and Computational Intelligence, pp. 99â€“103, Shanghai, China, November 2009. View at: Publisher Site  Google Scholar
 M. Burri, J. Nikolic, P. Gohl et al., â€śThe EuRoC micro aerial vehicle datasets,â€ť International Journal of Robotics Research, vol. 35, no. 10, pp. 1157â€“1163, 2016. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2018 Bo Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.