Journal of Sensors

Volume 2018, Article ID 4801584, 14 pages

https://doi.org/10.1155/2018/4801584

## An Indoor Navigation System Based on Stereo Camera and Inertial Sensors with Points and Lines

Key Laboratory of Micro-Inertial Instrument and Advanced Navigation Technology, Ministry of Education, School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

Correspondence should be addressed to Xiaosu Xu; nc.ude.ues@sxx

Received 20 December 2017; Revised 25 May 2018; Accepted 4 June 2018; Published 5 July 2018

Academic Editor: Frederick Mailly

Copyright © 2018 Bo Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

An indoor navigation system based on stereo camera and inertial sensors with points and lines is proposed to further improve the accuracy and robustness of the navigation system in complex indoor environments. The point and line features, which are fast extracted by ORB method and line segment detector (LSD) method, are both employed in this system to improve its ability to adapt to complex environments. In addition, two different representations of lines are adopted to improve the efficiency of the system. Besides stereo camera, an inertial measurement unit (IMU) is also used in the system to further improve its accuracy and robustness. An estimator is designed to integrate the camera and IMU measurements in a tightly coupled approach. The experimental results show that the performance of the proposed navigation system is better than the point-only VINS and the vision-only navigation system with points and lines.

#### 1. Introduction

Indoor navigation technique, which has been widely applied in the field of mobile robot (e.g., home service robot) or unmanned aerial vehicle (UAV) system [1, 2], has received considerable attention in the past few years. One major challenge of the indoor navigation system is the unavailability of the global position system (GPS) signal in indoor environment. Therefore, many other sensors have been applied in the system such as sonar [3], odometry [4], light detection and ranging (LiDAR) [2], camera [5], and inertial measurement unit (IMU) [6]. With the recent development in vision-based techniques, cameras used as sensors make the vision-based navigation system more and more attractive [7]. Direct method and feature-based method are two main methods for indoor navigation system. While the direct method estimates the motion by minimizing the photometric error [8], we deal with feature-based method in this paper. For conventional feature-based visual navigation system, feature extraction methods are applied to extract the point features from images collected by the camera. After matching these point features between two frames, the pose of the system can be estimated through ego-motion estimation methods (e.g., perspective-n-point (PnP)) or by minimizing the reprojection error [9, 10]. Note that only point features are used in the conventional feature-based visual navigation system. However, in some low-textured scenarios exemplified by man-made or indoor environments, it is hard to extract enough key points by suitable point feature extraction method. As a result, the performance of conventional feature-based method will decrease, which thus implies that other complementary features need to be explored for further performance improvement. In recent years, line feature has received more and more attention since lines or line segments provide significantly more information than points in encoding the structure of the surrounding environment [11]. They are usually abundant in human-made scenarios, which are characterized by regular structures rich in edges and linear shapes [12]. In addition, recent advances in the line segment detection method have made it possible to extract line features fast and accurately. Therefore, in some low-textured environments, the line features can be used in the feature-based visual navigation system to improve the navigation accuracy [11, 13]. In this paper, we choose the stereo camera to acquire images of indoor environments; thereafter, points and lines features are all extracted from these images. In addition, the scale information of the features can be estimated by stereo camera because of the small scale of the indoor environments.

Although the camera can be used efficiently in navigation system, it is not robust to motion blur induced by rapid motion [14] and the vision-based navigation system often fails in rapid motion scenarios. In order to make the system more robust, a common solution is fusing camera with inertial sensors, which leads to the visual-inertial navigation system (VINS) [6, 15]. Inertial sensors usually contain three orthogonal gyroscopes and accelerometers to measure the angular velocity and acceleration of the carrier and estimate the carrier’s motion in high frequency. However, it suffers from the accumulated error and relatively large measurement uncertainty at slow motion [16]. It can be found that visual and inertial measurements offer complementary properties which make them particularly suitable for fusion [17], since inertial sensors have a relatively low uncertainty at high velocity, whereas cameras can track features very accurately at low velocity and less accurately with increasing velocity [16]. Consequently, VINS works better than vision-only or pure inertial navigation system. Without loss of generality, the VINS can be classified into two methods, filtering-based method and optimization-based method. Although the accuracy of optimization approach is better than the filtering approach, it is computationally expensive. Considering the extra computational cost in line feature extraction, we adopt the filtering approach which usually uses the extended Kalman filter (EKF) to estimate the pose of the system.

In this paper, we present a visual-inertial navigation system with point and line features for indoor environments. The stereo camera is combined with an inertial sensor in a tightly coupled filtering approach to overcome the defect in single sensor. The point and line features are both used in the proposed system, such that this system can perform well in low-textured environments such as indoor environments. In addition, the robustness of this navigation system can be improved by these two schemes. The performance of this system is also tested in an experiment which uses visual-inertial benchmark dataset.

The rest of the paper is organized as follows. The related work is described simply in Section 2. In Section 3, we present the IMU model based on the inertial measurements and the representations of point and line features. The estimator and the implementation of the algorithm are described in Section 4. Experiments are conducted in Section 5 to demonstrate the performance of the proposed system. Finally, conclusions are drawn in Section 6.

#### 2. Related Work

Compared to the simultaneous localization and mapping (SLAM) system, the visual navigation system does not have loop closures, but in a large part, it follows the SLAM paradigm (especially the front-end of the SLAM). For feature-based method, pioneering work has been carried out in [18]. It presents a monocular camera SLAM system which uses sparse key points and EKF to estimate the camera motion. In [19], it points out that the monocular camera suffers from the “scale drift” problem; however, the stereo camera can efficiently overcome this issue. In [20], it presents a stereo visual SLAM system which uses Harris corner detector to extract point features and EKF to estimate the pose. More recently, the ORB-SLAM has been proposed in [21, 22], which supports all monocular, stereo, and RGB-D cameras. In ORB-SLAM, the ORB point features [23] have been used for the first time and the accuracy of the system is very high. However, the systems mentioned above would reduce their performance in low-textured environment with insufficient point features. Therefore, massive efforts are devoted to the line segment detection method and its application in visual navigation and SLAM system.

A linear-time line segment detector has been proposed in [24, 25] with a high detection accuracy. In [13], a line-based monocular 2D SLAM system has been also presented, which incorporates vertical and horizontal lines to estimate the motion with EKF. In addition, another line-based system has been proposed in [11] to improve the efficiency of the system by adopting two different representations of a line segment. However, the line feature only lends itself to the structured environment, which implies that the system performance is likely to be degraded in complex environment by employing the single line feature. Taking into account the complementarity between point and line features, a combination of both features has beneficial effects on the robustness and accuracy of the navigation system. Recently, visual navigation systems with points and line segments have been proposed in many works [26–28], which reveals superior performance in a wide variety of real-time scenarios. However, the visual-only system has its own limitation and the VINS is developed to overcome it.

The filtering-based VINS is particularly relevant to our work. A vision-aided inertial navigation system based on a multistate constraint Kalman filter has been proposed in [6, 15], presenting a novel system model which does not contain the feature position in state vector. Therefore, the computational overhead of the method is low. However, this system relies on large storage capacity, which makes it unavailable in some applications. A consistent VINS based on EKF has been proposed in [29]. In this paper, the author proposes a new method which improves the consistency of the system based on analyzing the observability of the VINS. An autonomous navigation system combines the stereo camera, and IMU has been proposed in [30], which improves the positioning accuracy by considering both the near and the far point features. However, all these methods mentioned above use point features only. Line-based VINS has been proposed in [31, 32], which use straight lines as features. These systems exhibit superior reconstruction performance against point-based navigation system in line-rich environment. However, in our work, we mainly focus on the VINS combining both point and line features.

#### 3. IMU Model and Feature Representation

In this section, we will describe the IMU model and the camera model which contain the point and line features. To begin with, we define the reference coordinate frames which would be applied in the rest of the paper.

The navigation coordinate frame {} (also can be referred as the world coordinate frame) is chosen as the East-North-Up (ENU) coordinate frame in this paper (the origin of this coordinate frame is fixed in the first frame of the system, and the -, -, and -axes are aligned with the east, north, and up of the system’s first frame, resp.). The pose information of the system is all characterized with respect to in this coordinate frame. In addition, the navigation coordinate frame is fixed with the first frame of the system such that the effect of the earth rotation can be ignored (cf. (1) and (2)).

The IMU coordinate frame {} and the camera coordinate frame {} are fixed with the IMU and the stereo camera, respectively. Their origins are located at the center of the IMU and the optical center of the camera. The -, -, and -axes of the IMU coordinate frame are aligned with the right, front, and up of the IMU, respectively. The -axis of camera coordinate frame points aligns with the optical axis. There are two camera coordinate frames, which are denoted in terms of left camera coordinate frame and the right camera coordinate frame . The transition matrix between them is obtained though camera calibration, and the is chosen as the reference camera coordinate frame {}.

In addition, for the sake of simplicity, in Sections 3 and 4, we assume that the IMU coordinate frame {} and the camera coordinate frame {} (left camera coordinate frame) are coincident (in practice, the transition matrix between these two coordinate frames can be extrinsically calibrated following the method proposed in [33]). To avoid confusion, in Sections 3 and 4, we use the other notation {} (body coordinate frame) to represent both coordinate frames (in practice, one of the {} and {} can be selected as the body coordinate frame).

##### 3.1. IMU Model

The angular velocity and the specific force of the system in the body coordinate frame can be measured by the IMU. These measurements include the angular velocity and the acceleration information with noise [34]: where denotes the true angular velocity in body coordinate frame, denotes the true acceleration in navigation coordinate frame, and are the constant drifts of the gyroscopes and the accelerometers in body coordinate frame, respectively, and are the random noises of the gyroscopes and the accelerometers in body coordinate frame, which can be modeled as uncorrelated zero-mean white Gaussian noise, denotes the rotation matrix from the navigation coordinate frame to the body coordinate frame, and is the gravity vector.

The estimates of the attitude information (described in the form of unit quaternion), the velocity , and the position can be updated by the IMU measurements as follows [34]: where denotes the estimated rotation matrix from the body coordinate frame to the navigation coordinate frame, and are the estimates of the constant drifts of the gyroscopes and the accelerometers, which can be obtained by the calibration method [35], and

In addition, the unit quaternion , the attitudes , and the rotation matrixes and can be interchangeable [34].

In addition, and can be modeled as follows:

According to the analysis above, the IMU state vector can be described as follows:

In practice, the angular velocity and the specific force are sampled by IMU at discrete times, so (3) should be calculated in a discrete method. Therefore, we assume that the and are constant between the two sampling times; thereafter, (3) can be discretized in an easy way.

##### 3.2. Feature Representation

The point and line features are extracted from the images collected by the stereo camera. In this subsection, we will describe the representations of the point and line features.

###### 3.2.1. Point Features

Owing to small scale of the indoor environments, the depth of the most points in images can be estimated by stereo camera through the baseline between the left and right cameras, so point features can be coded with a Cartesian coordinate frame as follows: where denotes the position of point features in body coordinate frame and it can be estimated from the pixel coordinate value of the point features as follows: where is the baseline between the left camera and the right camera, is the disparity, and are the pixel coordinate value of the point features in the left camera image, and are the pixel coordinate value of the optical center in the left camera image, and is the focal length. In addition, , , and are the intrinsic parameters of the camera which can be obtained by the camera calibration method [36].

The position of point features in navigation coordinate frame is also important in the proposed system, and the estimated position can be computed as follows:

This process is only executed when the point feature is detected in the first time. In this section, for the sake of simplicity, the rotation matrix between IMU coordinate frame and camera coordinate frame and the relative position between these two coordinate frames are ignored, but in practice, they should be considered.

According to the analysis above, the point feature state vector can be described as follows: where denotes the position of the th point features in navigation coordinate frame and is the number of the point features and this number is a variable because of the different images collected by the camera.

###### 3.2.2. Line Features

Different from the points represented by in three-dimensional spaces, lines are parameterized in 4 DOFs. In this paper, we introduce two different representations of lines, the Plücker coordinate frame representation and the orthonormal representation [11, 37]. The Plücker coordinate frame representation is initialized when the line features are detected in the first time, and the orthonormal representation is used in the estimator.

The Plücker coordinate frame representation is determined by two points on the line. Assuming the homogeneous coordinate frames of these two points are and , the Plücker matrix can be determined as follows:

The Plücker matrix is a 4 × 4 antisymmetric matrix in which the diagonal elements are zero and there are six nonzero elements in this matrix. The Plücker coordinate frame is a 6 × 1 vector composed by these six nonzero elements as below: where and are the 3 × 1 inhomogeneous coordinate frames of the two points. is orthogonal to the interpretation plane containing the line and the origin, and is the direction of the line. The relationship between the Plücker matrix and the Plücker coordinate frame is expressed as where denotes the skew-symmetric cross-product matrix with a vector and the is defined as

The transformation of the Plücker coordinate frame in different reference coordinate frames (e.g., navigation coordinate frame and body coordinate frame) can be described as follows [38]: where denotes the translation vector from navigation coordinate frame to the body coordinate frame.

The 3D lines represented by Plücker coordinate frames in body coordinate frame can also be projected onto the image plane as follows: where is the 2D line in the image plane represented by pixel coordinate frame. It can be found that only is used in this projection process.

According to the analysis above, six elements are used in the Plücker coordinate frame representation while the line’s DOFs are four. If using this representation in estimator, superfluous elements will cost more computation. In addition, the orthogonal constraint between the and , that is, , will decrease the numerical stability, and thus, we need to use another representation method in the estimator.

The orthonormal representation uses minimum 4 parameters to represent a line, and this representation can be derived from the Plücker coordinate frame representation [37].

We define a matrix and decompose this matrix by QR decomposition as and set

The 3D line can be represented by where . This means the and are three- and two-dimensional rotation matrices, respectively. Thus, the matrix can be updated accordingly by a vector containing 3 parameters such as Euler angles and can be updated by a scalar parameter as follows: where and denote the three-dimensional rotation matrices.

We can define a 4 × 1 increment vector which can be used in the estimator presented in the next section to represent the error of line features. The detailed discussions will be provided in Section 4.

In addition, the Plücker coordinate frame can be transferred from the orthonormal representation as follows:
where denotes the th column of **U**.

#### 4. Estimator Establishment and Algorithm Implementation

In this section, an EKF estimator is presented. The IMU measurements along with the point and line features mentioned in Section 3 are combined based on this estimator in a tightly coupled approach. In addition, we will elaborate the feature detection methods, the feature initialization, and some other processing steps.

##### 4.1. Error-State Equation

In our work, the EKF estimator uses the error model, which is defined as follows: where , , and are the error-state vectors of IMU, point features, and line features, respectively.

The IMU error-state vector can be defined as where , , , and are the velocity error, the position error, the error of gyroscope constant drift, and the error of accelerometer constant drift, respectively. The aforementioned error is formulated as where is the true value and is the estimated value. is the attitude error computed from the error quaternion which is defined as

The relationship between the true quaternion and the estimated quaternion is defined as , where is the quaternion multiplication [34].

Similarly, the error-state vector of point features can be defined as where is the position error of the th point features in navigation coordinate frame which is defined as . In addition, .

The error-state vector of the line features can be defined as follows: where denotes the error vector of the th line features in navigation coordinate frame and is the number of the line features. It is worth to mention that is a variable because of the different images collected by the camera. In addition,

According to the analysis above, the error-state equation can be established as follows: where is the process noise assumed to be uncorrelated white zero-mean Gaussian noise the covariance matrix of which is and

Owing to the estimator is executed in the discrete time, and the differential equation needs to be transformed into the discrete equation as follows: where is the sampling period and denotes the th frame.

In addition, in any environment, the stereo camera is used to capture a series of fixed images. According to the point and line feature extraction methods, for a series of fixed images, the extracted points and lines are also fixed and we assume that the matched features are all true positive, and thus, there is no random noise for these features. Actually, they only contain the measurement noise the differential of which equal to zero. Therefore, in (27), there is no noise of point and line features.

##### 4.2. Measurement Equation

The features’ reprojection errors in the image plane are selected as the measurement. For the sake of illustration, we present the reprojection errors by considering the case where a single point feature and a single line feature are present.

We first define as the th point feature which can be observed by the camera in the th frame. The reprojection error of the point feature can be defined as follows: where is the system’s position in the th frame, is the point feature’s estimated position in the body coordinate frame, is the point feature’s reprojection position in the image plane, and is its pixel value. is the point feature’s observation value in the image plane, and is its pixel value.

We define as the th line feature which can be observed by the camera in the th frame. The reprojection error of the line feature can be defined as follows: where is the line feature’s estimated position in the body coordinate frame, is the point feature’s reprojection position in the image plane. In addition, when the line feature is observed by the camera, we can obtain the pixel value of the start-point and end-point of the line feature. In (33), and denote the start-point and end-point of the line feature’s observation value in the image plane. Note that and are represented by the homogeneous coordinate frames which are vectors (the first two elements are the pixel value, and the third element is 1).

According to the analysis above, the measurement vector can be selected as follows:

The error-state equation can be established as follows:
where is the measurement noise assumed to be uncorrelated white zero-mean Gaussian process the covariance matrix of which is and **H** is the measurement Jacobian matrix formulated as:
where the individual matrixes and can be computed as
where , , , , and .

The measurement equation can also be written as the discrete form:

###### 4.3. Estimator and Algorithm Implementation

According to the analysis in the previous section, it can be found that the error-state equation and the measurement equation are nonlinear. In order to deal with the linearization error, we use the EKF estimator to fuse the inertial measurement and the images observed by the stereo camera. As presented in the previous section, these nonlinear functions have been linearized.

Based on the linearized error-state and measurement equations, the EKF estimator can be executed with the following two steps: (i)Predict where is the covariance matrix of the state vector.(ii)Update where is the Kalman gain matrix.

The covariance matrices , , and in the above equations are empirically initialized based on the characteristics of the sensors. Specifically, and are initialized, respectively, associated with the random noise of the inertial sensors and the scales of the feature space. is initialized concerning the constant drifts of the inertial sensors and the scales of the feature space.

Finally, we summarize the whole algorithm. In this system, static base alignment is achieved for obtaining the initial attitude angle by following the method in [30]. Subsequently, analogous to the inertial navigation system [34], the IMU measurement is used to update the pose and the velocity of the system by solving (3).

On the other hand, the images are collected by the stereo camera; after performing the feature detection and initialization which will be discussed in the next subsection, the error-state and measurement equations can be established by (22), (23), (24), (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35), (36), (37), and (38) with the IMU measurement and the update result of the inertial navigation. In the last step, the EKF estimator is executed by (39) and (40). In addition, the covariance matrix of the EKF should be augmented when new point or line features are observed.

###### 4.4. Feature Detection and Initialization

In terms of the feature detection, the ORB method [23] is used to extract point features from the images. This method deals with the orientation problem of the FAST method and generates a very fast binary descriptor based on BRIEF to describe the detected point features. Therefore, this method can detect and describe the point features in a fast, efficient, and accurate manner. In [21, 22], the ORB method was used in the visual SLAM system, which resulted in an excellent performance.

On the other hand, the LSD method [24, 25] is used to extract the line features from the images. Its mechanism is to merge pixels with the similar gradient direction. This method can extract the lines with subpixel accuracy in linear time. In addition, the detected lines are described by a binary descriptor, line band descriptor (LBD) [39], which allows accurate matching between two frames or two cameras (left and right).

The nearest-neighbor method is used to match the features between the two frames based on the descriptor of features. In order to reduce the false matches, the random sample consensus (RANSAC) method is used to filter out the outliers for both point features and line features.

Once a new feature is detected and described, we next perform binocular matching. For a point feature in the left image, its counterpart in the right image can be acquired along the parallel epipolar line in the right image. The Hamming distance is used to measure the similarity of point features with a predefined threshold. For the line feature, we exploit the matching method proposed in [40].

The unsuccessfully matched features will be discarded. After matching the point and line features between the two images, the positions of the point features and the two endpoints of the line features can be obtained by (8). The representations of line features in the 3D space can be computed by (12). The point features and lines in the navigation coordinate frame can be computed by (9) and (16). With the above-mentioned feature initialization, the initial positions of the features in the navigation system are obtained and are thus used to compute the reprojection error in the measurement equation. In addition, after executing one filtering process, the error of the features’ positions in the navigation coordinate frame can be estimated after feature filtering. Therefore, these positions can be updated in the filtering processes and become progressively accurate.

#### 5. Experiments and Results

In this section, experiments are conducted to test the effectiveness of the proposed indoor navigation system. We assume our system is tailored for any structured indoor environment where the geometric scale information can be calculated by the stereo camera and the object textures can be characterized by either point or line features. We use EuRoC dataset [41] for evaluating our approach and use root-mean-square error (RMSE) for performance measure. In the comparative study, we compare the proposed navigation system (PL-VINS), with the StVO-PL [26] and the system based on multistate constraint Kalman filter (MSCKF) [15]. The StVO-PL is a state-of-the-art visual navigation system using stereo camera with points and lines, while the MSCKF is an EKF-based visual-inertial navigation system with high accuracy. In implementation, we directly use the publicly available source code for both StVO-PL and MSCKF methods. For the sake of consistency, we repeat all the competing approaches ten times on all the sequences and the final experimental results are obtained by averaging the accuracy scores.

##### 5.1. Experiment Description

The EuRoC dataset contains 11 sequences which are categorized into two types both recorded in three indoor environments. The first type of the dataset contains the first five sequences which are recorded in a machine hall (MH). The second type of the dataset contains the remaining six sequences recorded in Vicon room 1 (V1) and Vicon room 2 (V2) with an approximate size of 8 m × 8.4 m × 4 m. All eleven sequences include the stereo images and the IMU measurements containing the gyroscopes, accelerometers data along with the ground truth. More specifically, the stereo images are recorded by a stereo camera, MT9V034 with 2 × 20 Hz while the IMU measurements are recorded by a MEMS inertial sensor, ADIS16448 with 200 Hz. Besides, the ground truths are provided by Leica MS50 and VICON. All these equipment are mounted on a micro air vehicle (MAV) which flies in the indoor environment to record the data. More details about the EuRoC dataset can be found in [41]. In addition, these 11 sequences provide varying challenges relevant to the speed of the MAV, illumination, texture, and so forth. In some sequences, especially for the sequence Vicon room 2 03, the environment is low-textured with few visual contents.

According to the description of the EuRoC dataset, it can be found that these sequences are suitable for testing the proposed indoor navigation system because of the visual-inertial data and the low-textured indoor environments. In addition, to the best of our knowledge, this dataset is the only dataset which provides visual-inertial data and ground truth in indoor environments.

In order to test the algorithms quantitatively, the transformation error matrix between two coordinate frames can be computed as follows: where is the transformation matrix between two coordinate frames provided by the ground truth, is the estimation transformation matrix from the navigation system, and is the error matrix. The translation errors and the rotation errors can be obtained from this error matrix.

The RMSE of translation and rotation can also be computed for comparison of different systems: where is the number of the frames, is the 3 × 1 translation error vector, and is the 3 × 1 rotation error vector which are represented by the Euler angles.

##### 5.2. Results and Discussion

###### 5.2.1. Translation Errors

Extensive evaluations on 11 sequences are carried out by making use of three different navigation systems mentioned above. First, we analyze the translation errors of the dataset. Figures 1 and 2 show the translation errors of different methods on sequences machine hall 04 and Vicon room 2 02. Table 1 gives the relative , , , and whole translation RMSE errors of all 11 sequences. In Table 1, the “” column means the whole translation RMSE errors. “MH,” “V1,” and “V2,” respectively, indicate the “machine hall,” the “Vicon room 1,” and the “Vicon room 2.” “MH01” implies the sequence 01 in machine hall, and another four sequences named from “MH02” to “MH05” are also used in our evaluations. Besides, we also use the respective three sequences in V1 and V2 in our experiments, namely, “V1 01,” “V1 02,” “V1 03,” “V2 01,” “V2 02,” and “V2 03.” In addition, in order to demonstrate the robot movement in some environment, we give the ground truth trajectories and the trajectories estimated by the proposed method as shown in Figure 3.