Abstract

We present a novel feature-level data fusion method for autonomous localization in an inactive multiple reference unknown indoor environment. Since monocular sensors cannot provide the depth information directly, the proposed method incorporates the edge information of images from a camera with homologous depth information received from an infrared sensor. Real-time experimental results demonstrate that the accuracies of position and orientation are greatly improved by using the proposed fusion method in an unknown complex indoor environment. Compared to monocular localization, the proposed method is found to have up to 70 percent improvement in accuracy.

1. Introduction

For achieving seamless positioning and navigation in degraded or denied unknown areas such as indoors [1, 2], various new technologies have replaced the Global Positioning System (GPS) [36] and Inertial Navigation System (INS) [7] as the mainstream technologies in indoor autonomous localization field. The most popular one is visual Simultaneous Localization and Mapping (SLAM) [3, 8, 9]. One of the key technologies in SLAM is the autonomous localization which is localizing oneself in an unknown environment. The other one is recreating a map of this environment. These two tasks are coupled to each other: an accurate localization is benefited to mapping, and a good map is crucial to localize oneself. For this reason, both of the two tasks must be performed at the same time [10].

The key point of good localization is to ensure the correct processing of geometrical information gathered by cameras. Three types of methods have been considered for autonomous localization estimation. (a) Monocular Autonomous Localization: in an early stage, researchers prefer to use a monocular method for its simple device structure. A good example of this method was introduced by Civera et al. in [11, 12]. 1-point Random Sample Consensus (RANSAC) based on Extend Kalman Filter (EKF) was used as the data association method. They also showed real-time feasibility of monocular SLAM with affordable hardware which cannot provide depth information, using the EKF algorithm to get the depth information from features information [13], but a long convergence time for obtaining depth information of the monocular method is required. (b) Multisensor of same “type”: stereovision SLAM is a typical example of multisensor of same “type” which is considering two linked cameras as a single 3D sensor [14, 15]; SLAM has the defects of limited 3D estimation range and mechanical fragility. A stereovision sensor can only provide a reasonable 3D estimation up to infinity. (c) Multi-sensor of different “types”: with the using of data fusion method and filtering in the field of multi-sensor, researchers are paying attention to the multi-sensor methods recently [1618]. Multi-sensor of different “types” can provide more types of feature information for enhancing estimation accuracy and reducing strong anti-interference. In order to further improve the performance of the multi-sensor of different “types” in SLAM, we introduce a novel feature-level data fusion for indoor autonomous localization. The basic idea is of fusing camera and infrared sensor information with each other. In contrast to the laser sensor, the advantage of low cost is also outstanding with this type of sensor.

This paper is organized as follows. Our system framework is introduced in Section 2. Section 3 describes proposed feature-level data fusion based indoor localization method. Section 4 presents the experimental results. Section 5 gives the conclusion of the paper.

2. Framework of Proposed System

Among the various SLAM methods [10], the most popular one is the monocular SLAM [11]. In the literature, feature detection [19] and data association and prediction are important parts of SLAM. Feature detection is to extract features in images, and one can localize the camera by analyzing the relative position between these points and the camera. Data association maps the feature points between two adjacent frames. Because of the lack of depth information, the initial depth information is usually stated before the localization, and the depth information is then estimated after several iterations of the sequential estimators such as EKF.

In this section, we present a framework that supports the feature-level data fusion based autonomous localization as shown in Figure 1. In the proposed localization method, features in the unknown area are first detected. The feature-level data fusion based autonomous localization in Section 3 will then be performed. It is designed to deal with two problems: (a) the unavailable initialized depth information of the monocular method and (b) self-calibration of the multiple dissimilar sensors.

Figure 1 is the block diagram of the proposed system. The state vector can be represented by the set of camera and the set of feature points information : where contains the position and orientation as a 6-DOF [20] constant velocity model, contains the world coordinate of feature points, represents the world coordinate, is the direction of the sensor view, and and are the velocity and angular velocity.

Consider the following:

In (2), refers to the step number, transforms the local orientation vector into a quaternion, and are the impulse of velocity and angular velocity, and is the delay between and . From the velocity model, the EKF prediction of the state vector can be described: where stands for the prediction of position and orientation of camera, is the Jacobian of , is the covariance of the zero-mean Gaussian noise assumed for the dynamic model, and is the Jacobian of this noise.

Consider the following: where is the updated features information, is the observation of , stands for Kalman gain, and is the Jacobian of the measurement equation. With the state vector and covariance matrix, the prediction of the camera motion is available.

The data association block in the framework can reduce the size of the data set and 1-point RANSAC [11] is a paradigm for the model. The number of RANSAC iterations can be greatly reduced and hence the computational cost. After the data association, the features of the adjacent two frames will be linked, deleted, or created.

3. Proposed Feature-Level Data Fusion Based Autonomous Localization

In this section, we present an autonomous localization method based on a novel method of feature-level data fusion. Because of the difference between camera and infrared sensor, depth information cannot be projected to an image with high accuracy. The proposed method provides the fused information of depth and image through the edge information as shown in Figure 2. With the known pose of the camera, the world coordinates of feature points can be inferred. Similarly for the features, the pose can also be calculated.

3.1. Camera and Infrared Sensor Geometric Modeling

A 3D point in the scene and its corresponding projection are related by the following equations. We use , , to represent the world coordinates and , for the image coordinates: where is the camera calibration matrix, is the infrared sensor calibration matrix, is the rotation matrix, and is the translation vector between world and camera coordinate systems.

Consider the following: where is defined by and referred to the focal length of the camera in the -axis and -axis and and are the elements of the principal point vector. Infrared sensor has the same and with the camera. The corresponding projection in the infrared sensor is , and the relation between and can be expressed as

3.2. Feature-Level Data Fusion Method of Image and Depth

While in Section 2 most feature points extraction methods are concentrated on the edges of objects, the depth information of these places is not accurate. In order to solve this problem, we use a feature-level data fusion method based on edge information by three steps. Before performing data fusion, edge information is chosen as features by analyzing the image and depth information. Comparing various methods, we select the canny operator to extract. The block diagram of the proposed method is shown in Figure 3.

First, is one edge point of the image , is the projection of on the depth information , and is the cross-shaped matrix with center point at . It is constituted by the next points around as shown in Figure 4.

Consider the following:

Second, the nearest nonzero point to in named which is the point used for edge alignment should be obtained. We assume that which means that is the minimum among all non-zero points in .

Third, moving forward in horizontal or vertical direction, and then extracting the depth information of the point next to (the one further than from ), we can fill in all the empty points between and . Since the relations between , , and , there are 4 solutions that should be considered. is the fused depth information described in the following:

3.3. Pose Estimation

In order to register a feature point into the world coordinate as a 3D point , these parameters should be known, the pose (world coordinate , angles between camera view and initial camera view are and , is horizontal and is vertical) of camera, depth of the feature point, and vertical and horizontal visual angles of camera view which are and and is the sizes of image. The equation for registering into the world coordinate will be

After introducing the projection of features from image plane into the 3D world coordinate, the relation between camera pose and 3D feature points in world coordinate will be described in this paragraph. Before the depiction of the processes, a problem should to be noticed. Assume a feature point is registered in world coordinate and the depth information is available. In that cases the camera move along the tangent of the circle whose center is feature point . Even though there is no change in the image projection and depth of , the pose of the camera should not be considered as being changeless. So, the number of features used for camera pose estimation should be more than two. In order to finish this calculation, the following parameters must be known as shown in Figure 5: (1) the world coordinate of feature points and (, ), (2) the projections of feature points on image and , (3) the depth information of feature points and , (4) the vertical and horizontal visual angles of camera view and , (5) the size of images and . Assume that the camera world coordinate is , angles between camera view and -axis and -axis are and . One can calculate the camera pose using following equations. Obviously, , , , , and can be derived by solving the previous equations. So, the pose of camera can be obtained through the previously mentioned parameters.

Consider the following:

At last, consolidating the previous equations, if the camera pose in time is known, the image coordinates of feature points and their depth are known all the time, and feature points information is stable, then camera pose in time can be calculated, and the world coordinates of feature points in time can also be calculated through the previously described data. Therefore, considering the mathematical induction, as long as there are more than two feature points that can be found in the image later, the camera pose can be obtained in every moment. The algorithm ultimately achieves autonomous localization estimation.

4. Experiments and Results Analysis

4.1. Experimental Platform

In this experiment, the Kinect [21, 22] sensor is used. Kinect is equipped with two sets of lenses. The middle sensor is the RGB camera (color CMOS) with a resolution of VGA (). The infrared transmitter in the left and the receiver in the right comprise a set of 3D depth sensors with a resolution of QVGA . The hardware operational principle which consists of camera, infrared transmitter and infrared receiver is show in Figure 6. The feature on image plane projects in world coordinates through the information from these sensors. The frame rate is 30 frames per second. The software platforms for Kinect are OpenNI [23] and OpenCV [24] which are under . Kinect classifies all the information into two types: color images and depth information. In a depth information, each item of one pixel (one pixel is 16 bits or 2 bytes) expresses the distance of the object closest to the sensor under the specified , coordinates. Depth distance ranges generally from 850 mm to 9000 mm. The depth 0 indicates unknown situation, for example, the projection of glass (low reflection) and mirror (high reflection). Stream support: , , and .

4.2. Experimental Results and Analyzation of Data Fusion

The image and the depth information of the carton shown in Figure 7 are used for experiments.(1)After calibration, the internal reference of infrared sensors and cameras is and . Since the rotation matrixes and translation matrixes of them are the same, mapping relation of the pixel between depth information and images can be calculated through the equation as (7).(2)The edge detection results of the depth information and image are shown in Figure 8, the red lines represent the image edges and the blue lines represent the depth information edges.(3)After the alignment of the edges, the results of the feature-level fused data fusion are shown as follow. The left one is the original data and the fused data is on right.

From the edge parts in Figure 9, we can find that there are a lot of glitches in the raw data. Comparing to the left one, the right image has a qualitative leap on the carton edges. After removing the glitch noise, the fused information is ready for further data analysis.

4.3. Indoor Autonomous Localization Estimation Error and Analysis

In the autonomous localization experiments, the chosen scene is a complex indoor environment shown in Figure 10. All of the depth information is continuously distributed in a range of 0.85 m to 9 m. In order to evaluate the positioning method comprehensively, two experiments are conducted in this part. The sensors are in uniform rectilinear motion in the first one and nonuniform nonlinear motion in the second one. In the first experiment, the camera moves 2.2 m from original point along the negative -axis. The sensor Kinect is placed on a cart moving at a speed of 2 cm/s in uniform motion. In order to simulate human behavior, we conduct the second experiment. In the second one, the camera moves 1.6 m uniformly along the positive -axis from the original point at first, then moves along a quarter circle with a radius of 1 m and the tangent of the cycle is -axis. The speeds of the uniform motion and the quarter circle are 4 cm/s and 2 cm/s.

The infrared sensor and the camera obtain information of depth and images at the frequency of 2.5 Hz, the images are of color pixels and the depth information is of 16-bit grey level pixels. The raw data from sensors are recorded in a hard drive for postexperiment analysis. There are three kinds of data as follows: the images without depth information, the images and the raw depth information, and the images and the fused depth information.

The positioning cumulative errors and the positioning Root-Mean-Square (RMS) errors of the experiments are shown as figures. Figures 11 and 12 are for the first experiment and others are for the second experiment.

Figure 11 explains the positioning cumulative errors from the respects of the error in the Euclidean coordinate, along the -axis and the -axis. The first site represents that fused data can improve accuracy by 70 percent along -axis with reducing the error from 0.5 m to 0.15 m. The second site explains that the fused data can improve accuracy by 67 percent along -axis with error reducing from 0.45 m to 0.15 m. The third one explains that the fused data can improve accuracy by 67 percent in the Euclid distance. So, these results show that the localization method based on feature-level data fusion holds more advantages in the direction of motion and this method increases about 70 percent accuracy of the positioning.

In Figure 12, the positioning RMS errors are shown from 4 respects: the error in the Euclidean coordinate (position and attitude), the error along the -axis and the -axis; they are the evaluations for stability of the localization method. By comparing the third and the fourth sites, we find that the localization method based on fused information has excellent performance in the direction of motion. At the same time, by comparing the first and the second sites, the significant improvement on the stability of the pose estimation is also considerable.

The same arrangement with Figures 11 and 13 can also be divided into three parts: the error in the Euclidean coordinate, along the -axis and the -axis. The first site shows the improvement of the error along -axis with reducing the error from 0.35 m to 0.02 m. The accuracy of localization is improved about 60 percent along -axis and in the Euclidean coordinate with the error reducing from 1 m to 0.3 m.

Figure 14 shows the results of the positioning RMS errors of the second experiment; the method also can be evaluated from four respects as in Figure 12. From these sites, we can find that the localization method based on feature-level fused information holds a low RMS error level by contrastive analysis and practical calculation.

5. Conclusion

We showed a novel feature-level data fusion based autonomous localization method for unknown indoor environment in this paper. Accuracy estimation was discussed for this method with real-time experiments. The experiment was conducted in an inactive multiple reference unknown indoor environment, and the platform consisted of a Kinect and a laptop. In the first experiment, the sensors move in uniform linear motion. The sensors move in a nonlinear and nonuniform motion in the second experiment. The positioning error of the three types of information from different data fusion methods was compared in the experiment. The first experimental results proved that the accuracy improvement of the autonomous localization based on feature-level data fusion guarantees up to 70 percent by comparing to monocular SLAM (without depth information) and 50 percent to pixel-level data fused information basing positioning method. The second experimental results show that the positioning method also holds practicality in the real environment. So, the advantage of feature-level data fused information is applied into two problems: data fusion method of multi-sensor of different “types” and the reliability of the visual SLAM.

Acknowledgments

This project is supported by the National Natural Science Foundation of China (no. 61261010), the China Postdoctoral Science Foundation funded project (no. 2012M521291), the Jiangxi Province Postdoctoral Science Foundation funded project, the Science and Technology Project of Jiangxi Provincial Department of Education (no. GJJ12006), and the National 863 Program (no. 2013AA013804).