Abstract

This paper proposes a vision-based Semantic Unscented FastSLAM (UFastSLAM) algorithm for mobile service robot combining the semantic relationship and the Unscented FastSLAM. The landmark positions and the semantic relationships among landmarks are detected by a binocular vision. Then the semantic observation model can be created by transforming the semantic relationships into the semantic metric map. Semantic Unscented FastSLAM can be used to update the locations of the landmarks and robot pose even when the encoder inherits large cumulative errors that may not be corrected by the loop closure detection of the vision system. Experiments have been carried out to demonstrate that the Semantic Unscented FastSLAM algorithm can achieve much better performance in indoor autonomous surveillance than Unscented FastSLAM.

1. Introduction

Visual simultaneous localization and mapping (SLAM) uses the cameras as the only exteroceptive sensors to recover a representation of the environment and achieve localization of the robot complemented with information from the proprioceptive sensors with the aim of increasing accuracy and robustness. To the mobile robotic, vision has proved to be an effective and inexpensive sensing device for localization and mapping. Sim et al. solved the SLAM problem with a stereo pair of cameras [1, 2]. Schleicher et al. used a top-down Bayesian method to perform a vision-based mapping process where identification and localization of the natural landmarks from the images were provided by a wide-angle stereo camera [3]. In this paper, a new semantic vision SLAM framework is proposed to improve the performance without increasing the complexity of the algorithms dramatically.

Literature on visual SLAM have focused on feature-based SLAM where a feature could be described by the points with its 2D position (SIFT [4], SURF [5]) or 3D position [6, 7], and also edge segments [8, 9]. But feature extractions from the natural visual scenes were heavily dependent on the environment where the sparse features might be found. These features could be occasionally too few to fully constrain the pose of the robot. Hence the appearance-based SLAM was proposed to represent the recorded images of the environment with prominent features as a whole [10]. Morita et al. reported another novel appearance-based localization approach for outdoor navigation with feature or object learning, recognition, and classification using SVM [11]. However, the usage of rich sensorial information in these appearance-based SLAM solutions has resulted in very time-consuming computation especially for larger-scale environments. To allow real-time operation in more moderately sized environments, one method was proposed to observe the interframe motion of every other corner feature in a visual odometry style [12, 13]. Also, some researchers proposed the method of discovering and incorporating higher level map structure in the forms of lines [14] and planes [15, 16].

Different kinds of maps have been applied in SLAM. Metric maps capture the geometric properties of the environment whereas topological maps describe the connectivity between different locations [17]. Topological maps can represent the environment as a list of the significant places which has simplified the problem of large-scale mapping [18]. However, one limitation of the topological representation was the lack of metric information. So the strategy of mixing the metric and topological information in a single consistent model was proposed [19]. Fernández et al. also developed a hybrid metric-topological algorithm to build a metric map while maintaining a topological graph and to detect loop closures [20]. Thrun and Buecken combined the grid based and topological based methods to map indoor robot environments [21]. Such hybrid algorithms took advantage of the local metric grids for enhanced local planning while avoiding the computation of a complete global grid map. However, these maps are very limited in describing the environment other than distinguishing between occupied and empty areas. In order to explore richer information of the environment, semantic mapping has become a research topic recently. Wolf and Sukhatme proposed a semantic classification method based on HMMs and SVMs to tackle the problem of terrain mapping and activity-based mapping [22]. Ranganathan and Dellaert described a technique to model and recognize the places using objects as the basic semantic concept [23]. Yi et al. proposed a semantic representation and Bayesian model for robot localization using spatial contexts among objects [24, 25]. This paper will take advantage of semantic relationship of features in the visual SLAM framework.

Early work on SLAM was done by Smith et al., where the Extended Kalman Filter (EKF) was applied [26]. Later Doucet et al. introduced the Rao-Blackwellized particle filter (RBPF) as an efficient solution to the SLAM problem which is also called FastSLAM [27]. The Unscented FastSLAM algorithm was then proposed to overcome the drawbacks of FastSLAM where the scaled unscented transformation (SUT) was applied to replace the linearization in the FastSLAM framework [28]. The SLAM solution in this paper will be based on Unscented FastSLAM.

Hence the main contribution of this paper includes a novel Semantic Unscented FastSLAM algorithm to improve accuracy of localization and mapping while maintaining the sparse map for real-time implementation. The semantic relationship and topological metric map are combined to form a new kind of map for SLAM. Few experiments have been carried out for validation of the proposed technique.

The rest of the paper is organized as follows: Section 2 describes the semantic topological metric map and observation model. Framework of the Semantic Unscented FastSLAM is presented in Section 3. The experimental results and discussion are presented in Section 4. The concluding remarks are presented in Section 5.

2. Semantic Topological Metric Map and Observation Model

2.1. Semantic Topological Metric Map

Semantic topological metric map is defined as the combination of the topological metric map and the semantic relationships between the landmarks where the assumption is that such semantic relationships can be represented by some mathematical equations. The spatial semantic relationship between the available landmarks is always invariant with respect to the robot location. Denote the semantic topological metric map as and the semantic metric relationship as . Figure 1 shows the process of creating the semantic topological map. The procedures are summarized as follows. (i) When a robot starts to move and the first landmark is observed, the semantic topological map only includes the position vector of the , (1). (ii) As the robot moves forward, more landmarks are observed. If there are no semantic relationships between any pair of landmarks, the semantic topological metric map will be the same as the regular topological metric map. If the number of the observed landmarks is , the semantic topological map includes the position vectors of , (2). (iii) When the robot observes landmark , the semantic relationship between landmark and landmark , , is also found. If all the semantic relationships with the observed landmark are defined as the set , the semantic topological metric map is then updated with the addition of the semantic metric relationship as in (3)-(4):(iv) When the robot observes landmark , if the semantic relationship between landmarks and is found, then the total semantic topological metric map would beSince more-than-one semantic relationships between different landmarks, (4) and (6), have been observed, the extended new semantic topological relationship will be created, Figure 2, where landmarks , and are associated together. The semantic topological metric map at the time being will becomewhere is the extended semantic relationship between landmark and landmark . When has the same semantic relationship as coincidently, we can associate them together as

2.2. Semantic Observation Model

A semantic observation model is the observation model of the vision sensor with implicit of the semantic relationships. Hence the semantic observation model consists of not only the metric distance and the bearing of each landmark, but also the semantic metric relationships between different landmarks. The dimension of the semantic observation model could be where is the total number of the landmarks observed so far. In this case, the semantic observation model can be represented aswhere is the mathematical expression of the semantic metric relationship, is the coordinates of the current robot pose, and is the coordinates of the landmark observed at the current time period. are the series of the possible semantic metric relationships associated with . The position vector of the robot is defined as . is defined as the position set of all the landmarks observed at the current time as follows:

3. Semantic Unscented FastSLAM

Semantic unscented FastSLAM partitions the SLAM posterior into a localization problem and independent landmark position estimation problem conditioned on the robot pose estimate and the semantic metric relationships between the landmarks as follows:where is the robot pose at time and denotes the full semantic metric map at the current time period as follows:Suppose the control vector of the robot is where and represent the linear and angular velocities of the robot. According to the kinematics of the wheeled mobile robot [29], the motion model is represented as follows:

3.1. Robot Pose Estimation

Since particle filter is incorporated into the FastSLAM frame, the following derivation will be associated with only one particle as an example. Then the robot pose at time for a th particle can be estimated as where is represented by a Gaussian with the mean and covariance . The can be predicted in the following according to the motion model of the robot. In order to integrate the robot pose and the map update, the state vector is augmented with a control input and the observation vector aswhere is the augmented state vector, is the motion noise covariance and is the observation noise covariance, and is the augmented covariance matrix.

In order to apply the unscented transformation, a symmetric set of sigma points ( is the dimension of the augmented state vector) need to be extracted first as follows [30]: where the subscript means the th column of a matrix. The is computed by and is a small number to avoid the sampling nonlocal effects for high nonlinearities. is a scaling parameter determining how far the sigma points are separated from the mean value. Each sigma point contains the robot pose, control noise, and semantic observation noise components asSo the prediction of the robot pose can be derived by passing the above sigma points through the motion model, in (13). The transformed sigma points of the robot pose, , are calculated as where the current control vector is the sum of the and the control noise component of each sigma point. Then the prediction of the robot pose can be calculated asThe weights are calculated by the following equations:where the weight is used to compute the mean of the predicted robot pose, and the weight is used to recover the covariance of the Gaussian. The parameter is used to incorporate the knowledge of the higher order moments of the posterior distribution.

Suppose the th landmark and its semantic relationships are observed; the transformed sigma points of the semantic observation vector can be derived aswhere the semantic metric relationships, , are included in the semantic observation model in (9) for robot pose update. So this new update will result in the improvement of robot localization. Then the prediction of the semantic observation vector can be calculated asThe Kalman gain can then be obtained by the following equations as usual: where is the innovation covariance and is the cross-covariance.

Therefore, the mean and covariance of the robot pose are estimated at the time period by

3.2. Landmark Position Estimate with Semantic Constraints

For the observed landmark , the probability of the landmark position estimate can be represented aswhere the probability is represented by a Gaussian with the mean and covariance . will be derived in the following. Likewise, the sigma points of the observed landmark position, , are initialized as The transformed sigma points of the landmark position estimation with semantic relationships can be derived as where is the observation model in (9), is the current estimation of the robot pose in (24). Hence the predicted semantic observation vector, , isThen the Kalman gain is calculated as follows: Note that the weights and are the same as (20). Finally, the mean and the covariance of the th landmark position are updated byNote that includes the true observation of the relative position of the landmark and the robot and the associated semantic relationships with this landmark. These observation values are obtained from the image process of the vision sensor data. If more landmarks are observed at one time, the derivation would be similar except that more semantic relationships would be included in the observation model.

As mentioned at the beginning, all the above derivation is with respect to the particle . Then the traditional resampling procedure will be taken, and the robot pose and the landmark positions will be estimated finally.

4. Experiments and Discussions

4.1. Experiment Procedures

The platform used in the experiments was a Pioneer 3-DX robot equipped with a binocular camera system. The camera was the only exteroceptive sensor to recover the representation of the environment. The sampling period was 0.5 seconds. The proposed technique has been evaluated by three different types of the experiments. In Experiment 1, the robot moved along a simple rectangular trajectory (8 m × 14 m) in a neat lab environment. The environment in Experiment 2 was a regular office area that was more general to most indoor service robots to verify the superior performance of the Semantic Unscented FastSLAM. Experiment 3 was conducted in a messy environment where the robot had to move along a zig-zag path to go through aisles.

In the experiments, three kinds of the semantic metric relationships were found. One semantic relationship was that the new observed landmark and another landmark existing in the previous map were both along the -axis (-line). The second semantic relationship was that two landmarks were both along -axis (-line). Such two kinds of semantic relationships are denoted by . The third one was that three landmarks were collinear such as the walls of neighboring cubes in an office. Suppose , , and are three landmarks with the above semantic relationships; the semantic observation model can be represented, respectively, as

4.2. Experiment Results and Discussions

Experiment 1. Figure 3 shows one image taken by the vision sensor on the robot with three landmarks , , and where the landmarks and were located along the -axis. This semantic relationship will be applied for localization and mapping. Figure 4 shows the comparison of the system performance using the Unscented FastSLAM (Figure 4(a)) and the proposed method (Figure 4(b)). As shown in Figure 4(a), the error of robot pose became larger especially after the robot was turning. This error could not be corrected by the loop closure detection because all the landmarks observed after turning have not been observed before. In Figure 4(b), the localization error has been eliminated greatly after the semantic topological metric map was applied. Figure 5 is the partially enlarged view of Figure 4 where A1 and B1 are the estimations by odometer only, A2 and B2 are the estimation from Unscented FastSLAM, and A3 and B3 are the estimation from the proposed Semantic Unscented FastSLAM. When Landmark #6 was observed by the robot at the first time, it was also found that Landmark #6 has the semantic topological relationship “-line” with the landmarks #4, #2, and #3 in the existing map. Hence this semantic relationship has resulted in much better robot pose estimation, B3 in Figure 5(b), which has pulled the dead reckoning estimate B1 back from the deviation comparing with B2 without taking advantage of semantic relationships.

Experiment 2. Figure 6 shows the experimental environment in Experiment 2 where the reference trajectory started from the circle and ended at the same point after a complex surveillance along the arrow directions. The start point was defined as the origin of the inertial frame . It is worth noting that this office was composed of a few cubes that were higher than the robot. Hence when the robot moved along the reference trajectory, most landmarks could not be observed more than once before the robot was close to the end point.
The experimental results using the Unscented FastSLAM and the proposed Semantic Unscented FastSLAM are shown in Figure 7. In this experiment, when the robot moved close to the end point, Landmark #1 should be observed after a long period for the loop closure detection. The estimation of the end point, C1, by odometer only was far away from the real end point. As shown in Figure 7(a), the end point estimated by the Unscented FastSLAM before the loop closure detection of Landmark #1, C2, was better but still had a huge error. This error was too large to be corrected by the loop closure detection (see D2 for the estimation after loop closure detection). Figure 7(b) shows that the end point estimated by the proposed Semantic Unscented FastSLAM before the loop closure detection, C3, was much smaller because of the semantic updates in the algorithm. Therefore, after the loop closure detection, the error was reduced close to the reference point (see D3 for the estimation by Semantic Unscented FastSLAM in Figure 7(b)).

Experiment 3. The experiment environment is shown in Figure 8 where small triangles represent a few locations along the reference path, and the solid line represents the wall of cubes. Notice that the reference path is not straightforward during each aisle because the aisle had irregular width and the robot also needs to avoid chairs and boxes on both sides of the aisle. Figure 9 shows an example of two pictures captured by the camera where three green landmarks were detected as collinear relationship. Figure 10 illustrates the performance of the surveillance robot in Experiment 3. As shown in Figure 10(b), the locations of robot and landmarks are much closer to the reference path using the proposed Semantic Unscented FastSLAM than without considering semantic relationships (Figure 10(a)).

5. Conclusions

This paper has proposed a vision-based Semantic Unscented FastSLAM for mobile robot. The semantic relationship is combined with the traditional topological metric map to improve the accuracy of localization and mapping. Experiments were conducted to verify that the Semantic Unscented FastSLAM is more robust and applicable to more general indoor autonomous surveillance.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is currently supported by the National Science and Technology Ministry of China under Grant no. 2013BAK01B02 and was supported by the State Key Laboratory of Robotics and System (HIT) under SKLRS-2011-ZD-04. The partial support of the National Natural Science Foundation of China (no. 61273335) and National High-Tech Research and Development Program of China (863 Program, no. 2015AA042303) is also appreciated.