Abstract

Self-localization and mapping are important for indoor mobile robot. We report a robust algorithm for map building and subsequent localization especially suited for indoor floor-cleaning robots. Common methods, for example, SLAM, can easily be kidnapped by colliding or disturbed by similar objects. Therefore, keyframes global map establishing method for robot localization in multiple rooms and corridors is needed. Content-based image matching is the core of this method. It is designed for the situation, by establishing keyframes containing both floor and distorted wall images. Image distortion, caused by robot view angle and movement, is analyzed and deduced. And an image matching solution is presented, consisting of extraction of overlap regions of keyframes extraction and overlap region rebuild through subblocks matching. For improving accuracy, ceiling points detecting and mismatching subblocks checking methods are incorporated. This matching method can process environment video effectively. In experiments, less than 5% frames are extracted as keyframes to build global map, which have large space distance and overlap each other. Through this method, robot can localize itself by matching its real-time vision frames with our keyframes map. Even with many similar objects/background in the environment or kidnapping robot, robot localization is achieved with position RMSE <0.5 m.

1. Introduction

The ideal indoor mobile robot should store a global map of the entire indoor space for self-localization, especially in a building including multiple rooms and corridors [1]. It would be much better if this global map can be built by the robot itself during indoor environment studying and without the help of human [2].

For this research field, SLAM (Simultaneous Localization and Mapping) is the commonly employed [3], especially the V-SLAM (Visual Simultaneous Localization and Mapping) [4, 5]. Many latest variants take advantage of new features or 3D information to build navigation map [6, 7], including ORB SLAM [8], dense SLAM [9, 10], semidense SLAM [11], LSD SLAM [12], and CV-SLAM [13, 14]. The CV-SLAM method is specially designed for indoor robot localization [15] and could make good use of ceiling features as navigation map through an upward-looking camera [16]. This technology has been used widely in the Roomba, a product of iRobot Inc. Similar methods are used in other models made by Dyson, SAMSUNG, and LG.

However, the SLAM method is easy to be disturbed, and the most troublesome problems include kidnap problem and similar objects interference [17]. The kidnap problem occurs frequently when a robot is suddenly involved in a collision, kicked, or intentionally repositioned during operation. The robot cannot localize itself according to video information of previous moments [18]. The similar object interference refers to the fact that a robot is easy to be confused by similar features on similar objects in different places and error fixed itself to the wrong position. An indoor robot that looks mostly at ceilings of indoor space do not detect many features of significance. To make matters worse, there are usually many similar objects in the indoor environment (e.g., air condition outlets, ceiling lamps, and ceiling). Hence established methods are not sufficiently reliable for global indoor positioning.

In order to solve these problems, this paper presents keyframes global map establishing method for robot localization through content-based image matching, involving both ceiling and wall regions. The robot could establish global map by itself and fix its position at any time without the need of its localization information of previous moments and cannot be kidnapped.

Comparing with the common keyframes extraction method (e.g., according to the robot moving time or distance), our keyframes extracted according to image content would be more useful (image content mainly including the objects in the room and their layout and their color distribution). Problems related to losing keyframes or redundancy can be avoided, especially when robot is changing speed or turning suddenly [19, 20].

In this paper, the keyframes global map establishing method is also based on a specially designed content-based image matching method, which can calculate images similarity exactly. The keyframes for each room and corridor can be extracted from robot environment studying video by image content matching. This matching method is designed to avert image distortion interference caused by robot view angle and movement. It is designed based on a distortion model, which is deduced to analyze the distortion features of ceiling and wall region (e.g., wall, door, window, and furniture). According to these features, two image processing methods, including overlap region extraction method and overlap region rebuild method through subblocks matching, are presented. We first adjust the distortion of vision frames nearly consistent and then calculate the similarity between frames. In order to adjust distortion more exactly, ceiling points detecting method and mismatching subblocks checking method are incorporated. Through content-based image matching, this paper can take advantage of the information of different objects imaged in different room to extract keyframes for each room from robot vision. These keyframes have very large space distance, and the whole indoor environment could be described only by 10 s of keyframes. Keyframes overlap with each other, and their position relationship can be extracted. These keyframes and their global position would compose global map for indoor environment. This content-based image matching method can also be used for robot self-localization. When the robot is moving in rooms or corridors, for the real-time updated robot vision, the most similar keyframe for each robot vision frame can be chosen automatically from this global map and the robot can fix its position precisely.

In the test, the experiment site is composed of two rooms (20-meter-square) and two corridors (11 m). The robot can move automatically in this site to study environment features, and a video about 9.5 min (1710 frames) long had been taken by robot vision during indoor environment studying. Through the map establishing method presented in this paper, 72 keyframes (less than 5% frames) are extracted from this video to build the map of experiment site. The robot can make good use of this map to localize itself in the whole building. It is able to draw robot trajectory exactly, even in the event of kidnapping, with position RMSE less than 0.4 m.

The rest of this paper is organized as follows. Section 2 describes the design of robot vision system and the distortion features of common indoor objects. In Section 3, we discuss content-based image matching method and the keyframes global map establishing method. In Sections 4 and 5, the experiment result and conclusion are presented and discussed.

2. Image Distortion Model and Feature Analysis for Common Indoor Objects

Content-based image matching with both ceilings and walls is one of the best ways for detecting keyframes and building global map, which can extract keyframes from environment video as efficiently as possible to describe the indoor environment and keep the connection between these frames. However, the main problem for content-based image matching is the image distortion. If two frames are taken in different view angle and position, the objects distortions in them are different, and image content similarity between the two frames is difficult to be calculated.

Here we discuss the method for analyzing features of image content distortion through establishing a distortion model. An upward-looking camera (SY012HD wide angle camera, Weixinshijie Technology Co., Ltd, Shenzhen, China) is installed on the wheeled robot (Roomba, R1-L081A, Midea, Suzhou, China) as the robot vision sensor [21], as shown in Figure 1(a), the same as that carrying a CV-SLAM [22, 23].

2.1. The Features of Robot View Angle and Movement for Ceiling and Wall Region

The camera model, which can change the three-dimensional object in the real world to two-dimensional picture, is as follows [24, 25]: whereis calculated through robot view angles of roll , pitch , and heading . The term is calculated through robot movements and and ceiling height , are the coordinates of an object point in three-dimensional real world, and are its corresponding points in the robot vision. The terms , , and are the constant of the camera: is the camera focus and and are the scale factors.

There are two kinds of main indoor objects: ceiling and wall regions. The ceiling is parallel to the floor, and wall regions (e.g., wall, window, door, and furniture) are perpendicular to the floor. For the upward-looking camera on the robot, the view angles for them are different.

When a robot is moving on the floor, its route parallels the ceiling, the heading (given ) is the only robot view angle can be changed, and both roll and pitch are equal to 0.

For the wall region, because they are fixed on the floor, the freedom of robot view angles is also very limited. We analyze each wall separately, by giving each wall an independent coordinate (front/back wall or side wall), as shown in Figure 1(b), with the -axis perpendicular to the wall, -axis parallel to the wall and floor, and -axis perpendicular to the floor. The heading between a wall and the robot is the only view angle which can be changed. As wall is perpendicular to floor and ceiling, it can be seen as a part of ceiling with roll or pitch of 90 deg. For the front wall (or back) of robot, the pitch is 90 deg and roll is 0, as shown in Figure 1(b). For the left wall (or right wall) on the robot, the roll is 90 deg and pitch is 0, as shown in Figure 1(b).

Because of the different view angles, ceiling and wall deform differently in the robot vision. In order to match precisely, this deformation should be revised before image content matching. According to (1), it can also be deduced that robot movement and object position can also contribute to image distortion. Therefore, bringing the robot view angle and movement and object position into (1), the distortion models of ceiling and wall region could be established, and their distortion features can be extracted.

2.2. Ceiling Distortion Model and Feature Extraction

Given the coordinate origin on the ceiling, and show robot movement, and is ceiling height. The heading between ceiling and camera is , and roll and pitch are 0. Therefore (1) can be transformed as follows:

It can be simplified as follows:

The distortion of ceiling only includes rotation and translation, and the shape of ceiling is unchanged. As rotation and translation are affine transform, they can be revised by affine transform too. Through SURF (Speed Up Robust Features), the matched feature points in frames and are extracted [26, 27]. Bringing these points coordinate into (5), the heading difference and movement differences and between these two frames could be resolved [28]:

Two frames, and , can be adjusted to the same robot view angles and camera shooting position after rotation and translation according to, , and . Their overlap regions (containing the same objects in two frames) could be extracted. The image similarity would be obtained by comparing the similarity of their overlap regions.

2.3. Wall Distortion Model and Feature Extraction

For the front wall (, ) and side wall (, ), the heading difference is 90 deg (or −90 deg), but their image distortions are the same and can be deduced as follows.

Given that the robot heading is when it is taking frame , for a point on the front wall, its corresponding point on frame of robot vision is calculated as follows:

And it can be simplified as follows:

For the point on the side wall, its corresponding pointon the frame of robot vision is calculated as follows:

And it can be simplified as

Comparing with (7), it means that the -axis and -axis terms are exchanged in (9) and (10). Accordingly, , are exchanged, and , are exchanged, too. The distortion feature of front wall in the robot vision is the same as that of side wall, and the distortion revise method for the front wall would be the same as that of side wall.

Take the distortion of front wall for example. Given the robot heading and movement for frame are , , for frame which would be marched with frame , its parameters are , , and . According to (8), the object point on frame is and is calculated as follows:

For the ceiling and wall region in frames and , the deformations of their ceiling images, including rotation and translation, can be adjusted similarly through (5). However, the wall region in frame , after rotation and translation, would be transformed differently from that of frame : where is point in frame processed by (5) and can be simplified as

If this point is translated by an amount ,

The transform result is

Comparing with (10) and (11), the transformed frame is much closer to frame than the original frame . The similarity between frames and can be calculated more exactly through matching with transformed frame compared to original frame .

But there is still some difference between transformed frame and frame . On one hand, their denominators are different, which are and . On the other hand, some terms in (15) are unknown (e.g., point coordinate in the room space), as the terms cannot be calculated through the two equations. In order to neutralize these disadvantages, overlap region extraction method and subblock matching method would be presented, and can be resolved through subblock matching. The term subblock is defined as follows. An image is divided into many small blocks which are equal sized, and these blocks are named as subblock in this paper. Equation (5) would also be accomplished through overlap region extraction method designed in this paper.

If frames and contain the same image content, their transform results would be very similar after being processed by (5) and (15) and very useful for image matching and similarity analysis. For the same ceiling regions in frames and , the correlation coefficient between them would be very large. For the same wall regions in two frames, as containing the same objects, the number of correct matching subblocks would be also very large. This paper would combine them together to analyze the similarity between frames and .

3. Match Method Design

Content-based image matching method would be designed firstly and then the keyframes global map can be established through this matching method.

3.1. Content-Based Image Matching through Overlap Region Extraction and Subblocks Rebuild

The content-based image matching method is designed according to the features of image distortion and consists of three parts: image overlap region extraction, overlap region rebuild through subblock matching, and, lastly, image content similarity calculating, as shown in Figure 2. For frames and , this method would adjust the distortion of frame similar to that of frame and then calculate their degree of similarity. In order to adjust more exactly, ceiling points detecting method is designed to improve the accuracy of overlap region extraction, and mismatched subblocks checking method is deigned to rebuilt the overlap region more exactly.

3.1.1. Image Overlap Region Extraction

For frames and , which are taken by robot at different positions, only parts of them might overlap. Content-based image matching would focus on analyzing the image similarity in this overlap region.

In order to extract this overlap region, frames and would be adjusted to the same view angle and camera shooting position according to their matched feature points extracted through the SURF method. If all the feature points are on the ceiling, the translation , , and rotation between the two frames can be calculated directly according to (5). Through rotation and translation, the same objects in frames and would be adjusted to the same position, and this image processing progress is shown in Figure 3. The overlap regions can be extracted effectively from these two adjusted frames.

The image procession result is shown in Figure 4. The rotation and translation of frames and are shown in Figures 4(c) and 4(d), and the same objects are overlapped, as shown in Figure 4(e). The overlap region mask is the pixels covered by the two image, as shown in Figure 4(f). The overlap regions in frames and can be extracted through this mask.

If the feature points were extracted from wall regions, they would interfere with the calculation result of , , and . In order to delete the points not on the ceiling (including wall, furniture, window, and door), this paper designs ceiling points detecting method according the feature of ceiling, which are of equal height.

For the ceiling points and , given that the origin of coordinate is on the ceiling and ceiling points , according to (4), the length of their connect lines in frames and (for points “” and “” in Figures 5(a) and 5(b), their connect line is “line ”) can be calculated as follows: where and are the coordinates of points and in the real world, and are the coordinates of points and in frame , and and are the coordinates of points and in frame . and are the robot movement for frames and . As points and are equal in height, it can be deduced by (17) that and the length of connect lines of the same ceiling points in the two images are unchanged, as shown in Figure 5.

However, if one of the feature points is on the wall (given height is ) and not on the ceiling, its height would differ from that of ceiling (). In this case, the length of the connect line is where and are coordinates of wall pointin frames and and and are the length of connect lines in these two frames, as is in the denominator of (19), and (18) and (19) are not equal, and . Therefore, the points on the wall can be deleted effectively by comparing the length of connect lines of feature points in the two frames, and the feature points on the ceiling can be reserved. The image process is shown in the Figure 6.

Therefore the points on the ceiling can be extracted by comparing the length of their connect lines in frames and . And these points can make the calculating result of (5) more exact and extracting overlap regions more effective. In the overlap regions, as rotation and translation have been completed, the difference of ceiling image between frames and caused by robot rotation and translation can be deleted. However, for the wall image distortion difference between these two frames, the distortion adjustment would be completed in the next section.

3.1.2. Overlap Region Rebuild through Subblocks Matching

For the points on the ceiling region, it is difficult to calculate the translation result of each point in the wall directly through (15), as robot cannot measure each point coordinates in the room through its monocular camera. This paper presents subblocks matching method to calculate the translation value.

As the distorted object in the frame is still an entity in the image, the translation values of distorted object points in a small image region are nearly equal. Therefore this region can be translated as smaller units according to the average translation value of this region. The overlap region in frame can be divided into many small subblocks firstly and then match the overlap region in frame through SAD method (Sum of Absolute Difference) to get the average translation value of each block. Therefore the distortion of overlap region in frame can be adjusted similar to that of frame , as shown in Figure 7.

The SAD matching method is as follows: where is one of the subblocks in the overlap region in frame , and size is , and is the overlap region in frame , and size is . Traversal , the terms , which can make (20) smallest, is the most suitable rebuild position of in . This is the of subblock . Through this method, every subblock in frame can find its rebuild position and get its translation , and the rebuild overlap region of frame is similar to that of frame .

Considering that there might be some mismatched subblocks, which are caused by similar objects and mismatched to wrong position, this paper presents mismatched subblocks checking method to check the wrong .

As is the average of of all the points in a subblock; the performance of is very similar to that of . For , (15) can changed into matrix from:

It can be seen from (21) that the translation is constituted of two parts: the projection of heading differenceproduced by the first and second term and the translation produced by the third and fourth term. The core of first term is , and it is the function of , which is the projection of. The core of second term is , the same as the rotation part in (5); it shows frame rotating during overlap region extraction. So the subtraction result of the first two terms in (21) can extract the projection of heading difference between frame and and excludes the rotation caused by overlap region extraction. The core of third term is ; it is the translation difference between frames and caused by wall object height and robot movement difference between frames and . For the fourth term, the same as the translation part in (5), it shows frame translation result during overlap region extraction. So the subtraction result of the last two terms in (21) can extract the translation difference between frames and and exclude the translation of overlap region extraction.

Therefore the wrong of mismatched subblocks can be picked out according to the rotation difference and translation difference between frames and after overlap region extraction, as is the projection ratio between real world and camera image, and can be substituted by its corresponding subblock coordinate . The thresholds for of subblock in can be calculated by affine model: where , , and are the thresholds of rotation difference and translation difference between frames and after overlap region extraction. In order to delete the wrong mismatched subblocks effectively, , , and are smaller than the maximums of rotation difference and translation difference.

If of a subblock is larger than its threshold , it can be deduced that this block is mismatched and should be deleted. Through mismatched block detection, the correct content-based image matching result can be gotten. The matching result for two similar frames is shown in Figure 8, and the matching result for the dissimilar frames is shown in Figure 9. This subblock matching method can match similar frames and evaluate their similarity effectively.

3.1.3. Image Content Similarity Calculating

If the content of frames and is similar, the number of correct matched subblocks is very great, and the correlation coefficients between rebuild overlap region in frame and overlap region in frame are also very large [29, 30], and the similarity between two frames is the product of subblock number and correlation coefficients: where is the number of matched subblock, is the pixel value in rebuild overlap region in frame , is the pixel value in overlap region in frame , and and are the average image pixel value.

3.2. Keyframes Global Map Establishing through Content-Based Image Matching

Before establishing map, robot would move in the building automatically to take a video of indoor environment through its camera firstly. And then, through this content-based image matching method, robot can extract keyframes sequence from vision video to build global map of indoor environment by itself.

The first keyframe is the first frame of vision video. And the other keyframes extraction progress is as follows:

Step 1. For the th keyframe, its similarity with subsequent 50 frames video (about 17 seconds) is calculated by the robot.

Step 2. The max similarity in the subsequent 50 frames is found out firstly, and it is correspondent to the frame whose space position is nearest to the th keyframe. And, then, the frame whose similarity is 50% of the max similarity can be extracted as the th keyframe. If all the subsequent 50 frames are larger than 50% of the max similarity, the 50th frame is the th keyframe. As there is 50% similarity between the and keyframes, they can both interval long space distance and overlap each other.

Repeating Steps 1 and 2 and processing all the frames of indoor environment video, keyframes sequence can be extracted, as shown in Figure 10.

This global map would be established by these keyframes, and consists of two parts: keyframes sequence and the global position of each keyframe. As the keyframes can partially overlap each other, extracting the feature points in the overlap region between neighboring keyframes and bringing them into (5), the relative position relationship of these keyframes, including heading difference and position difference , , would be resolved. Then, bringing these relation positions into (24), the global position of each keyframe can be gotten: where and are the global position of the th and th keyframe.

Through this content-based image matching method, the space distance between each keyframe could be very large, and the number of keyframes is very less. Robot would resolve its position quickly by matching with these keyframes. In the experiment, total 1710 frames are taken by the robot during indoor environment studying, and 72 keyframes are extracted by this method.

Through this global map, the robot can localize itself in real-time, and the image processing of robot localization is shown in Figure 11. When robot moving in the indoor environment to render service to human, this content-based image matching method could be used to match real-time robot vision frames with the keyframes sequence in the map and find out the most similar keyframe for each robot vision frame. The same as the global map establishing progress, feature points between robot vision and this keyframe can be extracted, and their relative position can be resolved through (5). And the global position of each robot vision frame, which is also the robot global position, can be resolved through (24), the same as resolving progress of the th keyframe global position.

4. Results of Experiments

This keyframes global map establishing method for robot localization through content-based image matching had been tested, including two parts: map building and localization in large indoor environment and the test for robot localization under kidnap condition.

4.1. The Experiment of Map Building and Localization in Large Indoor Environment

The experiment site is composed of two rooms (20-meter-square) and two corridors (11 m). The experiment progress is shown in Figure 12. During robot indoor environment studying, a video about 9.5-minute long (1710 frames) had been taken by robot vision during robot indoor environment studying.

Total 72 keyframes (less than 5%) are extracted by our method from these video frames to build map of experiment site. Their global position resolving result is shown in Figure 13. The environment of the experiment site, including ceiling and wall region (wall, window, door, and furniture), can also be described by this keyframe sequence and their global position, and the position RMSE is less than 0.3 m, as shown in Figure 14.

The robot can make good use of this map to localize itself in the whole building and draw its route in the different room and corridor effectively, as shown in Figure 15.

In order to evaluate the localization precision of this map building method, the corners of tile floor are taken as ground signs, and the air conditioning port and ceiling lamp on the ceiling are taken as ceiling signs. And the localization RMSE between robot localization result and these signs is less than 0.5 m, as shown in Table 1.

Table 1 also shows the comparison between our method and ORB SLAM. The algorithm architecture of our method is similar to that of ORB SLAM, but this paper uses image content matching taking place of feature points matching. Through image content matching, the robot can pick up the keyframe from global map more precisely, which is most similar to its real-time robot vision, and is seldom disturbed by similar objects in the indoor environment. As the experiment site (Figure 14) includes four parts, two corridors and two rooms, the comparison between our method and ORB SLAM is also divided into four parts, as shown in Table 1.

It can be seen from Figure 14 that there are many similar objects in the experiment site, and this would be a serious test for our method and ORB SLAM.

When robot is moving the in the two rooms, the experiment result of our method is better than that of ORB SLAM. This is due to the fact that there are many similar objects in the two rooms, such as the air condition outlet and ceiling lamps. The feature points on these similar objects in different rooms are easy to be error matched as the same points. If ORB SLAM needs to match all keyframes in the map, the robot can easily localize itself to wrong room (the distance between the two rooms is 12 m). But our method is able to take advantage of the image of dissimilar objects in different rooms, the interference caused by similar objects can be suppressed, and robot can fix its position precisely in these rooms and less error localized.

While the robot is moving in the two corridors, as when there are fewer similar objects than that of rooms and ORB, the method can extract feature points more effectively, the experiment result of ORB SLAM is better than that of our method.

Table 1 also shows the comparison between our method and CV-SLAM. The difference is not significant under normal condition. However, compared with CV-SLAM method on commercial equipment, our method is immune to kidnapping events, because our method can build global map of indoor environment and make good use of this map to fix robot position at any time.

4.2. The Test for Robot Localization under Kidnap Condition

The kidnap problem for robot self-localization can be solved effectively by our method. In order to test robot in more complex kidnap condition, two adjoining rooms (20-meter-square and 10-meter-square) and the corridor (5 m) are chosen as experiment site. Through our method, 16 frames were extracted as keyframes to build the global map of these rooms and the corridor, with the position relationship of keyframes sequence shown in Figure 16. The keyframes mosaic result is shown in Figure 17. The rooms and small corridor could restrict the field of robot camera, such that the robot cannot fix its position by watching the distant landmarks. It was kidnapped and suddenly put into a far-away place. In the test, the robot was frequently taken from one place to another place (2 or 3 m apart) by experimenters to achieve kidnap. Through matching with global map, the robot was still able to fix its position effectively, especially when the robot is suddenly moved from room to the corridor, as shown in Figure 17, and position RMSE is less than 0.4 m.

5. Conclusion

For the issue of robot localization and mapping in the indoor environment, this paper presents keyframes global map establishing method for robot localization through content-based image matching with ability to analyze distortion and overlapping of keyframes. Results show that common problems, such as kidnapping or disturbances by similar objects, can be resolved through the content-based image matching method presented in this paper, which is specially designed for indoor environment. In the test, the keyframes global map can be established by this method and describe the indoor environment effectively. Although there are many similar objects in the experiment site, the robot cannot be kidnapped and can localize itself accurately (Figure 18), with the position RMSE being less than 0.5 m.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work is supported by the NSFC (nos. 61271147, 61372052).