#### Abstract

In this work we present a topological map building and localization system for mobile robots based on global appearance of visual information. We include a comparison and analysis of global-appearance techniques applied to wide-angle scenes in retrieval tasks. Next, we define multiscale analysis, which permits improving the association between images and extracting topological distances. Then, a topological map-building algorithm is proposed. At first, the algorithm has information only of some isolated positions of the navigation area in the form of nodes. Each node is composed of a collection of images that covers the complete field of view from a certain position. The algorithm solves the node retrieval and estimates their spatial arrangement. With these aims, it uses the visual information captured along some routes that cover the navigation area. As a result, the algorithm builds a graph that reflects the distribution and adjacency relations between nodes (map). After the map building, we also propose a route path estimation system. This algorithm takes advantage of the multiscale analysis. The accuracy in the pose estimation is not reduced to the nodes locations but also to intermediate positions between them. The algorithms have been tested using two different databases captured in real indoor environments under dynamic conditions.

#### 1. Introduction

The autonomous navigation of a mobile robot usually involves a minimal knowledge of the surrounding environment. Normally, that knowledge is used with the purpose of building an internal representation of the area in a map. Using the map and the current information that the robot receives from its sensors, it is possible to carry out the localization of the robot and also to simultaneously add new information to the map.

In the literature, we can find a wide variety of environment representations depending on the sensor used. In this way, it is possible to find examples that try to compute the position of the robot using GPS, laser, or wheel encoders as input information sensors. Among all the possibilities, vision systems have become common sensors for robot control due to the richness of the information they provide, their relative low weight and cost, and the variety of possible configurations. So that, we can find researches based on single standard cameras as [1], wide-angle cameras [2], stereo cameras [3], catadioptric systems that provide us with omnidirectional images [4], or an array of cameras arranged circularly to obtain a panoramic image [5]. In this work, we use a fish-eye single camera, due to the fact that they provide a wide-angle view of the environment and have lower cost than other visual systems.

In the great majority of real visual applications, it is not possible to work with the image information directly from the sensors, as the memory requirements and computational cost would make the process unfeasible. Taking this into account, it is necessary to find an alternative representation of images that contains as much information as possible with a reduced memory size. In this task, two main categories can be found: feature based and global-appearance based descriptors. The first approach is based on the extraction and description of significant points or regions from the scene. In this sense, we find examples of the use of SIFT features [6, 7], SURF [8, 9], or Harris edge and corner detector [10] applied to localization and mapping tasks. On the other hand, global-appearance descriptors try to describe the scene as a whole, without the extraction of local features or regions. These techniques have a special interest in unstructured and changing environments where finding patterns to recognize the scene might be difficult. For example, Kröse et al. [11] demonstrate the robustness of PCA (principal component analysis) applied to image processing. Menegatti et al. [12] take advantage of the properties of the discrete fourier transform (DFT) applied to panoramic images in order to build descriptors of the scene. In [13], Kunttu et al. describe the behaviour of a descriptor based on Fourier transform and Wavelet filter in image retrieving tasks.

Regarding the representation of the map, three main approaches stand out: metric, topological, and hybrid techniques. Metric maps include information of distances with respect to a predefined coordinate system. These maps provide the position of the robot except for an uncertainty associated with the sensor error. However, they usually have high computational cost. As an example, it is possible to find examples based on a sonar sensor applied to robot navigation [14] and other approaches that solve the SLAM problem using a team of robots with a map represented by the three-dimensional position of visual landmarks [15].

In contrast, topological techniques use graph-based representations of the environment. The nodes correspond to different areas of the environment, whereas the edges represent the connectivity relationships between the nodes. In those maps there are no absolute distances. Since they constitute a simple and compact representation, time and memory requirements are generally lower than in metric maps. However, they contain enough information to allow the robot to navigate autonomously through the environment. Many different visual-based navigation systems can be found, as [16, 17] present. They both use an omnidirectional camera as input sensor and topological maps as a representation of structured indoor office environments. In these maps, the nodes correspond to images of the areas where the robot navigates. These images are described using PCA techniques in order to reduce the memory requirements. The navigation between nodes relies on a visual path following algorithm that extracts the edges of corridor walls. A similar system but using a single camera is developed in [18]. Nodes correspond to special locations where certain actions must be taken, such as a turning or a door crossing; meanwhile, the edges represent trajectories where visual servoing navigation can be carried out. Specifically, the visual servoing is based on the vanishing point to keep the trajectory of the robot centered in the corridor. Other examples of topological map indoor navigation can be found in [19], where the gradient orientation histogram of the image is used in order to describe the scenes. Štimec et al. [20] present an appearance-based method for path-based map learning in both indoor and outdoor environments. This system is based on clustering of PCA features extracted from panoramic images in order to obtain distinctive visual aspects. In [21] we can find a topological localization system in a home environment. They use a sonar sensor and a grid map matching in order to carry out the localization of a robot, dealing also with the kidnapping problem. In [22], another example of topological homing navigation system is presented. In the proposal, some information from an omnidirectional visual system, a 3D stereo vision system, and the odometry are combined to carry out mapping and localization tasks using a mobile robot. FAB-MAP [23, 24] is another well-known topological SLAM approach, based on SURF features extraction to describe the appearance of the images. This algorithm has been tested in large scale navigation environments. On the other hand, Bellotto et al. [25] describe a visual topological localization system for mobile robot that uses digital image zooming. This work is based on the appearance of omnidirectional images and includes an image matching algorithm that improves the image association by means of the use of digital zoom of the scenes.

At last, regarding hybrid techniques, they try to take advantage of both topological and metric proposals. Normally, hybrid maps use metric in order to build local maps of separated areas, whereas topological relations are used in order to create a general map. It is also possible to introduce the topological relations to carry out loop closures in metric maps. An example of hybrid SLAM algorithm is RatSLAM [26]. This technique integrates the internal odometry of the robot provided by wheel encoders and visual information, using a neural network in a biologically inspired fashion. In [27] we can find the joint use of FAB-MAP and RatSLAM.

In this paper we propose a framework for only-visual topological map building and localization. Our technique stands out because of the use of global-appearance techniques to describe the scenes and the application of a multiscale analysis of the visual information to estimate relative distances between images. The system is intended for autonomous robotic navigation in mainly indoor spaces, such as offices and industrial environments where topological navigation is suitable.

The map is represented as a graph. In this graph, the nodes are collections of 8 wide-angle images captured every 45 degrees, covering the complete field of view around their positions. The topological distance between nodes, which is estimated by means of the multiscale analysis, will be proportional to the actual distance between positions of the nodes in the real world.

We use the information of several routes of images acquired along the environment, which pass through the nodes, to carry out the map building. The map building system is able to recognize new nodes, find their orientation, their relative position, and connectivity using these routes of images. We use the multiscale analysis to obtain both an increase of correct matching of route images in the map database and also a measurement of the relative position of the compared scenes.

After the map building, we propose a route estimation algorithm, which takes also advantage of that multiscale analysis. In this case, this analysis is used to enhance the localization of the robot, being able to locate the robot not only in the nodes positions but also in intermediate points. We also introduce a weighting function that improves the localization accuracy using the graph obtained during the map building.

In [28], we find an example of visual route navigation using visual information that tries to keep the input memory to a minimum. Following that idea, we include a study of the computational cost of image retrieving using different global-appearance descriptors and image sizes in order to minimize the time and memory requirements. This study will be used to select the descriptor and the features of the images in the map building and localization experiments.

The following glossary includes some of the terms used in the text.(i)*Node*: collection of 8 images captured from the same position on the ground plane every 45° approximately, covering the complete field of view around that position.(ii)*Map database or node database*: collection of the images of all the nodes.(iii)*Map*: graph that represents the topological layout of the nodes.(iv)*Map building*: process of finding the topological connection between nodes and their relative position.(v)*Topological distance*: relative position between images or nodes in the map.(vi)*Image distance (d)*: Euclidean distance between the descriptors of two images.

The contributions of this paper are the following.(i)We propose the multiscale analysis, which allows us to determine the relative position of two images captured with approximately the same orientation using global-appearance descriptors.(ii)The paper includes a map building algorithm. Our algorithm lies in finding the spatial distribution of a collection of nodes of images distributed along an area. The system provides a topological graph that represents the adjacency relations and layout of the nodes.(iii)We develop a topological localization system that extends the possible current pose estimation of the robot not only to the node locations but also to intermediate positions, making use of the multiscale analysis. This system allows the robot to estimate its path during the navigation using only the visual information.(iv)Finally, we offer an experimental validation of the map building and route path estimation algorithms using our own database, which contains information of two different areas.

The remainder of the work is structured as follows. Section 2 introduces the specifications of the images captured to carry out the experiments and the global appearance descriptors considered to describe the scenes. In Section 3 we compare and analyze the computational requirements and matching precision of different descriptors using several image resolutions. Section 4 introduces the multiscale analysis. Next section presents the algorithm developed to build the topological map. In Section 6, we explain a novel route path estimation algorithm. Section 7 includes the experimental results using our own database. Finally, a summary with the main contributions in this work is included.

#### 2. Images Features and Descriptors

This section describes the main features of the images used in the experimental part and the techniques we have applied in order to obtain a descriptor that extracts the main information from the images based on their global appearance.

The images are captured using a fisheye lens camera. Specifically, we use the Hero2 camera of GoPro [29]. We choose this camera due to its wide-angle field of view (127°), its low weight, and relative low cost compared to other visual systems.

The goal of the image descriptors is to solve the problem of place recognition using the global appearance of the scenes, trying to keep the memory requirements and computational cost to a minimum. The descriptors based on the global appearance concentrate the visual information of the image working with it as a whole, avoiding the extraction of landmarks or local features. They have presented good results in visual navigation tasks. It is possible to find previous works comparing these techniques [30, 31] or using them in map building and localization [32]. These researches use omnidirectional vision sensors. However, we do not have knowledge of any work where these techniques had been applied to nonpanoramic images.

Due to the use of a fisheye lens, the images captured with our camera present a radial distortion that must be corrected. It would be impossible to obtain useful information from the distorted images using global appearance descriptors, since they are based on the spatial distribution and disposition of the elements in the scene. For that reason, we use the Matlab Toolbox* OCamCalib* [33] to calibrate the camera and compute the undistorted scenes from the original images.

We consider the coordinate system of the image and the coordinate system of the real world, which is situated in the focal point of the lens. We consider the directions aligned with . The coordinates and of a point in the real world are proportional to the coordinates and of an image point:

Therefore, the vector can be defined as

We can include the parameter in the function . In this way, the previous equation can be expressed as

Due to the symmetric geometry of the lens, the coordinate of the point only depends on the distance of the image point regarding the image coordinate center:

The function , also named forward projection function, depends on the lens geometry. In general, it can be expressed as a -degree polynomial:

In our particular case, the minimum calibration error is obtained for a polynomial of degree equal to 3. The calibration function is

This function provides information about the direction of the rays that arrive to the camera system. The undistorted image corresponds to the protection of these rays in a plane parallel to the camera sensor.

Figure 1 shows the forward projection function of the camera, obtained after the calibration process.

In Figure 2 we can see an example of the original image and the undistorted view.

**(a)**

**(b)**

In the remainder of the paper, when we talk about the images, we will refer to the undistorted version of the original scenes.

Next, we include a summary of different descriptors based on the global appearance of the scenes.

##### 2.1. Fourier-Based Techniques

It is possible to describe an image using the discrete Fourier transform over its rows. We can transform each row of the image into the sequence of complex numbers :

The most relevant information of the image is concentrated in the low frequencies. These frequencies represent large scale features of the images. Moreover, in real images, high frequencies are often affected by noise. Figure 3 shows the modules of the first components of the Fourier transform of each row of an image. Hence, we select only the first coefficients of the discrete Fourier transform of each row to build the descriptor.

In [12], Menegatti et al. present a descriptor that uses the discrete Fourier transform in panoramic images, defining the Fourier signature. Since the magnitude of the transform presents rotational invariance, in that case it constitutes the localization descriptor. However, our database images are not panoramic, and the rotational invariance may introduce localization errors in areas where there is a symmetry between different images, as corridors. To avoid it, our descriptor is not made up by the magnitude but by the original complex values.

##### 2.2. Histogram of Oriented Gradients

The descriptor based on the histogram of oriented gradients (HOG) [34] uses the orientation of the gradient of an image. First we have to compute the spatial derivatives of the image along and -axis ( and ). Then, we obtain the magnitude and direction values of the gradient of each pixel:

Next, the image is divided into cells, and the histograms of the cells are computed. In Figure 4 we can see the division of the gradient of an image to obtain different cells (Figure 4(b)) and the estimation of the histogram of each cell (Figure 4(c)). The histogram is computed based on the gradient orientation of the pixels within the cell, weighted with the magnitudes of the gradient. The descriptor consists of the histograms’ values of all the cell the image is divided into, ordered in a vector.

**(a)**

**(b)**

**(c)**

##### 2.3. Gist-Based Techniques

Gist denotes a group of techniques that can be used to compress visual information as [35] details. These descriptors try to obtain the essential information of the images simulating the human perception system, that is, identifying a scene through its colour or remarkable structures, avoiding the representation of specific objects. Oliva and Torralba [36] develop this idea under the name of* holistic representation of the spatial envelope* to create a descriptor. In [37], this model uses global scene features, such as spatial frequencies and different scales based on Gabor filtering.

A Gabor filter is a lineal filter whose impulse response is a sinusoid modulated with a Gaussian function [38]. Therefore, a Gabor mask is localized both in the spatial and in the frequency domains (Figure 5). Thanks to its properties regarding textures treatment, Gabor filter can be used in compression and segmentation of digital images, as [39] shows.

**(a)**

**(b)**

**(c)**

First, we create a bank of Gabor masks including different resolutions and orientations. Then, the image is filtered using the set of filters. The orientation of each filtering depends on the number of masks of each resolution, since they are equally distributed between 0 and 180 degrees. The filtered images encode different structural information according to the mask applied. After that, the images are divided into cells, and we compute the average pixels’ value within each cell. This process is represented in Figure 6. This is repeated for every filtered image. The final descriptor is composed of the mean value of intensity of the pixels contained in horizontal cells applied to every filtered scene.

#### 3. Localization Recognition

In this section, we present a comparison between the different global appearance descriptors included in the previous section applied to location retrieval tasks. The aim of this study is to check the performance of these descriptors applied over perspective images using different resolutions. The comparison takes into account both the precision in correct matching and the computational requirements.

The database is composed of several nodes of images captured in different isolated places randomly chosen in both indoor and outdoor environments. Note that every node is composed of 8 images captured with a phase lag equal to 45 between consecutive images. We also take a set of test images. The test images are captured in the same locations of the nodes with unknown orientations. Specifically, we capture 26 nodes and 3 test images per location. This database is different from the images used in the following sections.

In the experiments, we create a database with the descriptors of all the images of the nodes. When a new test image arrives, we compute its descriptor and compare it with the database. We define the image distance as the Euclidean distance between descriptors, which allows us to measure the similarity between scenes. Regarding the classifier, we choose the nearest neighbour.

Moreover, we are interested in finding the minimum image resolution we can use without detriment of matching precision. Table 1 shows the image sizes we have tested along the experiments. Size 1 is the original resolution of the camera.

In Figure 7 we can see the necessary time to compute the descriptor of an image. The information has been divided into two different graphs in order to clarify the data included. On the other hand, Figure 8 shows the memory requirements to store the image descriptors of all the nodes in a database. We can appreciate an exponential reduction of the requirements when we use smaller resolutions.

**(a)**

**(b)**

Gist-Gabor stands out as the computationally most expensive descriptor (Figure 7), although as the image size is reduced, the time differences between descriptors decrease. Regarding the memory requirements (Figure 8), the Fourier signature is the technique with higher memory requirements. Figure 9 shows the performance of each descriptor when finding the correct position using recall and precision measurements [40] for the three nearest neighbours. The Fourier signature is almost invariant to the image resolution, whereas HOG presents a notable reduction of the precision using the smallest resolutions.

**(a)**

**(b)**

**(c)**

The main criteria for the selection of the descriptor is the precision in image association. For that reason, HOG becomes the less appropriate descriptor, although its memory and time performance are favorable. The Fourier signature presents a high precision in the position estimation for all image sizes, but it is lower than Gist-Gabor. Moreover, the size of the database created when we use the Fourier signature is higher than the other techniques. For those reasons, the descriptor selected is Gist-Gabor. The results obtained using the fifth image size (64 × 32 pixels) show an appropriate compromise between time and memory requirements and precision.

#### 4. Multiscale Analysis

This section describes the multiscale analysis. During the matching process between the images of the nodes and the routes, the nodes might be too separated for a correct association, especially in the route locations that are halfway between nodes. As a consequence, the appearance of the route images could present insufficient similarity with the nodes scenes to find a reliable retrieval in the node database. The aim of the multiscale analysis is to improve the association accuracy and to estimate the relative position between images making use of the global appearance descriptors.

Given two images, this technique carries out the comparison of several zooms-in of the central area of each image at different scales. Figure 10 shows the field of view of a camera when it moves forward perpendicularly to its projection plane. We can appreciate that the scene in the ahead position, represented in blue, corresponds to the central area of the field of view associated with the first position, represented in orange. If we select the central area of the orange image and rescale it to the original image size (simulating a digital zoom), the appearance regarding the second image (the blue) increases. Figure 11 illustrates an example of this idea. It includes two images captured during the forward navigation movement of robot following a route (Figures 11(a) and 11(b)). In the figure is also included a zoom-in of the central area of (a) (Figure 11()). We can appreciate that the zoomed image () is more similar in appearance to (b) than the original scene (a).

The similarity between scenes is measured using global-appearance descriptors (Section 3). After the comparisons with different scales, we select the association with the lowest image distance (i.e., the nearest neighbour), since it denotes the most similar images using the global appearance.

The scales of the two images matched during the association process provide information about their relative position. Specifically, the algorithm uses the difference of the scales to estimate the distance between images.

Figure 12 illustrates this process. The example includes four images of a route captured sequentially as the camera moves forward. In the example, we aim to estimate the topological distance of the four scenes regarding Scene 1, which is our reference image. For that purpose, we estimate different scales of Scene 1, compute their global-appearance descriptors, and compare them with (a) Scene 1, (b) Scene 2, (c) Scene 3, and (d) Scene 4 without zoom.

Since these images are captured sequentially, each scene is geometrically more separated from Scene 1 in the real world. In other words, Scene 3 is more separated from Scene 1 than Scene 2, and Scene 4 has been captured in the most distant point from Scene 1.

On the other hand, Figure 12 includes on the right side a graph that represents the image distances of the four scenes versus the different scales of Scene 1.

The scale factor () is the quotient between the original resolution of the image and the size of the area we select. For instance, if the resolution of the image is 32 × 64, a scale equal to = 2 is supposed to select the 16 × 32 central pixels.

In the graph, we highlight the scale of Scene 1 that presents the minimum image distance for each scene of the example. Note that lower image distances (i.e., Euclidean distance between descriptor) denote higher similarity between scenes.

As expected, the minimum image distance comparing with Scene 1, which is the same image than the reference scene, is obtained using a scale equal to 1, that is, when no zoom is applied. Regarding the comparisons with the other scenes, we can realize that the minimum image distance is obtained for higher zoom scales as the scene is more separated geometrically in the real world from the reference image (Scene 1).

Therefore, there is a direct correlation between the scale where the minimum image distance is obtained and the geometrical distance of the scenes in the real world. In our map building and localization algorithms, we use the difference of scales as topological distance between scenes.

Moreover, as seen in Figure 12, we obtain also a reduction of the image distance when comparing two images, which means that we increase the similarity of the compared scenes using the global appearance.

This increase in the similarity between images turns into an improvement of the precision in the image association task. The map building and navigation algorithms proposed in the following sections of this work rely on the matching between the images of isolated positions in the environment (the nodes) and images acquired along routes. For that reason, an improvement of the association precision is important.

To measure this improvement, we study the association between 352 node images and 172 images of routes. We compare each image of the route with all the node images and select the association with minimum image distance (the nearest neighbour). We consider that the association has been correct when the selected node is the nearest in geometric distance in the real world. Figure 13 shows the recall-precision results using the multiscale analysis and without it. Thanks to the introduction of the multiscale analysis, the precision in correct node retrieval increases 14%.

Therefore, the multiscale analysis improves the association between images and provides a measurement of topological displacement between two images by means of the zoom scales ().

#### 5. Map Building

This section details our topological map building algorithm. It starts with a database that contains the descriptors of all the images of the nodes, with no information of their spatial distribution. In the database, the images of the same node are stored consecutively, but the order of the nodes does not provide information about their spatial layout.

We also have different routes of images captured while the robot navigates along the nodes. The scenes of the routes are ordered as they are captured during the navigation, and the algorithm incorporates their information sequentially, that is, in the same order they are captured.

The aim of this algorithm is to select the nodes of the database as they are associated with the images of the routes using the global appearance descriptors, to establish the adjacency relationship between nodes, to define its orientation, and to estimate the distances between nodes using the multiscale analysis.

After the map building process, we obtain a graph that represents the spatial distribution of the nodes, and the edges are the adjacency relations between those nodes.

##### 5.1. Estimation of the Relative Position between Nodes and Route Images Using the Multiscale Analysis

During the map building, the algorithm uses both the multiscale analysis to compare each image of the route with the images of the nodes. Given a route image, the matching process carries out the comparison of different scales of that image with several scales of the node images. After the comparisons, we select the experiment with the minimum image distance. and represent the specific scale factors of the node and the route images, respectively, obtained with the multiscale analysis.

These two scales permit determining the relative position between the image of the node and the route. The topological distance () between the route image and the node can be estimated as

Following the example included in Figure 14, when the route image is in front of the node (example Node 1), the comparison with the highest similarity between scenes is obtained doing a zoom-in of the node image (Figure 14()). Hence, , obtaining a negative topological distance (). On the contrary, in the image of the route is situated backwards the node (example Node 2), the minimum image distance is obtained when we compare the image of the route using zoom-in with the image of the node without any zoom. In that case, according to (9), we obtain . Therefore, the topological distance not only provides information about the relative distance between the nodes and the route images but also the direction of their distance by means of its sign.

In Figure 15, an example of a reduced experiment of node retrieval and distance estimation is shown. It includes two node images and nine images of a route whose path coincides with the nodes position. The results show the nearest node , the scales of the node image () and the route image () estimated using the multiscale analysis, and the topological distance . In this example, we have omitted the estimation of the orientation, since the route follows a straight line. In the localization results, we can see that the topological distance is negative when the route images are backwards the nearest node and positive when they are ahead the node.

##### 5.2. Association between Routes and Nodes Images

The first step in the map building algorithm is the matching between the routes’ and the nodes’ images. This association is used to decide whether a new node is included in the map and is based on the image distance using the global appearance of the scene.

The algorithm can be summarized as follows.(i)First, we create the map retrieval database. For that purpose, the algorithm computes the descriptors of the node imagery (including different scales of every image). denotes the number of components of each descriptor.(ii)The descriptors are stored in columns of a matrix, which represents the map database, , and is the number of images included in the database, that corresponds with the product of the number of nodes, orientations per node, and number of zoom scales per image.(iii)Since the descriptors are stored following the same order as the database images, it is possible to know the node , orientation in the node , and zoom factor scale of each descriptor included in the database, since they are function of the position of the descriptor in the matrix . Denoting as the number of column in , It should be noted that the order in which the nodes are stored in the database does not provide information about its spatial distribution. The position of the nodes is totally unknown to the system when the algorithm starts.(iv)When a new route image arrives, the algorithm computes its descriptor , and it calculates its Euclidean distance with all the descriptors included in : (v)The image distance is used as a classifier. The algorithm selects the nearest neighbour and is associated with the minimum distance the corresponding values of , , and .(vi)The algorithm repeats this process using different scales of the route image ().(vii)Once we have estimated the image association for the different scales , we order the results regarding and select the experiment with the minimum distance.(viii)Finally, we save the parameters corresponding to the minimum image distance (). From every route image, we obtain the information vector:

##### 5.3. Graph Creation

The process of including a new node in the map starts with the information vector described in (12). We obtain a vector from every route image, and as we study a new image, we add the new information vector to an array. The decision of including a node in the map involves the last 5 route images.

is the number of repetitions of the mode value of the nodes number () included in the last 5 node’s retrieval, and and are the mean and standard deviation of all the included in the information vectors so far. The node is included in the graph if any of these two conditions is achieved:(i),(ii) and .

Therefore, the algorithm includes a new node if it associates the same node in 3 of the 5 last route images or in 2 of them but with a los image distance (what denotes a highly reliable association).

When the image association has an image distance value , the information vector is not taken into account, since a high value of indicates low reliability in the association.

To know the connections between nodes, we create the adjacency matrix , being the number of nodes. is a sparse matrix with rows labelled by graph nodes, with 1 denoting adjacent nodes, or 0 on the contrary. Supposing we have included the node in the graph and the next node found is , .

Regarding the topological distance between the nodes in the graph, we make use of the image scale factors. To estimate the distance between two consecutive nodes, the algorithm uses the following information: the topological distance of the last route image in which the first node is detected (), and the scale difference between the first image of the route matched with the next node (). It is worthy of recalling that, as stated above, since the last route scene where a node is detected is due to be in front of that node, the value of will be positive. On the contrary, as the first route image where a new node is matched is usually behind the node, is expected to be negative. So then, the topological distance between a node and takes into account the sign of the distances regarding the relative position of the route images and the nodes, and it is defined as

Following the example included in Figure 15, corresponds with the topological distance obtained in the fifth route image (the last one where the node 1 is detected) and with the topological distance of the sixth scene (the first where the node 2 is retrieved). Then, .

To build the graph, it is necessary to incorporate information about the orientation. We suppose that the routes follow a straight path until we detect a change of direction in one of the nodes. denotes the orientation associated with the first route image where the node is retrieved, and the output angle is the direction of the last image where the same node is detected. The difference of these angles provides a phase lag that coincides with the change in the direction of the graph:

Moreover, is the direction the robot has to follow in order to arrive from node to the node . Figure 16 illustrates the phase lag between two nodes.

We set the orientation of the map by defining the direction of the output image in the first node. That direction determines the global orientation system of the map. The orientation of the graph is updated in every node with the phase lag defined in (14). Since we have a global direction orientation system defined, we can compute for each node the difference of orientation between its local system and the global. For instance, if the input direction of a node is 0° in the global system, and it corresponds to regarding the local system, we have a phase lag of 90° for that node. That way, we can include the new nodes and orientate them according to our global reference system.

When a new route is studied, the algorithm initializes a new coordinates system for its nodes. That route will be analysed independently of the global graph until a common node is found. Using the position and orientation of the common node regarding both systems, we are able to add the new nodes of the current route to the global graph. If two routes share a common path and we match nodes of the map database that have been previously included in the global map, the topological distances between those common nodes are estimated again, and the results are taken into account in the graph by calculating the mean of the new estimation, , with the previous estimations. The mean will be weighted by the number of times that the same distance has appeared.

Therefore, our map building algorithm takes advantage of the information provided by the routes in order to obtain the relative position of the nodes by matching the sequence of images with the nodes descriptors database. It uses that information to estimate the adjacency relations, but it also gives information about the relative distance and position between them using the multiscale analysis.

#### 6. Path Estimation Algorithm

Once we have carried out the map building and we obtain the graph that represents the layout of the nodes, our aim is to estimate the path of the routes that the robot follows during the navigation in this graph. We can divide the localization of the robot in the map in two main steps: first, we carry out a coarse estimation, identifying the nearest node and the orientation. Note that the orientation of the robot is determined using the phase of the node image associated with the route image. The algorithm uses a weighting function to penalize associations that are geometrically far from the previous route pose, since consecutive route images should be located nearby in the graph.

If the localization of the route images is based only on the matching with the nodes in an image retrieval process, the localization accuracy will be limited to the node positions. In order to obtain a more accurate estimation of the pose, the second step in the localization algorithm includes the multiscale analysis. Specifically, we apply this technique using the current route image and the associated node image. That way, the algorithm is able to find the relative position between both images and to extend the possible position values to intermediate position of the nodes locations.

When a route image arrives, we compare several zoom scales of this image with the nodes database, , that includes the descriptor of different scales of the nodes images. The association between the route image and the database is determined again using the nearest neighbour regarding the image distance (). Since the test images come from a route path, we can suppose that the distance and phase lag between consecutive images should be not high. For that reason, in order to improve the localization of the route images, the algorithm introduces a weighting function in order to penalize the probability of finding the current location or orientation far away from the previous image pose.

##### 6.1. Weighting Function

As stated at the beginning of this section, we introduce a weighting function in the algorithm to improve the localization accuracy of the route images in the topological map. This function reduces the probability of finding the location of the current nearest node distant from the previous image pose. Since the image association criteria are the nearest neighbour, the weighting function increases the image distance of the associations whose node image pose is distant from the last robot pose. In this way, we reduce the likelihood of selecting them as the current nearest node.

The weighting function is composed of 2 terms: the first one takes into account the topological distance between consecutive route images in the graph and the second their phase lag.

The first term uses the positions of the nearest nodes associated with the previous and the current route images in order to estimate their topological distance in the map. The adjacency matrix allows us to find out the shortest path between two nodes in the map. Since we have a connected graph, we can always find a path that connects any 2 nodes of the map. represents the cost of traversing 2 adjacent nodes (that corresponds with the topological distance between nodes, defined in (13)) and the sequence of nodes of the shortest path that connects and , the cost associated with the sequence of nodes can be defined as

The second term takes into account the phase lag between consecutive route poses.

Finally, the weighting function between two images can be defined as where and are constants that module the weight action of the topological distance and the phase lag, respectively.

As (16) shows, the weighting value between 2 images depends on the cost to traverse the path that connects their respective closest nodes and their orientation difference.

##### 6.2. Route Images Localization in the Graph

First, the algorithm computes the image distance between the current route image and the nodes images using (11). From it, we obtain , , that represents the image distance of the route image with every image included in the map database, . It includes the descriptors of all the nodes images with different scales.

Since the map database includes different zoom scales of image of the nodes, each descriptor included in has a value of , , and associated (10). The algorithm compares the current route image descriptor with , classifies the results regarding the value of image distance , and selects the -nearest neighbours. Then, this process is repeated using different zoom scales of the route image .

After that, we update the image distance values of the neighbours selected using each scale using the weighting function: with and the nearest node and orientation of the previous route image and and the nearest node and orientation of each neighbour selected in the matchings. The weighting value may change for every neighbour, since and might be different in every particular case.

When all the image distances have been updated, we classify again the results regarding and choose the Nearest Neighbour. With the information associated with the retrieval, we find out the closest node of the route image, its orientation , and the scale factors of both the node and the route images ( and ).

With this data, we can determine the current robot pose in the map. The position is estimated using the nearest node, and the relative position between the matched images, provided by the multiscale analysis and the difference of the scale factors and defined in (9). The direction of advance is provided by . Note that is the orientation of the node image regarding the local reference system of each node that must be corrected with the phase lag between the map global system and the node reference system, estimated previously during the map building process.

#### 7. Experiments and Results

This section details the database used during the experiments and the results of the map building and route path estimation using the multiscale analysis and the global appearance of images. As stated at the end of Section 3, the technique selected to describe the global appearance of the images is Gist-Gabor, and the image resolution is 32 × 64 pixels.

##### 7.1. Dataset: Nodes and Routes Images

Two different databases have been captured. They correspond to common areas of the Merchant Venturers Building of the University of Bristol.

Each database is composed of a set of nodes, and different routes of images are distributed along the areas where the nodes were captured. Note that each node has 8 images, with a phase lag of 45 between consecutive images, covering the complete file of view around the position of the node.

The experiments are divided into two different areas. The number of images of each area appears in Table 2, and the real distribution of the nodes in Figure 17. The actual distance between consecutive nodes is 2 meters as a rule, but in places where an important change of appearance is produced, that is, changes of direction or crossing a door, a new node is captured independently of the distance with the previous node. For that reason, the distance may be lower.

**(a) Area 1**

**(b) Area 2**

Regarding the routes, the frequency of image acquisition is higher in the routes of Area 2. The images are taken every 0.5 meter in Area 1 and every 0.2 meter in Area 2. We increase the capture rate at turnings. We take a minimum of four images per position when a change of orientation is produced. In Area 2, this frequency increases with a minimum of 6 images per position. Figure 18 shows the distribution of the nodes and the routes in a synthetic representation. Figure 19 presents some examples of node and route images. They show typical situations of real applications, such as changes in illumination conditions and movements of the furniture and occlusions produced by people moving in the area. The system must be able to cope with these situations.

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

##### 7.2. Map Building Results

Figure 20 presents the nodes graph of (a) Area 1 and (b) Area 2. It has been obtained after running our algorithm. We can appreciate that the algorithm is able to estimate the connections between nodes, with a similar distribution regarding the real layout in the both areas. Area 1 has been the most challenging due to the higher number of nodes, the transition from different rooms, and the loop closure in the map. In the loop closure, the graph representation slightly differs from the real layout. However, although the map loses some accuracy, the navigation of the robot is not affected. The robot can navigate from one node to the other by knowing the node output image that connects the first node to the second one, and this does not depend on the graph layout.

**(a)**

**(b)**

It is important to remark that the algorithm needs a minimum number of route images between nodes. Otherwise, a node might not be included in the map.

We use the Procrustes analysis [41, 42] in order to measure the error of the graphs obtained in the map building. This analysis studies the geometric error between the real layout of the nodes and the layout obtained with our algorithm. It returns a standardized value of dissimilarity . The lower , the more accurate graph. We show the results in Table 3. In both cases, the geometric error is considerably low.

The system is especially sensitive in the phase lag between nodes, since it is based on the angle estimation of the input and output images of the node. For that reason, it is advisable to raise the frequency of the image acquisition in the nodes where there is a change of direction. In the experiments, the maximum value of is , with a fixed step of 0.1 between consecutive scale factors. Regarding the routes images, has a maximum value of 1.4, with a step of 0.05. We have chosen a small step in both route and node scales since we have given priority to the performance of the results over the computational requirements. The average time per image in Area 1 is 725 ms and in Area 2 is 680 ms. The difference of time requirements is due to the matching of the routes images with the nodes database, since the number of nodes in Area 1 is bigger. The estimation of the Euclidean distance between the new route image descriptor and the descriptors contained in the database supposes the 45% of the total time in the map building process. Area 1 contains more nodes than Area 2, so that its descriptors database has a higher number of elements, what supposes more time to carrying out the retrieval and, therefore, an increase of the global process time in the map building.

The orientation of the global reference system is determined by defining the direction of the output image of the first node. In the experiments, we choose this direction so that the graph has the same orientation that the layout represented in Figure 18. If we had chosen any other direction, the map shape would have been the same, but rotated. Anyway, the orientation of the global reference system does not affect the localization algorithm.

##### 7.3. Path Estimation Results

In order to find proper values of the weighting constants, we carry out a study of the localization performance regarding the values of and . Figure 21 shows the precision in the node image retrieval varying both parameters. The dataset of the experiments is composed of the images included in the routes 1 and 5 of the map 1. In the precision measurement, we consider that a retrieval has been successful when it selects the node image that corresponds with both the correct position and orientation. In Figure 21(a) we study the retrieval performance regarding the weighting constant . We can notice an increase of the precision for low values of . Once we have selected , we study the precision varying the parameter . The results are shown in in Figure 21(b). Both graphs prove that the weighting function improves the retrieval precision. However, if we are too restrictive with the position or phase changes, the precision decreases. For that reason, when the constants are given high values, the retrieval accuracy is lower.

**(a)**

**(b)**

In the path localization experiments, the weighting constants are given the values and , and we use nearest neighbours when doing the retrieval of each zoom scale of the route images. The node zoom scale varies from 1 to 2.2 with a step of 0.4. Regarding the route zoom scale, it varies from 1 to 2.2 with a step of 0.3 between consecutive scales.

Figure 22 shows the path estimation of different routes of both areas. The dots in the paths of the routes represent the position of the different images studied. As it can be seen, the algorithm copes with the interpolation of the location in halfway positions between the nodes using the image’s scales information. In general, the precision at turnings in the routes decreases. It is also important that, despite the fact that we introduce the weighting function, the algorithm is able to find again the correct position although a previous estimation is not correct, as we can see in Figures 22(a) or 22(c). The result in the path planning of the fourth route of the first area (Figure 22(b)) is also interesting. As we can appreciate in Figure 18(a), the route number 4 presents a variation in its path that differs from the layout of the nodes. However, despite that fact, the path estimation algorithm is able to estimate the position accurately.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

Therefore, the results prove that our algorithm is able to estimate the path of the route even in intermediate positions of the nodes and deal with the correction of false association of nodes in previous parts of the route.

#### 8. Conclusions

In this paper we have studied the problem of the only-visual topological mapping and route navigation using global appearance image descriptors. First, we have included a comparison of three global appearance techniques and different image sizes. Next, we present the multiscale analysis, which permits estimating the relative position of two images using digital zooming. Then, we include an algorithm to build a topological map from a set of nodes and routes of images. Finally, we have developed a localization algorithm that estimates the position of the mobile in the graph using the visual information as input.

In the comparison of the global appearance descriptors, all the techniques show a reasonable high accuracy in image retrieving tasks. Fourier analysis presents a stable retrieving precision with regard to the images size and a reduced time requirement, but the memory requirement is clearly higher than the other techniques, especially with the bigger images. HOG descriptor shows good computational cost and memory requirements. However, when we reduce the image size, the accuracy in localization decreases more than in the case of the other techniques. Hence, since we pretend to use a reduced image resolution in the map building and localization algorithms, HOG is inadvisable in these applications. Gist-Gabor is the most compact representation in almost all the experiments. We select this descriptor to carry out the experiments due to the fact that it is the most reliable descriptor using the lower image resolutions. It is possible to reduce the scene almost 30 times the original size without an important detriment of precision. In this way, the computational time in the image processing in order to obtain the descriptor is reduced more than 10 times.

Regarding the multiscale analysis, it improves the association between images using the global appearance of the scenes and provides a measurement of the topological distance between images.

The map building algorithm is able to determine the adjacency relationships between the nodes distributed in the navigation area and to create a graph using the information of routes taken along the nodes positions. The results present a high accuracy in the node detection and estimation of adjacency and relative orientation. Moreover, the estimation of the topological distance between the nodes provides a graph representation of the nodes with similar layout to the real distribution.

The algorithm created to estimate the path of routes along the area takes advantage of the multiscale analysis to improve the topological localization of the robot in the map. After doing the matching of the route image with the map database, the difference of scales between the node and the route image provides the relative position of both images. Although we use a weighting function in order to penalize important changes in position and orientation between consecutive route images, the algorithm is able to find again the correct location although a previous image of the route would introduce a false pose.

The results obtained both in the map building and the path representations of routes encourage us to continue the possibilities of the application of global appearance image descriptors to these tasks. It would be interesting to extend this study to find the minimum information that the map has to include in order to allow a correct navigation of the robot, the application of new global appearance descriptors, the use of omnidirectional visual information, or the improvement in the estimation of the orientation in order to correct small errors during the navigation.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This work has been supported by the Spanish government through the Project DPI2010-15308 “Exploración integrada de entornos mediante robots cooperativos para la creación de mapas 3D visuales y topológicos que puedan ser usados en navegación con 6 grados de libertad.”