Abstract

Current RGB-D sensors provide a big amount of valuable information for mobile robotics tasks like 3D map reconstruction, but the storage and processing of the incremental data provided by the different sensors through time quickly become unmanageable. In this work, we focus on 3D maps representation and propose the use of the Growing Neural Gas (GNG) network as a model to represent 3D input data. GNG method is able to represent the input data with a desired amount of neurons or resolution while preserving the topology of the input space. Experiments show how GNG method yields a better input space adaptation than other state-of-the-art 3D map representation methods.

1. Introduction

A 3D point is comprised of (, , and ) values representing the spatial coordinates. Moreover, if color information (R, G, and B) is available for each point, it is referred to as RGB-D data. RGB-D cameras provide this kind of data and have become very popular due to their low cost, like the Kinect sensor. This sensor could provide more than 300,000 3D points per capture. That feature is very useful to carry out a fundamental task in mobile robotics: mapping [1]. Mapping is a task that builds a map from the observations and movements of the robot. Each time the robot moves, an observation is linked to that movement. Then, using different methods, for example, registration, the map can be built, transforming each observation with respect to a common coordinates frame. This map is useful to develop subsequent tasks, like localization, navigation, and recognition. The use of RGB-D data as observations is referred to as RGB-D mapping and RGB-D maps.

The amount of data in RGB-D maps is huge since the number of poses is high. In a typical map with 10,000 poses, the data could consist of more than 3 billion of 3D points, which are unaffordable for representation and other tasks. Furthermore, as acquisitions frame rate is high, for a common area, a huge amount of redundant points is used to represent the input space. Due to the huge quantity of data, several methods have been proposed to reduce the number of points in the map while preserving the main features of the data, as it would be used in further tasks.

Elevation maps were a commonly used structure in the past [2]. These elevation maps are represented using a regular 2D cell grid where each cell value represents the elevation or height of the surface of that space. This compact model allows a simple representation of large areas but with low level of detail. Triebel et al. [3] presented an extension of the height maps to represent different surfaces at different heights. This multilevel surface map (MLS map) allows the representation of vertical structures and different surfaces in a 2D cell-based structure like the ones used in the traditional height maps. This approach focuses on the representation of planar surfaces to help mobile robotics tasks like robotics navigation.

Following this idea of 3D space representation, some other structures have been proposed like occupancy grids or Octrees. Occupancy grids represent the entire space as 3D cell grids. The cell information could consist of a single value of occupancy or contain more complex information as the probability of occupancy. Several works in mobile robotics have used this structure as a base for their applications [46]. Another common structure is the Octree [7]. The Octree is a tree structure in which each internal node has eight children. Each node of the tree is recursively subdivided into eight new nodes until a certain condition is fulfilled, like the size of the area represented by a node. This structure represents both occupied and empty space in the area represented by the Octree. It also allows some optimized operations like closest point searching or occupancy checking. In [8], an Octree based framework called OctoMap is presented. It uses probabilistic occupancy estimation where areas of the space are represented as occupied, empty, or uncertain. Another commonly used structure is the voxel grid (VG). The VG downsampling technique is based on the input space sampling using a grid of 3D voxels. This technique has been traditionally employed in the area of computer graphics to subdivide the input space and reduce the number of points [7, 9].

Wang et al. [10] presented a feature based 3D point cloud simplification method. They detect the points with more information (big curvature) and subsample the rest of the points using a uniform spherical sampling method. Therefore, they preserve the keypoints and subsample those points with less curvature information. This method is able to subsample 3D point clouds obtained from object surfaces. It does not work on scene maps since the spherical sampling and the feature selection process is usually harder and problem dependent.

Other approaches use self-organizing maps in order to reduce the input space. Viejo et al. [11] used a Growing Neural Gas (GNG) algorithm to filter and reduce single frontal point clouds. In this paper, we propose the extension of that work to manage complete maps. The GNG adapts its structure to the complete map, reducing its size, preserving the input space topology, and providing better adjustment than existing methods. To validate our method, we present several experiments comparing our method with map size reduction state-of-the-art methods.

The rest of this work is organized as follows. First, in Section 2 we introduce and describe the proposed GNG application and the Octree and voxel grid methods that we will use in the experimentation. Next, in Section 3, the validation of our method is carried out comparing it with the two previous mentioned methods. Finally, conclusions and future works are drawn.

2. 3D Representation Methods

One way of selecting points of interest in 3D point clouds is the use of a topographic mapping where a low-dimensional map is fitted to the high dimensional manifold of the model, whilst preserving the topographic structure of the data.

In this section, we review some typical methods to represent and compress 3D data. First, we propose the use of a Growing Neural Gas algorithm to reduce and represent 3D point cloud maps. Then, we briefly describe two well-known data structures in order to compare them with our method.

2.1. GNG Method

A common way to achieve data dimension reduction is by using self-organising neural networks where input patterns are projected onto a network of neural units such that similar patterns are projected onto units adjacent in the network and vice versa. As a result of this projection, a representation of the input patterns is achieved, which in postprocessing stages allows exploiting the similarity relations of the input patterns.

However, most common approaches do not provide good neighborhood and topology preservation if the logical structure of the input pattern is not known a priori. In fact, the most common approaches specify in advance the number of neurons in the network and a graph that represents topological relationships between them, for example, a two-dimensional grid, and seek the best match to the given input pattern manifold. When this is not the case, the networks fail to provide good topology preservation as in the case of Kohonen’s algorithm [12].

The approach presented in this paper is based on self-organising networks trained using the Growing Neural Gas learning method [13], an incremental training algorithm. The links between the neurons in the network are established through competitive Hebbian learning [14]. As a result, the algorithm is suitable in cases where the topological structure of the input pattern is not known a priori and yields topology preserving maps of feature manifold [15].

In GNG, the nodes of the network compete to determine the ones with the highest similarity to the input distribution. In our case, the input distribution is a finite set of 3D points extracted from different types of sensors. The highest similarity reflects which node together with its topological neighbors is the closest one to the input sample point which is the pattern generated by the network. The -dimensional input signals are randomly generated from a finite input distribution.

The nodes move towards the input distribution by adapting their position to the input data geometry. During the learning process local error measures are gathered to determine where to insert new nodes. New nodes are inserted near the node with the highest accumulated error. At each adaptation step a connection between the winner and its topological neighbors is created as dictated by the competitive Hebbian learning method. This is continued until an ending condition is fulfilled, for example, the evaluation of the optimal network topology, a predefined network size or a deadline.

The network is specified as follows.(i) It is a set of nodes (neurons). Each neuron has its associated reference vector . The reference vectors can be regarded as positions in the input space of their corresponding neurons.(ii) It is a set of edges (connections) between pairs of neurons. These connections are not weighted and its purpose is to define the topological structure. The edges are determined using the competitive Hebbian learning algorithm. An edge-aging scheme is used to remove connections that are invalid due to the activation of the neuron during the adaptation process.

The GNG learning algorithm is as follows.(1) Start with two neurons and at random positions and in .(2) Generate a random input signal according to a density function .(3)Find the nearest neuron (winner neuron) and the second nearest .(4) Increase the age of all the edges emanating from .(5) Add the squared distance between the input signal and the winner neuron to a counter error of : (6) Move the winner neuron and its topological neighbours (neurons connected to ) towards by learning steps and , respectively, of the total distance: (7)If and are connected by an edge, set the age of this edge to 0. If it does not exist, create it.(8)Remove the edges larger than . If this results in isolated neurons (without emanating edges), remove them as well.(9)With every certain number of input signals generated, insert a new neuron as follows.(i)Determine the neuron with the maximum accumulated error.(ii)Insert a new neuron between and its further neighbor: (iii)Insert new edges connecting the neuron with neurons and , removing the old edge between and .(iv)Decrease the error variables of neurons and multiplying them with a constant . Initialize the error variable of with the new value of the error variable of and .(10)Decrease all error variables by multiplying them with a constant .(11)If the stopping criterion is not yet achieved, go to step 2.

In summary, the adaptation of the network to the input space takes place in step 6. The insertion of connections (step 7) between the two closest neurons to the input patterns establishes an induced Delaunay triangulation in the input space. The elimination of connections (step 8) eliminates the edges that no longer comprise the triangulation. This is made by eliminating the connections between neurons that are no longer activated or isolated. Finally, the accumulated error (step 5) allows the identification of those areas in the input space where it is necessary to increase the number of neurons to improve the mapping.

Using a Growing Neural Gas model to represent 3D data has some advantages over the traditionally used methods like voxel grid or Octrees. For example, we specify the number of neurons (representative points of the map), while other methods like the voxel grid or Octree get different number of occupied cells depending on the distribution and resolution of the cells (voxels on voxel grid and leaves on Octree based methods).

2.2. Octree Based Method

Most 3D point cloud mapping algorithms typically use the spatial organization of the points to encode them in a structure like an Octree to reduce the amount of information. An Octree is a tree data structure in which their internal nodes have exactly eight children. Octrees make a partition of the three-dimensional space by recursively subdividing it into eight octants. It starts from a user specified volume space or it computes the bounding box of the input set. Then, each node or cell is subdivided into 8 children nodes until a certain condition is reached. These conditions vary depending on the problem or the Octree implementation. A commonly used condition is to stop producing new children nodes when the volume or size of the corresponding cell node reaches the desired precision.

One of the main features of the Octree representation is that nodes not containing input space points are not subdivided and therefore those leaf nodes represent an empty volume of the space. This feature is useful for some mobile applications as robot navigation. There exist different approaches to select the representative point of the occupied nodes. A simple one is to get the center of the node cell but using the mean or centroid of the cell inner points improves the preservation of the topology. This approach offers better results but it has a higher computational and memory cost.

2.3. Voxel Grid Method

The VG downsampling technique is based on the input space sampling using a grid of 3D voxels [16]. VG algorithm defines a voxel grid in the 3D space and for each voxel a point is chosen as the representative of all points that lie on that voxel. It is necessary to define the size of the voxels as this size establishes the resolution of the filtered point cloud and therefore the number of points that form the new point cloud. The representative of each cell could be chosen by using one of the approaches described in the previous section. Thus, a subset of the input space is obtained that roughly represents the underlying surface.

The VG method, as the Octree based methods, presents the same problems compared to other subsampling techniques: it is not possible to define the final number of points which represents the surface, geometric information loss due to the reduction of the points within a voxel and sensitivity to noisy input spaces.

2.4. Discussion

In this subsection we briefly describe the main differences of the above-described methods. Figures 2, 3 and 4 show several examples of using the three described methods. The GNG representation provides a set of neurons and their neighbors. These representatives and their connections can be used in some algorithms like 3D mesh reconstruction or feature extraction.

Both, voxel grid and Octree methods should provide similar results due to their final representation of the points. In a point cloud reduction application, the Octree gets their representatives of the leaf nodes and if we use the same resolution as the voxel grid method we get a similar division of the space in cubes or cells of the same dimension. The voxel grid method is the most simple and fastest reduction method but it does not have any of the advantages of the Octree structure or GNG model like neighbor searching facilities.

Figure 5 shows a 2D description of the representative points of the described methods. We observe that the GNG method assigns more neurons on high density input areas (bottom left area) than the voxel grid and Octree methods. We also observe how the GNG is able to eliminate some noisy values like the point near the center in contrast with the representatives used in the VG and Octree methods.

3. Experimentation

In this section we test the quality of adaptation of the three described methods. We first describe the data used in the experiments and then we analyze the results of the tested methods, both quantitatively and qualitatively.

3.1. Experimentation Setup

To test the implemented scene mapping systems on room map scenarios, we used the TUM RGB-D dataset [17]. This dataset provides RGB-D and ground-truth data with the goal of evaluating visual odometry and visual SLAM systems. The dataset contains the color and depth images obtained using a Microsoft Kinect sensor along the ground-truth trajectory of the sensor. It provides images at full frame rate (30 Hz) and sensor resolution (640 × 480). The ground-truth trajectory was obtained from a high-accuracy motion-capture system with eight high-speed tracking cameras (100 Hz).

This dataset contains 39 sequences recorded in two different scenarios. The fr1 datasets were recorded in a typical office environment (first scenario) and the fr2 datasets were recorded in a large industrial hall (second scenario). Figures 6 and 7 show the ground-truth reconstruction maps of the “fr1 360” and “fr2 desk,” respectively.

Table 1 shows the number of points of the input maps used in the experimentation. We can observe that the number of input points ranges from one million (“fr1 ”) to 6 million (“fr2 desk”).

3.2. Quality Adaptation Experiment

As we previously mentioned, we are going to compare the proposed GNG adaptation with two commonly used data structures in the state-of-the-art, Octree, and voxel grid. The implementation of both methods is included in the Point Cloud Library (PCL) (the Point Cloud Library (or PCL) is a large scale, open project [18] for 2D/3D image and point cloud processing). The Octree implementation uses the center of the leaf nodes as representative points. However, the voxel grid implementation uses the centroid of the points of each nonempty voxel. Both implementations use a resolution parameter that represents the size of the voxel in the VG method and the side of the leaf cell of the Octree implementation. The GNG results are obtained using 10,000 λ input patterns.

We extensively tested the implemented methods using different number of representatives. Since the three tested methods reduce the amount of noise in the generated map, it is needed to know the real distance from the selected representatives to the original input space. The following measure specifies how close the representations are from the original model.

A quantitative measure of the input space adaptation of the generated map is obtained by computing the mean error (ME) of the reduced map against sampled points (input space): where is the input space, is a point that belongs to the input space, and is the representative point with the minimum distance to . Euclidean distances to closest points are averaged over the entire input space.

Figure 8 shows the RMS errors of the three methods on the six different tested maps. We observe that the three methods have similar behavior on the different datasets. The Octree method gets the worse results probably due to the selection of its cell-node center as representative. The voxel grid gets lower errors than Octree due to the use of the centroid of the inner points instead of using the center of the voxel or cell.

It is important to point out again that the representative selection method used in this comparison is given by the used implementations. But Octree and voxel grid methods can use a representative selection strategy. GNG adaption shows the best results on all datasets. It is noticeable that the GNG gets lower errors with different number of representatives but as the number of representatives increases the three different methods converge to the same error.

3.3. Qualitative Results

In this subsection we qualitatively analyze the results obtained with the three different methods. Figure 9 shows the original map and the three tested representations of the “fr1 360” scene. Part (a) shows the point cloud that we are trying to represent and reduce. Parts (b) and (c) are, respectively, the Octree and voxel grid representation, and part (d) is the GNG representation of the scene. The Octree representation, using the centers of the leaf nodes, gets a strongly structured point representation. This representation gets a more uniform distribution of the representatives but the adaption error is worse as we saw in Figure 5 and the mean error graphs. The voxel grid representation gets similar results compared to the Octree, where the points are uniformly distributed as it can be observed in the points that represent the floor. However, it gets better results on the boundaries compared with the Octree method. Both VG and Octree place representatives in isolated and noisy points. However, the GNG neurons are uniformly distributed over the input space and the impact of the noisy points and undefined borders on the obtained representation is reduced. We also observe the inherent triangulation of the space that the GNG algorithm gets using the neurons neighborhood.

Figure 10 shows two more experiments with GNG representation. In these experiments we have selected two maps with holes. GNG is able to adapt to these holes and it does not insert neurons in them.

3.4. Execution Time

With respect to computational cost, our method is feasible to be included in a modern system using general purpose computing platforms. However, we designed in a previous work [17] a GPU-based implementation of the GNG algorithm that speeds up the sequential version several times. The speedup is increased as the number of neurons used for the representation grows.

In Table 2 some results can be appreciated with different number of neurons and input patterns with CPU and GPU runtimes and speedup obtained with GPU version with respect to the CPU ones. The GPU used was a GTX 480 NVIDIA graphic card with 480 cores, a global memory of 1.5 MB, and a bandwidth memory of 177.4 GB/sec.

3.5. GNG Maps Applications

The experiments showed how the GNG is able to adapt their topology to represent the input map space. In [19] we showed many applications of the GNG structure in order to improve 3D data representation and computer vision methods. We proved that the GNG representation partially discards most of the noisy values provided by these RGB-D sensors. In addition, we performed some experiments to show the improvement obtained using the GNG representation on normal estimation using point neighborhood and keypoint detection and description.

Normal estimation methods are based on the analysis of the eigenvectors and eigenvalues of a covariance matrix created from the nearest neighbours and are very sensitive to noisy data. Therefore, we computed normals on raw and filtered point clouds in order to demonstrate how a simple 3D processing step like normal or curvature estimation is affected by the presence of noise.

Figure 11 shows how more stable normals are estimated using filtered point clouds produced by the GNG method. Normals are considered more stable as their distribution is smooth and also they have less abrupt changes in their directions. Moreover, a flat wall with some small changes in its surface was selected to appreciate changes in normal directions. We employed 20,000 neurons and 1,000 λ patterns as configuration parameters for the GNG method in the normal estimation experiment Figure 11(b).

In order to test the keypoint detector/descriptor improvement, we performed a transformation estimation algorithm and we compared the GNG results against the voxel grid representation and against the entire source point cloud. We used the available descriptors [18, 20, 21] and detectors [22] provided by the Point Cloud Library [23]. The minimum, median, mean, and maximum of RMS transformation error (with respect to different keypoint detectors) are presented in Table 3. These results show how filtered point clouds using the GNG method generally improved the precision of the estimated transformation. Moreover, the worst estimated transformations (maximum errors) were also slightly improved using the GNG compared to the other techniques.

4. Conclusions

3D maps obtained from RGB-D data are useful for robotics tasks, like robot navigation. But this kind of maps contains a huge amount of data, which must be reduced to properly process the map. In this paper, we have presented a method to represent and reduce 3D maps. Our method is based on a GNG neural network which is adapted to the 3D input space. The experiments carried out demonstrated the validity of our method, as it provided better adaptation than two of the most used methods for these tasks: voxel grid and Octree.

As future works, we propose to extend our method to provide a useful map for robot navigation. We also plan to provide the GNG with a way to revert the reduction or compression of the points, storing information in the neurons neighborhood (color, point distribution, etc.).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was partially funded by the Spanish Government DPI2013-40534-R grant. Experiments were made possible with a generous donation of hardware from NVDIA.