Abstract

This paper describes in a detailed manner a method to implement a simultaneous localization and mapping (SLAM) system based on monocular vision for applications of visual odometry, appearance-based sensing, and emulation of range-bearing measurements. SLAM techniques are required to operate mobile robots in a priori unknown environments using only on-board sensors to simultaneously build a map of their surroundings; this map will be needed for the robot to track its position. In this context, the 6-DOF (degree of freedom) monocular camera case (monocular SLAM) possibly represents the harder variant of SLAM. In monocular SLAM, a single camera, which is freely moving through its environment, represents the sole sensory input to the system. The method proposed in this paper is based on a technique called delayed inverse-depth feature initialization, which is intended to initialize new visual features on the system. In this work, detailed formulation, extended discussions, and experiments with real data are presented in order to validate and to show the performance of the proposal.

1. Introduction

The online robot estimation position from measurements of self-mapped features is a class of problem, in the robotics community, known as simultaneous localization and mapping (SLAM) problem. This technique consists in increasingly building a consistent map of the environment and, at the same time, localizing the robot’s position while it explores its world. SLAM is perhaps the most fundamental problem to solve in robotics in order to build truly autonomous mobile robots.

Robot sensors have a large impact on the algorithm used in SLAM. Early SLAM approaches focused on the use of range sensors as sonar rings and lasers, see [1, 2]. Nevertheless there are some disadvantages with the use of range sensors in SLAM: correspondence or data association becomes difficult, they are expensive, and some of them are limited to 2D maps and computationally inefficient when the number of features is large (see [3, 4] for a complete survey).

The aforementioned issues have propitiated that recent works move towards the use of cameras as the primary sensing mode. Cameras have become more and more interesting for the robotics research community, because they yield a lot of valuable information for data association. A wide variety of algorithms can be obtained from the computer vision research community in order to extract high-level primitives from images. Those primitives are matched with the primitives stored in the map allowing thus the data association. This is still an open problem. A camera is a sensor that yields 3D information. Even for indoor robots whose pose can be represented in 2D, the ability to gather 3D information of the environment is essential. Cameras are also suitable for embedded systems: they are lightweight, cheap, and power saving. Using vision, a robot can locate itself using as landmarks common objects that encounters along its path.

In this context, the 6-DOF (degrees of freedom) monocular camera case (monocular SLAM) possibly represents the most difficult variant of SLAM; in monocular SLAM, a single camera can be freely moving by its environment representing the sole input sensor to the system. On the other hand, while range sensors (e.g., laser) provide range and angular information, a camera is a projective sensor which measures the bearing of image features. Therefore, depth information (range) cannot be obtained in a single frame. This drawback has triggered the emergence of especial techniques for feature initialization approaches in order to allow the use of bearing sensors (such as cameras) in SLAM systems.

As computational power grows, an inexpensive monocular camera can be used to perform simultaneously range and appearance-based sensing replacing typical sensors for range measurement (laser and sonar rings) and for dead reckoning (encoders). Thus, a camera connected to a computer becomes a position sensor which could be applied to different fields such as robotics (motion estimation basically in humanoids robots), wearable robotics (motion estimation for camera-equipped devices worn by humans), tele-presence (head motion estimation using an outward-looking camera), or television (camera motion estimation for live augmented reality) [5].

Monocular SLAM is closely related to the structure-from-motion (SFM) problem for reconstructing scene geometry. SFM techniques, which originally come from the computer vision research community, are sometimes formulated as off-line algorithms that require batch, simultaneous processing for all the images acquired in the sequence. Nevertheless, several recursive solutions to the SFM problem can be found in the literature. In this sense, one of the first works was presented by Matthies et al. in [6] though in this work it is assumed that the camera motion is known. By contrast, Gennery in [7] proposes a method to estimate motion from a known structure. A method for addressing the problem of the two above works, unknown structure as unknown motion, was first introduced by Azarbayejani et al. in [8]. Other SFM-based techniques are presented in works [9, 10]. Some hybrid techniques (SFM-Kalman Filtering) based on stereovision, as stated in [11], have also appeared.

Monocular SLAM has received much attention in the last years. In [12], Deans and Martial proposes a combination of a global optimization BA (bundle adjustment) for feature initialization and a Kalman filter for state estimation. Strelow proposes in [13] a similar method but mixing camera and inertial sensors measurements in an iterated extended Kalman filter (IEKF). In [14], Bailey proposes a variant of constrained initialization for bearing-only SLAM, where past estimates for the vehicle pose are retained in the SLAM state and thus feature initialization can be deferred until its estimates become well-conditioned. Moreover there are some works that use other estimation techniques (apart from EKF) like particle filters (PF) methods. In [15, 16], Kwok and Dissanayake propose a method based in particle filtering techniques. In this case the initial state of features is approximated using a sum of Gaussians, which defines a set of hypothesis for landmark position. In subsequent steps, bad hypotheses of depth are pruned while observations are done. A similar approach is presented in [17] for bearing-only tracking.

Davison et al., [5, 18], proposes a real-time method based on the well-established EKF framework as the main estimation technique. In those works a Bayesian partial initialization scheme for incorporating new landmarks is used. A separate particle filter, which is not correlated with the rest of the map, is used in order to estimate the feature depth prior to its inclusion in the map. It is important to note that prior to Davison, the feasibility of real-time monocular SLAM was first demonstrated by Jin et al. in [19].

In [20], Smith et al. presents a real-time method based on straight lines detection using the EKF estimator. Jensfelt et al. in [21] presents a method where the main idea is to let the SLAM estimation lag behind 𝑛 frames; those 𝑛 frames will determine which points are good landmarks to find an estimation of their 3D location. In [22], Solà et al. presents a method based on federate Kalman filtering technique. With an initial probability distribution function (PDF) for the features, a geometric sum of Gaussians is defined. The method is an approximation of the Gaussian sum filter (GSF) that permits undelayed initialization with an additive simple growth of the problem size. In [23] Lemaire et al. present a similar method to [22] but the features are initialized with a delayed method.

In [24], Eade and Drummond proposes a FastSLAM-based [25] approach. In that work the pose of the robot is represented by particles and a set of Kalman filters refines the estimation of the features. When the inverse depth collapses, the feature is converted into a fully initialized standard Euclidean representation. Montiel et al. in [26] present a new approach where the transition from partially to fully initialized features does not need to be explicitly tackled, making it suitable for direct use in EKF framework in sparse mapping. In this approach, the features are initialized in the first frame they are observed with an initial, fixed inverse depth and uncertainty heuristically determined to cover the range from nearby to infinite; therefore distant points can be coded.

In delayed methods, a feature observed in the instant 𝑡 is added to the map in a subsequent time 𝑡+𝑘. Usually the delay is used, in this kind of methods, for collecting information that permits to initialize the feature in a robust manner. On the other hand, the undelayed methods take advantage of the observation of the feature from the instant 𝑡 for updating the system. Conversely the system update step should be computed carefully.

In authors’ previous works, [27, 28], a monocular SLAM approach is proposed. In this paper a method, the so-called delayed inverse-depth feature initialization, is used to initializing new features in the system. This method, which is based on the inverse depth parameterization, defines a single hypothesis for the initial depth of features by the use of a stochastic technique of triangulation.

This paper presents an extended and detailed version of the monocular SLAM scheme proposed in [27, 28]. Though the fundamental idea of the original method remains the same, some important and remarkable improvements have been made and included in this paper. Moreover, the implementation of the proposed method is presented in a very detailed manner. New experiments with real data and discussion are also included in order to show the performance of the proposed method and its importance as an open problem in autonomous robotics engineering.

2. Method Description

In this section, the proposed scheme for monocular SLAM is described. In Figure 1 a block diagram indicating the flow information and main sub-processes of the system is shown. All its components are explained in detail in subsequent sub-sections. It is important to remark that once the system is initialized, the detection and measurement of image features represents the sole sensory input of the system.

2.1. System Parameterization

The complete system state ̂𝑥 consists of: ̂𝑥=̂𝑥𝑣,̂𝑦1,̂𝑦2,,̂𝑦𝑛𝑇,(2.1) where ̂𝑥𝑣 represents the state of a free robot camera moving in any direction in 3×SO(3), and ̂𝑦𝑖 represents a feature point 𝑖 in the 3D scene. At the same time, ̂𝑥𝑣 is composed of: ̂𝑥𝑣=𝑟𝑊𝐶𝑞𝑊𝐶𝑣𝑊𝜔𝑊𝑇,(2.2) where 𝑟𝑊𝐶=𝑥𝑣𝑦𝑣𝑧𝑣𝑇,(2.3) represents the camera optical center position in Cartesian coordinates, and: 𝑞𝑊𝐶=𝑞1𝑞2𝑞3𝑞4𝑇(2.4) represents the orientation of the camera respect to the navigation frame by a unit quaternion. Unit quaternions provide a convenient mathematical notation for representing orientations and rotations of objects in three dimensions. Compared with Euler angles they are simpler to compose avoiding the problem of gimbal lock. Compared with rotation matrices they are numerically more stable and in many cases they are more efficient. A good review for attitude representations is given in [29]. The following expressions: 𝑣𝑊=𝑣𝑥𝑣𝑦𝑣𝑧𝑇,𝜔𝑊=𝜔𝑥𝜔𝑦𝜔𝑧𝑇,(2.5) denote linear and angular velocities, respectively. The superscripts 𝑊 and 𝑊𝐶 denote magnitudes expressed in the world reference, and in the camera reference respectively.

A feature ̂𝑦𝑖 is composed of the following 6-dimension state vector: ̂𝑦𝑖=𝑥𝑖𝑦𝑖𝑧𝑖𝜃𝑖𝜙𝑖𝜌𝑖𝑇,(2.6) which models the 3D point located at: 𝑥𝑖𝑦𝑖𝑧𝑖+1𝜌𝑖𝑚𝜃𝑖,𝜙𝑖,(2.7) where 𝑥𝑖, 𝑦𝑖, 𝑧𝑖 corresponds to the optical center coordinates of the camera when the feature was observed for the very first time; and 𝜃𝑖,𝜙𝑖, represent the azimuth and the elevation respectively (respect to the world reference 𝑊) for the directional unitary vector 𝑚(𝜃𝑖,𝜙𝑖): 𝑚𝜃𝑖,𝜙𝑖=cos𝜙sin𝜃sin𝜙cos𝜙cos𝜃𝑇(2.8)

The point depth 𝑟𝑖 is coded by its inverse value, 𝜌𝑖 = 1/𝑟𝑖, as quoted in [26]. Figure 2 illustrates the camera and features parameterization.

2.2. System Initialization

In a robotics context, obtaining the metric scale of the world can be very useful. However, in monocular SLAM the scale of the observed world cannot be obtained using only vision, and therefore another sensor or the observation of a known dimension reference have to be used in order to retrieve the scale of the world.

In this work, the system metric initialization process is analogous to a classical problem in computer vision: the perspective of 𝑛-point (PnP) problem [30]. The PnP problem is to find the position and orientation of a camera with respect to a scene object from 𝑛 correspondent points. In [30] it is demonstrated that, when the control points are coplanar, the perspective on 4-point (P4P) problem has a only solution.

Therefore, in the present work it is assumed that 4 coplanar points are known (for instance, the dimensions of a black paper sheet over a white background). It is also assumed that the intrinsic parameters of the camera are already known: (i) 𝑓 (focal length); (ii) 𝑖0, 𝑗0 (displacement to the image center), and (iii) 𝑘1,,𝑘𝑛 (radial lens distortion).

The problem consists in estimating two extrinsic camera parameters: 𝑅𝐶𝑊 (world to camera rotation matrix for camera orientation) and 𝑡 (translation vector for the position of the camera center), given 4 coplanar points with spatial coordinates (𝑥𝑖, 𝑦𝑖, 0), with 𝑖 in [1..4], and their corresponding 4 undistorted image coordinates (𝑖,𝑗), as shown in Figure 3.

In order to estimate 𝑅𝐶𝑊 and 𝑡, the method is based in the approach proposed in [31]. The following system of linear equations is formed with the vector 𝑏 as an unknown parameter: 𝑥1𝑓𝑦1𝑓00𝑖1𝑥1𝑖1𝑦1𝑓000𝑥1𝑓𝑦1𝑓𝑗1𝑥1𝑗1𝑦1𝑥0𝑓2𝑓𝑦2𝑓00𝑖2𝑥2𝑖2𝑦2𝑓000𝑥2𝑓𝑦2𝑓𝑗2𝑥2𝑗2𝑦2𝑥0𝑓3𝑓𝑦3𝑓00𝑖3𝑥3𝑖3𝑦3𝑓000𝑥3𝑓𝑦3𝑓𝑗3𝑥3𝑗3𝑦3𝑥0𝑓4𝑓𝑦4𝑓00𝑖4𝑥4𝑖4𝑦4𝑓000𝑥4𝑓𝑦4𝑓𝑗4𝑥4𝑗4𝑦4𝑖0𝑓𝑏=1𝑗1𝑖2𝑗2𝑖3𝑗3𝑖4𝑗4,(2.9) where 𝑟𝑏=11𝑡3𝑟12𝑡3𝑟21𝑡3𝑟22𝑡3𝑟31𝑡3𝑟32𝑡3𝑡1𝑡3𝑡2𝑡3𝑇.(2.10) The linear system in (2.9) is solved for 𝑏=[𝑏1𝑏2𝑏3𝑏4𝑏5𝑏6𝑏7𝑏8]𝑇. Then 𝑡3 is obtained as follows: 𝑡3=𝑓2𝑏21+𝑏23+𝑓2𝑏25.(2.11) The extrinsic parameters 𝑅𝐶𝑊 and 𝑡 are obtained using the following equations: 𝑅𝐶𝑊=𝑡3𝑏1𝑡3𝑏2𝑅21𝑅32𝑅31𝑅22𝑡3𝑏3𝑡3𝑏4𝑅31𝑅12𝑅11𝑅32𝑡3𝑏5𝑡3𝑏6𝑅11𝑅22𝑅21𝑅12𝑡,(2.12)𝑡=3𝑏7𝑡3𝑏8𝑡3𝑇.(2.13)

In (2.12) the third column of matrix 𝑅𝐶𝑊 is formed by the combinations of the values of first and second column of the same matrix. A good review of algorithms for coplanar camera calibration can be found in [29].

At the beginning of the iterative process, the system state ̂𝑥ini is formed by the camera-state ̂𝑥𝑣, and the four initial features used for estimating the extrinsic camera parameters: ̂𝑥ini=𝑟𝑊𝐶ini𝑞𝑊𝐶ini𝑣𝑊ini𝜔𝑊inî𝑦1̂𝑦2̂𝑦3̂𝑦4𝑇,(2.14) where 𝑟𝑊𝐶ini=𝑡,𝑞𝑊𝐶ini𝑅=𝑞𝐶𝑊𝑇,𝑣𝑊ini=03×1,𝜔𝑊ini=03×1,(2.15)𝑟𝑊𝑅ini has the same value that the extrinsic parameter 𝑡 calculated by (2.13). 𝑞𝑊𝐶ini is estimated using the rotation matrix to quaternion transformation 𝑞(𝑅), described by (A.2) in the Appendix, from the transpose of the rotation matrix (𝑅𝐶𝑊)𝑇 obtained by (2.12).

Each initial feature ̂𝑦𝑖 with 𝑖 in [1..4] is composed by: ̂𝑦i=𝑟𝑊𝐶ini𝑔atan21,𝑔3atan2𝑔2,𝑔21+𝑔231𝑟𝑊𝐶ini𝑇,(2.16) where 𝑔1𝑔2𝑔3=𝑥𝑖𝑦𝑖0𝑟𝑊𝐶ini,(2.17)atan2 is a two-argument function that computes the arctangent of 𝑦/𝑥 for given values of 𝑦 and 𝑥, within the range [𝜋, 𝜋]. If certain confidence is assumed in the estimation process for the initial system state ̂𝑥ini, then the initial covariance matrix of the system 𝑃ini can be filled with zeros or 𝜀, being this value an arbitrary very small positive value: 𝑃ini=𝜀37×37.(2.18)

2.3. Camera Motion Model and Filter Prediction Equations

An unconstrained prediction model for a constant-acceleration camera motion can be defined by the following equation [5, 32]: 𝑓𝑣=𝑟𝑊𝐶𝑘+1𝑞𝑊𝐶𝑘+1𝑣𝑊𝑘+1𝜔𝑊𝑘+1=𝑟𝑘𝑊𝐶+𝑣𝑊𝑘+𝑉𝑊𝑘𝑞Δ𝑡𝑘𝑊𝐶𝜔×𝑞𝑊𝑘+Ω𝑊𝑣Δ𝑡𝑊𝑘+𝑉𝑊𝜔𝑊𝑘+Ω𝑊,(2.19) being 𝑞((𝜔𝑊𝑘+Ω𝑊)Δ𝑡) the quaternion computed from the rotation vector (𝜔𝑊𝑘+Ω𝑊)Δ𝑡. This transformation from rotation vector to quaternion is defined by (A.1) in the Appendix.

No prior assumption can be made about the camera movement. Therefore, the model defined in (2.19) supposes, at every step 𝑘, that the changes in linear and angular velocity 𝑣𝑊 and 𝜔𝑊 are produced by an input 𝑢 of unknown linear and angular velocity 𝑉𝑊 and Ω𝑊, with linear and angular acceleration with zero-mean and known Gaussian process covariance, 𝑎𝑊 and 𝛼𝑊, respectively: 𝑉𝑢=𝑊Ω𝑊=𝑎𝑊𝛼Δ𝑡𝑊Δ𝑡.(2.20)

The unknown linear and angular velocity 𝑉𝑊 and Ω𝑊 are incorporated into the system by the process noise covariance matrix 𝑈: 𝜎𝑈=𝑉Δ𝑡2𝐼3×303×303×3𝜎ΩΔ𝑡2𝐼3×3.(2.21)

If a static scene is assumed where the landmarks remain in a fixed place, the prediction step for the Extended Kalman Filter (EKF) is defined as follows: ̂𝑥𝑘+1=𝑓𝑣̂𝑥𝑣̂𝑦1̂𝑦𝑛𝑃,(2.22)𝑘+1=𝐹𝑥𝑃𝑘𝐹𝑇𝑥+𝐹𝑢𝒬𝐹𝑇𝑢,(2.23) where 𝑄 and the Jacobians 𝐹𝑥 and 𝐹𝑢 are defined as follows: 𝐹𝑥=𝜕𝑓𝑣𝜕̂𝑥𝑣013×𝑛0𝑛×13𝐼𝑛×𝑛,𝐹𝑢=𝜕𝑓𝑣0𝜕𝑢13×𝑛0𝑛×60𝑛×𝑛,𝑄=𝑈06×𝑛0𝑛×60𝑛×𝑛.(2.24)𝜕𝑓𝑣/𝜕̂𝑥𝑣 are the derivatives of the equations of the prediction model 𝑓𝑣 with respect to the camera state ̂𝑥𝑣. 𝜕𝑓𝑣/𝜕𝑢 are the derivatives of the equations of the prediction model 𝑓𝑣 with respect to the parameters of the covariance matrix 𝑈. In this case 𝑛 corresponds to the dimension of the features ̂𝑦𝑖 in the system state ̂𝑥𝑣.

2.4. Measurement Prediction Model

The different locations of the camera, along with the location of the features already mapped, are used by the measurement prediction model 𝑖 to forecast the distorted pixel coordinates 𝑢𝑑 and 𝑣𝑑 for each feature ̂𝑦𝑖: 𝑢𝑑𝑣𝑑=𝑖̂𝑥𝑣,̂𝑦𝑖,(2.25)𝑖(̂𝑥𝑣,̂𝑦𝑖) is derived using the following procedure: the model observation of any point ̂𝑦𝑖 from a camera location defines a ray expressed in the camera frame as follows, see Figure 2: 𝑐=𝑥𝑦𝑧=𝑅𝐶𝑊𝑥𝑖𝑦𝑖𝑧𝑖+1𝜌𝑖𝑚𝜃𝑖,𝜙𝑖𝑟𝑊𝐶.(2.26)𝐶 is observed by the camera through its projection in the image. 𝑅𝐶𝑊 is the transformation matrix from the global reference frame to the camera reference frame. 𝑅𝐶𝑊 is computed by 𝑅𝐶𝑊=(𝑅𝐶𝑊(𝑞𝑊𝐶))𝑇 where 𝑅𝑊𝐶(𝑞𝑊𝐶) is a rotation matrix depending on the camera orientation quaternion 𝑞𝑊𝐶 using (A.3) which computes a rotation matrix from a quaternion. The directional unitary vector 𝑚(𝜃𝑖,𝜙𝑖) is computed with (2.8).

The vector 𝐶 is projected to the normalized camera retina by the following: 𝑢𝑣=𝑥𝑧𝑦𝑧.(2.27) The camera calibration model is applied to produce the undistorted pixel coordinates 𝑢𝑢 and 𝑣𝑢: 𝑢𝑢𝑣𝑢=𝑢0𝑣𝑓𝑢0𝑓𝑣,(2.28) where 𝑢0,𝑣0 is the camera center in pixels, and 𝑓 is the focal length. Finally, a radial distortion model is applied to obtain the distortion pixel coordinates 𝑢𝑑,𝑣𝑑. In [18] the following distortion model is proposed being 𝐾1 the distortion coefficient: 𝑢𝑑𝑣𝑑=𝑢𝑢𝑢01+2𝐾1𝑟2+𝑢0𝑣𝑢𝑣01+2𝐾1𝑟2+𝑣0,𝑟=𝑢𝑢𝑢02+𝑣𝑢𝑣02.(2.29)

2.5. Features Matching and Filter Update Equations

If the feature ̂𝑦𝑖 is predicted to be appeared in the image (0<𝑢𝑑< image height & 0<𝑣𝑑< image width), then an active search technique [33] can be used to constrain the features matching from frame to frame inside elliptical regions around the predicted 𝑖= [𝑢𝑑,𝑣𝑑]. The elliptical regions are defined by the innovation covariance matrix 𝑆𝑖: 𝑆𝑖=𝐻𝑖𝑃𝑘+1𝐻𝑇𝑖+𝑅𝑢𝑣.(2.30)

Measurement noise is incorporated into the system by the noise covariance matrix 𝑅𝑢𝑣: 𝑅𝑢𝑣=𝜎2𝑢00𝜎2𝑣.(2.31)

Variances 𝜎2𝑢 and 𝜎2𝑣 are defined in pixel units. 𝑃𝑘+1is the prior state covariance estimated by (2.23). 𝐻𝑖 is the Jacobian derived from the measurement model 𝑖: 𝐻𝑖=𝜕𝑖𝜕̂𝑥𝑣02×6𝜕𝑖𝜕̂𝑦𝑖02×6,(2.32)𝜕𝑖/𝜕̂𝑥𝑣 are the derivatives of the equations of the measurement prediction model 𝑖 with respect to the camera state ̂𝑥𝑣. 𝜕𝑖/𝜕̂𝑦𝑖 are the derivatives of the measurement prediction model 𝑖 with respect to the parameters of the feature ̂𝑦𝑖. Note that 𝜕𝑖/𝜕̂𝑦𝑖 has only a nonzero value at the location (indexes) of the observed feature ̂𝑦𝑖.

The size of the axis 𝑠𝑥,𝑠𝑦 for the elliptical search region defined by the matrix 𝑆𝑖 is determined by (see Figure 4): 𝑠𝑥𝑠𝑦=2𝑛𝑆𝑖(1,1)2𝑛𝑆𝑖(2,2),(2.33) where 𝑛 is the number of standard deviations for the desired region of search.

When a feature ̂𝑦𝑖 is added to the map, a unique image patch of 𝑛-by-𝑛 pixel is stored and linked with the feature. For matching a feature in the current image frame, a patch cross-correlation technique, [33], is applied over all the image locations (𝑢𝑑,𝑣𝑑) within the search region defined in (2.33). If the score of a pixel location (𝑢𝑑,𝑣𝑑), determined by the best cross correlation between the feature patch and the patches defined by the region of search, is higher than a threshold then this pixel location (𝑢𝑑,𝑣𝑑) is considered as the current feature measurement 𝑧𝑖.

If the matching process is successful, then the filter is updated as follows: ̂𝑥𝑘=̂𝑥𝑘+1𝑃+𝑊𝑔,𝑘=𝑃𝑘+1𝑊𝑆𝑖𝑊𝑇,(2.34) where the innovation 𝑔 is as follows: 𝑔=𝑧𝑖𝑖,(2.35) and 𝑊 is the Kalman gain: 𝑊=𝑃𝑘+1𝐻𝑇𝑖𝑆𝑖1.(2.36)

In SLAM, it is well known the injurious effect of incorrect or incompatible matches. In monocular SLAM systems, when delayed feature initialization techniques are used (as in this work), implicitly some weak image features are pruned prior to their addition to the stochastic map, for example, image features produced by fast lighting changes, shines in highly reflective surfaces, or even caused by some dynamic elements in the scene. Nevertheless, the risk of incorrect or incompatible matches could remain due to several factors as: (i) incompatibility due repeated design; (ii) fake composite landmark; (iii) incompatibilities produced by reflections on curved surfaces and materials; (iv) detection of landmarks running along edges. The above drawbacks can be mitigated by the used of batch validation techniques [34, 35] (or variants of them), with the penalty of the increment of computational time.

On the other hand, for the standard EKF-SLAM algorithm, there exists significant empirical evidence which demonstrates that the computed state estimates tend to be inconsistent. Thus, if the covariance matrix is underestimated, then the search regions determined by the active search technique could even collapse, leading the tracking to fail. A feasible solution to this problem could be implemented by fixing a minimum possible size for the axes 𝑠𝑥, 𝑠𝑦 in (2.33).

2.6. Delayed Inverse-Depth Feature Initialization

As it is stated before, depth information cannot be obtained in a single measurement when bearing sensors (e.g., a single camera) are used. To infer the depth of a feature, the sensor must observe this feature repeatedly as it freely moves through its environment, estimating the angle from the feature to the sensor center. The difference between those angle measurements is the parallax feature. Actually, parallax is the key that allows estimating features depth. In case of indoor sequences, a displacement of centimeters is enough to produce parallax; on the other hand, the more distant the feature, the more the sensor has to travel to produce parallax. For incorporating new features to the map, special techniques for features initialization are needed in order to enable the use of bearing sensors in SLAM systems.

In this work, authors propose a novel method called delayed inverse-depth feature initialization in order to incorporate new features ̂𝑦new to the stochastic map from the bearing measurements of the camera. The proposed method uses the inverse-depth parameterization and implements an stochastic technique of triangulation, by means of delay, in order to define hypothesis of initial depth for the features. The use of the inverse-depth parameterization, in the context of motion estimation and scene reconstruction using vision, was introduced earlier by Favaro in [36], and extended later in [24, 26].

The proposed method states that a minimum number of features ̂𝑦𝑖 is considered to be predicted appearing in the image, otherwise new features should be added to the map. In this latter case, new points are detected in the image with a saliency operator. In this particular case, Harris corner detector is applied, but other detectors should also be used.

These points in the image, which are not added yet to the map, are called candidate points, 𝜆𝑖. Only image areas free of both, candidate points 𝜆𝑖 and mapped features ̂𝑦𝑖, are considered for detecting new points with the saliency operator, see Figure 5.

When a candidate point 𝜆𝑖 is detected for the first time, some data are stored: 𝜆𝑖=𝑟𝜆,𝜎𝑟,𝑞𝜆,𝜎𝑞,𝑢1,𝑣1,(2.37) where 𝑟𝜆 is taken from the current camera optical center position 𝑟𝑊𝐶: 𝑟𝜆=𝑥1,𝑦1,𝑧1=𝑥𝑣,𝑦𝑣,𝑧𝑣,(2.38)𝜎𝑟 represents the associated variances of 𝑟𝑊𝐶, taken from the state covariance matrix 𝑃: 𝜎𝑟=𝜎𝑥1,𝜎𝑦1,𝜎𝑧1=𝑃(1,1),𝑃(2,2),𝑃(3,3),(2.39)𝑞𝜆 is taken from the quaternion 𝑞𝑊𝐶 representing the current camera orientation: 𝑞𝜆=𝑞𝑊𝐶,(2.40)𝜎𝑞 represents the associated variances of 𝑞𝑊𝐶, taken from the state covariance matrix 𝑃: 𝜎𝑞=𝜎1𝑞1,𝜎1𝑞2,𝜎1𝑞3,𝜎1𝑞4=𝑃(4,4),𝑃(5,5),𝑃(6,6),𝑃(7,7),(2.41) and 𝑢1,𝑣1 are the initial pixel coordinates where the candidate point 𝜆𝑖 was detected for the first time. In subsequent frames, each candidate point 𝜆𝑖 is intended to be tracked (this approach is independent of the used tracking method) in order to maintain the pixel coordinates 𝑢 and 𝑣 updated for such a candidate point. In practice, not all the candidate points 𝜆𝑖 can be tracked, nevertheless the method takes advantage of this fact in order to prune weak image features. For each candidate point 𝜆𝑖, the tracking process is realized until the point is pruned or initialized in the system as a new feature ̂𝑦new. In practice for each frame, the candidate points 𝜆𝑖 could be detected, pruned or considered to be added to the map.

A candidate point 𝜆𝑖 is added as a new feature ̂𝑦new when a minimum threshold of parallax 𝛼min is reached. Parallax 𝛼 is estimated using (as it can be seen in Figure 6):(i)displacement base-line 𝑏;(ii)𝜆𝑖 and associated data;(iii)the current camera state ̂𝑥𝑣 and current measurement 𝑧𝑖.

For each candidate point 𝜆𝑖, when a new measurement 𝑧𝑖 is available then parallax angle 𝛼 is estimated by the following: 𝛼=𝜋(𝛽+𝛾).(2.42) The angle 𝛽 is determined by the directional unitary vector 1 and the vector 𝑏1 defining the base-line 𝑏 in the direction of the camera trajectory by the following: 𝛽=cos11𝑏11𝑏1,(2.43) where the directional projection ray vector 1=[1𝑥1𝑦1𝑧], expressed in the absolute frame 𝑊, is computed from the camera position and the coordinates of the observed point when it was observed for the first time, using the data stored in 𝜆𝑖. 1=𝑅𝑊𝐶𝑞𝜆𝐶1𝑢1𝑢,𝑣1𝑢.(2.44)𝑅𝑊𝐶(𝑞𝜆) is the transformation matrix from the camera reference frame to the global reference frame, depending on the stored quaternion 𝑞𝜆; (A.3) is used for computing a rotation matrix from a quaternion.

𝐶1(𝑢1𝑢,𝑣1𝑢)=[1𝑥1𝑦1𝑧] is the directional vector, in the camera frame 𝐶, pointing from the position when the candidate point 𝜆𝑖 was observed for the first time to the feature location, and it can be computed by the following: 𝐶1𝑢1𝑢,𝑣1𝑢=𝑢0𝑢1𝑢𝑓𝑣0𝑣1𝑢𝑓1.(2.45) The undistorted pixel coordinates 𝑢1𝑢 and 𝑣1𝑢 are obtained from 𝑢1,𝑣1 applying the inverse of the distortion model in (2.29): 𝑢1𝑢𝑣1𝑢=𝑢1𝑢012𝐾1𝑟2+𝑢0𝑣1𝑣012𝐾1𝑟2+𝑣0,𝑟=𝑢1𝑢02+𝑣1𝑣02.(2.46)𝑏1=[𝑏1𝑥𝑏1𝑦𝑏1𝑧] is the vector representing the camera base-line 𝑏 between the camera optical center position [𝑥1𝑦1𝑧1], where the point was observed for the first time, and the current camera optical center 𝑟𝑊𝐶=[𝑥𝑣𝑦𝑣𝑧𝑣], taken from the current camera state ̂𝑥𝑣: 𝑏1=𝑥𝑣𝑥1𝑦𝑣𝑦1𝑧𝑣𝑧1.(2.47) The angle 𝛾 is determined in a similar way as 𝛽 but using instead the directional projection ray vector 2 and the vector 𝑏2: 𝛾=cos12𝑏22𝑏2.(2.48)

The directional projection ray vector 2 expressed in the absolute frame 𝑊, is computed in a similar way as 1 but using the current camera position ̂𝑥𝑣 and therefore, the current undistorted points coordinates 𝑢𝑢,𝑣𝑢: 2=𝑅𝑊𝐶𝑞𝑊𝐶𝐶2𝑢𝑢,𝑣𝑢,(2.49)𝑅𝑊𝐶(𝑞𝑊𝐶) is the transformation matrix from the camera reference frame to the global reference frame, depending on the current quaternion 𝑞𝑊𝐶. Equation (A.3) is used for computing a rotation matrix from a quaternion.

𝐶2(𝑢𝑢,𝑣𝑢)=[2𝑥2𝑦2𝑧] is the directional vector, in the camera frame 𝐶, pointing from the current camera position to the feature location, and it can be computed as follows: 𝐶2𝑢𝑢,𝑣𝑢=𝑢0𝑢𝑢𝑓𝑣0𝑣𝑢𝑓1.(2.50) The undistorted coordinates pixel 𝑢𝑢 and 𝑣𝑢 are obtained from the current pixel coordinates 𝑢, 𝑣 applying the inverse of the distortion model in (2.29): 𝑢𝑢𝑣𝑢=𝑢𝑢012𝐾1𝑟2+𝑢0𝑣𝑣012𝐾1𝑟2+𝑣0,𝑟=𝑢𝑢02+𝑣𝑣02.(2.51)𝑏2=[𝑏2𝑥𝑏2𝑦𝑏2𝑧] is similar to 𝑏1 but pointing to the opposite direction: 𝑏2=𝑥1𝑥𝑣𝑦1𝑦𝑣𝑧1𝑧𝑣.(2.52) The base-line 𝑏 is the module of 𝑏1 or 𝑏2: 𝑏𝑏=1=𝑏2.(2.53) If 𝛼>𝛼min then 𝜆𝑖 is initialized as a new feature map ̂𝑦new. The feature ̂𝑦new is made up of: ̂𝑦new=𝑔̂𝑥,𝜆𝑖,𝑧𝑖=𝑥𝑖𝑦𝑖𝑧𝑖𝜃𝑖𝜙𝑖𝜌𝑖𝑇.(2.54) The first three elements of ̂𝑦new are obtained directly from the current camera optical center position 𝑟𝑊𝐶=[𝑥𝑣,𝑦𝑣,𝑧𝑣] taken from ̂𝑥𝑣: 𝑥𝑖𝑦𝑖𝑧𝑖=𝑥𝑣𝑦𝑣𝑧𝑣,𝜃𝑖𝜙𝑖=atan22𝑦,22𝑥+22𝑧atan22𝑥,2𝑧,(2.55) where [2𝑥2𝑦2𝑧] is computed using (2.49).

The inverse depth 𝜌𝑖 is derived from the following: 𝜌𝑖=sin(𝛼).𝑏sin(𝛽)(2.56) The new system state ̂𝑥new is made up simply adding the new feature ̂𝑦new to the final part of the vector state ̂𝑥. If ̂𝑥=[̂𝑥𝑣̂𝑦1̂𝑦2]𝑇 then ̂𝑥new is composed by the following: ̂𝑥=̂𝑥𝑣̂𝑦1̂𝑦2,̂𝑥new=̂𝑥𝑣̂𝑦1̂𝑦2̂𝑦new.(2.57) The new state covariance matrix 𝑃new is derived from the error diagonal measurement covariance matrix 𝑅𝑗 and the current state covariance matrix 𝑃: 𝑅𝑗𝜎=diag2𝑢𝜎2𝑣𝜎2𝑢𝜎2𝑣𝜎𝑥1𝜎𝑦1𝜎𝑧1𝜎1𝑞1𝜎1𝑞2𝜎1𝑞3𝜎1𝑞4,(2.58) where 𝑅𝑗 is composed by the measurement error variances 𝜎2𝑢 and 𝜎2𝑣 (defined in pixel units), and the variances stored with the candidate point 𝜆𝑖.

The new state covariance matrix 𝑃new after the initialization process is given by the following: 𝑃new𝑃=𝑌𝑘00𝑅𝑗𝑌𝑇.(2.59) The Jacobian 𝑌 is composed of the following: 𝐼𝑌=𝑛×𝑛0𝜕𝑔𝜕̂𝑥𝑣,06×6,,06×6,𝜕𝑔𝜕𝑅𝑗,(2.60) where 𝐼 is a 𝑛-dimensional identity matrix where 𝑛 is the dimension of 𝑃𝑘. 𝜕𝑔/𝜕̂𝑥𝑣 are the derivatives of the initializations equations 𝑔(̂𝑥,𝜆𝑖,𝑧𝑖) with respect to the camera state ̂𝑥𝑣. Note that the derivatives with respect to the previously mapped features are filled with zeros. 𝜕𝑔/𝜕𝑅𝑗 are the derivatives of the initializations equations 𝑔(̂𝑥,𝜆𝑖,𝑧𝑖) with respect to the parameters of the covariance matrix 𝑅𝑗.

For certain environments, those features located far away from the camera will not produce parallax (for small trajectories with respect to the feature depth). To include features in the map with big uncertainty in depth is not very helpful for estimating the camera location 𝑟𝑊𝐶, although this kind of features could provide useful information for estimating camera orientation 𝑞𝑊𝐶.

A special case is when a near feature does not produce parallax. This case happens when the camera moves in the direction towards the feature position (e.g., 𝛽0). In this work it is assumed that there exist the necessary conditions for producing parallax at every time at least for one feature observed by the camera. Furthermore, features located in front of the direction of the camera movement will be discarded (if 𝛽<20).

For the proposed method, regarding whether distant features should be included in the map, a minimum base-line 𝑏min can be used as an additional threshold for considering a candidate point 𝜆𝑖 to be initialized as a new feature ̂𝑦new. In this case, a candidate point 𝜆𝑖 is considered to be distant from the camera when 𝑏>𝑏min but 𝛼<𝛼min. Camera moved from its previous position but no enough parallax has been produced, due to the distant condition of the landmark.

The value of 𝑏min can be heuristically determined depending on the particularities of the application. In this case, for initializing a candidate point 𝜆𝑖 as a new feature ̂𝑦new, where 𝑏>𝑏min but 𝛼<𝛼min, the simple method for initializing new features proposed in [26] can be used. Nevertheless, in this work the following equations are proposed in order to determine initial values 𝜌ini, 𝜎𝜌(ini) depending on 𝑏min and 𝛼min: 𝜌ini<𝛼2sinmin/2𝑏min𝜎𝜌(ini)=𝜌min4.(2.61)

2.7. Map Management

When the number of features in the system state increases then computational cost grows rapidly and consequently it becomes difficult to maintain the frame rate operation. To alleviate this drawback, old features can be removed from the state for maintaining a stable number of features and, therefore, to stabilize the computational cost per frame. Obviously, if old features are removed, then previous mapped areas cannot be recognized in the future. However, in the context of visual odometry, this fact is not considered as a problem. A modified monocular SLAM method that maintains the computational operation stable can be viewed as a complex real-time virtual sensor, which provides appearance-based sensing and emulates typical sensors as laser for range measurements and encoders for dead reckoning (visual odometry).

In authors’ related work [37], the virtual sensor idea is proposed, developed, and extended so that, using a distributed scheme, it is possible to recognize previous mapped areas for loop closing tasks. In this scheme, the outputs provided by the virtual sensor have been used by an additional SLAM process (decoupled from the camera’s frame rate) in order to build and to maintain the global map and final camera pose.

This present work regards the approach of monocular SLAM as a virtual sensor. Therefore, if the number of features exceeds a specified threshold, in order to maintain a stable amount of features in the system the algorithm removes older features that were left behind by camera movement as well as those features that have been predicted to be appearing in the image, but have not been matched in several previous frames.

Removing features from the system state is much easier than adding them. To delete a feature from the state vectorand covariance matrix, the rows and columns which contain it have to be removed. As an example, the remaining matrices after the removal of feature 𝑦2 will be the following:𝑥𝑣𝑦1𝑦2𝑦3𝑥𝑣𝑦1𝑦3𝑃𝑥𝑥𝑃𝑥𝑦1𝑃𝑥𝑦2𝑃𝑥𝑦3𝑃𝑦1𝑥𝑃𝑦1𝑦1𝑃𝑦1𝑦2𝑃𝑦1𝑦3𝑃𝑦2𝑥𝑃𝑦2𝑦2𝑃𝑦2𝑦2𝑃𝑦2𝑦3𝑃𝑦3𝑥𝑃𝑦3𝑦1𝑃𝑦3𝑦2𝑃𝑦3𝑦3𝑃𝑥𝑥𝑃𝑥𝑦1𝑃𝑥𝑦3𝑃𝑦1𝑥𝑃𝑦1𝑦1𝑃𝑦1𝑦3𝑃𝑦3𝑥𝑃𝑦3𝑦1𝑃𝑦3𝑦3.(2.62)

3. Experimental Results

The proposed method has been implemented using MATLAB in order to test its performance. First, several video sequences were acquired using two different low cost cameras. Later, the algorithm was executed off-line using those video sequences as input signals. In experiments, the following parameter values have been used: variances for linear and angular velocity, respectively, 𝜎𝑉=4 m/s2, 𝜎Ω=4 m/s2, noise variances 𝜎𝑢=𝜎𝑣=1 pixel, minimum base-line 𝑏min=15 cm and minimum parallax angle 𝛼min=5.

In the first series of experiments, a Microsoft LifeCam Studio Webcam was used. This low cost camera has an USB interface and a wide angle lens. It is capable of acquiring HD color video, but in the present experiments gray level video sequences, with a resolution of 424 × 240 pixels at 30 frames per second, were used.

Figure 7 shows the experimental results for a video sequence of 790 frames acquired in a desktop environment with the LifeCam Studio webcam. In this case, the camera was moved from the left to the right following a trajectory similar to the curved front edge of the desktop (see Figure 9, left side). A black paper sheet was used as the initial metric reference (Frame 1, Figure 7). It is important to remark that the initial metric reference determines the origin of the world reference. The initial camera position and orientation respect to the coordinates of the world reference frame are determined according Section 2.2 explanations. Some candidate points were detected from the first frame; however, the first feature was not added to the map until Frame 111. In this frame, two new candidate points were detected (in red), and two candidate points were tracked (in blue). For simplicity, 3𝜎 feature depth uncertainty is indicated by red 3D lines in plots illustrating the outputs of the proposed method (central and right columns in Figure 7). The features are located over these red lines. Of course, full geometric uncertainties should be represented by 3D ellipsoids. In frame 203 the first features, corresponding to the initial metric reference, have been left behind by the camera movement. Other features, corresponding to a painting placed approximately in the same plane that the initial metric reference, have been initialized. In frame 533 several features, corresponding to two different PC screens, have been added to the map. Weak image features, due for example to brightness over reflecting surfaces or changing video in the PC screens, are implicitly pruned by the image tracking process of candidates points. At frame 688, several features have been left behind by the camera movement and removed from the system (in yellow), while other features close to the range camera view, are maintained in the map (in red).

Figure 8 shows a top view of the experimental environment (left plot), and the final camera trajectory and map estimates, in an 𝑋-𝑍 view at frame 790 (right plot). Several objects have been labeled in both plots for illustrating the reconstruction of the scene structure. In this experiment both, camera trajectory and scene structure, have been recovered in a satisfactory manner.

In the second series of experiments, a Fire-i Unibrain monochrome webcam was used. This low-cost camera has an IEEE1394 interface and interchangeable wide angle lens. Video sequences with a resolution of 320 × 240 pixels acquired at 30 frames per second were used in experiments. Figure 9 shows the experimental results for a video sequence containing about 1750 frames. This sequence has been recorded walking over a predefined cyclical trajectory inside a laboratory environment. At the end of the sequence, 390 features were initialized in the map, but only about 20 or 30 features in average were maintained in the system state allowing that the computational time per frame was stable along the sequence.

In this experiment, the only reference for recovering the metric scale is a computer screen (frame 3), that was left behind around frame 300 by the camera movement. In the frame 565, some features have been clearly removed from the map. Here a feature is removed if it cannot be detected and tracked in the next 20 frames. In the frame 777 the camera has completed its first turn (or cycle) to the defined trajectory. In the frame 1635 the camera has returned next to its initial position (note that the PC screen used as initial metric reference appears again). At this point the discrepancy between the estimated trajectory and the real trajectory is very perceptible (the real path followed is like a long rectangle with rounded edges). The biggest degeneration in the estimated trajectory happens when the camera is turning the curves, mainly in the second and third curve, because there are not far features to obtain enough information for the orientation. Without underestimating the importance of the effects produced by the drift in the trajectory estimations, it can be appreciated in Figure 10 that the 3D structure of the environment has been recovered reasonably well.

4. Conclusions

In this paper a method for addressing the problem of 3D visual SLAM in a context of visual odometry using a single free-moving camera has been presented.

Depth information cannot be obtained in a single measurement when a bearing sensor (i.e., monocular camera) is used. Therefore, in order to incorporate new features to the map, special techniques for feature system initialization are needed in order to enable the use of bearing sensors in SLAM systems. In this work a method called delayed inverse-depth feature initialization is proposed to incorporate new features to the stochastic map from the bearing measurements of the camera. The proposed method is based in the inverse depth parameterization and implements, by means of a delay, a stochastic technique of triangulation in order to define a hypothesis of initial depth for the features.

In SLAM methods based on the extended Kalman filter is difficult to maintain the frame rate operation when the number of features in the system state increases. Old features have to be removed from the state in order to maintain a stable computational cost per frame. Obviously, if features are removed then previous mapped areas cannot be recognized in the future. However, a modified monocular SLAM method to maintain stable computational operation can be viewed as a complex real-time virtual sensor, which provides appearance-based sensing and emulates typical sensors (i.e., a laser) for range measurements and encoders for dead reckoning (visual odometry).

The experimental results, with real data, show that although a small and single metric reference is used for recovering the world scale, the metric degeneration quantification at the end of the trajectory is low. In the experiments carried out, the camera trajectory and scene structure have been recovered in a satisfactory manner. The results in the trajectory are not error-free though they are more likely to appear in curves when distant features do not exist.

Even though this work is concerned with the approach of monocular SLAM as a virtual sensor (e.g., in the context of visual odometry), the effectiveness of the proposed method has been also verified in a more general SLAM context. A real time implementation of the proposed method [37], fully adapted as a virtual sensor, provides a rich sensory input to a decoupled (Global) SLAM process. This Global SLAM is used for loop closing tasks and therefore for building and maintaining the global map and final camera pose.

Appendix

In this appendix some transformations, repeatedly used along the paper, are included.

A quaternion 𝑞 can be computed from a rotation vector 𝜔 by the following: 𝑞(𝜔)=cos𝜔2sin𝜔2𝜔𝜔𝑇.(A.1) A quaternion 𝑞 can be computed form a rotation matrix 𝑅 by the following: 𝑞𝑅𝑏𝑛=𝑞1𝑅𝑏𝑛(3,2)𝑅𝑏𝑛(2,3)4𝑞1𝑅𝑏𝑛(1,3)𝑅𝑏𝑛(3,1)4𝑞1𝑅𝑏𝑛(2,1)𝑅𝑏𝑛(1,2)4𝑞1,𝑞1=1+𝑅𝑏𝑛(1,1)+𝑅𝑏𝑛(2,2)+𝑅𝑏𝑛(3,3).(A.2) A rotation matrix 𝑅 can be computed from a quaternion 𝑞 by the following: 𝑞𝑅=21+𝑞22𝑞23𝑞242𝑞2𝑞3𝑞1𝑞42𝑞1𝑞3+𝑞2𝑞42𝑞2𝑞3+𝑞1𝑞4𝑞21+𝑞22𝑞23𝑞242𝑞3𝑞4𝑞1𝑞22𝑞2𝑞4𝑞1𝑞32𝑞1𝑞2+𝑞3𝑞4𝑞21+𝑞22𝑞23𝑞24.(A.3)

Acknowledgments

This work has been funded by the Spanish Ministry of Science Project DPI2010-17112.