Abstract

With the increasing demand for location-based services such as railway stations, airports, and shopping malls, indoor positioning technology has become one of the most attractive research areas. Due to the effects of multipath propagation, wireless-based indoor localization methods such as WiFi, bluetooth, and pseudolite have difficulty achieving high precision position. In this work, we present an image-based localization approach which can get the position just by taking a picture of the surrounding environment. This paper proposes a novel approach which classifies different scenes based on deep belief networks and solves the camera position with several spatial reference points extracted from depth images by the perspective--point algorithm. To evaluate the performance, experiments are conducted on public data and real scenes; the result demonstrates that our approach can achieve submeter positioning accuracy. Compared with other methods, image-based indoor localization methods do not require infrastructure and have a wide range of applications that include self-driving, robot navigation, and augmented reality.

1. Introduction

According to statistics, more than 80 percent of people’s living time is in an indoor environment such as shopping malls, airports, libraries, campuses, and hospitals. The purpose of the indoor localization system is to provide accurate positions in large buildings. It is vital to applications such as evacuation of trapped people at fire scenes, tracking of valuable assets, and indoor service robot. For these applications to be widely accepted, indoor localization requires an accurate and reliable position estimation scheme [1].

In order to provide a stable indoor location service, a large number of technologies are researched including pseudolite, bluetooth, ultrasonic, WiFi, ultra wideband, and LED [2, 3]. It is almost impossible to obtain very accurate results for a radio-based approach in view of the multipath interference through arrival time and arrival angle methods. The time-varying indoor environment and the movement of pedestrians also have adverse effects on the stability of fingerprint information [46]. In addition, the high cost of hardware equipment, construction, and installation as well as maintenance and update is also an important factor limiting the development of indoor positioning technology. Besides, these kinds of methods can only output the position ( coordinates) but not the view angle (pitch, yaw, and roll angles).

The vision-based positioning method is a kind of passive positioning technology which can achieve high positioning accuracy and does not need extra infrastructure. Moreover, it can not only output the position but also the view angle at the same time. Therefore, it has gradually become a hotspot of indoor positioning technology [7, 8]. Such methods typically involve four steps: first, establishing an indoor image dataset collected by depth cameras with exact positional information; second, comparing the images collected by a camera to the images in the database which established the last step; third, retrieving some of the most similar pictures, then extracting the feature and matching the points; at last, solving the perspective--point problem [912]. However, the application of scene recognition to mobile location implies several challenges [1315]. The complex three-dimensional shape of the environment results in occlusions, overlaps, shadows, and reflections which require a robust description of the scene [16]. To address these issues, we propose a particularly efficient approach based on a deep belief network with local binary pattern feature descriptors. It enables us to find out the most similar pictures quickly. In addition, we restrict the search space according to adaptive visibility constraints which allows us to cope with extensive maps.

Before presenting the proposed approach, we review previous work on image-based localization methods and divide these methods into three categories roughly.

Manual mark-based localization methods completely rely on the natural features of the image which lacks robustness, especially under conditions of varying illumination. In order to improve the robustness and accuracy of the reference point, special coding marks are used to meet the higher positioning requirements of the system. There are three benefits: simplify the automatic detection of corresponding points, introduce system dimensions, and distinguish and identify targets by using a unique code for each mark. Common types of marks include concentric rings, QR codes, or patterns composed of colored dots. The advantage is raising the recognition rate and effectively reducing the complexity of positioning methods. The disadvantage is that the installation and maintenance costs are high, some targets are easily obstructed, and the scope of application is limited [17, 18].

Natural mark-based localization methods usually detect objects on the image and match them with an existing building database. The database contains the location information of the natural marks in the building. The advantage of this method is that it does not require additional local infrastructure. In other words, the reference object is actually a series of digital reference points (control points in photogrammetry) in the database. Therefore, this type of system is suitable for large-scale coverage without increasing too much cost. The disadvantage is that the recognition algorithm is complex and easy to be affected by the environment, the characteristics are easy to change, and the dataset needs to be updated [1922].

Learning-based localization methods have emerged in the past few years. It is an end-to-end method that directly obtains 6dof pose, which has been proposed to solve loop-closure detection and pose estimation [23]. This method does not require feature extraction, feature matching, and complex geometric calculations and is intuitive and concise. It is robust in weak textures, repeated textures, motion blur, and lighting changes. In the training phase, the calculative scale is very large, and GPU servers are usually required, which cannot run smoothly on mobile platforms [20]. In many scenarios, learning-based features are not as effective as traditional features such as SIFT, and the interpretability is poor [2427].

3. Framework and Method

In this section, first, we introduce the overview of the framework. Then, the key modules are explained in more detail in the subsequent sections.

3.1. Framework Overview

The whole pipeline of the visual localization system is shown in Figure 1. In the following, we briefly provide an overview of our system.

In the offline stage, the RGB-D cameras are held to collect enough RGB images and depth images around the indoor environment. At the same time, the pose of the camera and the 3D point cloud are constructed. The RGB image is used as a learning dataset to train the network model, and then, the network model parameters are saved until the loss function value does not decrease. In the online stage, after the previous step is completed, anyone enters the room, downloads the trained network model parameters to the mobile phone, and takes a picture with the mobile phone, and the most similar image is identified according to the deep learning network. The unmatched points are eliminated, and the pixel coordinates of the matched points and the depth of the corresponding points are extracted. According to the pinhole imaging model, the -point perspective projection problem-solving method can be used to calculate the pose of the mobile phone in the world coordinate system. Finally, the posture is converted into a real position and displayed on the map.

3.2. Camera Calibration and Image Correction

Due to the processing error and installation error of camera lens, the image has radial distortion and tangential distortion. Therefore, we must calibrate the camera and correct the images in the preprocessing stage. The checkerboard contains some calibration reference points, and the coordinates of each point are disturbed by the same noise. Establishing the function :where is the coordinate of the projection points on image for reference point in the three-dimensional space. and are the rotation and translation vectors of image . is the three-dimensional coordinate of reference point in the world coordinate system. is the two-dimensional coordinate in the image coordinate system.

3.3. Scene Recognition

In this section, we use the deep belief network (DBN) to categorize the different indoor scenes. The framework includes image preprocessing, LBP feature extracting, DBN training, and scene classification.

3.3.1. Local Binary Pattern

The improved LBP feature is insensitive to rotation and illumination changes. The LBP operator can be specifically described as the following: the gray values in the window center pixel are defined as the threshold, and the gray values of the surrounding 8 pixels are, respectively, compared with the threshold in a clockwise direction, and if the gray value is bigger than the threshold, then mark the pixel as 1; otherwise, mark 0, and then get an 8-bit binary number through the comparison. After the decimal conversion, get the LBP value of the center pixel in this window. The value reflects the texture information of the point at this position. The calculation process is shown in Figure 2.

The formula of local binary pattern:where is the horizontal and vertical coordinate of the center pixel; is number 8; are the gray values of the center pixel and the neighborhood pixel, respectively; and is the two-valued symbol function.

The earliest proposed LBP operator can only cover a small range of images, so the optimization and improvement methods for the LBP operator are constantly proposed by researchers. We adopt the method which improves the insufficiency of the window size of the original LBP operator by replacing the traditional square neighborhood with a circular neighborhood and expanding the window size as shown in Figure 3.

In order to make the LBP operator have rotation invariance, the circular neighborhood is rotated clockwise to obtain a series of binary strings, and the minimum binary value is obtained, and then, the value is converted into decimal, which is the LBP value of the point. The process of obtaining the rotation-invariant LBP operator is shown in Figure 4.

3.3.2. Deep Belief Network

The deep belief network consists of a multirestricted Boltzmann machine (RBM) and a backpropagation (BP) neural network. The Boltzmann machine is a neural network based on learning rules. It consists of a visible layer and a hidden layer. The neurons in the same layer and the neurons in different layers are connected to each other. There are two types of neuron output states: active and inactive, represented by numbers 1 and 0. The advantage of the Boltzmann machine is its powerful unsupervised learning ability, which can learn complex rules from a large amount of data; the disadvantages are the huge amount of calculation and the long training time. The restricted Boltzmann machine canceled the connection between neurons in the same layer; each hidden unit and visible layer unit are independent of each other. Roux and Bengio theoretically prove that as long as the number of neurons in the hidden layer and the training samples are sufficient, the arbitrary discrete distribution can be fitted. The structure of BM and RBM is shown in Figure 5.

The joint configuration energy of its visible and hidden layers is defined aswhere are parameters in RBM, is bias of visible layer , is bias of visible layer , and is the weight.

The output of the hidden layer unit is

When the parameters are known, based on the above energy function, the joint probability distribution of where is the normalization factor. Distribution of is , joint probability distribution :

Since the activation state of each hidden unit and visible unit is conditionally independent, therefore, when the state of the visible and hidden units is given, the activation probability of the first implicit unit and visible elements iswhere is the sigmoid activation function.

3.4. Feature Point Detection and Matching

In this paper, we propose a multifeature point fusion algorithm. The combination of the edge detection algorithm and the ORB detection algorithm enables the detection algorithm to extract the edge information, thereby increasing the number of matching points with fewer textures. The feature points of the edge are obtained by the Canny algorithm to ensure that the object with less texture has feature points. ORB have scale and rotation invariance, and the speed is faster than SIFT. The BRIEF description algorithm is used to construct the feature point descriptor [2831].

The Brute force algorithm is adopted as the feature matching strategy. It calculates the Hamming distance between each point of the template image and each feature point of the sample image. Then compare the minimum Hamming distance value with the threshold value; if the distance is less than the threshold value, regard these two points as the matching points; otherwise, they are not matching points. The framework of feature extraction and matching is shown in Figure 6.

3.5. Pose Estimation

The core idea is to select four noncoplanar virtual control points; then, all the spatial reference points are represented by the four virtual control points, and then, the coordinates of the virtual control points are solved by the correspondence between the spatial reference points and the projection points, thereby obtaining the coordinates of all the spatial reference points. Finally, the rotation matrix and the translation vector are solved. The specific algorithm is described as follows.

Given reference points, the world coordinate is , . The coordinates of the corresponding projection point in the image coordinate system are , and the corresponding homogeneous coordinates are and . The correspondence between the reference point and the projection point :where is the depth of the reference point and is the internal parameter matrix of the camera:where is the focal length of the camera and is the optical center coordinate.

First, select four noncoplanar virtual control points in the world coordinate system. The relationship between the virtual control points and their projection points is shown in Figure 7.

In Figure 7, , , , and . are homogeneous coordinates of the virtual control point in the camera coordinate system, is the corresponding nonhomogeneous coordinate, is the homogeneous coordinate of the projection point corresponding in the image coordinate system, and is the corresponding nonhomogeneous coordinate. is the homogeneous coordinate of the reference point in the camera coordinate system; is the corresponding nonhomogeneous coordinate. The relationship between the spatial reference points and the control points in the world coordinate is as follows:where vector is the coordinate of the Euclidean space based on the control point . From the invariance of the linear relationship under the Euclidean transformation,

Assume , then

Then, obtain the equation:

Assume , then the equations are obtained from the correspondence between spatial points and image points as follows:

The solution is the kernel space of the matrix :where is the eigenvector of , is the dimension of the kernel, and is the undetermined coefficient. For a perspective projection model, the value of is 1, resulting inwhere ; then, the image coordinates of the four virtual control points are

The image coordinates of the four virtual control points obtained by the solution and the camera focal length obtained during the calibration process are taken into the absolute positioning algorithm to obtain the rotation matrix and the translation vector.

4. Experiments

We conducted two experiments to evaluate the proposed system. In the first experiment, we compare the proposed algorithm with other state-of-the-art algorithms on public datasets and then perform numerical analysis to show the accuracy of our system. The second experiment evaluated the performance of accuracy in the real world.

4.1. Experiment Setup

The experimental devices include an Android mobile phone (Lenovo Phab 2 Pro) and a depth camera (Intel RealSense D435) as shown in Figure 8. The user interface of the proposed visual positioning system on a smart mobile phone running in an indoor environment is shown in Figure 9.

4.2. Experiment on Public Dataset

In this experiment, we adopted the ICL-NUIM dataset which consists of RGB-D images from camera trajectories from two indoor scenes. The ICL-NUIM dataset is aimed at benchmarking RGB-D, Visual Odometry, and SLAM algorithms [3234]. Two different scenes (the living room and the office room scene) are provided with ground truth. The living room has 3D surface ground truth together with the depth maps as well as camera poses and as a result perfectly suits not only for benchmarking camera trajectory but also for reconstruction. The office room scene comes with only trajectory data and does not have any explicit 3D model with it. The images were captured at 640480 resolutions.

Table 1 shows localization results for our approach compared with state-of-the-art methods. The proposed localization method is implemented on Intel Core i5-4460 [email protected] GHz. The total procedure from scene recognition to pose estimation takes about 0.17 s to output a location for a single image.

4.3. Experiment on Real Scenes

The images are acquired by a handheld depth camera at a series of locations. The image size is pixels, and the focal length of the camera is known. Several images of the laboratory are shown in Figure 10.

Using the RTAB-Map algorithm, we get the 3D point cloud of the laboratory. It is shown in Figure 11. The blue points are the position of the camera, and the blue line is the trajectory.

The 2D map of our laboratory is shown in Figure 12. The length and width of the laboratory are 9.7 m and 7.8 m, respectively. First, select a point in the laboratory as the origin of the coordinate system and establish a world coordinate system. Then, hold the mobile phone, walk along different routes, and take photos, respectively, as indicated by the arrows.

In the offline stage, we get a total of 144 images. Due to some images captured at different scenes being similar, we divide them into 18 categories. In the online stage, we captured 45 images at different locations on route 1 and 27 images on route 2. The classification accuracy formula iswhere is the correct classified number of scene images and is the total number of scene images. The classification accuracy of our method is 0.925.

Most mismatched scenes concentrate in the corner, mainly due to the lack of significant features or mismatches. Several mismatched scenes are shown in Figure 13.

After removing the wrong matched results, the error cumulative distribution function graph is shown in Figure 14.

The trajectory of the camera is compared with the predefined route. After calculating the Euclidean distance between the results through our method and the true position, we get the error cumulative distribution function graph (Figure 14). It can be seen that the average positioning error is 0.61 m. Approximately 58% point positioning error is less than 0.5 m, about 77% point error is less than 1 m, about 95% point error is less than 2 m, and the maximum error is 2.55 m.

Since the original depth images in our experiment are based on RTAB-Map, its accuracy is not accurate. For example, in an indoor environment, intense illumination and strong shadows may lead to inconspicuous local features. It is also difficult to construct a good point cloud model. In the future, we plan to use laser equipment to construct a point cloud.

5. Conclusions and Future Work

In this article, we have presented an indoor positioning system based only on cameras. The main work is to use deep learning to identify the category of the scene and use 2D-3D matching feature points to calculate the location. We implemented the proposed approach on a mobile phone and achieved a positioning accuracy of decimeter level. The preliminary indoor positioning experiment result is given in this paper. But the experimental site is a small-scale place. The following work needs to be done in the future: with the rapid development of deep learning, it can generate high-level semantics and effectively solve the limitations caused by artificial design features, use a more robust lightweight image retrieval algorithm, and carry out tests under different lighting and dynamic environments, system tests under large-scale scenarios, and long-term performance tests.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was partially supported by the Key Research Development Program of Hebei (Project No. 19210906D).