A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots

Liu, Yanjie; Zhao, Changsen; Wei, Yanlong

doi:10.3390/app12063007

Open AccessArticle

A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots

by

Yanjie Liu

^*,

Changsen Zhao

and

Yanlong Wei

State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(6), 3007; https://doi.org/10.3390/app12063007

Submission received: 3 February 2022 / Revised: 3 March 2022 / Accepted: 11 March 2022 / Published: 16 March 2022

Download

Browse Figures

Versions Notes

Abstract

:

Map-based, high-precision dynamic pose tracking and rapid relocalization in the case of unknown poses are very important for indoor navigation robots. This paper aims to propose a robust and high-precision indoor robot positioning algorithm that combines vision and laser sensor information. This algorithm mainly includes two parts: initialization and real-time pose tracking. The initialization component is mainly to solve the problem of the uncertainty of a robot’s initial pose and loss of pose tracking. First, the laser information is added to the posenetLSTM neural network that only considers image information as a geometric constraint, and the loss function is redesigned thereby improving global positioning accuracy. Second, on the basis of visual rough positioning, the branch and bound method is used to quickly search the high-precision pose of the robot. In the real-time tracking component, small-scale correlation sampling is performed on the high-resolution environment grid map, and the robot’s pose is dynamically tracked in real time. When the score of the tracking pose is lower than a certain threshold, the method of nonlinear graph optimization is used to perform the pose optimization. In order to prove the robustness, high precision, and real-time performance of the algorithm, this article first builds a simulation environment in Gazebo for evaluation, and then verifies the relevant performance of the algorithm through the Mir robot platform. Both simulations and experiments show that the introduction of laser information into the neural network can greatly improve the accuracy of vision relocalization and the system can quickly perform high-precision repositioning when the camera is not severely blocked. At the same time, compared with the pose tracking performance of the adaptive Monte Carlo localization (AMCL) algorithm, the proposed algorithm has also improved in accuracy and in real-time performance.

Keywords:

mobile robot; relocalization; correlation sampling; graph optimization

1. Introduction

The stable and precise positioning of a robot is an important function in the research field of intelligent mobile robots, and it is an important foundation for the realization of other functions such as robot trajectory tracking, navigation, and environment detection. The indoor positioning of the robot can be divided into absolute positioning and relative positioning. Indoor absolute positioning system can be implemented by using radio frequency identification (RFID), ultra-wide band (UWB) and wireless local area network (WLAN) technologies. The RFID-based localization methods [1] require additional readers, antennas and tags to be deployed between the environment and the target (robot). The UWB-based positioning method [2] calculates the pose of the robot by measuring the distance information between the robot and a fixed base station. Its positioning accuracy is low (more than 10 cm), and it is impossible to obtain the robot’s attitude direction information. The wireless local area network (WLAN) [3] positioning method firstly establishes the radio map of the environment in an offline manner, which is usually very time-consuming, and then uses the map to achieve the target sensing and the low-precision localization. Relative positioning includes the use of a wheel odometer, an inertial measurement unit (IMU), etc. At the same time, we can also use laser range finders and cameras to indirectly locate by perceiving the environment. The visual sensor is low cost, and it can obtain abundant environmental information. However, positioning accuracy based on vision is low, and it is easily affected by light changes and weak textures. A high-precision positioning method can be realized by using a laser sensor that can directly measure accurate distance information to obstacles, which is widely used in the process of indoor mobile robot positioning and navigation. However, because the distance information obtained by the laser sensor is sparse, positioning failures caused under similar structure and occlusion conditions are difficult to relocate. The combination of laser and vision can complement each other. In the case of normal dynamic pose tracking, laser information is used to ensure accurate positioning. At the same time, when the positioning fails, the rich environmental information obtained by the visual sensor is used to achieve coarse precision repositioning to improve the robustness of the system [4,5,6].

Aiming at the navigation process in an indoor environment, the Monte Carlo localization algorithm [7] is widely used as a method of laser positioning with particle filtering, which has the characteristics of fast calculation, high accuracy, and stability. However, after the system completes the initial positioning, the Monte Carlo algorithm has poor repositioning ability if the robot dynamic tracking fails. At the same time, efficiency is not high because this algorithm uses a fixed number of particles. In view of the above shortcomings, there have been many improved versions of the Monte Carlo localization [8,9]. The robustness of the system is improved by setting random particles and the calculation efficiency of the system is improved by setting adaptive number of particles. Yet, in practical applications, the Monte Carlo method and the adaptive Monte Carlo method usually require us to manually specify the initial position and gradually converge to the true position through the movement of the robot. When the system needs to be relocated (for example, the robot encounters a kidnapping problem), adaptive Monte Carlo positioning cannot quickly obtain satisfactory results.

In response to this problem, there are many works that integrate the visual information collected by the camera into the Monte Carlo method to improve the robustness of system positioning. In the paper [10], the system introduces visual odometry in the pose tracking stage to improve the laser positioning accuracy in a dynamic environment. Shi et al. [11] proposed to use the simultaneous localization and mapping (SLAM) method to actively record the Oriented FAST and Rotated BRIEF (ORB) feature points, and then the collected landmarks were trained to generate a database of bag-of-words (DboW) through the kmeans++ method. When the robot’s laser is blocked, the bag of words is used to perform pose regression; ORBSLAM [12] and VINS-Mono [13] all use the bag-of-words model for loop detection. The use of the bag-of-words model for pose regression usually requires large overlap between the current pose of the robot and the pose in the pre-training set, which means that in order to be able to perform better relocalization the pre-training set must cover large enough environmental regions which will result in larger bag-of-words; but, the retrieval time based on landmarks will be longer if the bag-of-words is too large. Different from extracting artificially designed feature points, the relocalization of the vision-based convolutional neural network can increase the stability of positioning because the convolutional neural network can extract the high-dimensional features of the environment. In 2015, Kendall et al. proposed PoseNet [14], the first deep learning network framework for relocalization by using a single image. Inspired by the PoseNet framework, many improved versions including posenetLSTM [15] have emerged for network structure, loss function, increase constraint relations, multi-task modeling, etc. Both algorithms use the trained model to directly estimate the six degree of freedom (6-DoF) pose of the two-dimensional image. This relocalization algorithm is mainly aimed at outdoor or large scenes, and it has high robustness to illumination changes, weather changes, and motion blur. However, the relocation accuracy for small indoor scenes is not high.

In summary, it can be seen that combining the relocation capability of vision with the high-precision capability of laser positioning has achieved many results. However, due to the low accuracy of visual relocation, how to use this positioning information to perform rapid and high-precision positioning is still worth studying. This paper proposes an indoor robot positioning system based on an improved posenetLSTM visual neural network and a particle filter that can realize fast and high-precision relocation and dynamic tracking. The methods and the main contributions of this paper are:

(1): Based on posenetLSTM, the laser measurement and the prior map information are used as geometric constraint relationships and they are added to the loss function of the neural network to improve the accuracy of visual relocation;
(2): The use of particle filtering to sample the rotation and the translation of particles separately on the pre-calculated, multi-resolution grid map; and the use of the branch and bound method to accelerate the high-precision relocation of the robot;
(3): The pose tracking stage is based on the original high-resolution map, using an extended Kalman filter (EKF) to fuse the IMU, the wheel odometer information for initial pose estimation, and correlated particle sampling to obtain the particle with the highest score. The system also introduces a nonlinear optimization method to optimize the pose, which further improves the tracking accuracy and real-time performance of the system;
(4): Simulations and physical experiments were carried out in Gazebo and real environments, respectively, which verified that the system could: realize relocation quickly and accurately and track trajectory in real time with high accuracy.

The distribution of the paper is as follows. In Section 2, the related works of laser-based, vision-based, and their fusion localization is introduced. Section 3 introduces the system and the method of this article. In Section 4, a simulation environment in Gazebo to perform algorithm simulation analyses is set up. In Section 5, the algorithm verification is carried out on the physical Mir robot platform. Section 6 discusses the results and the outlook of this article, and Section 7 gives the conclusion of this article.

2. Related Works

2.1. Laser-Based Localization Method

The Monte Carlo localization method [7] uses particles for posterior state estimation. It is a particle filter positioning algorithm that is widely used in robot localization and navigation. However, the Monte Carlo positioning method cannot solve the kidnapping problem or the issues with recovering from a global positioning failure. To solve this problem, the adaptive Monte Carlo method [16] generates random particles in the sampling process, and the number of random particles are determined by the estimated value of long-term and short-term weights. The number of particles is an important parameter of the particle filter; and, the number of particles directly affects the calculation load of the algorithm, which in turn affects the real-time performance. When the robot has not completed global positioning, the high number of random particles can more accurately express the confidence level, which is beneficial to the robot in obtaining its global positioning faster and more accurately. However, once the global positioning of the robot is obtained, the use of a high number of particles for pose tracking is a waste of computing resources, reducing the real-time performance of the system. Kulback–Leebler divergence sampling [17] can adaptively adjust the required number of particles by calculating the distribution of particle set weights. When the particles are relatively dispersed, more particles will be sampled; and, when the particles are concentrated, fewer particles will be sampled.

The paper [18] uses the method of correlation scanning to improve the low efficiency of the Monte Carlo method for brute sampling calculation. Hess and Kohler [19] proposed a real-time correlation scan matching applied to pose tracking and loop detection in the SLAM process. Yuan et al. [20] proposed a progressive scan matching method, sampling in the rotation and translation directions, respectively, which improved the real-time performance of the system. In the algorithm, the positioning uncertainty in the long corridor environment was considered and a multi-resolution map was used to ensure positioning accuracy. However, it used the grid cell as a unit to sample in the translation direction, so the positioning accuracy of the system was severely limited by the resolution of the grid. Peng et al. [21] used a nonlinear optimization method for scan matching and inserted the pose estimation into the AMCL particle collection, which improved the positioning of the system in a complex and unstructured environment. Tipaldi and Spinello [22] used a 2D laser measurement for retrieval, which could realize efficient and accurate position recognition in very large-scale environments. Park and Roh [23] proposed a coarse-to-fine positioning method, which used a two-dimensional laser measurement for position learning and coarse localization and then scan matching for fine pose. In an indoor environment, the Monte Carlo localization and its improved version used a 2D laser sensor for positioning and navigation, which achieved good results that are now widely used. However, due to its brute force sampling method and the randomness of particles, there is still room for improvement in positioning accuracy and in computational efficiency. At the same time, only using a sparse two-dimensional laser point cloud for position recognition (relocalization) is still not sufficient. At present, many advances and results have been made in repositioning using abundant visual information. Therefore, it is a good and a reasonable solution to use visual information to enhance the laser’s repositioning ability.

2.2. Vision-Based Localization Method

In traditional visual positioning, bag-of-words [24] and locality-sensitive hashing (LSH) [25] are two popular methods, but their essence is still based on low-level artificially designed local feature points such as oriented FAST and rotated BRIEF (ORB) [26], scale-invariant feature transform (SIFT) [27] and speed up robust features (SURF) [28]. This makes it difficult to deal with complex environments such as changes in light and in the participation of dynamic objects. In recent years, with the development of deep learning, visual positioning algorithms based on convolutional neural networks (CNN) have been widely studied. PoseNet [14] is the first end-to-end visual positioning algorithm based on a neural network. It adds a fully connected layer on the basis of GoogLeNet [29] used for classification to directly perform 6-DoF camera pose regression on the input RGB image. It can obtain approximately 2 m and 3° accuracy for large-scale outdoor scenes and 0.5 m and 5° accuracy for indoor scenes. The weight file is only 50 MB, and the calculation time for an image needs only approximately 5 ms. On the basis of this idea, many improved versions have subsequently appeared. In order to solve the over-fitting problem of PoseNet, Walch and Hazirbas [15] introduced LSTM to neural networks to perform structured dimensionality reduction and to choose the most useful feature correlations for the task of pose estimation after the fully connected layer, which could improve the positioning accuracy of the algorithm by 32–37%. In the paper [30], a probability was introduced into a PoseNet framework and it was extended to a Bayesian neural network, which could estimate a positioning error by using the uncertainty measure while positioning. The network retained all the advantages of PoseNet and it improved the positioning accuracy of the system by averaging the Monte Carlo dropout samples. Valada and Radwan [31] proposed a multi-task model of a convolutional neural network to perform 6-DoF pose regression and odometry on continuous images. The global pose and the odometer sub-network shared hard parameters. The relative pose between the two images could be used during training to limit the search space and to obtain a consistent pose estimation. This method greatly improved the positioning accuracy of the system, even close to the traditional positioning method based on local features. The paper [32] added semantic information on the basis of [31]; and, it learned simultaneously with 6-DoF pose regression and odometer information as well as by adopting an adaptive weighting approach to aggregate motion-specific temporal information and adding semantic information to the localization stream. While completing multiple tasks, the positioning accuracy of the system was further improved. Geometric consistency constraints have recently been used to help improve the accuracy of pose regression, and they have proven to be more effective than using Euclidean distance constraints alone. In the paper [33], the photometric error and geometric similarity error are added to the loss function at the same time by using depth image information. Kendall et al. [34] introduced the reprojection error, using the predicted pose of the camera to project the 3D point in the environment onto the 2D image plane, and the deviation of the pixel position was used as a constraint item. Adding geometric constraints can improve positioning accuracy; but, in the current methods, it is less often considered to introduce the laser information commonly used in indoor positioning into the geometric constraints.

2.3. Localization Method Combining Vision and Laser

Fusing visual and laser information can make use of the advantages in each to improve the robustness and the accuracy of the positioning system. There have been many successful applications in SLAM [5,35,36]. In the indoor environment, the fusion of visual information and map-based Monte Carlo positioning has also yielded many results. The paper [37] retrieved similar pictures based on the feature points extracted from the image, thereby reducing the matching area. Then, the laser information was combined with the with Monte Carlo localization algorithm to optimize the pose and the system performed well in the relocalization. In the paper [38], Soonyong et al. established a hybrid topological map containing metric scale information and object location information. The system first used visual information for object recognition and least squares optimization to obtain coarse pose information; and, then it used particle filtering to obtain its fine estimated pose. To solve the problem of robot kidnapping, Xu et al. [39] used a LSTM network for image association to achieve relocalization, and they used an adaptive Monte Carlo algorithm to achieve precise positioning. Mur-Artal and Montiel [11] proposed collecting ORB feature points when using laser to build a grid map and then training the collected feature points to generate a visual bag-of-words in the offline stage. The online stage obtained prior information by extracting visual features and generating the proposal distribution. Then, the system removed the occluded information and selected reliable beams of lidar for precise positioning. At present, most of the positioning frameworks that integrate vision and laser are usually carried out in order from coarse to fine, that is, first using visual information for rough estimation and then introducing rough pose estimation into the AMCL system for accurate pose estimation. This process usually requires the robot to be in motion and the algorithm needs several iterative cycles to complete the pose convergence. In order to overcome its shortcomings and improve its iterative convergence speed, this paper will carry out the research of a high-precision, real-time relocalization system toward indoor environments.

3. System Overview and Methodology

3.1. System Overview

The system framework of this paper is shown in Figure 1, which is mainly composed of three parts. The first part is to collect the image and 2D laser information, and then train the convolutional neural network together with the prior map information (which can be obtained through the SLAM method). The second part is to use the trained neural network to provide rough pose estimation and then use the branch and bound method to speed up the search for accurate pose when the robot is in the stage of pose initialization or pose tracking failure (such as the kidnapping problem). The third part is to fuse the information of the IMU and the wheel odometer in the robot tracking stage to obtain the predicted pose of the robot. Then, the correlation sampling method is used to obtain the optimal estimated pose of the robot, and the nonlinear optimization method can be further used to obtain a higher-precision pose.

3.2. Design of Convolutional Neural Network

The PoseNet uses the SAD optimization method to optimize the Euclidean loss function. There is no problem with the position because the three position directions are independent of each other. However, for the orientation represented by quaternion q, its modulus length is 1, and q and –q are equivalent to represent the same rotation posture. Using the Euclidean loss function cannot guarantee that it is on the unit circle and defining the loss function in this way is over-constrained, which reduces the speed of training convergence and causes loss value fluctuations in the training process. Here we take the sum of the squares of the imaginary part of

\hat{q} q^{- 1}

as the loss function of the orientation, which is

| \hat{q} q^{- 1} |_{x, y, z}^{2}

. In some papers, it is mentioned that adding geometric information can improve the positioning accuracy of the system. Here, the laser measurement data and a priori map are used as environmental constraints. For each laser measurement point, the predicted pose

(\hat{p}, \hat{q})

can be projected into the prior grid map and the grid coordinates can be calculated. The distance from the grid to the obstacle grid cell is defined as the loss function value, which is added to the loss function of the neural network, and the overall loss function of the neural network is:

l o s s (I) = α | \hat{p} - p |^{2} + β | \hat{q} q^{- 1} |_{x, y, z}^{2} + γ \sum_{z} | d i s (z, \hat{p}, \hat{q}, m) |^{2}

(1)

where

α, β

are scale factors used to adjust the displacement error, direction error, and geometric error approximately equal at the time of final convergence. The laser measurement point is z and m is the prior grid map. The prior map acquisition and the calculation of the map grid coordinates from the laser measurement points will be introduced below. The overall vision-based neural network structure is shown in Figure 2.

The size of the weight file of the network is approximately 50 MB, which is nearly the same as the weight file of the PoseNet network. Since the environmental constraints are only added when training the network and not when predicting, adding geometric constraints will not affect the computational efficiency of the system. The system takes approximately 5 ms to process an image for pose prediction. In the following, the vision-based relocalization estimation is used as a rough estimate of the pose combined with laser measurement data to obtain a high-precision pose estimation.

3.3. Map Preprocessing

An occupation grid map describes the environment in the form of discrete grids. In order to make it more convenient to apply to the positioning algorithm, the information contained in the original grid is expanded to include distance information d and score information score. The specific steps are as follows. First, set the distance information of the occupied grid to 0 and the score to 1, with the remaining grid distances set to the maximum and the score to 0. Then, start from the occupied grid and expand outwards, updating the distance information and the score information of the surrounding grids. Assuming that the coordinates of the grid are

(x_{n}, y_{n})

and the grid coordinates of the nearest obstacle are

(x_{o b s t a c l e}, y_{o b s t a c l e})

, then the distance information of the grid is calculated as a Euclidean distance:

d (x_{n}, y_{n}) = \min {\sqrt{{(x_{n} - x_{o b s t a c l e})}^{2} + {(y_{n} - y_{o b s t a c l e})}^{2}}, d_{\max}}

(2)

In Formula (2),

d_{\max}

is the preset value. At the same time, the score information can be calculated according to the distance information:

s c o r e (x_{n}, y_{n}) = \exp (- d (x_{n}, y_{n}) * d (x_{n}, y_{n}) / (2 * s i g m a * s i g m a))

(3)

In Formula (3), sigma is the design parameter. In this way, the resolution of the updated map is the same as that of the original occupancy grid map (usually 5 cm). When initializing the global pose or the relocalization, there may be a large error between the actual pose and the initial estimated pose. Precise pose estimation at this resolution requires sampling a large number of particles to cover the pose space, which will greatly reduce the real-time performance of the system. Here, the original map is processed with multi-resolution first and then the branch and bound algorithm is used to accelerate the search for the optimal pose. When using branch and bound to solve, first sample the particles on the low-resolution grid map to obtain the particle with the highest score. On the basis of this particle, the resolution of the map is improved and a higher-precision pose is obtained until the optimal pose is searched under the highest resolution map. In order to ensure that the method can obtain the global optimal solution in the search window, the low-resolution grid map needs to meet the following constraint conditions. Suppose the low-resolution grid map to be calculated is

M^{h}

, and the sampling step of translation is

2^{h}

grid cell size. Under a specific rotation, the number of particles contained at the highest resolution is

2^{h} \times 2^{h}

, and the corresponding state is

(ξ_{j}, j = 0 : 2^{h} \times 2^{h})

. When calculating the score of the laser point, the following needs to be met:

s c o r e (c) = \sum_{k = 1}^{K} M^{h} (T_{ξ_{c}} z_{k}) \geq \max_{ξ_{j}} \sum_{k = 1}^{K} M^{0} (T_{ξ_{j}} z_{k})

(4)

where

M^{0}

is the original high-resolution grid map and c is the particle index. The robot pose transformation matrix is

T_{ξ_{c}}

represented by particle c sampled based on the low-resolution grid map

M^{h}

. The possible pose transformation matrix is

T_{ξ_{j}}

represented by the corresponding

2^{h} \times 2^{h}

particles on the original high-resolution map

M^{0}

. The laser measurement data is

z_{k}

, K is the number of laser measurement points. In order to ensure that the inequality is true, the method for solving the low-resolution map

M^{h}

according to the high-resolution map

M^{0}

is:

M^{h} (x, y) = \max_{\begin{array}{l} x^{'} \in [x, x + r (2^{h} - 1)] \\ y^{'} \in [y, y + r (2^{h} - 1)] \end{array}} M^{0} (x^{'}, y^{'})

(5)

where

M^{0}

and

M^{h}

have the same number of pixels. The original grid cell size is r and the preprocessed grid map is shown in the Figure 3. The darker the color, the closer to the obstacle.

3.4. Correlation Particle Sampling Method

Brute force sampling according to the motion model is to sample the displacement and the angle at the predicted position at the same time according to the Gaussian distribution. Each particle state information contains the particle’s pose and weight information and then scores according to each particle’s pose state. Assuming that the particle pose state is

(x, y, θ)

, the coordinates of the laser measurement point

(z_{k}^{t}, θ_{k, s e n s}^{t})

in the global coordinate system are:

[\begin{array}{l} x_{z_{k}^{t}} \\ y_{z_{k}^{t}} \end{array}] = [\begin{array}{l} x \\ y \end{array}] + [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}] [\begin{array}{l} x_{k, s e n s} \\ y_{k, s e n s} \end{array}] + z_{k}^{t} [\begin{array}{l} \cos (θ + θ_{k, s e n s}^{t}) \\ \sin (θ + θ_{k, s e n s}^{t}) \end{array}]

(6)

It can be seen from the above equation that the right side of the equation contains the translation and the rotation components, and they are independent of each other. Using brute force sampling, each particle state needs to calculate the rotation and the translation components. Additionally, if we first uniformly sample the angle as the base and then perform uniform sampling on the translation, we only need to perform the corresponding translation (involving the addition operation) on the basis of the rotation, which will greatly simplify the calculation. The particle sampling process based on the correlation method is shown in Figure 4.

Step 1 means using the robot’s internal sensors such as the wheel odometer and the IMU to obtain the proposal pose of the robot. According to the laser measurement, the maximum distance measured by the laser is obtained and the angle step is calculated as

Δ θ = 1 - \arccos (\frac{1}{z_{\max}^{2}})

. The number of rotation samples can be calculated according to the angle search window set by the algorithm. As shown in step 2, there are 11 rotation directions. In each rotation direction, the coordinates of the laser measurement point in the world coordinate system are obtained and recorded. In step 3, according to the translation search window size and the sampling interval set by the algorithm, the number of sampled particles for translation can be calculated. In Figure 4, three particle samples are set in the x and the y directions for each orientation. According to the coordinates of the laser measurement data saved in each orientation in step 2 in the world coordinate system plus the corresponding offset in the x and the y directions, the coordinates of the laser measurement points of all sampled particles in the world coordinate system can be obtained. Then, we can use Formula (7) to calculate the corresponding grid coordinates.

\{\begin{cases} x_{z_{t}^{k}}^{i n d e x} = (x_{z_{t}^{k}} - x_{o r i g i n}) / s c a l e \\ y_{z_{t}^{k}}^{i n d e x} = (y_{z_{t}^{k}} - y_{o r i g i n}) / s c a l e \end{cases}

(7)

where

(x_{o r i g i n}, y_{o r i g i n})

is the coordinate of the grid map origin in the world coordinate system and scale is the resolution of the grid map. According to the grid coordinates of the laser measurement point, the score information can be directly obtained from the grid map through the index and the sample particle with the highest score can be found. The optimal pose estimation can be obtained by adding the offset corresponding to the highest scoring particle to the initial proposal pose.

3.5. Scan Matching Optimization

Graph optimization may be used for pose optimization during the robot’s global initial positioning and pose tracking phases. When the initial pose estimation is outside the branch-and-bound search window during the global initial positioning of the robot, the optimal solution cannot be found through search. However, through the following simulation, it was found that in a structured environment the use of graph optimization methods can increase the proportion of successful relocalization. Additionally, in the pose tracking stage, when the tracking error is large (the highest score of the particle set is lower than the preset threshold) scan matching based on graph optimization can quickly converge the estimated pose to near the true value. The scanning matching method based on nonlinear optimization can be described as a least squares problem:

E (T) = \underset{ξ}{m i n} \sum_{k = 1 : n} {(1 - M (I (P (z_{t}^{k}))))}^{2} = \underset{ξ}{m i n} \sum_{k = 1 : n} {(f_{k} (ξ))}^{2}

(8)

where

ξ = {(x, y, θ)}^{T}

is the variable to be optimized and n is the number of laser points;

I (P (z_{t}^{k}))

represents the conversion of the coordinates

P (z_{t}^{k})

of the laser measurement point

z_{k}^{t}

in the world coordinate system to the grid coordinates in the map coordinate system; and

P (z_{t}^{k})

and

I (P (z_{t}^{k}))

are, respectively:

P (z_{t}^{k}) = [\begin{array}{l} x_{z_{t}^{k}} \\ y_{z_{t}^{k}} \end{array}] = [\begin{array}{l} x \\ y \end{array}] + [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}] [\begin{array}{l} x_{k, s e n s} \\ y_{k, s e n s} \end{array}] + z_{k}^{t} [\begin{array}{l} \cos (θ + θ_{k, s e n s}) \\ \sin (θ + θ_{k, s e n s}) \end{array}]

(9)

I (P (z_{t}^{k})) = {(x_{z_{t}^{k}}^{i n d e x}, y_{z_{t}^{k}}^{i n d e x})}^{T} = {((x_{z_{t}^{k}} - x_{o r i g i n}) / s c a l e, (y_{z_{t}^{k}} - y_{o r i g i n}) / s c a l e)}^{T}

(10)

This least squares problem can be solved using the Gauss–Newton method, the LM (Levenberg–Marquardt) method, or the Dogleg method. The algorithm proposed in this paper uses the framework based on the Ceres library [40] to complete the calculation. However, before using each method to solve the problem, we first need to derive the derivative of the residual

1 - M (I (x_{z_{t}^{k}}, y_{z_{t}^{k}}))

with respect to the pose T:

J_{k} = \frac{\partial f_{k} (ξ)}{\partial ξ} = - \nabla M (I (P (z_{t}^{k}))) \frac{\partial I (P (z_{t}^{k}))}{\partial P (z_{t}^{k})} \frac{\partial P (z_{t}^{k})}{\partial ξ}

(11)

where:

\begin{array}{l} \frac{\partial P (z_{t}^{k})}{\partial ξ} = [\begin{matrix} 1 & 0 & - x_{k, s e n s} \sin θ - y_{k, s e n s} \cos θ - z_{k}^{t} \sin (θ + θ_{k, s e n s}) \\ 0 & 1 & x_{k, s e n s} \cos θ - y_{k, s e n s} \sin θ + z_{k}^{t} \cos (θ + θ_{k, s e n s}) \end{matrix}] \\ \frac{\partial I (P (z_{t}^{k}))}{\partial P (z_{t}^{k})} = \frac{1}{s c a l e} \end{array}

(12)

In the above formula, only

\nabla M (I (P (z_{t}^{k})))

is unknown. Since the grid map is discrete, each grid area will only record the score of its center. Generally, the grid coordinate

I (P (z_{t}^{k}))

projected by the laser point onto the map does not necessarily completely coincide with the grid cell center, and the discrete form is not conducive to the calculation of a differential. Here, we can use the bilinear interpolation method to calculate the score and the differentiation. Bilinear interpolation uses grid coordinates to calculate four adjacent grids, as shown in the Figure 5.

For the convenience of marking, suppose the grid coordinates of the laser point projected into the map coordinate system is

P (x, y)

and its four adjacent grid coordinates are

P (x_{0}, y_{0})

,

P (x_{1}, y_{0})

,

P (x_{1}, y_{1})

, and

P (x_{0}, y_{1})

by extension. The score corresponding to each grid is

M (P (x_{0}, y_{0}))

,

M (P (x_{1}, y_{0}))

,

M (P (x_{1}, y_{1}))

, and

M (P (x_{0}, y_{1}))

; and, the score at

P (x, y)

can be obtained by Lagrangian interpolation:

\begin{array}{l} M (P (x, y)) \approx & \frac{y - y_{0}}{y_{1} - y_{0}} (\frac{x - x_{0}}{x_{1} - x_{0}} M (P (x_{1}, y_{1})) + \frac{x_{1} - x}{x_{1} - x_{0}} M (P (x_{0}, y_{1}))) + \\ \frac{y_{1} - y}{y_{1} - y_{0}} (\frac{x - x_{0}}{x_{1} - x_{0}} M (P (x_{1}, y_{0})) + \frac{x_{1} - x}{x_{1} - x_{0}} M (P (x_{0}, y_{0}))) \end{array}

(13)

The partial derivative of

M (P (x, y))

with respect to x is:

\begin{array}{l} \frac{\partial M (x, y)}{\partial x} = & \frac{y - y_{0}}{(y_{1} - y_{0}) (x_{1} - x_{0})} (M (P (x_{1}, y_{1})) - M (P (x_{0}, y_{1}))) + \\ \frac{y_{1} - y}{(y_{1} - y_{0}) (x_{1} - x_{0})} (M (P (x_{1}, y_{0})) - M (P (x_{0}, y_{0}))) \end{array}

(14)

The partial derivative of

M (P (x, y))

with respect to y is:

\begin{array}{l} \frac{\partial M (x, y)}{\partial y} = & \frac{x - x_{0}}{(x_{1} - x_{0}) (y_{1} - y_{0})} (M (P (x_{1}, y_{1})) - M (P (x_{1}, y_{0}))) + \\ \frac{x_{1} - x}{(x_{1} - x_{0}) (y_{1} - y_{0})} (M (P (x_{0}, y_{1})) - M (P (x_{0}, y_{0}))) \end{array}

(15)

Substituting Equations (14) and (15) into Equation (11), we can complete the calculation of the Jacobian matrix

J_{k}

. After obtaining the Jacobian matrix

J_{k}

, we conveniently use the Ceres library [40] to solve the nonlinear least squares problem (8).

4. Simulation

4.1. Simulation Settings

In order to verify the relocation and pose tracking performance of the algorithm, we set up a simulation environment in Gazebo. The Mir mobile robot platform was equipped with a 2D laser sensor that could directly obtain obstacle information around the robot. The sensing angle range was: −2.356 rad to +2.356 rad, the sensing distance range was 0.05 m to 24.0 m, and 541 laser points were collected in one frame. At the same time, the robot mounted a visual sensor to receive the real-time RGB raw image with a resolution of 1920 × 1080. The algorithm was developed based on the Ubuntu 18.04 operating system and the Robot Operating System (ROS) environment. All simulation experiments were completed in a computer equipped with an i7 7700 K CPU and a GTX1060 GPU that could be used to accelerate neural network training.

4.2. Initialization of the Robot Global Positioning

4.2.1. Visual Localization Accuracy Analysis

The simulation environment built in Gazebo is shown in Figure 6d. First, 665 images and their corresponding laser measurement data and pose information were collected in the simulation environment as the training data of the neural network. Figure 6a is the image information collected by the camera of the robot at the origin of the coordinate system. The red dot in Figure 6b represents the laser measurement information at the same position. The pose information of the robot can be obtained directly through the Gazebo backend, and then the same collected data can be used to train the posenetLSTM and the neural network proposed in this paper. Since the neural network directly outputs the pose only by inputting the image, the trained neural network is directly used to estimate the robot’s motion trajectory to characterize its relocalization ability.

The green line in Figure 6c is the trajectory of the robot used to evaluate the relocation accuracy of the algorithm. The green dot is the starting point of the trajectory and the red dot is the end of the trajectory. The estimated trajectory and the trajectory error based on the posenetLSTM and the method proposed in this paper are shown in Figure 7. It can be seen from Figure 7 that the amplitude of the error fluctuation of the original posenetLSTM method is significantly larger than that of the method proposed in this paper, both in translation and in the rotation direction. Moreover, in some positions, the error based on the posenetLSTM method also has a mutation. Since the proposed method introduces laser measurement as a structural constraint, the positioning accuracy and the stability are greatly improved. The statistical results including root mean square error (RMSE), mean absolute error (MAE), and standard deviation (SD) are shown in Table 1. It can be seen that compared with the posenetLSTM method, the relocalization accuracy of the proposed algorithm in the x, y, and yaw directions can be improved by approximately 28.92%, 43.13%, and 46.83%, respectively.

4.2.2. System Relocalization Analysis

The experiment in this section is to verify the ability of the algorithm to perform accurate global relocation when the initial position of the robot is unknown. The algorithm’s default initial position of the robot is at the origin of the coordinate system, which is

(x_{0}^{m a p_b a s e}, y_{0}^{m a p_b a s e}, φ_{0}^{m a p_b a s e}) = (0.0, 0.0, 0.0)

. The linear and angular search ranges of the algorithm are (−1.5 m, +1.5 m) and (−π rad, π rad), respectively. It is time-consuming to calculate the scores of all laser points in each particle state. To improve the calculation time efficiency, the score is calculated after every eight laser points and approximately 70 laser points are calculated for one frame of data for scoring. From the simulation experiment results in the previous section, the accuracy of the pose estimated based on the neural network was poor. Here, the pose provided by the neural network was used as the initial pose, and the laser information was used to estimate the precise pose of the robot. A total of 36 ∗ 4 = 144 sets of tests were done, including 36 positions, with each position including 4 test directions. The relocalization test results based on posenetLSTM and the method proposed in this paper are shown in Figure 8. The green arrow in Figure 8 indicates that the relocalization test was successful (the exact pose of the robot could be correctly estimated), and the red arrow indicates that the estimation cannot be successful. The blue arrow in Figure 8a indicates that the method proposed in this paper can be successfully estimated but cannot be estimated successfully based on posenetLSTM. Based on the method proposed in this paper, the number of successful relocalizations is 118 with a success rate of 81.9%, and the number of successful relocations based on posenetLSTM is 103 with a success rate of 71.5%. It can be seen from Figure 8 that most of the relocalization failures are near obstacles because the camera’s viewing angle is almost blocked in such cases. The second reason for the failure of relocalization is that the surface of some obstacles in the simulation environment has repeated brick-shaped textures, which is not conducive to the relocalization of the visual neural network.

The pose change under the condition that both the proposed method and posenetLSTM can relocate successfully is shown in Figure 9. For the yaw angle in Figure 9e,f, the −π rad and +π rad represent the same orientation. At the same time, when the relocalization is successful, the two methods can both achieve extremely high positioning accuracy after optimizing the initial pose estimated by the visual neural network using laser information.

The pose estimated by the visual neural network is used as the initial pose, and then the laser information is used to search for the optimal posture. The search time of the two methods is shown in Figure 10. Since the algorithm’s time consumption fluctuates every time the algorithm runs, the program for each location is run five times and the running time is recorded. Then, the average calculation time is taken as the search time for that location. It can be seen from Figure 10 that since the initial pose estimated by the neural network proposed in this article is more accurate, the search time for the optimal pose based on it is also shorter. Compared with the average time based on the posenetLSTM algorithm of 0.4526 s, the average time based on the method proposed in this paper is 0.3856 s, a reduction of 14.81%.

4.3. Dynamic Pose Tracking of Robot

This section is mainly to verify the high-precision pose tracking performance of the algorithm. The positioning system involves three main coordinate systems: map, odom, and base. The base coordinate system is the coordinate system bound to the robot, the EKF filter [41] is estimated in the odom coordinate system, and the global pose is defined in the map. Before particle sampling, we use the EKF framework to fuse the robot inner motion sensor’s IMU and the wheel odometer information to obtain the predicted initial pose of the robot in the odom coordinate system. Assume that the fused poses at time k−1 and k are

(x_{k - 1}^{o d o m_b a s e}, y_{k - 1}^{o d o m_b a s e}, φ_{k - 1}^{o d o m_b a s e})

and

(x_{k}^{o d o m_b a s e}, y_{k}^{o d o m_b a s e}, φ_{k}^{o d o m_b a s e})

, respectively. Then, the global pose of the robot at time k is:

\{\begin{cases} x_{k}^{m a p_b a s e} = x_{k - 1}^{m a p_b a s e} + δ_{t r a n s} \cos (φ_{k - 1}^{m a p_b a s e} + δ_{r o t 1}) \\ y_{k}^{m a p_b a s e} = y_{k - 1}^{m a p_b a s e} + δ_{t r a n s} \sin (φ_{k - 1}^{m a p_b a s e} + δ_{r o t 1}) \\ φ_{k}^{m a p_b a s e} = φ_{k - 1}^{m a p_b a s e} + δ_{r o t 1} + δ_{r o t 2} \end{cases}

(16)

where:

\{\begin{cases} δ_{r o t 1} = \arctan 2 (y_{k}^{o d o m_b a s e} - y_{k - 1}^{o d o m_b a s e}, x_{k}^{o d o m_b a s e} - x_{k - 1}^{o d o m_b a s e}) - φ_{k - 1}^{o d o m_b a s e} \\ δ_{t r a n s} = \sqrt{{(x_{k}^{o d o m_b a s e} - x_{k - 1}^{o d o m_b a s e})}^{2} + {(y_{k}^{o d o m_b a s e} - y_{k - 1}^{o d o m_b a s e})}^{2}} \\ δ_{r o t 2} = φ_{k}^{o d o m_b a s e} - φ_{k - 1}^{o d o m_b a s e} - δ_{r o t 1} \end{cases}

(17)

Let the robot follow basically the same trajectory as in Section 4.2 and compare the pose tracking accuracy of EKF, AMCL, and the algorithm proposed in this paper. Carry out the two experiments of case 1 and case 2, respectively, where case 1 is the IMU and the wheel odometer has a small error, and Case 2 is the IMU and the wheel odometer has a large error. In the case of case 1, the results of pose estimation and the corresponding pose errors of EKF, AMCL, and the algorithm proposed in this paper are shown in Figure 11. The statistical results of the error are shown in Table 2. Taking RMSE statistical results as an example, compared with the AMCL algorithm, the proposed algorithm has improved pose tracking accuracy by 9.75%, 14.98%, and 8.6% in the x, y, and yaw directions, respectively. Both AMCL and the algorithm proposed in this paper can achieve a higher accuracy, with the RMSE value of the position error being less than 0.05 m (the size of a grid).

In case 2, the pose estimation and the pose errors of EKF, AMCL, and the algorithm proposed in this paper are shown in Figure 12. It can be seen from Figure 11 and Table 2 that when the error of the wheel odometer and the IMU is large, the correlation sampling algorithm proposed in this paper and AMCL algorithm can also achieve high accuracy, which is the same as the conclusion of case 1. At the same time, we can also find that when the IMU and the wheel odometer errors are large, the accuracy of the pose tracking based on the AMCL algorithm will be slightly increased, but the method proposed in this article has basically no effect. In practical applications, the error of the IMU and the wheel odometer is inevitable, and the method proposed in this paper can still keep the accuracy unaffected or less affected, which is one of the major advantages of the algorithm proposed in this paper.

From the above experiment, we can see that the accuracy of the algorithm in this article and the AMCL algorithm are slightly improved, but the difference is not big. The number of sampled particles set in the simulation is 775 and the score is calculated every 5 laser points for statistical evaluation. The running time of the algorithm proposed in this paper and the AMCL algorithm is shown in Figure 13. The jump point of the calculation time in Figure 13b is the case where the AMCL algorithm adaptively increases the sampled particles, and the maximum number of sampled particles is 5000. The statistical results of the running time of the algorithm are shown in Table 3. It can be seen from the table that the average running time of the algorithm proposed in this article is 30% less than the AMCL algorithm. In simulation we also found that if the number of sampled particles or laser points selected for calculation increases, the computational efficiency of the algorithm proposed in this paper will be more significantly improved compared to the AMCL algorithm.

5. Experiments

5.1. Experimental Platform and Measuring Equipment

In order to verify the performance of the algorithm, we built an autonomous mobile robot platform, Mir-UR5. The experimental platform is shown in Figure 14a. Its hardware mainly included: a mobile robot MiR100, a six-degree-of-freedom manipulator UR5, and a depth camera Kinect V2. The mobile controller device was an Alienware computer, which was configured with a GTX1080 GPU, 32G RAM and an I9 CPU processor. The operating system was Ubuntu 18.04 and the Robot Operating System (ROS) framework was also used. The external measurement system is shown in Figure 14b, which contains a Leica Geosystems Absolute Tracker (AT960) and laser tracker targets, with an absolute positioning accuracy that could reach 0.01 mm.

5.2. Relocalization Experiment

5.2.1. Test of Visual Relocalization Accuracy

The occupancy grid map of the indoor environment with a grid resolution of 0.05 m was established by using a SLAM algorithm Cartographer [19] as the laser-based positioning accuracy on the map was one order of magnitude higher than the visual relocation accuracy. So, after establishing the indoor environment grid map, we let the robot run in the environment while recording the image information released by the camera and the measurement information released by the laser scanner. Then, we used laser-based pose estimation as the true value of the pose during the visual neural network training process.

After completing the training, the following will analyze the accuracy of vision relocation and the precise relocation of the fusion of vision and laser information. We let the robot run along the green trajectory shown in Figure 15d. The laser-based positioning was used as a reference for analyzing the relocation accuracy of the visual neural network. The absolute pose errors based on posenetLSTM and the neural network proposed in this paper are shown in Figure 15a,b, respectively. The statistical results of the absolute pose error are shown in Figure 15c. Taking RMSE statistics as an example, the absolute positioning accuracy of the proposed neural network improved by 27.84% compared with the original postenetLSTM.

The pose estimations in the x, y, and yaw directions are shown in Figure 16a,c,e, respectively, and the corresponding errors are shown in Figure 16b,d,f. The error statistics results are shown in Table 4. The positioning accuracy in the actual environment was higher than that in the simulation environment. The main reason was that the visual features in the actual environment were obviously richer than those in the simulation environment.

5.2.2. Test of Relocalization Integrating Visual and Laser Information

The relocation process of fusing vision and laser information is shown in Figure 17. Figure 17a is the robot in a certain test pose in the actual environment and Figure 17b is the image obtained by the camera. The coarse relocation result based on vision is shown in Figure 17c, and the accurate relocation result fused with laser information is shown in Figure 17d.

We tested 11 points in the environment, each with eight directions, as shown in Figure 18a. The search window size of the algorithm was set as: translation direction −2.0 m to +2.0 m and angle range −0.785 rad to +0.785 rad. The system successfully performed accurate relocation in all test directions of all test points and the search time from the coarse pose estimated by vision to the fine pose fused with laser information is shown in Figure 18b. The average time of using posenetLSTM and the neural network proposed in this paper in all tests are 0.3261 s and 0.3602 s, respectively, an increase of 9.46%. The improvement of computational efficiency is not obvious and this is because the main factor determining the system relocalization time is the search window size parameter of the algorithm, which is set to the same. The visual convolutional neural network proposed in this paper has a higher visual relocation accuracy, and the real-time performance of system relocalization can be greatly improved by reducing the search window in practical applications.

The test results show the system’s excellent precision relocation capabilities. For example, in test positions 1 and 9 facing the door, the image information obtained by the camera was very similar, and position 9 also contained chaotic environmental parts. The position of the two doors and the chaotic part are shown as A, B, and C in Figure 18a. However, the algorithms can successfully complete precise relocation, showing excellent precise relocation capabilities and the relocation process is shown in Figure 19.

5.3. Dynamic Pose Tracking

The experiment in this section was to verify the accuracy and the real-time performance of the system during dynamic pose tracking. The measured value provided by the Leica Geosystems Absolute Tracker (AT960) measuring system was used as the true value of the robot’s pose. The two test trajectories of the robot on the priori grid map are shown as the green lines in Figure 20. The green dot is the starting point, the red dot is the end point, and the particles are distributed in the green box. In the experiment, in order to ensure that the robot could run safely without collision and the test equipment could continuously track the robot, the speed of trajectory 1 is lower than that of trajectory 2.

The comparison between the two trajectories estimated using the proposed algorithm, AMCL algorithm, and EKF, respectively, and the measured trajectory are shown in Figure 21a,b. The comparison of estimation errors between the proposed algorithm and the AMCL algorithm in the x and the y directions is shown in Figure 21c–f, respectively. As can be seen from Figure 21c–f, both the proposed algorithm and the AMCL algorithm can achieve a high positioning accuracy. This also proves that the use of laser-based positioning as the true value of the visual neural network training in the previous section is reasonable. The statistics of their errors are shown in Table 5. It can be seen from the table that the accuracy of the algorithm proposed in this paper can be increased by approximately 10% compared with the AMCL algorithm, and the performance is more obvious in trajectory 2. It is because the speed of trajectory 2 is greater than that of trajectory 1; and, when the speed is low, the AMCL algorithm has enough time to complete the convergence of the estimated pose. The estimation error here is smaller than that in the simulation environment. The main reason is, here, the robot runs at a lower speed than in the simulation environment so that the robot can run more safely in the environment and the measurement equipment can continuously track the robot.

In the process of dynamically tracking the robot’s pose, the comparison of the calculation time of the two algorithms is shown in Figure 22. The AMCL algorithm adaptively adjusts the number of sampled particles between 775 and 5000 during operation, which leads to large fluctuations in the calculation time. In this paper, the correlation sampling method combined with the nonlinear graph optimization method can ensure the high real-time performance of the system and it can achieve a high positioning accuracy with a low number of particles. The average calculation time of the two trajectories of the AMCL algorithm is 54.01 ms and 58.20 ms, respectively, while the average calculation time of the algorithm proposed in this paper is only 8.70 ms and 9.35 ms, which reduces the calculation time by 83.87% and 83.92%.

6. Discussion

Both simulations and experiments prove the robustness, and the real-time and the high precision of the localization algorithm proposed in this paper. When collecting images to train the neural network, we not only collected the observation images during the translational movement of the robot, but also increased the observation images of the rotation movement at certain points. Compared to translational motion, in the case of rotational motion, the camera can obtain richer information in the environment. In the experimental test, it was found that, except for the case where the camera’s field of view was severely obscured, the system could be repositioned correctly in almost all possible positions and orientations of the robot. In the simulation environment, the use of nonlinear optimization methods is very beneficial to the precise repositioning of the system. However, when testing in an actual environment, it was found that in a very small number of test positions, after using the optimized method, the pose score did not improve. The main reason may be that there are non-structural parts in the actual environment and the approximate method of interpolation of the map occupancy probability is used in the optimization process. Therefore, this article combines the two methods to achieve very good results in precise relocation. In the pose tracking stage, the dynamic object will also affect the tracking and the positioning accuracy of the system, and then affect the navigation of the robot.

7. Conclusions

In this paper, based on the original posenetLSTM convolutional neural network, laser information is added as a geometric constraint and the rotation loss function is redesigned. Both simulation and experiment prove that the accuracy of visual relocalization can be greatly improved after such modification. Taking visual pose estimation as coarse estimation and using the correlation scanning method for particle sampling in combination with the branch and bound method to quickly search for the optimal solution, the rapid, high-precision relocation of the system is realized. The system can complete accurate relocalization within 0.4 s without manual assistance, which greatly enhances the robustness of the system in the case of an unknown initial pose and the loss of pose tracking. In the pose tracking stage, we use a method combining correlation scanning and nonlinear optimization; compared with the AMCL algorithm, it improves accuracy and greatly improves real-time performance. In the experiment, compared with the AMCL method, the localization accuracy of the system in x and in y directions can be improved by 10% and 12%, respectively, and the calculation time can be reduced by 83.87% and 83.92%. In the future, we will study how to remove the measurement of dynamic objects (for example, according to the velocity estimation of dynamic objects) to improve the positioning accuracy of the system. At the same time, the neural network used in this paper can only perform relocation. Next, we will combine information such as semantic segmentation and odometry to improve the positioning accuracy of the system and the ability to complete multiple tasks.

Author Contributions

Conceptualization, Y.L. and C.Z.; methodology, C.Z.; software, Y.L.; validation, Y.L., C.Z. and Y.W.; formal analysis, C.Z.; investigation, C.Z.; resources, Y.L.; data curation, C.Z.; writing—original draft preparation, C.Z.; writing—review and editing, Y.W.; visualization, C.Z.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by State Key Laboratory of Robotics and System, grant number SKLRS201813B.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data generated during the experiments are presented in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, M.; Wang, H.; Yang, Y.; Zhang, Y.; Ma, L.; Wang, N. RFID 3-D Indoor Localization for Tag and Tag-Free Target Based on Interference. IEEE Trans. Instrum. Meas. 2019, 68, 3718–3732. [Google Scholar] [CrossRef]
Zhang, Y.; Tan, X.; Zhao, C. UWB/INS Integrated Pedestrian Positioning for Robust Indoor Environments. IEEE Sens. J. 2020, 20, 14401–14409. [Google Scholar] [CrossRef]
Zhou, M.; Lin, Y.; Zhao, N.; Jiang, Q.; Yang, X.; Tian, Z. Indoor WLAN Intelligent Target Intrusion Sensing Using Ray-Aided Generative Adversarial Network. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 4, 61–73. [Google Scholar] [CrossRef]
Zhang, J.; Singh, S. Laser-visual-inertial odometry and mapping with high robustness and low drift. J. F. Robot. 2018, 35, 1242–1264. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, J.; Chen, S.; Yuan, C.; Zhang, J.; Zhang, J. Robust High Accuracy Visual-Inertial-Laser SLAM System. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 6636–6641. [Google Scholar] [CrossRef]
Debeunne, C.; Vivet, D. A review of visual-lidar fusion based simultaneous localization and mapping. Sensors 2020, 20, 2068. [Google Scholar] [CrossRef] [Green Version]
Thrun, S.; Fox, D.; Burgard, W.; Dellaert, F. Robust Monte Carlo localization for mobile robots. Artif. Intell. 2001, 128, 99–141. [Google Scholar] [CrossRef] [Green Version]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; Communications of the ACM: New York, NY, USA, 2005; Volume 1. [Google Scholar]
Wang, Y.; Zhang, W.; Li, F.; Shi, Y.; Chen, Z.; Nie, F.; Zhu, C.; Huang, Q. An improved Adaptive Monte Carlo Localization Algorithm Fused with Ultra Wideband Sensor. In Proceedings of the 2019 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO), Beijing Institute of Technology (BIT), Beijing, China, 31 October–2 November 2019. [Google Scholar]
Yuan, R.; Zhang, F.; Fu, Y.; Wang, S. A robust iterative pose tracking method assisted by modified visual odometer. Sens. Rev. 2021, 41, 1–15. [Google Scholar] [CrossRef]
Shi, Y.; Zhang, W.; Li, F.; Huang, Q. Robust localization system fusing vision and lidar under severe occlusion. IEEE Access 2020, 8, 62495–62504. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef] [Green Version]
Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 2938–2946. [Google Scholar] [CrossRef] [Green Version]
Walch, F.; Hazirbas, C.; Leal-Taixe, L.; Sattler, T.; Hilsenbeck, S.; Cremers, D. Image-Based Localization Using LSTMs for Structured Feature Correlation. In Proceedings of the IEEE ICCV, Venice, Italy, 22–29 October 2017; pp. 627–637. [Google Scholar] [CrossRef] [Green Version]
Hanten, R.; Buck, S.; Otte, S.; Zell, A. Vector-AMCL: Vector based adaptive monte carlo localization for indoor maps. Adv. Intell. Syst. Comput. 2017, 531, 403–416. [Google Scholar] [CrossRef]
Li, T.; Sun, S.; Sattar, T.P. Adapting sample size in particle filters through KLD-resampling. Electron. Lett. 2013, 49, 740–742. [Google Scholar] [CrossRef] [Green Version]
Olson, E.B. Real-time correlative scan matching. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 4387–4393. [Google Scholar] [CrossRef] [Green Version]
Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-time loop closure in 2D LIDAR SLAM. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–20 May 2016; pp. 1271–1278. [Google Scholar] [CrossRef]
Yuan, R.; Zhang, F.; Qu, J.; Li, G.; Fu, Y. An enhanced pose tracking method using progressive scan matching. Ind. Rob. 2019, 46, 235–246. [Google Scholar] [CrossRef]
Peng, G.; Zheng, W.; Lu, Z.; Liao, J.; Hu, L.; Zhang, G.; He, D. An improved AMCL algorithm based on laser scanning match in a complex and unstructured environment. Complexity 2018, 2018, 1–11. [Google Scholar] [CrossRef]
Tipaldi, G.D.; Spinello, L.; Burgard, W. Geometrical FLIRT phrases for large scale place recognition in 2D range data. In Proceedings of the 2013 IEEE international conference on robotics and automation, Karlsruhe, Germany, 6–10 May 2013; pp. 2693–2698. [Google Scholar] [CrossRef]
Park, S.; Roh, K.S. Coarse-to-Fine Localization for a Mobile Robot Based on Place Learning With a 2-D Range Scan. IEEE Trans. Robot. 2016, 32, 528–544. [Google Scholar] [CrossRef]
Gálvez-López, D.; Tardós, J.D. Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Andoni, A.; Indyk, P.; Laarhoven, T.; Razenshteyn, I.; Schmidt, L. Practical and optimal LSH for angular distance. Adv. Neural Inf. Process. Syst. 2015, 28, 1225–1233. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef] [Green Version]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded Up Robust Features. In Proceedings of the European conference on computer vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
Kendall, A.; Cipolla, R. Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the IEEE ICRA, Stockholm, Sweden, 16–21 May 2016; pp. 4762–4769. [Google Scholar] [CrossRef] [Green Version]
Valada, A.; Radwan, N.; Burgard, W. Deep Auxiliary Learning for Visual Localization and Odometry. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 6939–6946. [Google Scholar] [CrossRef] [Green Version]
Radwan, N.; Valada, A.; Burgard, W. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry. IEEE Robot. Autom. Lett. 2018, 3, 4407–4414. [Google Scholar] [CrossRef] [Green Version]
Tian, M.; Nie, Q.; Shen, H. 3D Scene Geometry-Aware Constraint for Camera Localization with Deep Learning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4211–4217. [Google Scholar] [CrossRef]
Kendall, A.; Cipolla, R. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6555–6564. [Google Scholar] [CrossRef] [Green Version]
Zuo, X.; Geneva, P.; Lee, W.; Liu, Y.; Huang, G. LIC-Fusion: LiDAR-Inertial-Camera Odometry. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 5848–5854. [Google Scholar] [CrossRef] [Green Version]
Shao, W.; Vijayarangan, S.; Li, C.; Kantor, G. Stereo Visual Inertial LiDAR Simultaneous Localization and Mapping. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 370–377. [Google Scholar] [CrossRef] [Green Version]
System, I.; Monte, W.; Localization, C.; Wolf, J.; Burgard, W.; Burkhardt, H. Robust Vision-Based Localization by Combining. IEEE Trans. Robot. 2005, 21, 208–216. [Google Scholar]
Park, S.; Park, S.K. Vision-based global localization for mobile robots using an object and spatial layout-based hybrid map. In Proceedings of the 2009 International Conference on Advanced Robotics, Munich, Germany, 22–26 June 2009; pp. 2111–2116. [Google Scholar]
Xu, S.; Chou, W.; Dong, H. A robust indoor localization system integrating visual localization aided by CNN-based image retrieval with Monte Carlo localization. Sensors 2019, 19, 249. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Agarwal, S.; Mierle, K.; Ceres Solver, O. Available online: http://ceres-solver.org (accessed on 3 February 2022).
Meeussen, W. robot_pose_ekf. Available online: http://wiki.ros.org/robot_pose_ekf (accessed on 3 February 2022).

Figure 1. System framework.

Figure 2. Convolutional neural network structure.

Figure 3. Grid map preprocessing. (a) h = 0; (b) h = 2; (c) h = 4; (d) original map.

Figure 4. Correlative particle sampling.

Figure 5. Bilinear interpolation.

Figure 6. (a) Obtained image information; (b) Laser measurement information; (c) The robot’s trajectory; (d) Simulation environment in Gazebo.

Figure 7. Visual pose tracking results based on neural network. (a,c,e) The comparison of pose estimation results in the x, y, and yaw directions, respectively; (b,d,f) The comparison of pose estimation errors in the x, y, and yaw directions, respectively.

Figure 8. Record result of relocation (a) The posenetLSTM method; (b) The method proposed in this paper.

Figure 9. The change of pose from coarse to fine in the process of relocalization that combines laser and visual information. (a,c,e) The pose change in the x, y, and yaw directions, respectively, when using the posenetLSTM method; (b,d,f) The pose change in the x, y, and yaw directions, respectively, when using the method proposed in this paper.

Figure 10. Search time from coarse to fine pose.

Figure 11. The pose estimation result of case 1. (a) The estimated error in x direction; (b) The estimated error in y direction; (c) The estimated error in the yaw direction; (d) True trajectory and estimated trajectory.

Figure 12. The pose estimation result of case 2. (a) The estimated error in x direction; (b) The estimated error in y direction; (c)The estimated error in the yaw direction; (d) True trajectory and estimated trajectory.

Figure 13. The comparison of time–cost. (a) case 1; (b) case 2.

Figure 14. Experiment setup. (a) The experimental platform; (b) The measuring equipment.

Figure 15. (a) Absolute pose error of posenetLSTM; (b) Absolute pose error of the proposed neural network; (c) Statistics of absolute pose error; (d) The robot’s trajectory.

Figure 16. (a,c,e) The comparison of pose estimation in the x, y, and yaw directions, respectively; (b,d,f) The comparison of pose estimation errors in the x, y, and yaw directions, respectively.

Figure 17. (a) Robot at test position; (b) Image acquired by camera; (c) The coarse relocation result based on vision; (d) The accurate relocation result fused with laser information.

Figure 18. (a) Location of 11 test points; (b) Search time from coarse to fine pose.

Figure 19. (a–c) are the images obtained by the camera, the coarse positioning of vision and the accurate relocation result fused with laser information at test point 1, respectively; (d–f) are the results of test point 9.

Figure 20. Robot’s two test trajectory.

Figure 21. Comparison of the pose estimation and their errors of the two test trajectories. (a,b) Comparison of predicted trajectories;(c,d) Position deviation in x direction; (e,f)Position deviation in y direction.

Figure 22. Comparison of the calculation time of the two test trajectories (a) trajectory 1 (b) trajectory 2.

Table 1. The result of the metric absolute speed error when robot is in motion.

Pose	posenetLSTM			The System			Improvement
Pose	RMSE	MAE	S.D.	RMSE	MAE	S.D.	RMSE	MAE	S.D.
pose_x(m)	0.5938	0.4642	0.5939	0.4291	0.3191	0.4291	27.74%	31.26%	27.75%
pose_y(m)	0.5595	0.4094	0.5586	0.3187	0.2345	0.3148	43.04	42.72%	43.64%
pose_yaw(rad)	0.1840	0.1180	0.1831	0.09518	0.06790	0.09197	48.27%	42.45%	49.77%

Table 2. The results of the pose estimation errors in case 1 and case 2, respectively.

Pose		AMCL			The System			Improvement
Pose		RMSE	MAE	S.D.	RMSE	MAE	S.D.	RMSE	MAE	S.D.
case 1	pose_x(m)	0.0400	0.0319	0.0350	0.0361	0.0295	0.0321	9.75%	7.52%	8.28%
	pose_y(m)	0.0327	0.0266	0.0272	0.0278	0.0220	0.0238	14.98%	17.29%	12.5%
	pose_yaw(rad)	0.0186	0.0119	0.0186	0.0170	0.0108	0.0170	8.6%	9.24%	8.6%
case 2	pose_x(m)	0.0415	0.0322	0.0411	0.0355	0.0289	0.0310	14.5%	10.2%	24.5%
	pose_y(m)	0.0381	0.0301	0.0329	0.0269	0.0218	0.0225	29.4%	27.6%	31.6%
	pose_yaw(rad)	0.0195	0.0128	0.0193	0.0157	0.0097	0.0156	19.5%	24.2%	19.2%

Table 3. The comparison of mean time–cost in case 1 and case 2, respectively.

Case 1	AMCL	The System	Time–Cost Reduction
Case 1	Mean Time–Cost (ms)	Mean Time–Cost (ms)	Time–Cost Reduction
case 1	10.442	7.258	30.49%
case 2	10.883	7.286	33.05%

Table 4. The results of the pose estimation errors in the x, y, and yaw directions, respectively.

Pose	posenetLSTM			The System			Improvement
Pose	RMSE	MAE	S.D.	RMSE	MAE	S.D.	RMSE	MAE	S.D.
pose_x(m)	0.2738	0.1769	0.2686	0.1870	0.1370	0.1867	31.70%	22.56%	30.49%
pose_y(m)	0.2758	0.2070	0.2751	0.2089	0.1359	0.2078	24.26%	34.34%	24.46%
pose_yaw(rad)	0.0799	0.0545	0.0798	0.0584	0.0424	0.0571	26.91%	22.20%	28.45%

Table 5. The results of the pose estimation errors in the x and the y directions, respectively.

Trajectory	Pose	AMCL		The System		Improvement
Trajectory	Pose	RMSE	MAE	RMSE	MAE	RMSE	MAE
1	pose_x(m)	0.0310	0.0293	0.0304	0.0290	1.94%	7.52%
1	pose_y(m)	0.0192	0.0175	0.0185	0.0157	3.65%	10.28%
2	pose_x(m)	0.0240	0.0225	0.0214	0.0204	10.8%	9.33%
2	pose_y(m)	0.0237	0.0230	0.0207	0.0192	12.65%	16.52%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhao, C.; Wei, Y. A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots. Appl. Sci. 2022, 12, 3007. https://doi.org/10.3390/app12063007

AMA Style

Liu Y, Zhao C, Wei Y. A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots. Applied Sciences. 2022; 12(6):3007. https://doi.org/10.3390/app12063007

Chicago/Turabian Style

Liu, Yanjie, Changsen Zhao, and Yanlong Wei. 2022. "A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots" Applied Sciences 12, no. 6: 3007. https://doi.org/10.3390/app12063007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Localization System Fusion Vision-CNN Relocalization and Progressive Scan Matching for Indoor Mobile Robots

Abstract

1. Introduction

2. Related Works

2.1. Laser-Based Localization Method

2.2. Vision-Based Localization Method

2.3. Localization Method Combining Vision and Laser

3. System Overview and Methodology

3.1. System Overview

3.2. Design of Convolutional Neural Network

3.3. Map Preprocessing

3.4. Correlation Particle Sampling Method

3.5. Scan Matching Optimization

4. Simulation

4.1. Simulation Settings

4.2. Initialization of the Robot Global Positioning

4.2.1. Visual Localization Accuracy Analysis

4.2.2. System Relocalization Analysis

4.3. Dynamic Pose Tracking of Robot

5. Experiments

5.1. Experimental Platform and Measuring Equipment

5.2. Relocalization Experiment

5.2.1. Test of Visual Relocalization Accuracy

5.2.2. Test of Relocalization Integrating Visual and Laser Information

5.3. Dynamic Pose Tracking

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI