Next Article in Journal
Privacy-Preserving Top-k Query Processing Algorithms Using Efficient Secure Protocols over Encrypted Database in Cloud Computing Environment
Next Article in Special Issue
Semantic Segmentation of Side-Scan Sonar Images with Few Samples
Previous Article in Journal
Modification of SPWM Modulating Signals for Energy Balancing Purposes
Previous Article in Special Issue
Power Line Scene Recognition Based on Convolutional Capsule Network with Image Enhancement
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YKP-SLAM: A Visual SLAM Based on Static Probability Update Strategy for Dynamic Environments

1
School of Electronic, Electrical Engineering and Physics, Fujian University of Technology, Fuzhou 350118, China
2
National Demonstration Center for Experimental Electronic Information and Electrical Technology Education, Fujian University of Technology, Fuzhou 350118, China
*
Author to whom correspondence should be addressed.
Electronics 2022, 11(18), 2872; https://doi.org/10.3390/electronics11182872
Submission received: 24 June 2022 / Revised: 24 August 2022 / Accepted: 5 September 2022 / Published: 11 September 2022
(This article belongs to the Special Issue Advances in Image Enhancement)

Abstract

:
Visual simultaneous localization and mapping (SLAM) algorithms in dynamic scenes can incorrectly add moving feature points to the camera pose calculation, which leads to low accuracy and poor robustness of pose estimation. In this paper, we propose a visual SLAM algorithm based on object detection and static probability update strategy for dynamic scenes, named YKP-SLAM. Firstly, we use the YOLOv5 target detection algorithm and the improved K-means clustering algorithm to segment the image into static regions, suspicious static regions, and dynamic regions. Secondly, the static probability of feature points in each region is initialized and used as weights to solve for the initial camera pose. Then, we use the motion constraints and epipolar constraints to update the static probability of the feature points to solve the final pose of the camera. Finally, it is tested on the TUM RGB-D dataset. The results show that the YKP-SLAM algorithm proposed in this paper can effectively improve the pose estimation accuracy. Compared with the ORBSLAM2 algorithm, the absolute pose estimation accuracy is improved by 56.07% and 96.45% in low dynamic scenes and high dynamic scenes, respectively, and the best results are almost obtained compared with other advanced dynamic SLAM algorithms.

1. Introduction

Simultaneous localization and mapping (SLAM) is to estimate camera pose and build a map of the environment simultaneously during motion from sensor data collected by the robot. After decades of development, some very mature SLAM algorithms have emerged, such as PTAM [1], LSD-SLAM [2], DSO [3], ORB-SLAM2 [4], and VINS Mono [5], which are basically based on the assumption of static environments. However, in practical applications of robotics, motion scenes are more common than static scenes, and most application scenes encounter dynamic objects, e.g., pedestrians, vehicles, animals, etc. Dynamic objects can introduce anomalous “outliers” that disrupt the normal correspondence between image features, resulting in significant drift in camera pose. Some optimization algorithms, such as random sample consensus [6] (RANSAC) and graph optimization, can filter out a small number of weak dynamic features in the environment as outliers. These methods can achieve certain results for low-speed motion with a small number of outliers. Though, they are not able to process dynamic features very well for high-speed complex motion scenes, and the visual SLAM system might fail to track and localize. Therefore, it is particularly important to study SLAM algorithms in dynamic environments.
In order to solve the visual SLAM problem in a dynamic environment, the traditional method is to eliminate dynamic objects through geometric constraints and set a threshold according to the size of the reprojection error to distinguish static objects from dynamic objects. However, this method has two problems. (1) The method cannot distinguish the residuals caused by moving objects from those caused by mis-matching. (2) The segmentation threshold is difficult to set; if the threshold set is too large, the static features will be mis-rejected, and if the segmentation threshold set is too small, it is difficult to completely reject the dynamic features in the environment. Therefore, the method is more suitable for a low dynamic environment. Additionally, in a high dynamic environment, the accuracy of dynamic feature detection is low, and the accuracy of pose estimation is poor.
In recent years, with the development of computer vision and deep learning, semantic constraints have been widely applied to visual SLAM problems in dynamic environments. The semantic constraint approach mainly applies semantic segmentation and target detection to obtain semantic information in the environment. By identifying and removing potential dynamic objects, the performance of visual SLAM in dynamic scenes can be greatly improved. The semantic segmentation algorithm can provide fine pixel-level object masks, but its real-time performance is poor. The improvement of segmentation accuracy and robustness often comes at the cost of huge computational cost. Even then, the segmentation boundary of an object cannot be very accurate. The target detection algorithm can quickly obtain the object frame of an object with low computational cost, but it cannot obtain accurate object boundaries, and if the features in the dynamic object frame are directly removed, it will lead to the false removal of some static features. Moreover, there are three problems with semantic constraints. (1) The actual motion is stationary, however, the algorithm cannot judge a semantic prior is a dynamic object or not, which may lead to the false removal of some static features. (2) It can only handle known objects labeled in the training set of the network but may still fail in the face of unknown moving objects, which leads to the missed detection of some dynamic features. (3) It deletes all dynamic features of semantic information discrimination and does not calculate the pose. This will lead to a reduction in constraints in pose calculation, knowing that dynamic features can still provide weak constraints for pose calculation. If it is deleted directly, it will lead to a decrease in the accuracy of pose estimation.
To address the above problems, in order to improve the pose estimation accuracy and robustness of the SLAM system in a dynamic environment, this paper proposes a YKP-SLAM algorithm in a dynamic environment. On the basis of ORBSLAM2, YKP-SLAM adds three major processes: YOLOv5 target detection, improved K-means clustering, and probability updating strategy. Our experiments prove that the YKP-SLAM algorithm can effectively reduce the tracking error and improve the accuracy and robustness of the SLAM system, both in a slow-moving dynamic environment and in a fast-moving dynamic environment.
The main contributions of this paper are as follows:
(1) We incorporate the lightweight YOLOv5 object detection algorithm into the SLAM system, which can quickly and accurately provide accurate semantic priors for subsequent operations.
(2) A K-means clustering algorithm specifically for depth images is proposed, which can select the number of clusters adaptively and can segment dynamic object contours from dynamic object frames quickly and accurately.
(3) A method for initializing static probability is proposed. The image is divided into three regions by combining YOLOv5 and improved K-means clustering. Then, the initial poses are solved by probability initialization of feature points in each region separately. More accurate initial poses are provided for the subsequent motion constraints and polar constraints.
(4) A probability update strategy based on motion constraints and epipolar constraints is proposed. Probability updates are performed for all feature points in the image. Then, all feature points are added to the pose calculation to solve the final pose.

2. Related Work

2.1. Dynamic SLAM Based on Traditional Method

Traditional dynamic SLAM algorithms are mainly based on geometric constraints to filter out dynamic feature points in the environment. For example, Zou [7] et al. project feature points from the previous frame onto the current frame and calculate the 2D reprojection error of matching points with the current frame and classify feature points into static and dynamic feature points according to the magnitude of the reprojection error. Wang [8] et al. detected the matched outlier points in two adjacent frames by epipolar constraint and then fused the clustering information of the depth map provided by RGB-D cameras to identify the moving targets in the scene. Dai [9] et al. proposed a static object geometry prior method in a feature-based SLAM framework. The algorithm utilizes the connectivity of map points to separate moving objects from the static background, thus reducing the impact of moving objects on the pose estimation.
In addition to geometric constraints, optical flow methods are also used to distinguish dynamic and static features. For example, Klappstein [10] et al. defined the likelihood of “moving objects in the scene” based on the motion metric calculated by optical flow. Fang [11] et al. improved the optical flow method to detect dynamic targets based on point matching techniques and uniform sampling strategies and introduced a Kalman filter to enhance detection and tracking. FlowFusion [12] estimated the optical flow of two adjacent frames through a PWC-Net [13] network, and at the same time, estimated the camera pose based on the intensity and depth of the two adjacent frames and then used the estimated optical flow and camera motion to compute the 2D scene flow and finally used the 2D scene flow for dynamic feature segmentation.

2.2. Dynamic SLAM Based on Semantic Constraints

In recent years, deep-learning-based image semantic segmentation and target recognition have been widely used, and the detection methods have evolved greatly in terms of efficiency and accuracy. Many researchers have tried to solve the dynamic SLAM problem by removing potential dynamic objects through semantic tagging or target detection preprocessing. For example, Yang [14] et al. used the target detection network Faster R-CNN [15] to detect dynamic objects and then performed geometric matching with the current frame and keyframes to determine whether they are dynamic objects. Yu [16] et al. proposed the DS-SLAM algorithm, combining a semantic segmentation network and optical flow method to provide a semantic representation of octree maps, thus reducing the dynamic objects. The DynaSLAM proposed by Bescos [17] et al. uses a combination of multi-view geometry and Mask RCNN [18] to detect and filter dynamic targets. ZHANG Jinfeng [19] et al. used the target detection network YOLOv3 [20] to filter dynamic feature points in the scene, which effectively reduced the trajectory error of the SLAM system. Zhong [21] et al. proposed Detect-SLAM combined with the target detection network SSD [22] to identify dynamic targets, such as pedestrians and vehicles, in the environment as a priori dynamic targets and then filter the feature points on the a priori dynamic target to improve its localization accuracy. Blitz-SLAM [23] obtains the mask of the object by BlitzNet [24], then completes the mask by depth information, and finally classifies the static feature points and dynamic feature points by epipolar constraints.

3. Materials and Methods

3.1. System Architecture

The algorithm framework of YKP-SLAM is shown in Figure 1. Based on ORBSALM2, we added the YOLOv5 target detection algorithm and the improved K-means clustering algorithm to the fore-end and added a complete probability update strategy to the back-end pose calculation. The algorithmic flow of YKP-SLAM can be described as follows. Firstly, the RGB image is detected by YOLOv5 target detection algorithm to obtain the dynamic object frame, and at the same time, the ORB [25] feature points are extracted from the RGB image. Secondly, the depth values of the pixel points are clustered within the dynamic object frame by the improved K-means clustering algorithm combined with the depth image. The results of YOLOv5 target detection and K-means clustering are used to segment the image into static regions, suspicious static regions, and dynamic regions, initialize the static probability of feature points within each region, and add them as weights to the camera pose estimation to calculate the initial camera pose T c w 1 . Finally, the static probability of feature points is updated by the motion constraint and the epipolar constraint, and the second stage pose T c w 2 and the final pose T c w of the camera are solved, respectively.
Of course, we also considered the failure of the YOLOv5 algorithm. When YOLOv5 fails, the dynamic object frame cannot be obtained. Then, at this time, we perform feature matching between the feature points in the current frame and the dynamic feature points in the previous frame. Mark the successfully matched feature points of the current frame as dynamic feature points, and mark the remaining feature points as static feature points. The only difference from a normal operation is that the characteristic points are divided into three categories in a normal operation, and the characteristic points in a fault operation are divided into two categories. The subsequent static probability initialization method and probability update strategy are the same. The feature point classification process of YOLOv5 fault runtime is shown in the purple dashed box in Figure 1.

3.2. YOLOv5 Target Detection

You Only Look Once (YOLO) is a regression-based target detection algorithm. It is the pioneering work of the one-stage method. It was released by Ultralytics on 10 June 2020. It is one of the most widely used target detection algorithms. It solves target detection as a regression problem and directly obtains the bounding box position and classification of the predicted object from an input image. It ensures the accuracy while taking into account the real-time performance and achieves very good speed and accuracy. YOLOv5 proposes a total of 4 network models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The network structure of the four models is the same; the difference is that the depth_multiple and width_multiple parameters can be used to control the depth of the model and the number of convolution kernels, respectively. Among them, YOLOv5s is the network with the smallest network depth and the smallest feature map width. It occupies only 7.5 M of memory. Its detection speed on TeslaP100 reaches 140FPS, which fully meets real-time performance. The other three are continuously deepened and widened on this basis, with improved accuracy and slower speed.
In order to meet the real-time nature of the SLAM system, the fastest YOLOv5s algorithm is adopted, which is embedded in the fore-end of the SLAM system, to perform target detection on each RGB image passed by the camera and obtain the bounding box position of the object and its category. In the bounding box, the people and animals are located as dynamic object boxes D B . The target detection results of YOLOv5s are shown Figure 2. The yellow frame in Figure 2 is the dynamic object box. It can be seen from the figure that whether the person is on the front, side, back, or only half of the body is exposed, YOLOv5 can be accurately framed.

3.3. Improved Adaptive K-means Clustering Algorithm

Although the YOLOv5 target detection algorithm can quickly and accurately locate the bounding boxes of dynamic objects, it cannot obtain an accurate dynamic object mask. Therefore, this paper proposes an adaptive K-means clustering segmentation algorithm based on depth images, which can segment dynamic objects from the dynamic object box D B quickly and accurately.
The K-means algorithm is an unsupervised clustering algorithm, which is easy to implement and runs fast. However, the traditional K-means clustering algorithm pre-specifies the number of clusters and randomly initializes the cluster centers according to experience, which is likely to cause too many iterations of the algorithm or misclassification. Since the number of clusters is artificially set in advance, the direct application of the traditional K-means clustering algorithm to depth image clustering will have the following two problems:
(1) If the number of clusters set is too large, a complete dynamic objects would be divided into multiple categories, which might cause incomplete segmentation of dynamic objects.
(2) If the number of clusters set is too small, the dynamic objects cannot be separated from the static background.
In order to solve the above problems, an improved adaptive K-means algorithm is proposed in this paper. The algorithm can automatically generate the optimal number of clusters and the initial cluster centers, so that dynamic objects can be segmented from the static background more quickly and accurately. The steps of the improved K-means algorithm are as follows:
(1) Take out the depth image I D B i in the dynamic object frame D B and count the total number of pixels M and the maximum pixel depth D max in I D B i .
(2) Solve the histogram of the depth image I D B i and divide the data of the histogram into k segments:
k = D max T
where T is the segmentation threshold, whose size can determine the number of clusters. Since the depths of dynamic objects do not change much in the two adjacent frames, we first use the depth mean D p of dynamic feature points in the previous frame as the prior of the depth value of dynamic objects in the current frame. Then, the ratio λ of the number of pixels in the dynamic object in the previous frame to the number of pixels in the dynamic object frame is calculated. Finally, find the neighborhood U ( D p , δ ) = { x D p δ < x < D p + δ } of point D p in the histogram of the depth image I D B i , so that the number of pixels in the neighborhood is equal to λ M ; then, the size of the segmentation threshold T is the range of the neighborhood.
T = 2 δ
(3) We take k as the number of clusters for subsequent K-means clustering and take the maximum depth value of each piece of data as the initial cluster center for each category.
(4) The K-means clustering algorithm obtains a depth image segmentation graph based on the number of clusters calculated in step (3) and the initial cluster centers.
Since the depth values of dynamic objects do not change too much within the two adjacent frames, the depth mean value D p of dynamic features in the previous frame is used as a criterion, and then, the pixel depth mean value of each cluster in the dynamic object box is solved, and the cluster with a pixel depth mean value closest to the depth mean value of dynamic points in the previous frame is marked as a dynamic region; the other clusters in the dynamic object box are marked as suspicious static regions, and the regions outside the dynamic object box are marked as static regions. The whole dynamic region classification process is shown in Figure 3.
The results of K-means clustering are shown in Figure 4. From the figure, we can see that the improved K-means clustering algorithm proposed in this paper can segment people from the background completely and does not lead to mis-segmentation.

3.4. Initialize the Static Probability and Calculate the Initial Camera Pose

In this paper, the YOLOv5 target detection algorithm and the improved adaptive K-means clustering algorithm are used to segment the image into dynamic regions, suspicious static regions, and static regions. In order to obtain a more accurate initial pose, the feature points in different regions are assigned static probability initial values of
S t a t i c   p r o b a b i l i t y { ω a = 0 D y n a m i c   r e g i o n ω b = 0.5 S u s p i c i o u s   s t a t i c   r e g i o n ω c = 1 S t a t i c   r e g i o n
These initial static probabilities are then used as weights for the pose calculation, and the initial pose T c w 1 for the current frame is calculated according to the weighted minimization reprojection error.
The structure of the camera pose T c w 1 is
SE ( 3 ) = { T c w 1 = [ R c w 1 t c w 1 0 T 1 ] 4 × 4 R c w 1 SO ( 3 ) , t c w 1 3 }
where R c w 1 is the rotation matrix, and t c w 1 is the translation vector.
T c w 1 can be solved by Equation (5).
T c w 1 = arg min ( a = 1 N a K T c w 1 x a p a 1 2 + b = 1 N b K T c w 1 x b p b 2 2 + c = 1 N c K T c w 1 x c p c 3 2 )
Among them
1 = ω a × n × E 2 = ω b × n × E 3 = ω c × n × E
Where, p a , p b , p c are the 2D pixel point coordinates of dynamic feature points, suspicious static points, and static points in the current frame, respectively, while x a , x b , x c are the coordinates of their corresponding matching 3D map points. 1 , 2 , 3 is the information matrix of feature points in each region, n is the number of layers of the image pyramid where the current feature point is located, and E is the unit matrix of 3 × 3 . N a , N b , N c are the numbers of dynamic feature points, suspicious static points, and static points in the current frame, respectively.

3.5. Probability Update Based on Motion Constraints

The traditional geometric method distinguishes dynamic points and static points by the size of the reprojection error and sets the threshold value and judges the points with a reprojection error larger than the threshold value as dynamic points and those smaller than the threshold value as static points. The threshold size of this method is difficult to set, which can easily lead to mis-segmentation of dynamic and static points. Therefore, this paper proposes a new segmentation method that uses the motion distance of the a priori dynamic point p a judged by the front-end of the SLAM system (YOLOv5 and K-means) as a scale to update the static probability of the suspicious static point p b and static point p c . The schematic diagram of the motion constraint is shown in Figure 5.
We now know the initial pose T c w 1 and the camera internal reference K of the current frame, and we can also directly obtain the depth information Z of the feature points through the depth camera. Then, we first back-project the dynamic point p a in the current frame to the world coordinate system to obtain the 3D point coordinate P a in the world coordinate system.
P a = [ X a Y a Z a ] = T w c 1 K p a
Calculate the square value L a of the movement distance between the back-projection point P a and the corresponding map point x a :
L a = ( X a X a ) 2 + ( Y a Y a ) 2 + ( Z a Z a ) 2
where [ X a Y a Z a ] T are the 3D point coordinates of the map point x a .
Similarly, the squares of the motion distances of the suspicious static point p b and the static point p c can be solved as L b and L c , respectively.
Then, solve the mean μ L and variance S L of the square of the motion distance of the dynamic point p a in the current frame:
μ L = a = 1 N a L a N a
S L = a = 1 N a ( L a μ L ) 2 N a
By comparing the motion distance of the suspicious static point p b , static point p c , and dynamic point p a to update their static probability, this paper designs a sigmoid function to calculate the static probability of each suspicious static point p b and static point p c as follows:
ω b 1 = 1 1 + exp ( α ( L b μ L S L ) )
ω c 1 = 1 1 + exp ( α ( L c μ L S L ) )
where α is a coefficient greater than 0.
Update the static probability of each feature point in each region in combination with the initial static probability:
ω a = ω a ω b = ω b × ω b 1 ω c = ω c × ω c 1
Based on the updated static probability of the feature points, the static probabilities are brought into Equation (5) to calculate the camera pose T c w 2 in the second stage.

3.6. Probability Update Based on Epipolar Constraint

As shown in Figure 6, O 1 , O 2 is the camera optical center at the moment of the current frame and reference frame, respectively, and p 1 , p 2 is a pair of matching points between the current frame and reference frame. x is the map point corresponding to the p 1 point on the reference frame, and the projection point of this point on the current frame should be located on the polar line l 2 if the point is stationary, or not on the polar line if it is moving. In this paper, the static probability of the feature points is updated based on the distance from point p 2 to the polar line l 2 .
Through the current frame camera pose T c w 2 and the reference frame camera pose T c w r solved in the second stage, the rotation matrix and translation matrix t 2 r between the two frames can be solved:
R 2 r = R c w 2 × R c w r 1
t 2 r = R c w 2 × R c w r 1 × t c w r + t c w 2
Among them, R c w 2 and t c w 2 are the rotation matrix and translation matrix of the current frame, respectively, and R c w r and t c w r are the rotation matrix and translation matrix of the reference frame.
Fundamental matrix F
F = K T ( t 2 r ) R 2 r K 1
Solve the polar equation corresponding to the feature point on the reference frame to the current frame according to the fundamental matrix. The polar equation is expressed as
[ A B C ] T = F [ u 1 v 1 1 ]
[ u 1 v 1 1 ] is the homogeneous coordinate of the reference frame feature point p 1 .
Calculate the square of the polar distance from the feature point of the current frame to the corresponding polar line:
H = ( A u 2 + B v 2 + C ) 2 A 2 + B 2
[ u 2 v 2 1 ] is the homogeneous coordinate of the current frame feature point p 2 .
According to the above Equations (16)–(18), the polar distance H a , H b , H c of the dynamic point, suspicious static point, and static point of the current frame can be calculated, respectively.
Calculate the mean μ H and variance S H of the polar distance of the dynamic points, as with the motion constraints:
μ H = a = 1 N a H a N a
S H = a = 1 N a ( H a μ H ) 2 N a
By comparing the polar distance of the suspicious static point p b , static point p c , and dynamic point p a to update their static probability
ω b 2 = 1 1 + exp ( β ( H b μ H S H ) )
ω c 2 = 1 1 + exp ( β ( H c μ H S H ) )
where β is a coefficient greater than 0.
Update the final static probability of the feature points in each region using the static probability of epipolar constraints:
ω a = ω a ω b = ω b × ω b 2 ω c = ω c × ω c 2
The final camera pose T c w can be calculated from the final static probability of the feature points and Equation (5).

4. Experiments and Analysis

In order to evaluate the performance of the YKP-SLAM algorithm, this paper uses the public TUM RGB-D dataset [26] to conduct the experiments. The TUM dataset is produced by the University of Munich, Germany, and uses a Kinect sensor to capture information at a rate of 30 HZ with an image resolution of 640 ∗ 480 and uses a high-precision motion capture system VICON with an inertial measurement system while acquiring image data. The camera position and pose data are acquired in real time, which can be approximated as the real positional data of the RGB-D camera. In this paper, we mainly use eight dynamic scene sequences from the TUM RGB-D dataset for experiments, which are divided into two categories: walking and sitting. The sitting dataset series are low dynamic scenes, in which two people are sitting in front of a table and chatting, with low motion. The walking dataset series are high dynamic scenes, in which two people are walking in front of or around a table, with high motion. For each type of dataset series, the camera motion is also divided into four states, where static means the camera is at rest, xyz means the camera is moving along the spatial X-Y-Z axis in translation, rpy means the camera is rotating in a flip angle, pitch angle, and yaw angle, and halfsphere means the camera is moving along the trajectory of a hemisphere with a diameter of 1 m.
The experiments were run on a server with Ubuntu 18.04, a GeForce RTX 3060 graphics card with 12 GB of video memory, a 7-core Intel(R) Xeon(R) CPU, and 20 GB of RAM.

4.1. Comparison with ORBSLAM2

Since the YKP-SLAM algorithm proposed in this paper is improved on the basis of ORBSLAM2, a comparison experiment with ORBSLAM2 is conducted first. In this paper, the absolute trajectory error (ATE) and relative pose error (RPE) [26] are adopted to evaluate algorithm accuracy. The absolute trajectory error is the direct difference between the estimated and real poses, which can reflect the algorithm accuracy and global consistency of the trajectory very intuitively. The relative trajectory error contains the relative translation error and relative rotation error, which are directly measured by the odometer. The experimental results are shown in Table 1 and Table 2, where RMSE denotes the root mean square error, Mean denotes the mean error, and Std denotes the standard deviation.
The improvement rates in the table are calculated as follows:
η = ( 1 β α ) × 100 %
where η represents the algorithm improvement rate, β represents the experimental results of the YKP-SLAM algorithm, and α represents the experimental results of the ORBSLAM2 algorithm.
Table 1 and Table 2 show the quantitative evaluation of the errors, from which it can be seen that in the low dynamic scene sitting dataset series, the average improvement of the RMSE of absolute and relative trajectory errors of the YKP-SLAM algorithm compared with the ORBSLAM2 algorithm is 46.16% and 40.86%, respectively. The average improvement of the RMSE of absolute and relative trajectory errors of this algorithm over ORBSLAM2 is 96.52% and 57.79%, respectively, in the walking data set series of high dynamic scenes, which shows that the YKP-SLAM algorithm has a great improvement over the traditional ORBSLAM2 algorithm in both low and high dynamic scenes. The trajectory accuracy is greatly improved in both low and high dynamic scenes.
Figure 7 and Figure 8 show the absolute trajectory error distributions of the ORBSLAM2 algorithm and the YKP-SLAM algorithm under the low dynamic sequences s_xyz, s_half and the high dynamic sequences w_xyz, w_half, respectively. Figure 9 and Figure 10 show the comparison of the estimated trajectory and the real trajectory of the ORBSLAM2 algorithm and the YKP-SLAM algorithm under the low dynamic sequences s_xyz, s_half and the high dynamic sequences w_xyz, w_half, respectively. It can be seen that under the low dynamic sequences s_xyz and s_half, the absolute trajectory error of the YKP-SLAM algorithm is slightly smaller than that of the ORBSLAM2 algorithm, and the estimated trajectory is closer to the real trajectory than the ORBSLAM2 algorithm. Under the high dynamic sequences w_xyz and w_half, the absolute pose error of the YKP-SLAM algorithm is smaller than that of the ORBSLAM2 algorithm, and the estimated trajectory is still very close to the real trajectory, while the estimated trajectory of the ORBSLAM2 algorithm is far away from the real trajectory. This proves that the YKP-SLAM algorithm can effectively improve the pose estimation accuracy of the SLAM system in low dynamic and high dynamic scenes.

4.2. Comparison with Advanced Dynamic SLAM Algorithms

In order to verify the superiority of the YKP-SLAM algorithm, DS-SLAM [16], DynaSLAM [17], and Blitz-SLAM [23] are selected for comparison experiments with YKP-SLAM in this paper. The root mean square error RMSE and variance Std in the absolute trajectory error are selected as the evaluation metrics for verification. The experimental results are shown in Table 3, where the bold font indicates the best results. Among them, the DS-SLAM and DynaSLAM codes were open sourced as well as the experimental data, while the Blitz-SLAM algorithm code was not open sourced. As can be seen from the table, the YKP-SLAM algorithm achieves almost the best results compared to the other dynamic SLAM algorithms, both in high dynamic scenes and in low dynamic scenes. The performance is slightly worse under the s_rpy and w_rpy data sets, which is caused by the fact that the camera motion is too large at this time, making the YOLOv5 target detection results less accurate.

4.3. Ablation Experiment

In order to verify the effectiveness of the improved K-means clustering algorithm and probability update strategy proposed in this paper, we conduct ablation experiments, and the experimental results are shown in Table 4 The bold font indicates the best results, and the underlined ones represent the second best results.
In Table 4, Y-SLAM refers to the direct elimination of feature points within the dynamic object frame by YOLOv5 target detection; YK-SLAM is the combination of YOLOv5 and improved K-means clustering to eliminate feature points within the dynamic object; YKP-SLAM is the proposed algorithm.
The comparison between Y-SLAM and YK-SLAM shows that the performance of YK-SLAM is better than Y-SLAM in the low dynamic environment, which is due to the fact that the number of dynamic points is smaller in the low dynamic environment. In contrast, Y-SLAM eliminates all the points in the dynamic object frame and deletes some static points by mistake, resulting in a reduction in constraints in the pose calculation, thus causing a decrease in pose accuracy. The performance of Y-SLAM is better than that of YK-SLAM in the high dynamic environment, which is due to the higher number of dynamic points and larger dynamic amplitude in the high dynamic environment. The area of the dynamic object frame is larger than that of the dynamic object, which allows Y-SLAM to reject more dynamic points and thus make its pose accuracy more accurate. YKP-SLAM with the addition of the probability update strategy achieves the best results in both low and high dynamic scenes. This is due to the fact that the probability update strategy assigns appropriate static probabilities to static and dynamic points and then adds all points to the pose calculation, which does not lead to either false deletion of static points or missed detection of dynamic points.

4.4. Real-Time Analysis

Real-time performance is one of the important evaluation indicators of SLAM systems. As shown in Table 5, in order to measure the real-time performance of the YKP-SLAM algorithm proposed in this paper, we test each module of the YKP-SLAM algorithm and the ORBSALM2 algorithm, respectively, under the highly dynamic “walking_xyz” sequence. In the table, A represents the YOLOv5 target detection module, B represents the ORB feature extraction module, C represents the improved K-means clustering module, D represents the probability update module, and E represents the normal tracking calculation pose module. Among them, the YOLOv5 target detection module and the ORB feature extraction module in the YKP-SLAM algorithm are run in parallel. The results show that the YOLOv5 target detection module cost less time than the ORB feature extraction module; that is to say, there is no need to wait for the detection results of YOLOv5 after the ORB feature extraction is completed. Therefore, in the case of sufficient computing power, adding the YOLOv5 module will not increase the system time. The average total time per frame of ORBSLAM2 and YKP-SLAM is 48.20ms and 62.05ms, respectively; that is, the running speed reaches 20 Fps and 16 Fps, respectively. Overall, YKP-SLAM basically meets the real-time performance of SLAM while ensuring accuracy in dynamic environments.

5. Conclusions

In this paper, a YKP-SLAM algorithm in dynamic environment is proposed. The algorithm first segments the whole current frame image by YOLOv5 target detection algorithm and improved K-means clustering algorithm and assigns a priori static probability to each feature point according to the segmentation result. The a priori static probability is used as the weight to calculate the initial camera pose, and then, the static probability of the feature points is updated according to the motion constraint and the epipolar constraint to solve the final camera pose. The algorithm in this paper is verified under the TUM dataset. Compared with the ORBSLAM2 algorithm, the accuracy and robustness of this algorithm are greatly improved in both low and high dynamic scenes. Compared with the other SLAM algorithms in dynamic scenes, the YKP-SLAM algorithm also achieves almost the best localization accuracy. In future work, we will propose a dense semantic map construction method in dynamic scenes based on the existing one and make full use of the advantages of localization accuracy in high dynamic scenes and the semantic information provided by YOLOv5 to realize path planning and obstacle avoidance in dynamic scenes.

Author Contributions

Conceptualization, L.L. and J.G.; methodology, J.G.; software, J.G.; validation, L.L., J.G. and R.Z.; formal analysis, L.L.; investigation, R.Z.; resources, J.G.; data curation, J.G.; writing—original draft preparation, L.L. and J.G.; writing—review and editing, J.G.; visualization, R.Z.; supervision, L.L.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Natural Science Foundation of Fujian Province under Grant 2022H6005 and 2022J01952, in part by the Initial Scientific Research Fund of FJUT under Grant GY-Z12079, Grant GY-Z21036, and Grant GY-Z20067.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the editors and the anonymous reviewers for their insightful comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Klein, G.; Murray, D. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
  2. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Computer Vision—ECCV 2014, Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar]
  3. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  4. Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
  5. Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  6. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  7. Zou, D.; Tan, P. Coslam: Collaborative visual slam in dynamic environments. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 354–366. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, R.; Wan, W.; Wang, Y.; Di, K. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes. Remote Sens. 2019, 11, 1143. [Google Scholar] [CrossRef]
  9. Dai, W.; Zhang, Y.; Li, P.; Fang, Z.; Scherer, S. RGB-D SLAM in dynamic environments using point correlations. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 373–389. [Google Scholar] [CrossRef] [PubMed]
  10. Klappstein, J.; Vaudrey, T.; Rabe, C.; Wedel, A.; Klette, R. Moving object segmentation using optical flow and depth information. In Advances in Image and Video Technology, Proceedings of the Pacific-Rim Symposium on Image and Video Technology, Tokyo, Japan, 13–16 January 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 611–623. [Google Scholar]
  11. Fang, Y.; Dai, B. An improved moving target detecting and tracking based on optical flow technique and Kalman filter. In Proceedings of the 2009 4th International Conference on Computer Science & Education, Nanning, China, 25–28 July 2009; pp. 1197–1202. [Google Scholar]
  12. Zhang, T.; Zhang, H.; Li, Y.; Nakamura, Y.; Zhang, L. Flowfusion: Dynamic dense RGB-D SLAM based on optical flow. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 15 September 2020; pp. 7322–7328. [Google Scholar]
  13. Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
  14. Yang, S.; Wang, J.; Wang, G.; Hu, X.; Zhou, M.; Liao, Q. Robust RGB-D SLAM in dynamic environment using faster R-CNN. In Proceedings of the 2017 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 2398–2402. [Google Scholar]
  15. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. Available online: https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf (accessed on 23 June 2022). [CrossRef] [PubMed]
  16. Yu, C.; Liu, Z.; Liu, X.J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
  17. Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
  18. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  19. Zhang, J.; Shi, C.; Wang, Y. SLAM method based on visual features in dynamic scene. Comput. Eng. 2020, 46, 95–102. [Google Scholar]
  20. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  21. Zhong, F.; Wang, S.; Zhang, Z.; Chen, C.; Wang, Y. Detect-SLAM: Making object detection and SLAM mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1001–1010. [Google Scholar]
  22. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  23. Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A semantic SLAM in dynamic environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]
  24. Dvornik, N.; Shmelkov, K.; Mairal, J.; Schmid, C. BlitzNet: A real-time deep network for scene understanding. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4154–4162. [Google Scholar]
  25. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  26. Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Figure 1. The algorithmic framework of YKP-SLAM. In the image, green points represent static points, blue points represent suspicious static points, red points represent dynamic points, and yellow points represent points where the probability changes.
Figure 1. The algorithmic framework of YKP-SLAM. In the image, green points represent static points, blue points represent suspicious static points, red points represent dynamic points, and yellow points represent points where the probability changes.
Electronics 11 02872 g001
Figure 2. YOLOv5 target detection results. (ad) represents the detection results of the YOLOv5 algorithm in several different scenes.
Figure 2. YOLOv5 target detection results. (ad) represents the detection results of the YOLOv5 algorithm in several different scenes.
Electronics 11 02872 g002
Figure 3. Schematic diagram of dynamic region division.
Figure 3. Schematic diagram of dynamic region division.
Electronics 11 02872 g003
Figure 4. Improved K-means clustering, where each color represents one class. The red region is the dynamic region, and the other colored regions are suspicious static regions. (ad) represents the clustering results of the improved K-means clustering algorithm in several different scenes.
Figure 4. Improved K-means clustering, where each color represents one class. The red region is the dynamic region, and the other colored regions are suspicious static regions. (ad) represents the clustering results of the improved K-means clustering algorithm in several different scenes.
Electronics 11 02872 g004
Figure 5. Schematic diagram of motion constraints, where the ellipse represents the local map, the rectangle represents the current frame, the red point inside the ellipse represents the local map point, the blue point represents the 3D point of the current frame feature point back-projected to the world coordinate system, and the line between the red point and the blue point represents the motion distance of the feature point.
Figure 5. Schematic diagram of motion constraints, where the ellipse represents the local map, the rectangle represents the current frame, the red point inside the ellipse represents the local map point, the blue point represents the 3D point of the current frame feature point back-projected to the world coordinate system, and the line between the red point and the blue point represents the motion distance of the feature point.
Electronics 11 02872 g005
Figure 6. Schematic diagram of epipolar constraint.
Figure 6. Schematic diagram of epipolar constraint.
Electronics 11 02872 g006
Figure 7. Absolute trajectory error distribution of ORBSLAM2 algorithm. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Figure 7. Absolute trajectory error distribution of ORBSLAM2 algorithm. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Electronics 11 02872 g007
Figure 8. Absolute trajectory error distribution of YKP-SLAM algorithm. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Figure 8. Absolute trajectory error distribution of YKP-SLAM algorithm. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Electronics 11 02872 g008
Figure 9. Comparison of estimated trajectory and real trajectory of ORBSLAM2 algorithm. The colored line is the estimated trajectory, and the gray line is the real trajectory. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Figure 9. Comparison of estimated trajectory and real trajectory of ORBSLAM2 algorithm. The colored line is the estimated trajectory, and the gray line is the real trajectory. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Electronics 11 02872 g009
Figure 10. Comparison of estimated trajectory and real trajectory of YKP-SLAM algorithm. The colored line is the estimated trajectory, and the gray line is the real trajectory. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Figure 10. Comparison of estimated trajectory and real trajectory of YKP-SLAM algorithm. The colored line is the estimated trajectory, and the gray line is the real trajectory. (a) s_xyz. (b) s_half. (c) w_xyz. (d) w_half.
Electronics 11 02872 g010
Table 1. Comparison of absolute trajectory error (ATE) between ORB-SLAM2 and YKP-SLAM.
Table 1. Comparison of absolute trajectory error (ATE) between ORB-SLAM2 and YKP-SLAM.
SequencesORB-SLAM2/mYKP-SLAM/mImprovement/%
RMSEMeanStdRMSEMeanStdRMSEMeanStd
sitting_xyz0.01110.00930.00590.00720.00650.003335.1430.1144.07
sitting_half0.04370.03600.02470.01530.01320.007664.9963.3369.23
sitting_static0.01280.01200.00460.00520.00430.002859.3864.1739.13
sitting_rpy0.03580.02930.02050.02680.02370.012625.1319.1138.53
walking_xyz0.51850.44200.27110.01470.01300.006897.1697.0697.49
walking_half0.58200.45710.36030.02450.02200.010795.7995.1997.03
walking_static0.27420.22860.15140.00630.00560.002697.7097.5598.28
walking_rpy1.53201.42620.55940.07020.04890.051495.4196.5790.81
Table 2. Comparison of relative pose error (RPE) between ORB-SLAM2 and YKP-SLAM.
Table 2. Comparison of relative pose error (RPE) between ORB-SLAM2 and YKP-SLAM.
SequencesORB-SLAM2/mYKP-SLAM/mImprovement/%
RMSEMeanStdRMSEMeanStdRMSEMeanStd
sitting_xyz0.01480.01260.00770.00790.00700.003846.6244.4450.65
sitting_half0.02270.01210.01920.01370.01080.008439.6410.7456.25
sitting_static0.01800.01690.00630.00580.00550.003167.7867.4650.79
sitting_rpy0.02560.02080.01480.02320.01710.01519.3817.79−2.27
walking_xyz0.03820.03030.02330.01390.01160.007663.6161.7267.38
walking_half0.04520.03170.03220.01960.01480.012856.6453.3160.25
walking_static0.04730.02910.03730.00720.00620.003184.7878.6991.69
walking_rpy0.04290.03160.02910.03170.02180.023926.1131.0117.97
Table 3. Comparison of absolute trajectory error (ATE) between YKP-SLAM algorithm and other dynamic SLAM algorithms.
Table 3. Comparison of absolute trajectory error (ATE) between YKP-SLAM algorithm and other dynamic SLAM algorithms.
SequencesDS-SLAM/mDynaSLAM/mBlitz-SLAM/mYKP-SLAM/m
RMSEStdRMSEStdRMSEStdRMSEStd
sitting_xyz0.01870.01190.01350.00630.01480.00690.00720.0033
sitting_half0.01620.00610.01930.00840.01600.00760.01530.0076
sitting_static0.00650.00330.00850.0051//0.00520.0028
sitting_rpy0.02660.01530.08650.0516//0.02680.0126
walking_xyz0.02470.01860.01760.00860.01530.00780.01470.0068
walking_half0.03030.01590.02730.01300.02560.01260.02450.0107
walking_static0.00810.00360.00670.00310.01020.00520.00630.0026
walking_rpy0.44420.23500.03890.02370.03560.02200.07020.0514
Table 4. Comparison of absolute trajectory error of ablation experiment.
Table 4. Comparison of absolute trajectory error of ablation experiment.
SequencesY-SLAM/mYK-SLAM/mYKP-SLAM/m
RMSEStdRMSEStdRMSEStd
sitting_xyz0.01680.00790.01290.00680.00720.0033
sitting_half0.08580.01780.01890.00840.01530.0076
sitting_static0.00720.00350.00790.00320.00520.0028
sitting_rpy0.04810.03760.03840.02210.02680.0126
walking_xyz0.01810.01050.02120.01110.01470.0068
walking_half0.02920.01440.03010.01350.02450.0107
walking_static0.00790.00340.00800.00350.00630.0026
walking_rpy0.09620.06250.14570.07010.07020.0514
Table 5. The average running time of each module.
Table 5. The average running time of each module.
AlgorithmA/msB/msC/msD/msE/msTotal Time/ms
ORBSLAM2/19.28//28.9248.20
YKP-SLAM15.4619.287.336.5228.9262.05
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, L.; Guo, J.; Zhang, R. YKP-SLAM: A Visual SLAM Based on Static Probability Update Strategy for Dynamic Environments. Electronics 2022, 11, 2872. https://doi.org/10.3390/electronics11182872

AMA Style

Liu L, Guo J, Zhang R. YKP-SLAM: A Visual SLAM Based on Static Probability Update Strategy for Dynamic Environments. Electronics. 2022; 11(18):2872. https://doi.org/10.3390/electronics11182872

Chicago/Turabian Style

Liu, Lisang, Jiangfeng Guo, and Rongsheng Zhang. 2022. "YKP-SLAM: A Visual SLAM Based on Static Probability Update Strategy for Dynamic Environments" Electronics 11, no. 18: 2872. https://doi.org/10.3390/electronics11182872

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop