A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments

Wei, Shuangfeng; Wang, Shangxing; Li, Hao; Liu, Guangzu; Yang, Tong; Liu, Changchang

doi:10.3390/app13158790

Open AccessArticle

A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments

by

Shuangfeng Wei

^1,2,3,4,

Shangxing Wang

^1,*

,

Hao Li

¹,

Guangzu Liu

^1,*,

Tong Yang

¹ and

Changchang Liu

¹

School of Geomatics and Urban Spatial Information, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

²

Engineering Research Center of Representative Building and Architectural Heritage Database, Ministry of Education, Beijing 102616, China

³

Key Laboratory for Urban Spatial Informatics of Ministry of Natural Resources, Beijing 102616, China

⁴

Beijing Key Laboratory for Architectural Heritage Fine Reconstruction and Health Monitoring, Beijing 102616, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8790; https://doi.org/10.3390/app13158790

Submission received: 2 March 2023 / Revised: 26 July 2023 / Accepted: 27 July 2023 / Published: 29 July 2023

(This article belongs to the Special Issue AI-Based Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In unknown environments, mobile robots can use visual-based Simultaneous Localization and Mapping (vSLAM) to complete positioning tasks while building sparse feature maps and dense maps. However, the traditional vSLAM works in the hypothetical environment of static scenes and rarely considers the dynamic objects existing in the actual scenes. In addition, it is difficult for the robot to perform high-level semantic tasks due to its inability to obtain semantic information from sparse feature maps and dense maps. In order to improve the ability of environment perception and accuracy of mapping for mobile robots in dynamic indoor environments, we propose a semantic information-based optimized vSLAM algorithm. The optimized vSLAM algorithm adds the modules of dynamic region detection and semantic segmentation to ORB-SLAM2. First, a dynamic region detection module is added to the vision odometry. The dynamic region of the image is detected by combining single response matrix and dense optical flow method to improve the accuracy of pose estimation in dynamic environment. Secondly, the semantic segmentation of images is implemented based on BiSeNet V2 network. For the over-segmentation problem in semantic segmentation, a region growth algorithm combining depth information is proposed to optimize the 3D segmentation. In the process of map building, semantic information and dynamic regions are used to remove dynamic objects and build an indoor map containing semantic information. The system not only can effectively remove the effect of dynamic objects on the pose estimation, but also use the semantic information of images to build indoor maps containing semantic information. The proposed algorithm is evaluated and analyzed in TUM RGB-D dataset and real dynamic scenes. The results show that the accuracy of our algorithm outperforms that of ORB-SLAM2 and DS-SLAM in dynamic scenarios.

Keywords:

simultaneous localization and mapping; dynamic scenarios; semantic segmentation; semantic map

1. Introduction

Simultaneous Localization and Mapping (SLAM) [1] is considered to be the basis for intelligent mobile robots to perceive their surroundings and perform tasks such as navigation, obstacle avoidance, and path planning. Currently, SLAM has been used in many industries, such as auto pilot and virtual reality [2]. The main problem solved by SLAM is that of mobile robots using sensors to build maps while estimating their pose in unknown environments [3]. At present, the main research directions of SLAM can be divided into two categories, laser SLAM and vSLAM. Compared with the expensive sensors of laser SLAM, vSLAM is popular among researchers for its price advantage. The sensors commonly used for vSLAM are monocular cameras, binocular cameras, and depth cameras. Furthermore, vSLAM can use the image information captured by the camera for semantic perception of the surrounding environment [4].

At present, many achievements have been made in traditional vSLAM. However, these vSLAM schemes are worked under the assumption of a static environment. Dynamic objects in real scenes can degrade the performance of vSLAM [5]. In addition to this, the dynamic objects in the maps are not conducive to robot navigation or man-machine interaction [6]. Therefore, the effect of dynamic objects on vSLAM cannot be ignored.

With the development of deep learning, many scholars have integrated methods such as object detection and semantic segmentation into vSLAM. These methods can improve the scenario understanding of vSLAM and provide new directions for the development of vSLAM [7]. In terms of tracking, semantic information can assist inter-frame matching to obtain more robust results. In terms of mapping, vSLAM can build semantic maps by combining semantic information in images. The maps with semantic information can give mobile robots a higher understanding of the environment.

For the effect of dynamic objects on vSLAM, a semantic information-based optimized vSLAM is proposed with ORB-SLAM2 [8] as the base framework. In the process of frame matching, in order to improve the accuracy of pose estimation in dynamic sceneries, the detection of dynamic regions that are based on homography matrix and dense optical flow methods are used to remove feature points in dynamic regions. In the process of mapping, the semantic information in the image is obtained using the deep learning method. Then, a region growing algorithm combined with depth information is used to improve the accuracy of object semantic segmentation. Finally, the 3D information of dynamic objects is removed using semantic information and dynamic region detection to avoid building dynamic objects into the map. The main contributions of this paper are as follows:

1. A strategy that enables vSLAM initialization in dynamic environments is proposed, which is able to initialize local maps and poses robustly and efficiently.

2. An efficient method for inter-frame pose estimation in dynamic environments based on homography matrix and dense optical flow method is proposed.

3. A method of optimized semantic segmentation based on depth images is proposed for accurately annotating semantic objects in key frames.

The remainder of this paper is organized as follows: Section 2 discusses the related research on vSLAM in dynamic environments. Section 3 presents the proposed method for localization and building maps for vSLAM in indoor dynamic scenes. Section 4 validates the proposed method using publicly available datasets and real scenario. Finally, Section 5 summarizes this study.

2. Related Work

At present, many vSLAM schemes already have excellent performance in static environments. Some examples are monocular camera schemes [9,10], binocular camera schemes [11] and RGB-D camera schemes [12]. With the popularity of a series of depth cameras such as Kinect and Releasense, RGB-D cameras are widely used in vSLAM schemes. Compared to monocular and binocular cameras, depth cameras can perform direct ranging. In addition, depth cameras can compensate for the high computational complexity of pure vision schemes. ORB-SLAM2 enjoys widely acclaim in the vSLAM field as a complete open-source scheme. ORB-SLAM2 determines the correspondence between the 2D ORB points of the current frame and the 3D points of the local map by feature matching. Then, the 3D coordinates of the matched points in the local map are projected to the current frame to estimate the camera pose by minimize the reprojection error. Finally, a global sparse map is built by bundle adjustment (BA). In order to run vSLAM in a specific static environment, some researchers had designed variants of the vSLAM scheme. Cai et al. [13] proposed an affine transformation-based ORB feature extraction method (Affine-ORB) which is used and applied to existing robot vision SLAM methods. Li et al. [14] designed an RGB-D SLAM system that works specifically in structured environments, aiming to improve tracking and mapping accuracy by relying on geometric features extracted from the surroundings. Nebiker et al. [15] designed a mobile mapping system installed on a motorized tricycle to count on-street parking statistics using RGB-D camera and GPS system. Chen et al. [16] built a SLAM system using RGB-D cameras that are suitable for working in complex orchard environments. Zhang et al. [17] combined the Grid-based Motion Statistics (GMS) with the RANSAC algorithm to jointly filter the feature matching set acquired by ORB to improve the accuracy of ORB-SLAM2 poses. However, all of the above-mentioned works work in static environments. When dynamic objects are present in the environment, the system performance of vSLAM will deteriorate several times or even tens of times [18].

Dynamic objects usually affect the inter-frame pose estimation and mapping of vSLAM systems. The matching relationship between feature points on dynamic objects is usually an important factor that causes pose estimation errors. Wei et al. [19] combined a GMS feature point matching method with a k-means clustering algorithm to reject feature points on dynamic objects. Yang et al. [20] cluster regions in each image frame using depth information and intensity information. Then, dynamic feature points are eliminated by checking the consistency of motion of feature points in each region.

In recent years, many scholars consider the incorporation of deep learning into vSLAM as the key to solve the dynamic environment problem. The semantic information in the images obtained by deep learning further improves the perception and understanding of the environments. Bescos et al. [21] proposed DynaSLAM based on ORB-SLAM2, which used the Mask-RCNN [22] network to achieve dynamic detection and background restoration functions. Yu et al. [23] proposed the DS-SLAM, which uses SegNet [24] to perform semantic segmentation to obtain semantic masks. Later, Xi et al. [25] improved the segmentation accuracy of DS-SLAM through PSPNet network [26] to achieve higher pose estimation accuracy in dynamic environments. Wang [27] et al. proposed Dynamic-SLAM, which uses convolutional neural networks to build Single Shot MultiBox Detector (SSD) [28] to detect dynamic objects. Sun et al. [29] took the lead in applying weakly supervised semantic segmentation to the dynamic detection of vSLAM. The drawback of this approach is that motion detection takes too much time, which affects the real-time performance of SLAM. Zhang et al. proposed the VDO-SLAM [30], which does not rely on the a priori information of motion. Instead, it adopts the dense optical flow method and semantic segmentation to estimate the motion of objects in the scene, which builds a globally consistent scene map with high robustness and accuracy. Soares et al. [31] proposed Crowed-SLAM, which uses the YOLOv3 [32] Tiny network to detect crowds in images and can deal well with the crowded environment problem in SLAM.

From the above analysis, there have been many research results on vSLAM. However, the dynamic objects in the scene greatly affect the conventional vSLAM development scheme. Many scholars have started to study operational vSLAM systems in dynamic environments. Among them, the focus of the research is on dynamic feature point elimination and the building of point cloud maps.

1. For dynamic feature point elimination, the existing methods can be classified into two types. One is to eliminate dynamic feature points using multi-viewpoint geometry [33] or new matching methods. Another approach is based on deep learning to acquire dynamic objects and then eliminate the feature points in dynamic regions. However, deep learning-based approaches require a large amount of computation and usually do not enable real-time operation.

2. For point cloud map building, the existing methods eliminate dynamic objects in local maps by semantic segmentation of key frames. This approach not only reduces the impact of dynamic objects on global optimization and loopback detection, but also enables to build maps with semantic information. However, most of the segmentation networks cannot extract the objects accurately due to the influence of semantic segmentation accuracy.

3. Methodology

3.1. Overview

The vSLAM tracking module and mapping module will be affected when objects are moving within the field of view of the camera during indoor data acquisition. Therefore, vSLAM should have the ability of robust inter-frame matching and dynamic object detection in dynamic environments. To solve the above problems, we optimized a vSLAM system based on ORB-SLAM2 that run in dynamic environments and generate semantic maps, as shown in Figure 1. Compared with ORB-SLAM2, our system has made the following improvements with the following four modules.

Tracking module. ORB feature points are extracted from the color image. Then, the dynamic regions in the image are acquired based on the dynamic detection module, and the ORB feature points that contained in the regions are removed. Finally, the remaining feature points are matched, and the feature matching set is used to estimate the pose of the camera and to determine whether the current image is a key frame. When the system is in the first frame of operation, the dynamic regions in the image cannot be acquired, so only the first frame is semantic segmented and then the dynamic target is directly acquired, and then the pose estimation is done for the first and second frames to complete the initialization.
Motion detection module. It calculates the homography of the current color image frame and the previous image frame. Then, the image of the current frame is transformed using the homography, and the optical flow field is calculated using the dense optical flow method. Finally, the dynamic region for the tracking module is obtained based on the optical flow field. In addition, the semantic masks are combined with the dynamic regions to acquire the dynamic targets for the map building module.
Semantic segmentation module. The semantic information is obtained by semantic segmentation of color images using BiSeNet V2 network. Then, the results of semantic segmentation are optimized using region growing algorithm and depth image. Finally, the semantic information in the image is mapped to the point cloud.
Map building module. The tracking module selects key frames, and then the depth image and color image of the dynamic target are removed from the key frames to build the point cloud of the current frame and add semantic information. Finally, the mapping module uses the pose information to join the point clouds from keyframes, and if a looping is detected, it performs global pose optimization and outputs a point cloud map and pose information with semantic information.

3.2. Tracking Module

The initialization performance of vSLAM has a great impact on the subsequent pose estimation and mapping [18]. Conventional RGB-D SLAM usually adopts the first frame

k_{1}

as the key frame after the initialization condition is satisfied. The local map is initialized by projecting k_1 into the 3D space based on the depth information acquired by the camera. Then, Perspective-n-Point (PnP) is used to calculate the matching relationship between the feature points of the current frame

k_{2}

and the local map points of

k_{1}

to complete the initial pose. However, when vSLAM initializes in a dynamic environment using traditional methods, objects that are moving in the camera’s field of view are not only appended to the local map, but feature points located on dynamic objects in two adjacent frames can also significantly affect inter-frame matching. Therefore, we propose a strategy for vSLAM initialization in dynamic environments that can robustly and efficiently initialize local map and pose. First, the ORB feature points of

k_{1}

are extracted. At the same time, the semantic segmentation function in the semantic segmentation module (Section 3.4) is used to obtain the semantic information of dynamic objects (e.g., pedestrians) in

k_{1}

. The semantic information will be used to eliminate the key points located on dynamic objects in

k_{1}

. Next, the matching relationship between the key points of

k_{1}

and

k_{2}

is determined whether the initialization condition of ORB-SLAM2 is satisfied. If not,

k_{1}

is released.

k_{2}

is reinitialized as

k_{1}

. On the contrary, the dynamic objects in

k_{1}

that are located on the 3D local map will be eliminated. The matching relationship of key points in

k_{2}

and

k_{1}

is calculated according to the dynamic detection module (Section 3.3) to obtain the initial pose and eliminate the dynamic feature points in

k_{2}

. After initialization, ORB feature extraction, dynamic feature rejection, feature matching set building and pose estimation are performed on the subsequent images.

3.3. Dynamic Detection Module

In the odometry of vSLAM, the conventional tracking module determines the inter-frame poses based on the matching relationship pairs of feature points in the previous frame

k_{n - 1}

and the current frame

k_{n}

. When objects are moving within the camera field of view, inter-frame matching is easily affected. The specific reason is that feature points located on dynamic objects usually produce incorrect matching relationships. Additionally, when the camera is in motion, the positions of stationary objects, background and dynamic objects in the image all change, which makes it more difficult to determine dynamic regions. Considering the necessity of real-time inter-frame pose estimation for odometry, we propose a method for inter-frame pose estimation based on the homography and dense optical flow method, which can solve the above problems efficiently and effectively. As shown in Figure 2, the homography is used to align

k_{n}

to the moment where

k_{n - 1}

is located. Then, the dynamic region in

k_{n}

is obtained using the dense optical flow method, and the feature points located in the dynamic region in

k_{n}

are eliminated, as shown in Figure 3. Finally, the inter-frame pose is re-estimated based on the static feature points in

k_{n - 1}

and

k_{n}

.

3.3.1. Homography Calculation

In order to accurately determine the homology matrix between adjacent frames, the matching relationships located on dynamic objects are eliminated as much as possible when inter-frame matching is performed. When the first inter-frame matching is performed, the dynamic feature points in

k_{1}

are eliminated based on the semantic information from the dynamic semantic recognition module. When the nth inter-frame matching is performed, the dynamic feature points in

k_{1}

are eliminated in the

n - 1 th

inter-frame matching based on the dynamic region information. Therefore, the dynamic feature points in

k_{n - 1}

are not involved in the calculation of the homology matrix of the current two adjacent frames. Based on this, the matching relationship is constructed using the static feature points in

k_{n - 1}

and the feature points in

k_{n}

. Finally, the RANSAC algorithm is used to obtain the results of robust homology matrix between adjacent frames

k_{n - 1}

and

k_{n}

.

3.3.2. Dynamic Region Detection

The dense pyramidal optical flow method is applied to detect dynamic regions in images. By building a dense two-dimensional optical flow field on the image, richer motion information in the image is obtained. In motion scenes, if the speed of dynamic objects is too fast, the difference between the two images is more obvious, and the single-layer optical flow can easily reach the local optimum and cannot be tracked correctly. The same image is scaled by image pyramids to obtain images at different resolutions, as shown in Figure 4. When calculating the optical flow, the result of the optical flow calculated in the previous layer is used as the initial value of the optical flow in the next layer. By calculating the optical flow in this coarse-to-fine way, the effect of object motion uncertainty on the optical flow can be effectively solved, and the results are shown in Figure 5b. After obtaining the optical flow field, the Munsell color system [34] optical flow is used to interpolate Figure 5b, and the results are shown in Figure 5c. To determine the dynamic regions in the image, a dynamic optical flow threshold θ is defined. The region where the optical flow amplitude is higher than

θ

is considered as a moving region. On the contrary, the optical flow amplitude in the static region is lower than

θ

. The optical flow fields are binarized using

θ

, as shown in Figure 5d. The optical flow amplitude is defined according to Formula (1).

f = \sqrt{f_{u}^{2} + f_{v}^{2}}

(1)

where

f

represents the optical flow amplitude,

f_{u}

represents the horizontal optical flow value, and

f_{v}

represents the vertical optical flow value.

3.4. Semantic Segmentation Module

Conventional SLAM mapping module can obtain a complete 3D map of the surrounding environment. However, it cannot obtain semantic information from the map, which makes it difficult to perform high-level semantic tasks. At the same time, loopback detection and global map optimization will be affected when there are dynamic objects in the local map. Therefore, we add a separate semantic segmentation thread to semantically segment the keyframes. In addition, a method is proposed to optimize semantic segmentation based on depth images. Finally, the dynamic region detection of the tracking module is combined to eliminate the dynamic objects.

3.4.1. BiSeNet V2

In the semantic segmentation thread, the BiSeNet V2 network [35] is used to perform real-time semantic segmentation of the original image to obtain the semantic information of the image. It can meet the need of real-time operation while ensuring relatively accurate detection accuracy. To apply the BiSeNet V2 network to VSLAM in indoor environments, the network was trained on the Microsoft Common Objects in Context [36] (MS COCO) dataset. The trained semantic segmentation model was used to perform semantic segmentation experiments in dynamic environment datasets and real scenes. As shown in Figure 6, the BiSeNet V2 semantic segmentation network can effectively perform the semantic segmentation task and obtain the semantic information of the images in the indoor dynamic environment.

3.4.2. Refine Segmentation

The depth camera can provide color and depth images. The two-dimensional image and semantic information of each key frame can be mapped to a three-dimensional point cloud based on the camera internal parameters, as shown in Figure 7. However, due to the over-segmentation problem in obtaining the semantic information of the image, the semantic information in two dimensions is mapped to the three-dimensional point cloud with large deviations, as shown in Figure 7c. Therefore, a region growing algorithm with depth information is given to optimize 3D semantic annotation, as shown in Table 1. As shown in Figure 8, the accuracy of optimized 3D semantic annotation by the algorithm in Table 1 has significantly improved.

m_{p q} = \sum_{u, v}^{} u^{p} v^{q} I (u, v), p, q = \{0, 1\} P_{c e n t e r} = (u_{0}, v_{0}) = (\frac{m_{10}}{m_{00}}, \frac{m_{01}}{m_{00}})

(2)

where

p, q

is 0 or 1;

I (u, v)

represents the gray value at pixel

(u, v)

,

m_{p q}

represents the geometric moment of the contour of the mask.

P_{c e n t e r}

represents the mask center point.

In step 2 of the algorithm in Table 1, the Euclidean distance threshold

T

needs to be determined. The fixed threshold value is only applicable to a certain class of cases. Therefore, we give a method to dynamically select the Euclidean distance threshold

T

of the by random sampling within the mask.

First,

i

pixel points in the semantic mask are randomly selected, and at the same time, the adjacent eight pixels of each random pixel point are also selected. The average Euclidean distance of the corresponding spatial coordinates between random pixel points and their adjacent pixels points is calculated by Equation (3). Then,

{\bar{ρ}}_{m}

is sorted, and after eliminating the largest and smallest terms in

{\bar{ρ}}_{m}

, the average value of the Euclidean distance is calculated. Finally, the Euclidean distance threshold

T

in step 2 in Table 1 is calculated by Equation (7).

{\bar{ρ}}_{m} = \frac{\sum_{n = 1}^{8} \sqrt{{(x_{n}^{m} - x_{m})}^{2} + {(y_{n}^{m} - y_{m})}^{2} + {(z_{n}^{m} - z_{m})}^{2}}}{8}, m = 1, 2, \dots i ρ = \frac{1}{i - 2} \sum_{2}^{i - 1} {\bar{ρ}}_{m}

(3)

where

{\bar{ρ}}_{m}

represents the average value of the Euclidean distance of the spatial coordinates corresponding to the random pixel points and their adjacent pixel points;

(x_{m}, y_{m}, z_{m})

represent the spatial coordinates corresponding to the random pixel points,

(x_{n}^{m}, y_{n}^{m}, z_{n}^{m})

represent the spatial coordinates corresponding to the adjacent 8 pixels of each random pixel point, and

ρ

represents the average value of Euclidean distance.

T = ε ρ

(4)

where

ε

represents the dynamic coefficient, and

ρ

represents the average value of Euclidean distance.

3.5. Mapping Module

Aiming to remove dynamic objects from the local maps, the motion regions in the images obtained in Section 3.3 are used to determine whether the semantic masks in the BiSeNet V2 network are dynamic object masks. Specifically, the proportion

ρ_{i}

of dynamic regions in each semantic mask is first calculated according to Equation (5). if

ρ_{i}

is higher than the threshold

φ

, the semantic mask is determined to be dynamic, otherwise it is static, as shown in Figure 9. Then, the local three-dimensional semantic map is generated after removing the dynamic object mask areas in the color image and depth image, as shown in Figure 10. Each frame of the point cloud can be concatenated to form a global map based on the pose information. Finally, the global map is optimized to generate an accurate 3D semantic map based on loopback detection and global BA in ORB-SLAM2.

ρ_{i} = \frac{F_{i}}{M_{i}}

(5)

where

M_{i}

represents the number of pixels of one of semantic mask in the image and

F_{i}

represents the number of pixels in the dynamic region in

M_{i}

.

After constructing a local map containing semantic information for the key frames, the global map is joined by the camera pose. Let the 3D coordinate points constructed in frame

i

be the point set

P_{i}

, and the local point cloud map is constructed by stitching the 3D coordinate points of a single frame. When loopback is detected, the vSLAM system optimizes the overall motion pose and outputs the global point cloud map with pose information.

4. Experiments and Evaluation

4.1. Experimental Data and Evaluation Criteria

The feasibility of the proposed method was verified by testing the TUM RGB-D dataset and real scenarios using Ubuntu 18.04 on a computer (i7-9700K CPU, 16 GB RAM and Nvidia GeForce RTX 2060 GPU). The TUM RGB-D dataset consists of colour and depth images (640 × 480) acquired by a Microsoft Kinect sensor at a full frame rate (30 Hz). Among them, the ground-truth trajectory information of the sensor is acquired by a high-precision motion-capture system with eight high-speed tracking cameras (100 Hz). This high precision motion-capture system acquires frame-by-frame relative errors in the ground truth data which are less than the 1 mm and 0.5 deg measured by the Kinect optical centre, and the absolute error of the entire motion-capture area is less than 10 mm and 0.5 deg [37]. Finally, the five indoor dynamic scenes were selected for testing in the dataset, namely freiburg3_sitting_static, freiburg3_walking_halfsphere, freiburg3_walking_rpy, freiburg3_walking_static, and freiburg3_walking_xyz. The real scenario used Intel RealSense D435 to collect the data and the device parameters are shown in Table 2.

In order to verify the experimental results, the results of the experiments are evaluated by the following two criteria: absolute trajectory error (ATE) and relative pose error (RPE). ATE is defined as the error between the true trajectory and the estimated trajectory of the SLAM, calculated by Equation (6). RPE is defined as the error between the true position transformation and the position transformation estimated by SLAM within a certain period of time, calculated by Equation (7).

e_{i} = ║ l o g {(T_{g t, i}^{- 1} T_{e s t i, i})}^{\lor} ║_{2} {RMSE}_{A T E} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} e_{i}^{2}}

(6)

f_{i} = ║ l o g {({((T_{g t, i}^{- 1} T_{g t, i + Δ t}))}^{- 1} (T_{e s t, i}^{- 1} T_{e s t i, i + Δ t}))}^{\lor} ║_{2} {RMSE}_{R P E} = \sqrt{\frac{1}{N - Δ t} \sum_{i = 1}^{N - Δ t} f_{i}^{2}}

(7)

where

T_{g t, i}

. represents the true pose of the

i

-th frame.

T_{e s t i, i}

represents the SLAM estimate of the

i

-th frame.

e_{i}

represents the absolute trajectory error of the

i

-th frame.

T_{e s t i, i + Δ t}

represents the true pose of the

i

-th frame to

i + Δ t

-th frame.

T_{e s t i, i + Δ t}

represents the SLAM estimate of the

i

-th frame to

i + Δ t

-th frame.

f_{i}

represents the relative positional error of the

i

-th frame to

i + Δ t

-th frame.

4.2. Evaluation Using TUM RGB-D Dataset

Our algorithm was compared with ORB-SLAM2 and DS-SLAM using the TUB RGB-D dataset. Table 3 and Table 4 show the absolute trajectory error and relative pose error of the three algorithms in different scenarios. Compared with ORB-SLAM2, our algorithm has significantly improved. Similarly, comparing with DS-SLAM, which has dynamic object processing capability, our algorithm also has some improvement.

To visualize the estimated trajectory data, the trajectory data of freiburg3_walking_halfsphere scenario and freiburg3_walking_xyz scenario are visualized as shown in Figure 11.

Real-time response is an important factor in evaluating vSLAM systems. In this part, we experimentally evaluate the average frame rate (FPS) of different systems. Table 5 shows the average FPS of each sequence used by ORB-SLAM2 and Ours pairs for the evaluation of the TUM dataset. Ours achieves an average frame rate of 30.38 FPS, while ORB-SLAM2 achieves 25.24 FPS.

In addition, the BiSeNet V2 network performs semantic segmentation of images much faster than the semantic segmentation network used in DS-SLAM, so Ours outperforms other vSLAM systems that use GPU acceleration, for example DS-SLAM (13.08 FPS on fr3-w-h).

4.3. Evaluation Using Real Environment

In addition to the TUM RGB-D dataset, we also compared our algorithm with the open-source ORB-SLAM2 in a real environment. In the real environment, we add constantly moving pedestrians used to interfere with SLAM. The point cloud map of the dynamic environment constructed by ORB-SLAM2, and our algorithm is shown in Figure 12, Figure 13 and Figure 14. Affected by the dynamic object, the pose estimation of the camera had large errors, as shown in Figure 13a. When the map was constructed, point cloud information was mapped to the wrong location. Meanwhile, the information of dynamic objects is also retained in the map, as shown in Figure 13. Compared to ORB-SLAM2, our algorithm greatly improves these drawbacks. In scenes with dynamic objects, our algorithm is able to estimate the pose and construct the map very well. In addition, the algorithm is also able to obtain semantic maps while perfectly removing dynamic objects, as shown in Figure 13a,b.

In terms of pose accuracy, our method is higher than the ORB-SLAM2 system in all scenes; the reason for this is that our system can well remove the feature points in the im-age which have movement, thus improving the matching accuracy between images. In addition, by jointly using dynamic target and area growing algorithms, we get more ac-curate locations of dynamic points in the depth image, thus ensuring that dynamic object features are less involved in the mapping process. The test results of real scenes show that the point cloud maps created by our system are more effective than ORB-SLAM2. Our system compares to DS-SLAM in terms of pose accuracy, we use homography combined with optical flow method to determine dynamic feature points, while DS-SLAM uses fundamental matrix combined with polar line constraints. Homography and fundamental matrix produce different errors in facing different motion scenes, but the homography combined with optical flow method can acquire dynamic regions in the image, while the fundamental matrix and polar line constraint can only filter feature points.

Due to the elimination of dynamic feature points, our system can achieve faster tracking with local optimization compared to ORB-SLAM2. while the semantic segmentation speed of the SegNet network of DS-SLAM is far less rapid than that of the BiSeNet V2 network, which results in a situation that the tracking module needs to wait for the semantic segmentation module, finally leading to a decrease in the overall system slow-down. Moreover, as can be seen from Figure 1, our semantic segmentation network is not directly involved in the tracking module, so there is no tracking thread waiting for the semantic segmentation results.

5. Conclusions

To address indoor dynamic environments, we design a semantic information based VSLAM scheme, which is improved in several aspects, such as dynamic region detection, 3D semantic annotation, and geometric segmentation of depth images. Experimental results on the TUM RGB-D dataset show that our algorithm is better than ORB-SLAM2 and DS-SLAM in terms of bit pose estimation in dynamic scenes. This shows that the VSLAM scheme with the addition of a dynamic region detection module can well improve the ability of the camera to localize and build images in dynamic environments. In addition, the experimental results of real dynamic scenes show that our algorithm can build point cloud in unknown environments that contains semantic information rich enough to ensure the downstream work. Our scheme improves on the ORB-SLAM2 algorithm with the goal of removing dynamic information from the environment. However, due to the limitation of manual feature description algorithms such as ORB, feature matches usually still exist when there is no co-vision between neighbouring frames when the camera is moving intensely, which is disastrous for position estimation and mapping. Therefore, filtering the feature matches or using a data-driven approach to build a more robust feature descriptor is worthy of in-depth research.

Author Contributions

Conceptualization, H.L.; Methodology, S.W. (Shangxing Wang), H.L. and G.L.; Validation, H.L. and G.L.; Investigation, S.W. (Shuangfeng Wei), S.W. (Shangxing Wang) and H.L.; Data curation, T.Y. and C.L.; Writing—original draft preparation, G.L.; Writing—review and editing, S.W. (Shangxing Wang); Visualization, S.W. (Shangxing Wang); Project administration, S.W. (Shuangfeng Wei); Funding acquisition, S.W. (Shuangfeng Wei). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Project of China (2021YFB2600101), the National Natural Science Foundation of China (41601409), the Nanchang Major Science and Technology Research Project (Hongkezi [2022] No. 104), the Project Funding of Science and Technology Plan of Beijing Municipal Education Commission (KM202010016010), the Jiangxi Geological Bureau Young Science and Technology Leaders Training Program Project (2022JXDZKJRC06) and the Postgraduate Innovation Project of Beijing University of Civil Engineering and Architecture (PG2021082).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in the current study can be accessed from the corresponding authors depending on the purpose of use.

Acknowledgments

We are grateful to Jiangxi Nuclear industry Surveying and Mapping Institute Group Co., Ltd. for the funding support, especially Liping Tu and Junlin Fan for helping us to create a good experimental environment and giving us hardware support such as GPU and industrial camera.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jia, G.; Li, X.; Zhang, D.; Xu, W.; Lv, H.; Shi, Y.; Cai, M. Visual-SLAM Classical framework and key Techniques: A review. Sensors 2022, 22, 4582. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Wei, S.; Pang, F. Simultaneous localization and mapping scheme based on monocular and IMU. Sci. Surv. Mapp. 2020, 45, 10. [Google Scholar]
Wei, S.; Shi, X.; Liu, Z.; Xiao, B. Point-and-line joint optimization visual inertial odometer. Sci. Surv. Mapp. 2021, 46, 9. [Google Scholar]
Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An Overview on Visual SLAM: From Tradition to Semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
Wei, S.; Liu, Z.; Zhao, J.; Pang, F. A review of indoor 3D reconstruction with SLAM. Sci. Surv. Mapp. 2018, 43, 12. [Google Scholar]
Sun, Y.; Ming, L.; Meng, Q.H. Improving RGB-D SLAM in dynamic environments: A motion removal approach. Robot. Auton. Syst. 2016, 89, 110–122. [Google Scholar] [CrossRef]
Zhong, F.; Wang, S.; Zhang, Z.; Wang, Y. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]
Koestler, L.; Yang, N.; Zeller, N.; Cremers, D. Tandem: Tracking and Dense Mapping in Real-Time Using Deep Multi-View stereo. In Proceedings of the Conference on Robot Learning. PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 34–45. [Google Scholar]
Wimbauer, F.; Yang, N.; Von Stumberg, L.; Zeller, N.; Cremers, D. MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6112–6122. [Google Scholar]
Labbé, M.; Michaud, F. RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. J. Field Robot. 2019, 36, 416–446. [Google Scholar] [CrossRef]
Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
Cai, L.; Ye, Y.; Gao, X.; Li, Z.; Zhang, C. An improved visual SLAM based on affine transformation for ORB feature extraction. Optik 2021, 227, 165421. [Google Scholar] [CrossRef]
Li, Y.; Yunus, R.; Brasch, N.; Navab, N.; Tombari, F. RGB-D SLAM with Structural Regularities. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11581–11587. [Google Scholar]
Nebiker, S.; Meyer, J.; Blaser, S.; Ammann, M.; Rhyner, S. Outdoor mobile mapping and AI-based 3D object detection with low-cost RGB-D cameras: The use case of on-street parking statistics. Remote Sens. 2021, 13, 3099. [Google Scholar] [CrossRef]
Chen, M.; Tang, Y.; Zou, X.; Huang, Z.; Zhou, H.; Chen, S. 3D global mapping of large-scale unstructured orchard integrating eye-in-hand stereo vision and SLAM. Comput. Electron. Agric. 2021, 187, 106237. [Google Scholar] [CrossRef]
Zhang, D.; Zhu, J.; Wang, F.; Hu, X.; Ye, X. GMS-RANSAC: A Fast Algorithm for Removing Mis-matches Based on ORB-SLAM2. Symmetry 2022, 14, 849. [Google Scholar] [CrossRef]
Schofield, S.; Bainbridge-Smith, A.; Green, R. Evaluating Visual Inertial Odometry Using the Windy Forest Dataset. In Proceedings of the 2021 36th International Conference on Image and Vision Computing New Zealand (IVCNZ), Tauranga, New Zealand, 9–10 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Wei, H.; Zhang, T.; Zhang, L. GMSK-SLAM: A new RGB-D SLAM method with dynamic areas detection towards dynamic environments. Multimed. Tools Appl. 2021, 80, 31729–31751. [Google Scholar] [CrossRef]
Yang, X.; Yuan, Z.; Zhu, D.; Chi, C.; Li, K.; Liao, C. Robust and efficient RGB-D SLAM in dynamic environments. IEEE Trans. Multimed. 2020, 23, 4208–4219. [Google Scholar] [CrossRef]
Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Yu, C.; Liu, Z.; Liu, X.J.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1168–1174. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Xi, Z.; Han, S.; Wang, H. Indoor dynamic scene synchronous positioning and semantic mapping based on semantic segmentation. Comput. Appl. 2019, 39, 5. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Sun, T.; Sun, Y.; Liu, M.; Yeung, D.Y. Movable-object-aware visual slam via weakly supervised semantic segmentation. arXiv 2019, arXiv:1906.03629. [Google Scholar]
Zhang, J.; Henein, M.; Mahony, R.; Ila, V. VDO-SLAM: A visual dynamic object-aware SLAM system. arXiv 2020, arXiv:2005.11052. [Google Scholar]
Soares JC, V.; Gattass, M.; Meggiolaro, M.A. Crowd-SLAM: Visual SLAM towards crowded environments using object detection. J. Intell. Robot. Syst. 2021, 102, 50. [Google Scholar] [CrossRef]
Farhadi, A.; Redmon, J. Yolov3: An Incremental Improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Verykokou, S.; Ioannidis, C. An Overview on Image-Based and Scanner-Based 3D Modeling Technologies. Sensors 2023, 23, 596. [Google Scholar] [CrossRef]
Cesar, J.C.O. Chromatic harmony in architecture and the Munsell color system. Color Res. Appl. 2018, 43, 865–871. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft Coco: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 573–580. [Google Scholar]

Figure 1. System Framework.

Figure 2. Schematic of dynamic area detection.

Figure 3. Dynamic feature points elimination.

Figure 4. Schematic diagram of image pyramid.

Figure 5. Dynamic region detection. (a) Original image; (b) Dense optical flow field; (c) Mendelian chromatic interpolation optical flow field; (d) Binary optical flow field.

Figure 6. Dynamic area detection. The effect of semantic segmentation of BiSeNet V2 network. In (a), the first row is the publicly available data in the TUM RGB-D dataset [25], and the second row is the data of the actual dynamic scene collected. The effect of BiSeNet V2 on the semantic segmentation of the image is shown in (b). In this figure, light blue indicates pedestrians, dark blue indicates monitors, red indicates books, and pink indicates chairs.

Figure 7. Schematic diagram of 3D semantic annotation.

Figure 8. Comparison of the effect of three-dimensional semantic annotation. (a) Original image of schematic diagram before and after keyboard semantic optimization; (b) Original image of schematic diagram before and after mouse semantic optimization; (c) Original image of schematic diagram before and after Book Semantic optimization; (d) Original image of schematic diagram before and after display semantic optimization.

Figure 9. Schematic diagram of dynamic target detection.

Figure 10. Schematic diagram of restoring 3D coordinates from depth information. (a) Constructing 3D point cloud from original depth image; (b) Combining static depth image and geometric segmentation to construct 3D point cloud.

Figure 11. Trajectory data visualization in different scenarios. The first row shows the results of trajectory data visualization in the freiburg3_walking_halfsphere scene, and the second row shows the results in thefreiburg3_walking_xyz scene. From the results, it is shown that the estimated trajectory of our algorithm deviates less from the real trajectory, and the maximum absolute trajectory error is only 0.0867 m when the sensor moves in halfsphere form and 0.056 m when it moves in xyz form. These results indicate that the accuracy of our algorithm’s bit pose estimation performs well in the high dynamic scenario.

Figure 12. Point cloud map of dynamic scene.

Figure 13. Details of map construction by ORB-SLAM2.

Figure 14. Details of map construction by our algorithm.

Table 1. Region Growth Algorithm with Depth Information.

Main Steps	Operation Content
Step 1	Select a semantic mask from the semantic segmentation results of the image by the BiSeNet V2 network. After the initial seed point $(u_{0}, v_{0})$ is determined by Equation (2), the corresponding 3D coordinates $(x_{0}, y_{0}, z_{0})$ of the seed point are determined according to the camera internal reference. Add the three-dimensional coordinates $(x_{0}, y_{0}, z_{0})$ to the semantic point cloud set $P$ .
Step 2	Taking the seed point $(u_{0}, v_{0})$ as the center. In the range of semantic mask, search the spatial coordinates $(x_{i}, y_{i}, z_{i})$ that correspond to the eight-pixel points $(u_{i}, v_{i})$ around the seed point. Calculate the Euclidean distance between $(x_{i}, y_{i}, z_{i})$ and the spatial coordinates of the seed point $(x_{0}, y_{0}, z_{0})$ . If the distance is less than the threshold $T$ , $(u_{i}, v_{i})$ is pressed into the stack and $(x_{i}, y_{i}, z_{i})$ is added to $P$ .
Step 3	Take a pixel from the stack as a new seed point $(u_{0}, v_{0})$ , and return to step 2. When the stack is empty, the regional growth of the current semantic mask ends, and the point cloud set $P$ is returned as the optimized result.
Step 4	Return to Step 1, optimize the next semantic mask, and end the algorithm after all masks are completed.

Table 2. Specific parameters of RealSense D43 depth camera.

Sensor Metrics	Configuration Parameters
Dimensions	90 mm × 25 mm × 25 mm
Depth detection method	Active IR Stereo
Depth Resolution	Up to 1280 × 720
Depth Frame Rate	Up to 90 FPS
Depth Field of View	91.2° × 65.5° × 100.6°
Depth Detection Range	0.3~10 m
RGB Resolution	1920 × 1080
RBG Frame Rate	30 FPS
RGB Field of View	69.4° × 42.5° × 77°
Image Sensor Type	global shutter
Data Transfer Interface	USB 3.0

Table 3. Absolute trajectory error.

Scenario	ORB-SLAM2 (m)	DS-SLAM (m)	Ours (m)	Improvement
f3_s_s	0.007476	0.006015	0.006312	−4.94%
f3_w_h	0.748307	0.029287	0.028974	1.07%
f3_w_r	0.817401	0.050603	0.043179	14.67%
f3_w_s	0.407255	0.007145	0.007098	0.66%
f3_w_x	0.823788	0.022801	0.017345	23.93%

Table 4. Relative pose error.

Scenario	Relative Translation Error (m)				Relative Rotation Error (deg)
Scenario	ORB-SLAM2	DS-SLAM	Ours	Improvement	ORB-SLAM2	DS-SLAM	Ours	Improvement
f3_s_s	0.009126	0.007644	0.007984	−4.45%	0.007476	0.288099	0.276316	−4.18%
f3_w_h	0.498807	0.031374	0.028132	10.33%	0.748307	10.440685	0.770315	5.66%
f3_w_r	0.584204	0.068926	0.050974	26.05%	0.817401	11.262848	1.071637	22.85%
f3_w_s	0.2267	0.008876	0.009831	−10.76%	0.407255	4.104781	0.255871	1.20%
f3_w_x	0.404481	0.052734	0.020853	60.46%	0.823788	7.694351	0.638247	38.50%

Table 5. Mean tracking time [FPS] of ORB-SLAM2 and Ours in the TUM.

Sequence	ORB-SLAM2	Ours
f3_s_s	31.084	33.183
f3_w_h	23.159	31.861
f3_w_r	25.964	31.843
f3_w_s	22.362	28.167
f3_w_x	23.632	26.886
Average	25.240	30.388

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, S.; Wang, S.; Li, H.; Liu, G.; Yang, T.; Liu, C. A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments. Appl. Sci. 2023, 13, 8790. https://doi.org/10.3390/app13158790

AMA Style

Wei S, Wang S, Li H, Liu G, Yang T, Liu C. A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments. Applied Sciences. 2023; 13(15):8790. https://doi.org/10.3390/app13158790

Chicago/Turabian Style

Wei, Shuangfeng, Shangxing Wang, Hao Li, Guangzu Liu, Tong Yang, and Changchang Liu. 2023. "A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments" Applied Sciences 13, no. 15: 8790. https://doi.org/10.3390/app13158790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overview

3.2. Tracking Module

3.3. Dynamic Detection Module

3.3.1. Homography Calculation

3.3.2. Dynamic Region Detection

3.4. Semantic Segmentation Module

3.4.1. BiSeNet V2

3.4.2. Refine Segmentation

3.5. Mapping Module

4. Experiments and Evaluation

4.1. Experimental Data and Evaluation Criteria

4.2. Evaluation Using TUM RGB-D Dataset

4.3. Evaluation Using Real Environment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI