FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion

Zhu, Zunjie; Xu, Zhefeng; Chen, Ruolin; Wang, Tingyu; Wang, Can; Yan, Chenggang; Xu, Feng

doi:10.3390/rs14153551

Open AccessArticle

FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion

by

Zunjie Zhu

^1,2,3,

Zhefeng Xu

^2,4,

Ruolin Chen

⁴

,

Tingyu Wang

⁴,

Can Wang

⁵,

Chenggang Yan

⁴ and

Feng Xu

^6,*

¹

School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

²

Lishui Institute of Hangzhou Dianzi University, Lishui 323010, China

³

Hangzhou Innovation Institute of Beihang University, Hangzhou 310052, China

⁴

School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China

⁵

Linx Robot Company, Hangzhou 311100, China

⁶

BNRist and School of Software, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(15), 3551; https://doi.org/10.3390/rs14153551

Submission received: 1 June 2022 / Revised: 16 July 2022 / Accepted: 18 July 2022 / Published: 24 July 2022

(This article belongs to the Special Issue Latest Development in 3D Mapping Using Modern Remote Sensing Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Real-time 3D scene reconstruction has attracted a great amount of attention in the fields of augmented reality, virtual reality and robotics. Previous works usually assumed slow sensor motions to avoid large interframe differences and strong image blur, but this limits the applicability of the techniques in real cases. In this study, we propose an end-to-end 3D reconstruction system that combines color, depth and inertial measurements to achieve a robust reconstruction with fast sensor motions. We involved an extended Kalman filter (EKF) to fuse RGB-D-IMU data and jointly optimize feature correspondences, camera poses and scene geometry by using an iterative method. A novel geometry-aware patch deformation technique is proposed to adapt the changes in patch features in the image domain, leading to highly accurate feature tracking with fast sensor motions. In addition, we maintained the global consistency of the reconstructed model by achieving loop closure with submap-based depth image encoding and 3D map deformation. The experiments revealed that our patch deformation method improves the accuracy of feature tracking, that our improved loop detection method is more efficient than the original method and that our system possesses superior 3D reconstruction results compared with the state-of-the-art solutions in handling fast camera motions.

Keywords:

3D reconstruction; fast motion; EKF; IMU; loop closure

1. Introduction

More and more real-time 3D reconstruction techniques have been proposed in recent years, with the development of the depth-sensing and parallel computing techniques. Real-time 3D reconstruction can be applied in many real applications. It could be used to generate high-fidelity 3D scene models for real-time interactions in augmented reality (AR) and offline highly realistic modeling and rendering in virtual reality (VR). For instance, a user could reconstruct his/her home and preview the home layout before purchasing the furniture [1]. It could also be used in robotics to help the robot to better navigate and explore the environment.

Many previous works [2,3,4,5,6] have demonstrated promising performances for real-time indoor scene reconstruction. However, they require some assumptions, such as slow camera motions, a static scene without dynamic objects, abundant texture information or invariant illumination. In many real cases, these assumptions cannot be satisfied [7,8]. In fact, it is important and valuable to robustly reconstruct a 3D scene without any idealistic assumptions.

In this study, we mainly focus on the assumption of camera motions. In traditional vision-based 3D reconstruction systems, camera poses are estimated by finding feature correspondences between consecutive frames, and 3D maps of each frame are then fused by the camera poses. For slow sensor motions, it is relatively easy to find correspondences between consecutive frames (either depth frames or color frames), since the corresponding features in two consecutive frames are similar to each other and their positions should also be similar, leading to a small searching space. As a consequence, camera poses can be accurately estimated in real time. However, during fast camera motions, color information will be severely degraded by image motion blurs, leading to the failure of traditional feature-based camera tracking and loop closure. The depth image is not that sensitive to fast camera motions, but the depth image is not enough to estimate the large interframe camera motions either, as the iterative closest point (ICP) algorithm used for depth-based camera pose estimation requires small interframe motions. With fast camera motions, the ICP algorithm will be liable to trap in a local optimum.

Performing image motion deblur to acquire sharp images is a theoretically feasible way to estimate fast camera motions, as well as to detect loops. However, deblurring an image and estimating camera motions is the chicken and egg problem. To perform image deblur, camera motions are required to calculate the blur kernels [9]. On the other hand, estimating accurate camera motion requires a sharp image to build correct feature correspondences.

We solve the problem by involving an inertial measurement unit (IMU), which is a device that measures its linear acceleration and angular velocity, and a geometry-aware feature tracking approach is proposed to accurately track patch features under fast sensor motions. With an extended Kalman filter (EKF) framework effectively integrating the multi-mode information of IMU, color and depth, the robust camera pose estimation and geometry fusion of an indoor scene are achieved in real time.

Compared to our preliminary version [10], which does not have a loop closure module, this paper further achieves globally consistent 3D reconstructions for both slow and fast sensor motions using a novel loop closure method. To achieve real-time loop closure, we applied a submap-based mechanism to close a loop between two submaps rather than all voxels. Additionally, we propose a depth-only loop detection method based on the keyframe encoding [11]. Our loop detection method is not affected by blurry color images and is able to detect loops under fast sensor motions. By performing the detection process on the keyframe lists of candidate submaps, our method achieves a more efficient and accurate loop detection than the method in [11]. To demonstrate the advantage of our system with fast camera motions, we also detail the definition of the speed of a sequence and list the speed information of public datasets. Additional experiments were performed in this version to compare our system with state-of-the-art methods, as well as the system in our preliminary version, for both slow and fast datasets.

In summary, the main contributions of our work are as follows:

(1) We present an RGB-D-inertial 3D reconstruction framework that tightly combines the three kinds of information to iteratively refine camera pose estimation using an extended Kalman filter.

(2) We present a geometry-aware feature-tracking method for handling fast camera motions that utilizes patch features to adapt blurred images and considers the deformation of patches in building feature matching for images with very different perspectives.

(3) An improved keyframe encoding method is proposed for loop detection, which uses submaps and the prior camera pose to increase the accuracy, as well as the computational efficiency, and is also able to detect loops in fast camera motions due to the use of the depth image only.

The experimental results demonstrate that our method is on par with state-of-the-art methods in slow camera motions, while it possesses a superior performance in fast camera motions. The ablation study also shows that our improved keyframe encoding method achieves a better accuracy and computational efficiency for loop detection. An example of our reconstruction result is shown in Figure 1.

2. Related Work

The technique in this paper mainly focuses on indoor scene reconstruction under fast camera motion. In this section, we discuss traditional 3D reconstruction techniques, as well as techniques related to the estimation of fast camera motion.

2.1. 3D Reconstruction

There have been many impressive works on 3D reconstruction in recent years, and we analyze these works based on the use of different sensors. Methods such as KinectFusion [12] and InfiniTAM [3] only use depth information to reconstruct the 3D model, and estimate the camera pose from distance data by variants of the iterative closest point (ICP) algorithms [13,14]. However, depth-only camera tracking is brittle in geometry-less scenes and bright windows, as well as depth sensor noises. Camera tracking has made breakthrough progress in monocular RGB methods, including direct methods [15,16,17] and feature point methods [18,19,20,21]; these methods cannot reconstruct the detailed 3D model, whereas the reconstruction of detailed and dense 3D models is the purpose of this paper. Furthermore, Refs. [2,3,4,5,22] use both color and depth information to estimate camera motions and generate dense 3D models based on implicit truncated signed distance functions (TSDFs) or surfel representation. Different from them, Ref. [23] reconstructs large-scale scenes by merging images and laser scans. Although these approaches are state-of-the-art in terms of 3D indoor scene reconstruction, they still can only work on situations that strictly follow the assumptions, such as static scenes without dynamic objects, sufficient texture, geometrical information and slow camera motions, as well as invariant illumination.

These assumptions are invalid in many applications, so we have to improve the system’s robustness rather than establish more assumptions. Some researchers have noticed these problems and have published impressive works. For example, MixedFusion [24] focuses on dynamic scenes reconstruction, which decouples dynamic objects from static scenes and estimates both object motions and camera motions without any prior motion, such as a scene template [25] or motion model [26]. DRE-SLAM [27] builds an OctoMap for the static scene and the dynamic scene, respectively, with information of a RGB-D camera and two wheel-encoders. Refs. [28,29] allow multiple users to synergistically reconstruct a 3D scene rather than requiring only one input sequence for the reconstruction. This method does not require GPUS. Unlike traditional methods that rely on a simplified parallel lighting model, Refs. [30,31] estimate a lighting function from images captured under different illuminations. Ref. [32] reconstructs scenes with mirror and glass surfaces by detecting the mirrors and glasses from RGB-D inputs beforehand.

2.2. Estimation of Fast Camera Motion

The estimation of fast camera motions is a tough challenge. The fast camera motion not only blurs the color images but also induces the failure of the depth-based estimation method. For example, the fast interframe camera motion makes the real-time ICP algorithm easily fall into a local optimum.

Pushing up the frame-rate of the color camera could reduce the affection of image motion blur. However, we cannot arbitrarily increase the camera frame-rate due to the trade-off among the signal–noise ratio, motion blur and computational cost [33,34].

Hee Seok Lee et al. [35] and Haichao Zhang et al. [9] minimize the undesired effect of motion blur by deblurring images before feature extraction. However, these methods cannot be adopted in a real-time 3D reconstruction system, considering the computational cost.

Adopting the image patch matching could be a solution to adapt the blurs of color images, because the consecutive image frames could have a similar blurry effect that induces very closer pixel values. Traditional patch matching methods, such as [36], match patches by measuring the photometric error. Recently, Refs. [37,38,39] proposed modern neural networks to match patches using the extracted descriptors of each patch. However, the above methods cannot match patches in varied viewpoints. Ref. [40] achieves the matching of patches in varied viewpoints by proposing CNNs that learn a consistent descriptor for a patch, even in very different viewpoints.

Integrating IMU information is a way to effectively estimate the camera poses with fast motions. Refs. [3,41,42] use the rotation calculated by IMU as the initialization of the camera pose estimate. Tristan et al. [43] present the complete RGB-D-Inertial SLAM system based on ElasticFusion. Ref. [44] tracks camera motions via IMU pre-integration. Ref. [45] is devoted to the use of low-cost IMU sensors by calculating the acceleration bias as a variable during the initialization process. However, all of them fail to solve the influence of fast camera motions for real-time 3D reconstruction.

3. Overview

The pipeline of our system is illustrated in Figure 2. In our system, we tightly combined color, depth and inertial information for real-time indoor scene 3D reconstruction, and the system operated in an extended Kalman filter framework. We also propose a depth image encoding method and submap-based pose graph optimization for the loop closures under fast camera motions.

Kalman prediction (Section 4.2.1) was performed to predict the current camera motion by the camera state of the previous frame and the IMU measurements of the current frame. In the Kalman update step, we first performed feature tracking between two consecutive frames. During fast camera motions, the shape and appearance of an image patch may vary in two consecutive frames. We named this phenomenon the shrink and extend effect (SE effect), and we evaluated this SE effect for all patches by their intensity and geometric information (Section 4.1). The patches affected by the SE effect are deformed by our geometry-aware patch deformation for improved feature tracking. Then, we iteratively updated the camera pose and deformed the patches.

In the reconstruction process, we separated the reconstructed model into several small models called submaps (Section 4.3). When a new submap is created, the subsequent frames are fused into the submap and the global pose of the submap is initialized by its first frame. Patch features are updated after the fusion of the current frame; their geometries are refined by the so-far reconstructed 3D model.

In the loop closure module, the current depth image is encoded by a randomized fern encoding method, and compared with lists that store the codes of historical depth images. A loop is detected once an image that belongs to a historical submap successfully matches the current frame; then, a link between the two submaps is generated with the loop constraint. After that, we performed the submap-based pose graph optimization process to close the loop. Finally, the global 3D model of an indoor scene was rendered by fusing all submaps.

4. Method

We first introduce our novel geometry-aware feature-tracking method, which is the key to accurately estimate camera poses under fast sensor motions. We did not directly perform the feature tracking, but ran it in an extended Kalman filter framework, which integrates the IMU information to better achieve the sensor motion estimation. Then, we used the estimated camera pose to fuse the current depth into the current submap and update the features for the tracking of the next frame. Finally, we introduced our loop closure module, which detects loops during fast sensor motions and closes loops by deforming the submap-based geometry model.

4.1. Geometry-Aware Feature Tracking

During fast sensor motions, color images are severely blurred. Feature point methods such as ORB-SLAM [18] cannot detect enough feature points in the blurry images. Patch-based direct methods such as [36,46] outperform feature point methods in terms of the robustness of feature tracking on blurry images because the image intensities change similarly in two consecutive blurry images. However, patches may contain multiple objects with different depths, and, thus, the shape and appearance of a patch may change in interframes as fast camera motions generate larger perspective changes, leading to inaccurate feature tracking. As a consequence, a feature tracking method that considers the changes will improve the tracking performance.

We handled this by combining depth information with color to back-project 2D patches in one frame into 3D space, and then project them to the pixel coordinate by the camera pose of the next frame. The projection helps to deform the original patches to model the shape and appearance changes, and the patch tracking can be easily and accurately achieved by the deformed patches. To perform this in real time, we investigated the possible deformations, classified them into different cases and handled them separately.

4.1.1. SE Effect for Patch Deformation

During fast camera motions, an image patch feature is observed in frames with very different viewports, and, thus, the shape and appearance of the patch feature in the image domain vary from different frames. To model the deformations of patch features in different frames, different from Bloesch et al. [47], who only consider the 2D planar information of patches, we involved the 3D geometry of a patch to determine the 2D shape deformations of patches in interframes.

Because of the different 3D shapes of patches and the irregularity of camera motions, patches may produce different deformations in consecutive frames. Figure 3 shows the three representative cases of patch deformations:

Case 1. When the depths of a patch are consistent, the shape and appearance of the patch will not be influenced by the camera motions in interframes.
Case 2. In a static scene with an uneven surface, the shape of patches in interframes will not change in slow camera motions because the occlusion does not change over two consecutive frames with almost the same camera views.
Case 3. Different from case 1 and case 2, if the large depth variance in a patch is simultaneous with the aggressive camera motion, the intensity distribution and the shape of the patch will change. As shown in Figure 3, if the camera moves from $V_{0}$ to $V_{2}$ , the yellow region is occluded and the patch shrinks in the frame of $V_{2}$ . In addition, the black region occluded in $V_{0}$ is visible once the camera moves to $V_{1}$ , and thus the patch shape extends in the frame of $V_{1}$ . We named these phenomena as the shrink effect and extend effect (SE effect).

From the above cases, the patch deformation in interframes is mainly caused by the changes in occlusion. The camera rotation (yaw, patch, roll) will not generate the patch deformation, as the occlusion does not change in rotation-only camera motion.

Matching patches that are severely affected by the SE effect will generate a large margin of error. Thus, we designed a geometry-aware patch deformation method to handle the SE effect. The process of the deformation method is shown in Figure 4 and is formularized as follows.

Each pixel i in a patch of frame

k - 1

is defined as

P_{i}^{k - 1} : = (p_{i}^{k - 1}, I_{i}^{k - 1}, d_{i}^{k - 1}, n_{i}^{k - 1})

, where

p_{i}

and

I_{i}

denote the pixel coordinate and the intensity of the pixel i.

d_{i}^{k - 1}

and

n_{i}^{k - 1}

denote the depth and the 3D normal of the pixel i in frame

k - 1

. Note that the depth and the normal are from the so-far reconstructed model rather than the raw depth image of frame

k - 1

.

We first back-projected each pixel into the camera coordinate to obtain the local vertex position:

v_{i}^{k - 1} = π^{- 1} (P_{i}^{k - 1}),

(1)

where

π (\cdot)

is projecting a point from the 3D space to the 2D pixel coordinate, and

π^{- 1} (\cdot)

denotes the inverse operation. Then, we projected

v_{i}^{k - 1}

to the pixel coordinate of frame k:

{\hat{p}}_{i}^{k} = π ({\hat{v}}_{i}^{k}) = π (T_{k}^{- 1} T_{k - 1} v_{i}^{k - 1}) .

(2)

Here,

T_{k}

is the camera pose of the k-th frame, and

T_{k} v = R_{k} v + t_{k}

(

3 \times 3

rotation matrix

R_{k}

,

3 \times 1

translation vector

t_{k}

) transforms a 3D point from a local camera coordinate of the k-th frame to the global coordinate. Note that the global coordinate is defined to be the camera coordinate of the first frame. The camera pose

T_{k}

is to be solved and affects the 2D position of the projection.

We used

{\hat{P}}_{i}^{k}

to represent the pixel that is projected from frame

k - 1

to frame k, not originally from frame k. The projected pixel coordinate

{\hat{p}}_{i}^{k}

belongs to the pixel

{\hat{P}}_{i}^{k}

, and its depth value was set as the z-coordinate of

{\hat{v}}_{i}^{k}

. The intensity

{\hat{I}}_{i}^{k}

and the normal

{\hat{n}}_{i}^{k}

were directly set as the values of

P_{i}^{k - 1}

.

As shown in Figure 4, the SE effect can be evaluated after the projection. In the pixel coordinate, when the distance between two projected pixels

{\hat{P}}_{i}^{k}

and

{\hat{P}}_{j}^{k}

is smaller than a half-pixel distance, the pixel with the greater depth is covered by another one, and the shrink effect occurs in the patch of frame k. Therefore, the pixel with the greater depth value should be removed from the projected patch. The extend effect was evaluated by comparing the height and width between the projected patch and the original one. Note that the projected patch was also set as a rectangle for easy operation, and that empty pixels were marked and assigned as black.

If neither of these effects appear after projection, we considered the patch to not be affected by the SE effect. Note that the SE effect evaluation and the patch deformation are initialized by the camera pose predicted from the Kalman prediction, and are iteratively updated in the Kalman update. We specify these steps in Section 4.2.

4.1.2. Objective

In feature tracking, patches affected by the SE effect are replaced with the corresponding deformed patches. Inspired by Jaesik Park et al. [48], who presented an algorithm for aligning two colored geometry maps, we tracked the projected patch features by both the intensity and depth information of them.

The photometric error of each deformed patch was computed as follows. Firstly, pixel

P_{i}^{k}

of a patch in the current frame k was extracted by the pixel coordinate

{\hat{p}}_{i}^{k}

of the corresponding pixel

{\hat{P}}_{i}^{k}

of the deformed patch. Then, we calculated the intensity difference between the extracted patch and the projected patch. The photometric error can be formularized as:

E_{p} (T_{k}) = \sum_{i}^{Y} {∥I [P_{i}^{k}] - \hat{I} [{\hat{P}}_{i}^{k}]∥}_{2}^{2},

(3)

Y is the valid pixel amount of a patch. The geometric errors were calculated by the point-to-plane ICP algorithm:

E_{g} (T_{k}) = \sum_{i}^{Y} {∥{(T_{k - 1}^{- 1} \cdot T_{k} \cdot (v_{i}^{k} - {\hat{v}}_{i}^{k}))}^{⊤} \cdot \hat{n} [{\hat{P}}_{i}^{k}]∥}_{2}^{2},

(4)

where

v_{i}^{k}

and

{\hat{v}}_{i}^{k}

are the vertexes of the corresponding pixels in the camera coordinate of the k-th frame. Given

E_{p}

and

E_{g}

, the cost function for patch tracking was formulated as:

E (T_{k}) = λ \sum_{j}^{M} E_{p}^{j} (T_{k}) + (1 - λ) \sum_{j}^{M} E_{g}^{j} (T_{k}),

(5)

with

λ = 0.1

, which is similar to related work [2]. M denotes the patch amount, and j is the index of a patch.

4.2. Extended Kalman Filtering Framework

We did not directly solve for the energy minimization of Equation (5). Instead, we used it in our extended Kalman filter (EKF) framework. Our EKF framework aims to tightly combine the color, depth and inertial measurements. To be specific, we modeled the camera pose of each frame as the state of the EKF. The observations of the EKF are color and depth images. The relationship between the state and the observations was measured by the energy defined in Equation (5). If a state fits exactly to the observation, the energy is zero. On the other hand, the inertial information was used in the Kalman prediction step, which serves to build the motion prediction model.

We followed the traditional extended Kalman filter to define the variables. A nonlinear discrete time system with state

x

, observation term

z

, process noise

ω \sim N (0, Q)

and update noise

μ \sim N (0, U)

in the k-th frame can be written as:

x_{k} = f (x_{k - 1}, ω_{k}),

(6)

z_{k} = h (x_{k}, μ_{k}) .

(7)

In our framework, the state of the filter was composed of the following elements

x : = (R, t, v, b_{a}, b_{g})

, where

v

is the camera linear velocity and

b_{a}, b_{g}

are biases of IMU (accelerometer and gyroscope); process noise

ω : = (n_{a}, n_{w}, n_{b_{a}}, n_{b_{w}})

, where

n_{a}, n_{w}

are noises of IMU and

n_{b_{a}}, n_{b_{w}}

are white Gaussian noises of IMU’s biases that are modeled as a random walk process.

Q

is the covariance matrix of IMU’s noises and biases, which is derived by the offline calibration using Kalibr [49,50]. In the following, we use the symbol

\tilde{}

to represent values calculated from the Kalman prediction step, such as the predicted state

\tilde{x}

.

4.2.1. Kalman Prediction

Given the a posteriori estimation

x_{k - 1}

with covariance

P_{k - 1}

, the prediction step of the EKF yields an a priori estimation at the next frame k:

{\tilde{x}}_{k} = f (x_{k - 1}, 0),

(8)

{\tilde{P}}_{k} = F_{k} P_{k - 1} F_{k}^{⊤} + G_{k} Q G_{k}^{⊤},

(9)

with the Jacobians:

F_{k} = {\frac{\partial f}{\partial x}|}_{x_{k - 1}, 0}, G_{k} = {\frac{\partial f}{\partial ω}|}_{x_{k - 1}, 0} .

(10)

The key in the Kalman prediction step is to define the function f. In our work, we employed the inertial measurements in the definition. Assuming IMU is synchronized with the camera, and acquires measurements with a time interval

τ

, which is much smaller than that of the camera, then, similar to [47,51], the IMU pre-integration method [52] was performed to integrate the IMU measurements acquired between two consecutive frames:

Δ R = \prod_{n = 1}^{N} Exp (w_{n} \cdot τ)

(11)

Δ v = \sum_{n = 1}^{N} Δ R_{n} \cdot a_{n} \cdot τ

(12)

Δ t = \sum_{n = 1}^{N} (Δ v_{n} τ + \frac{1}{2} Δ R_{n} \cdot a_{n} \cdot τ^{2}),

(13)

where

Exp (\cdot)

denotes the exponential map from Lie-algebra to Lie-group.

a, w

are the bias-corrected linear acceleration and angular velocity of IMU, and are already transformed to the camera coordinate via the extrinsic matrix calibrated offline [49,50]. N denotes the number of inertial measurements acquired between two consecutive camera frames.

Δ R_{n}, Δ v_{n}

represent the integration results from the first measurement to the n-th (rather than N-th). Details about this pre-integration can be found in [52].

Finally, the state of k-th frame predicted in the Kalman prediction step was formulated as:

\begin{matrix} {\tilde{R}}_{k} = R_{k - 1} \cdot Δ R, {\tilde{v}}_{k} = v_{k - 1} + g N τ + R_{k - 1} Δ v, \\ {\tilde{t}}_{k} = t_{k - 1} + v_{k - 1} N τ + \frac{1}{2} g {(N τ)}^{2} + R_{k - 1} Δ t, \\ {\tilde{b}}_{a_{k}} = b_{a_{k - 1}} + n_{b_{a}}, {\tilde{b}}_{w_{k}} = b_{w_{k - 1}} + n_{b_{w}}, \end{matrix}

(14)

where

g

is the gravity acceleration in the global coordinate. The predicted camera state is used in Section 4.1.1 for the initialization of patch deformation. The Jacobian calculations in Equation (10) are complicated but a tractable matter of differentiation, and are similar to the ones in [51]; for brevity, we do not present the results here.

4.2.2. Kalman Update

In a traditional Kalman update step, the measurement residual is modeled as:

y_{k} = z_{k} - h ({\tilde{x}}_{k}, 0) .

(15)

Here,

0

means that we directly used

{\tilde{x}}_{k}

to calculate the residual without adding any Gausian noise. The updated state was formulated as:

x_{k} = {\tilde{x}}_{k} + K_{k} \cdot y_{k},

(16)

where

K_{k}

is the Kalman gain. In our method, we defined the residual as the photometric and geometric errors of patches (Equation (5)); thus, the residual is:

y_{k} = 0 - E ({\tilde{T}}_{k}) = - E ([\begin{matrix} {\tilde{R}}_{k} & {\tilde{t}}_{k} \\ 0 & 1 \end{matrix}]) .

(17)

The patch deformations used in calculating

y_{k}

by Equation (17) were greatly affected by the camera poses. Therefore, after acquiring an updated camera pose by Equation (16), we used the pose to iteratively derive the deformations of patches, and refined the camera pose again by Equation (16). Specifically, we used m to denote the iterations, and, thus, we had:

h (x_{k}^{m}, 0) = E ([\begin{matrix} R_{k}^{m} & t_{k}^{m} \\ 0 & 1 \end{matrix}]),

(18)

and the Kalman gain with respect to each iteration is:

K_{k}^{m} = {\tilde{P}}_{k} \cdot {(H_{k}^{m})}^{⊤} \cdot {(S_{k}^{m})}^{- 1}

(19)

S_{k}^{m} = H_{k}^{m} \cdot {\tilde{P}}_{k} \cdot {(H_{k}^{m})}^{⊤} + J_{k}^{m} \cdot U \cdot {(J_{k}^{m})}^{⊤} .

(20)

U = diag (σ_{u}^{2}, σ_{v}^{2}, σ_{d}^{2})

is the covariance matrix of

3 \times 1

image noise

μ_{k}

, which is assigned by the depth image model of [53]. In addition, the Jacobians updated in each iteration are

H_{k}^{m} = {\frac{\partial h}{\partial x}|}_{x_{k}^{m}, 0}, J_{k}^{m} = {\frac{\partial h}{\partial μ}|}_{x_{k}^{m}, 0},

(21)

where the calculation of

H_{k}^{m}

is equal to the Jacobian calculation of

R_{k}^{m}, t_{k}^{m}

in Equation (5). Then, the updated state of each iteration was calculated as follows:

x_{k}^{m + 1} = x_{k}^{m} - K_{k}^{m} \cdot h (x_{k}^{m}, 0) .

(22)

Notice that

x_{k}^{0}

is initialized by

{\hat{x}}_{k}

. The covariance matrix is only updated after the convergence:

P_{k} = (I - K_{k} \cdot H_{k}) \cdot {\tilde{P}}_{k}

(23)

4.3. Model Fusion and Patch Update

Similar to [3], we used TSDF [54] and a hierarchical hashing branch [55] to incrementally fuse each depth image by the camera poses obtained from the Kalman update. To perform a non-rigid deformation for the TSDF-represented geometry model in loop closure, we used the submap mechanism introduced in [56] to build the whole model via several small submaps

M

. We generated a new submap once the camera moved far away from the “central point” of the current submap, which was implemented by checking whether the “central point” could be projected into the current camera viewport. The “central point” is the geometry center of the current submap and was online updated with the fusion. The subsequent depth frames were fused to the newly generated submap, and the global pose of the new was is initialized by the camera pose of its first frame. Similar to [56], we also used a pose graph to represent the submaps, where the pose of each submap was denoted as a node and two consecutive submaps were linked with one edge. Note that the pose of a submap is updated by loop closure (described in the next section).

After the fusion of the current frame, we updated the patch features for the subsequent tracking. Firstly, features with great intensity errors should be excluded. Therefore, we excluded those bad features by comparing the average pixel intensity error of the patches. Then, we utilized the FAST corner detector [57] and the number of pixels with available depth information to add new features with distinct intensity gradients and sufficient depth information. Note that the SE patches were also replaced by their normal patches. Finally, the intensity information of the patch features was updated by the current color image, and their depth information was set by the fused geometry, which has better quality than the current depth image.

4.4. Loop Closure

4.4.1. Loop Detection

During fast camera motions, color images are severely blurred, leading to the failure of the color-based loop detection methods, such as [58,59]. Fortunately, depth images captured by the structured-light depth sensor are insensitive to fast sensor motions, thanks to the high frame-rate of the IR projector (more than 1000 FPS). Therefore, we utilized the keyframe encoding method [11] to only encode depth images for the detection of latent loops.

The original keyframe encoding method compares the code of the current frame with codes of all so-far harvested keyframes to detect loops and harvest keyframes. Its processing time grows with the number of keyframes, which goes against the real-time performance of the reconstruction system for long-term large-scale indoor scene reconstruction. In our work, we stored image codes separately into the code lists of their corresponding submaps, and a new code was only compared with the codes of the same submap. This design greatly reduces the computation cost by matching the code of the current frame only with the lists of nearby submaps. In addition, our method also improves the detection accuracy by removing outliers implicitly. Note that outliers are usually from regions with similar geometries, such as different corners in a room.

The encoding process is similar to Glocker et al. [11]; we utilized the randomized ferns to encode a binary code B for a depth image. The binary code consisted of N four-bit binary code blocks b, and each code block was calculated by a random pixel. The dissimilarity between the current frame I and a keyframe J was calculated via the Hamming distance:

BlockHD (B^{I}, B^{J}) = \frac{1}{N} \sum_{i = 1}^{N} b_{i}^{I} \equiv b_{i}^{J}

(24)

where the operation ≡ returns 0 only if two code blocks are identical. In our experiment, N was assigned as 1000.

Since we maintained one list for each submap, a code

B^{I}

of the current frame I was stored in the code list

{B_{i}^{I}}

of the current submap

M_{i}

. During the loop detection of each frame, we only matched the current frame with the candidate keyframes in lists that belong to the k nearest submaps. The nearest submaps were selected by comparing the poses of the current frame and each submap. We set

k = 5

by considering the experimental results, as well as our practical experience. If the smallest dissimilarity is higher than a threshold

σ_{1} = 0.1

, then the current frame is added as a new keyframe and the code is included in the code list of the current submap. Otherwise, a loop is successfully detected when the smallest dissimilarity (except the ones of the current submap) is lower than the threshold

σ_{2} = 0.05

.

4.4.2. Pose Graph Optimization

Once a loop was successfully detected between the current submap and a previous submap, we relocated the camera pose of the current frame via ICP between the current frame and the previous submap [11]. Note that the ICP was initialized by the pose of the matched keyframe in loop detection. A new edge between the two submaps was also added in the pose graph (Section 4.3) by comparing the camera poses of the two submaps. As the pose graph was updated, we used the pose graph optimization [60] to close the detected loop.

5. Experiments

In this section, we first introduce the hardware information of our platform and our computational performance. Then, we define the speed of sensor motion in a recorded sequence. After that, we evaluate the effectiveness of our geometry-aware feature tracking and the importance of integrating the inertial information by sequences with different speeds, as well as the accuracy and efficiency of our loop detection method. Next, we compare our indoor scene 3D reconstruction method with state-of-the-art methods in terms of both quantitative and qualitative experiments.

5.1. Performance and Parameters

All experiments were performed in a laptop with Intel Core i7-7820 HK CPU at 2.9 GHz, 32 GB of RAM and GPU GeForce GTX 1080 with 8 GB memory. We exhibit the computational performance of our system in Figure 5. The overall processing time of each frame does not increase with the frame number thanks to our improved loop detection method. We allocated voxel blocks with voxel size 0.005 m × 0.005 m × 0.005 m; thus, the size of one voxel block was 0.04 m × 0.04 m × 0.04 m, and we pre-allocated 16,384 blocks for each submap.

5.2. Speed Definition

The TUM RGB-D dataset [61] sets the average sensor motion speed of a sequence as the speed of the sequence. However, we find that it is the fastest part of a sequence that defines the real difficulty for camera tracking, so we plan to use the fastest camera motion speed to define a sequence as ‘fast’ or ‘slow’. In practice, we did not directly use the fastest speed value but the value at the top 10% because the obtained camera motion speed is not very accurate and thus the fastest speed may be easily affected by errors. Then, we empirically set a sequence belonging to ‘fast’ as long as its linear speed is greater than 1 m/s or its angular speed is greater than 2 rad/s; as we found for common sensors, this speed led to considerable motion blur.

In Table 1, we calculate the speeds of sequences for several public datasets, such as the ICL NUIM [53], VaFRIC [33], TUM RGB-D [61] and ETH3D [22]. Since most of the public datasets are recorded under slow camera motions (the ETH3D provides three fast sequences with small camera shakes), we made a dataset with both slow and fast sequences to further evaluate the scene reconstruction performance of our method. Our dataset was recorded in indoor scenes “Dorm1, Dorm2, Bedroom, Hotel” using the sensor Intel Realsense ZR300, and the sensor provided color, depth and IMU information. We have released our dataset to the public (https://github.com/zhuzunjie17/FastFusion).

Figure 6 shows the linear and angular camera velocities of the three sequences recorded by ourselves. In the two fast sequences, the camera velocities of consecutive frames change violently; this is mainly caused by great accelerations of the camera. In fact, other fast sequences used in our experiments also contain this kind of phenomenon. Notice that the speeds of the pubic datasets are derived from ground truth camera trajectories that are acquired by Vicon systems, and the speeds of the dataset recorded by ourselves are derived from successfully tracked camera trajectories.

5.3. Evaluation

5.3.1. Feature Tracking

To test whether the proposed geometry-aware feature tracking method tracks features accurately, we compared our feature tracking method with and without considering the SE effect. The method without considering the SE effect matches the patches directly by the shape of the previous frame. For the two compared methods, we set the patch size to

10 \times 10

and extracted no more than 100 patches of each frame. Figure 7 shows the qualitative tracking results for patches affected by the SE effect. The method without considering the SE effect tracks patches directly by the patch shapes of previous frames, whereas our method successfully detects the SE effect and deforms the patches for tracking purpose. The results show that our method obtains more accurate positions and smaller residuals for patches affected by the SE effect than for patches not affected. We list the average pixel intensity errors (AIE) of various sequences in Table 2 to quantitatively compare the two methods. In this table, ICL_kt2 is the “livingroom” sequence and ICL_kt3 is the “officeroom” sequence. From the table, we find that our method obtains smaller AIE for all sequences than those obtained by the competing method, especially in sequences with fast camera motions. In addition, the numbers of frames and SE patches are also listed.

5.3.2. IMU

To verify whether the integration of IMU helps in the reconstruction of the scene model for fast camera motions, we compared the results with and without IMU in the three sequences, Hotel_slow, Hotel_fast1 and Hotel_fast2. Note that the loop closure function is closed in this comparison. Figure 6 already demonstrates the details of camera motions in the three sequences. For our system without IMU, we used the uniform motion model in the Kalman prediction step, i.e., the predicted camera states are identical to the previous frame.

Figure 8 shows the corresponding reconstruction results. In the sequence with slow camera motions, the system without IMU works on par with the full system. In sequences with aggressive camera motions, our system without IMU reconstructs models with several fractures. As shown in Figure 8b, a large fracture appears in the reconstructed model of Hotel_fast2, which is essentially caused by the bad initialization of the SE patch deformation during frames from 300 to 350. The bad initialization of the SE patch deformation degenerates the geometry-aware patch feature tracking, leading to inaccurate camera pose estimation. A side-by-side comparison for the reconstruction process of our system with and w/o IMU can be viewed in our video in Supplementary Materials.

5.3.3. Loop Detection

Computional Efficiency. As described in Section 4.4.1, we improved the original keyframe encoding method [11] for more efficient loop detection. Therefore, we show the processing time of the original and our improved methods on a recorded long sequence “Ours_House” with more than 20k frames. The reconstruction of “Ours_House” (first 5k frames) is shown in Figure 9 and our video in Supplementary Materials. Note that both methods are run in CPU on the same platform (detailed in Section 5.1). In this experiment, the two methods share identical parameters, such as the code block amount being 1000. Figure 5 shows the relationship between the computational time and the number of keyframes. A frame is named as a keyframe as long as its code is added into a list. It is obvious that the computational time of the original method [11] is linearly correlated with the number of keyframes; the reason for this is that all keyframes are used for the dissimilarity calculation, as well as the loop detection. This affects the real-time performance of a 3D reconstruction system, even when running the loop detection process in parallel on another CPU core, because the computational time is going to grow indefinitely. On the contrary, our improved loop detection method only calculates dissimilarity for candidate keyframes of

k = 5

nearby submaps, and the candidate keyframe amount does not increase with the total keyframe amount. Consequently, the computational time of our method is always lower than a threshold; this guarantees the real-time performance of the 3D reconstruction.

Accuracy. We further evaluated the accuracy of our method on the public dataset “7 scenes” [11]. This dataset provides training and test image sequences in seven indoor scenes. The training sequences are used for keyframe harvesting and the test sequences are for an accuracy evaluation of camera relocalization by the harvested keyframes. In this experiment, we first reconstructed indoor scenes by training sequences, then initialized test sequences in the same global coordinate. The successful relocalization of a frame is defined in the same way as [11], i.e., the recovered pose of a frame after ICP camera tracking (which is initialized by the pose of the matched keyframe) is within a 2 cm translational error and 2 degrees angular error. Figure 10 shows the accuracy of the original method and our method with depth input only, i.e., it only uses the depth information for the relocalization. Interestingly, the accuracy of our method increases in all seven scenes, especially in the scene “Stairs”. The reason for this is that, with the selection of candidate keyframes by the prior camera pose, our method implicitly removes the outliers that are produced by the very similar geometry of two different regions; for example, depth images that record different sections of stairs in the scene “Stairs”.

We further evaluated the effect of the nearest submap number k; the results are summarized in Figure 11. In the figure, the accuracy increases with the nearest submap number when k is small, and decreases when more and more submaps are selected, because the outliers are rejoined. In summary, we set

k = 5

as a trade-off between the performance and processing time. Note that a 120 m² indoor scene costs nearly 30 submaps (the one in Figure 9 “House”); thus,

k = 5

is enough for most house-grade indoor scenes to cover potential loops, and it can also be tuned to adapt other scenes.

5.4. Comparison

5.4.1. Quantitative Comparison

We quantitatively compared our full system with state-of-the-arts, such as ORB-SLAM2 [44], ElasticFusion [5], InfiniTAM-v3 [56], BundleFusion [4], BAD-SLAM [22] and our system without loop closure [10]. ORB-SLAM2, BundleFusion and BAD-SLAM use feature point for the loop detection, whereas ElasticFusion, InfiniTAM-v3 and ours use the keyframe encoding method; note that InfiniTAM-v3 also uses depth only for the detection.

We used the absolute trajectory errors (ATE) (as used in [61]; smaller is better) for the comparison; the errors calculate the root-mean-square of the Euclidean distances between the estimated camera trajectory and the ground truth camera trajectory. We evaluated three slow sequences of the TUM RGB-D dataset [61] and three fast sequences of the ETH3D [22]. The TUM RGB-D dataset is a recognized and widely used real-world benchmark. Both datasets provide color and depth images, while ETH3D further provides IMU measurements. Table 3 shows the results in sequences fr1/desk, fr2/xyz and fr3/office of TUM RGB-D, as well as the results in sequences camera_shake1(cs1), camera_shake2(cs2) and camera_shake3(cs3) of ETH3D. The speeds of these sequences are listed in Table 1. Note that the TUM RGB-D dataset does not provide full IMU information (acceleration only), so we used color and depth images of the TUM RGB-D dataset, and used the uniform motion model in our Kalman prediction step. In addition, since all of the other methods (except InfiniTAM-v3) cannot utilize IMU data, we only input IMU measurements for our system and InfiniTAM-v3.

The TUM RGB-D dataset neither contains IMU nor fast camera motions, but we see that our system still generate reasonable results in this situation, and that our system with loop closure outperforms our original one on these sequences with multiple loops. Both ours and other methods track camera poses well and successfully perform loop detection, as well as global optimization, in slow camera motion. Our strategies for fast camera motion (i.e., SE effect) are of little help under slow movement. Therefore, there is not much difference in performance between ours and other methods in slow camera motion. The advantage of our work is adequately demonstrated in sequences cs1, cs2 and cs3 with fast camera motions. The three fast sequences in the ETH3D dataset are collected under fast camera shakes, which capture a same small scene without loop (the reconstructed model of the scene is shown in Figure 12). Therefore, our method with and without loop closure outputs the same RMSE. The results in Table 3 show that most approaches cannot successfully track the cameras in all of the three fast sequences, whereas our method outperforms other methods on the three fast sequences.

5.4.2. Qualitative Comparison

In the qualitative experiments, we only compared our system with the methods that focus on the 3D reconstruction, including InfiniTAM-v3 [56], which uses the rotation values calculated by IMU information for pose estimation; BundleFusion [4], which has an efficient global pose optimization algorithm; ElasticFusion [5], which contains loop closure and model refinement via non-rigid surface deformation; BAD-SLAM [22], which performs direct RGB-D bundle adjustment; and our system without loop closure [10].

We compared them on our dataset that provides slow and fast sequences with IMU and RGB-D images; each sequence of our dataset aims to reconstruct an indoor scene. We first compared all methods in our slow sequence “Bedroom_slow” with a looped camera trajectory. The reconstruction results are shown in Figure 13. As we can see, our system without loop closure cannot consistently reconstruct the scene because it cannot eliminate the accumulated error for the camera tracking. On the contrary, our system with loop closure successfully detects and closes loops to reconstruct a globally consistent 3D model. Meanwhile, due to the good color images and small interframe motions of the slow sequence, other methods also successfully track the camera and close loops.

Then, we evaluated them on three fast sequences: two from our dataset (named “Dorm2_fast2” and “Livingroom_fast1”) and one from the ETH3D dataset (named “Camrea_shake1”). The reconstruction results are shown in Figure 12. BAD-SLAM fails at the very beginning of these fast sequences, which is also indicated in Table 3. In addition, because of the cautious tracking and reconstruction strategy of BundleFusion, it fails when the camera quickly scans large unknown regions, and then it restarts the tracking and reconstruction when the scanning speed slows down. Therefore, BundleFusion restarts repeatedly and outputs incomplete 3D maps of the three sequences. The reason for why BundleFusion still works on the fast sequences cs1 and cs2 shown in Table 3 is that those sequences only reconstruct a small region (see Figure 12) with camera shakes.

The performance of ElasticFusion and InfiniTAM-v3 is much better than BAD-SLAM and BundleFusion. However, due to the severe image motion blur caused by aggressive camera motions, the loop closure of ElasticFusion, which only depends on color information, does not work well, leading to the inconsistent reconstructions. Though InfiniTAM-v3 uses depth images to successfully detect loops, it still fails to rectify the great accumulated errors caused by fast camera motions. Different from them, our camera tracking and loop closure are both robust for fast camera motions, and the indoor scenes are well reconstructed.

More reconstruction results can be viewed in Figure 9 and our video in Supplementary Materials; we encourage the reader to watch our video in Supplementary Materials for a better visualization of the comparison.

6. Limitation

A large loop cannot be detected in time by our loop detection method. The reason for this is that the target submap may not be included in the candidate submaps due to a huge gap between the current camera and the target submap. A solution is to adaptively increase the nearest submap number k, i.e., the size of k can be controlled by the submap number to adapt scenes with arbitrary magnitude.

In addition, our method cannot recover the sharp texture of the reconstructed 3D model; the blurry images should be recovered during or after the reconstruction.

7. Conclusions

We present a real-time system for indoor scene reconstruction by tightly coupling RGB-D-IMU information with an extended Kalman filter, and our system is demonstrated to be able to estimate camera poses and reconstruct 3D scene models under fast camera motions. In our system, we reconstructed a static scene quickly and casually without requiring slow sensor motions; this significantly enhances the user-friendliness and system robustness of scene reconstruction. In addition, we explored the SE effect caused by fast camera motions and handled it using a geometry-aware patch deformation method for accurate feature tracking under fast camera motions. The globally consistent reconstruction was achieved in our system for both slow and fast camera motions. We detected loops via a depth-based keyframe encoding method and reduced the computational time of detection using a submap-based candidate selection mechanism. When a loop is successfully detected between two submaps, we performed the pose graph optimization to close the loop and update the poses of all submaps.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs14153551/s1, a video demonstrating the reconstruction effect and comparing qualitative results.

Author Contributions

Conceptualization, Z.Z.; Data curation, Z.Z.; Formal analysis, R.C.; Investigation, T.W.; Methodology, Z.Z. and F.X.; Project administration, Z.Z.; Resources, F.X.; Software, Z.X.; Supervision, C.Y.; Validation, Z.X.; Visualization, R.C.; Writing—original draft, Z.Z.; Writing—review & editing, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant (2020YFB1406604), National Nature Science Foundation of China (61931008, 62071415), Zhejiang Province Nature Science Foundation of China (LR17F030006), Hangzhou Innovation Institute of Beihang University (2020-Y3-A-027), Lishui Institute of Hangzhou Dianzi University (2022-001).

Data Availability Statement

Our datasets used above have been released on https://github.com/zhuzunjie17/FastFusion.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Piao, J.C.; Kim, S.D. Real-Time Visual–Inertial SLAM Based on Adaptive Keyframe Selection for Mobile AR Applications. IEEE Trans. Multimed. 2019, 21, 2827–2836. [Google Scholar] [CrossRef]
Whelan, T.; Kaess, M.; Johannsson, H.; Fallon, M.; Leonard, J.J.; McDonald, J. Real-time large-scale dense RGB-D SLAM with volumetric fusion. Int. J. Robot. Res. 2015, 34, 598–626. [Google Scholar] [CrossRef] [Green Version]
Kähler, O.; Prisacariu, V.A.; Ren, C.Y.; Sun, X.; Torr, P.; Murray, D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Vis. Comput. Graph. 2015, 21, 1241–1250. [Google Scholar] [CrossRef] [PubMed]
Dai, A.; Nießner, M.; Zollhöfer, M.; Izadi, S.; Theobalt, C. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. (TOG) 2017, 36, 76. [Google Scholar] [CrossRef]
Whelan, T.; Salas-Moreno, R.F.; Glocker, B.; Davison, A.J.; Leutenegger, S. ElasticFusion: Real-time dense SLAM and light source estimation. Int. J. Robot. Res. 2016, 35, 1697–1716. [Google Scholar] [CrossRef] [Green Version]
Wu, P.; Liu, Y.; Mao, Y.; Jie, L.; Du, S. Fast and Adaptive 3D Reconstruction with Extensively High Completeness. IEEE Trans. Multimed. 2017, 19, 266–278. [Google Scholar] [CrossRef]
Han, J.; Pauwels, E.J.; De Zeeuw, P. Visible and infrared image registration in man-made environments employing hybrid visual features. Pattern Recognit. Lett. 2013, 34, 42–51. [Google Scholar] [CrossRef] [Green Version]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Yang, J. Intra-frame deblurring by leveraging inter-frame camera motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4036–4044. [Google Scholar]
Zhu, Z.; Xu, F.; Yan, C.; Hao, X.; Ji, X.; Zhang, Y.; Dai, Q. Real-time Indoor Scene Reconstruction with RGBD and Inertial Input. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 7–12. [Google Scholar]
Glocker, B.; Shotton, J.; Criminisi, A.; Izadi, S. Real-Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding. IEEE Trans. Vis. Comput. Graph. 2015, 21, 571–583. [Google Scholar] [CrossRef] [Green Version]
Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures; International Society for Optics and Photonics: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–607. [Google Scholar]
Rusinkiewicz, S.; Levoy, M. Efficient variants of the ICP algorithm. In Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling, Quebec City, QC, Canada, 28 May–1 June 2001; pp. 145–152. [Google Scholar]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
Wang, W.; Liu, J.; Wang, C.; Luo, B.; Zhang, C. DV-LOAM: Direct visual lidar odometry and mapping. Remote Sens. 2021, 13, 3340. [Google Scholar] [CrossRef]
Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
Tang, F.; Wu, Y.; Hou, X.; Ling, H. 3D Mapping and 6D Pose Computation for Real Time Augmented Reality on Cylindrical Objects. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2887–2899. [Google Scholar] [CrossRef]
Bonato, V.; Marques, E.; Constantinides, G.A. A Parallel Hardware Architecture for Scale and Rotation Invariant Feature Detection. IEEE Trans. Circuits Syst. Video Technol. 2008, 18, 1703–1712. [Google Scholar] [CrossRef]
Lentaris, G.; Stamoulias, I.; Soudris, D.; Lourakis, M. HW/SW Codesign and FPGA Acceleration of Visual Odometry Algorithms for Rover Navigation on Mars. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 1563–1577. [Google Scholar] [CrossRef]
Schops, T.; Sattler, T.; Pollefeys, M. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 134–144. [Google Scholar]
Gao, X.; Shen, S.; Zhu, L.; Shi, T.; Wang, Z.; Hu, Z. Complete Scene Reconstruction by Merging Images and Laser Scans. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 3688–3701. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Xu, F. MixedFusion: Real-Time Reconstruction of an Indoor Scene with Dynamic Objects. IEEE Trans. Vis. Comput. Graph. 2017, 24, 3137–3146. [Google Scholar] [CrossRef]
Guo, K.; Xu, F.; Wang, Y.; Liu, Y.; Dai, Q. Robust non-rigid motion tracking and surface reconstruction using l0 regularization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3083–3091. [Google Scholar]
Ye, M.; Yang, R. Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2345–2352. [Google Scholar]
Yang, D.; Bi, S.; Wang, W.; Yuan, C.; Wang, W.; Qi, X.; Cai, Y. DRE-SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot. Remote Sens. 2019, 11, 380. [Google Scholar] [CrossRef] [Green Version]
Golodetz, S.; Cavallari, T.; Lord, N.A.; Prisacariu, V.A.; Murray, D.W.; Torr, P.H. Collaborative large-scale dense 3d reconstruction with online inter-agent pose optimisation. IEEE Trans. Vis. Comput. Graph. 2018, 24, 2895–2905. [Google Scholar] [CrossRef] [Green Version]
Stotko, P.; Krumpen, S.; Hullin, M.B.; Weinmann, M.; Klein, R. SLAMCast: Large-Scale, Real-Time 3D Reconstruction and Streaming for Immersive Multi-Client Live Telepresence. IEEE Trans. Vis. Comput. Graph. 2019, 25, 2102–2112. [Google Scholar] [CrossRef] [Green Version]
Sato, I.; Okabe, T.; Yu, Q.; Sato, Y. Shape reconstruction based on similarity in radiance changes under varying illumination. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Zhang, Q.; Tian, F.; Han, R.; Feng, W. Near-surface lighting estimation and reconstruction. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1350–1355. [Google Scholar] [CrossRef]
Whelan, T.; Goesele, M.; Lovegrove, S.J.; Straub, J.; Green, S.; Szeliski, R.; Butterfield, S.; Verma, S.; Newcombe, R.A.; Goesele, M.; et al. Reconstructing scenes with mirror and glass surfaces. ACM Trans. Graph. 2018, 37, 102–111. [Google Scholar] [CrossRef] [Green Version]
Handa, A.; Newcombe, R.A.; Angeli, A.; Davison, A.J. Real-time camera tracking: When is high frame-rate best? In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 222–235. [Google Scholar]
Zhang, Q.; Huang, N.; Yao, L.; Zhang, D.; Shan, C.; Han, J. RGB-T Salient Object Detection via Fusing Multi-Level CNN Features. IEEE Trans. Image Process. 2019, 29, 3321–3335. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lee, H.S.; Kwon, J.; Lee, K.M. Simultaneous localization, mapping and deblurring. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1203–1210. [Google Scholar]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2017, 33, 249–265. [Google Scholar] [CrossRef] [Green Version]
Zagoruyko, S.; Komodakis, N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4353–4361. [Google Scholar]
Liu, W.; Shen, X.; Wang, C.; Zhang, Z.; Wen, C.; Li, J. H-Net: Neural Network for Cross-Domain Image Patch Matching. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 856–863. [Google Scholar]
Pemasiri, A.; Thanh, K.N.; Sridharan, S.; Fookes, C. Sparse over-complete patch matching. Pattern Recognit. Lett. 2019, 122, 1–6. [Google Scholar] [CrossRef] [Green Version]
Wu, G.; Lin, Z.; Ding, G.; Ni, Q.; Han, J. On Aggregation of Unsupervised Deep Binary Descriptor with Weak Bits. IEEE Trans. Image Process. 2020, 29, 9266–9278. [Google Scholar] [CrossRef]
Nießner, M.; Dai, A.; Fisher, M. Combining Inertial Navigation and ICP for Real-time 3D Surface Reconstruction. In Eurographics (Short Papers); Citeseer: Strasbourg, France, 2014; pp. 13–16. [Google Scholar]
Prisacariu, V.A.; Kähler, O.; Murray, D.W.; Reid, I.D. Real-time 3d tracking and reconstruction on mobile phones. IEEE Trans. Vis. Comput. Graph. 2014, 21, 557–570. [Google Scholar] [CrossRef]
Laidlow, T.; Bloesch, M.; Li, W.; Leutenegger, S. Dense RGB-D-inertial SLAM with map deformations. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 6741–6748. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Visual-inertial monocular SLAM with map reuse. IEEE Robot. Autom. Lett. 2017, 2, 796–803. [Google Scholar] [CrossRef] [Green Version]
Xu, C.; Liu, Z.; Li, Z. Robust visual-inertial navigation system for low precision sensors under indoor and outdoor environments. Remote Sens. 2021, 13, 772. [Google Scholar] [CrossRef]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
Bloesch, M.; Burri, M.; Omari, S.; Hutter, M.; Siegwart, R. Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback. Int. J. Robot. Res. 2017, 36, 1053–1072. [Google Scholar] [CrossRef] [Green Version]
Park, J.; Zhou, Q.Y.; Koltun, V. Colored point cloud registration revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 143–152. [Google Scholar]
Furgale, P.; Rehder, J.; Siegwart, R. Unified temporal and spatial calibration for multi-sensor systems. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 1280–1286. [Google Scholar] [CrossRef]
Furgale, P.; Barfoot, T.D.; Sibley, G. Continuous-time batch estimation using temporal basis functions. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 2088–2095. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef] [Green Version]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual–Inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef] [Green Version]
Handa, A.; Whelan, T.; McDonald, J.; Davison, A. A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014. [Google Scholar]
Curless, B.; Levoy, M. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 4–9 August 1996; pp. 303–312. [Google Scholar]
Kähler, O.; Prisacariu, V.A.; Valentin, J.P.C.; Murray, D.W. Hierarchical Voxel Block Hashing for Efficient Integration of Depth Images. IEEE Robot. Autom. Lett. 2016, 1, 192–197. [Google Scholar] [CrossRef]
Kähler, O.; Prisacariu, V.A.; Murray, D.W. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure. In Proceedings of the ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 500–516. [Google Scholar]
Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 430–443. [Google Scholar]
Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Lowry, S.; Sünderhauf, N.; Newman, P.; Leonard, J.J.; Cox, D.; Corke, P.; Milford, M.J. Visual place recognition: A survey. IEEE Trans. Robot. 2016, 32, 1–19. [Google Scholar] [CrossRef] [Green Version]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. G²o: A general framework for graph optimization. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the International Conference on Intelligent Robot Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Our real-time system reconstructs a 360° scene under fast camera motions (480 frames recorded in 16 s). Top row: two sampled input frames (see the strong motion blur in images); bottom row: the reconstructed 3D model.

Figure 2. Overview of our pipeline. Red blocks represent the input information in the current frame. The green arrow represents the iterative operation between both modules, and the blue arrows denote that the results from the previous frames will be used in the current frame.

Figure 3. This figure shows how the SE effect is affected by the camera motion and the 3D shape of a patch. The dotted triangles represent the cameras at the previous frame, and the red/blue ones are the cameras at the current frame. In each case, the upper graph represents the top view where the cameras are observing a scene, and the bottom grids represent the corresponding patches viewed on images.

Figure 4. The deformation process of a patch influenced by the SE effect. In this figure, a

3 \times 3

patch in the previous frame with depth values

d_{1}

and

d_{2}

is projected into the current frame. The shrink effect and extend effect both appear on the patch.

Figure 4. The deformation process of a patch influenced by the SE effect. In this figure, a

3 \times 3

patch in the previous frame with depth values

d_{1}

and

d_{2}

is projected into the current frame. The shrink effect and extend effect both appear on the patch.

Figure 5. The computational time of original frame encoding method (black) and our improved method (blue) for loop detection in a long sequence. We also exhibit the computational time of our full system (green), as well as our system using the loop detection method in [11] (cyan). Red dotted lines are linear fitting results of computational times.

Figure 6. Linear and angular camera velocities of the three sequences used for the IMU evaluation. The dotted lines denote the corresponding linear and angular velocities of each sequence. Two sequences belong to ‘fast’ and one belongs to ‘slow’ based on our definition.

Figure 7. This figure shows the feature tracking results of our method with/without the consideration of SE effect (marked with purple/yellow bounding boxes) for patches affected by the shrink effect (upper one) and the extend effect (bottom one). The images on the left and right are enlarged views of two consecutive frames from the VaFRIC dataset [33]. The residuals of the tracked patches are depicted in grayscale, i.e., black denotes 0. As mentioned in Section 4.1, empty pixels of deformed patches are marked and assigned as black.

Figure 8. The reconstruction results of a hotel under slow and fast camera motions. For a more equitable comparison, we disable the loop closure function and compare our system with and without IMU. (a) depicts the results of our system with IMU; (b) depicts the results without IMU.

Figure 9. This figure shows our reconstruction results in our datasets.

Figure 10. Percentages of the successfully recovered frames on each test sequence with depth input only.

Figure 11. The effect of the number of the selected nearest submaps. More selected submaps come with higher computational costs.

Figure 12. This figure shows a qualitative comparison for the reconstruction result of different methods in three sequences with fast camera motions. The reconstruction results in a column are results of different methods in the same video sequence.

Figure 13. Reconstruction results of a bedroom with slow camera motions and several loops.

Table 1. The Speed of sequences in different datasets.

Sequence	Linear Velocity (m/s)	Angular Velocity (rad/s)
ICL_kt0	0.27	0.33
ICL_kt1	0.09	0.40
ICL_kt2	0.40	0.46
ICL_kt3	0.24	0.31
VaFRIC	1.81	0.22
TUM_fr1_desk	0.66	0.94
TUM_fr2_xyz	0.1	0.08
TUM_fr3_office	0.36	0.35
ETH3D_camera_shake1	0.64	2.65
ETH3D_camera_shake2	0.48	3.30
ETH3D_camera_shake3	0.51	3.13
Ours_Dorm1_slow	0.59	1.07
Ours_Dorm1_fast1	0.92	2.59
Ours_Dorm1_fast2	0.93	2.42
Ours_Dorm2_fast1	1.60	2.16
Ours_Dorm2_fast2	1.43	2.49
Ours_Hotel_slow	0.68	1.16
Ours_Hotel_fast1	1.26	2.34
Ours_Hotel_fast2	1.55	2.10
Ours_Bedroom_slow	0.25	0.42

Table 2. Comparison of patch feature tracking.

	Sequence	Frame	SE Patch	AIE
		Num.	Num.	w/o SE	w/ SE
Slow	ICL_kt2	882	91	4.90	4.08
	ICL_kt3	1240	298	6.3	5.87
	TUM_fr1_desk	613	323	13.38	9.53
	TUM_fr3_office	2585	1302	12.93	12.1
	Dorm1_slow	1534	403	8.13	7.89
	Hotel_slow	1283	277	7.78	7.35
Fast	VaFRIC (20FPS)	300	1032	17.22	7.83
	ETH3D_cs1	318	988	15.61	8.01
	Dorm1_fast1	1477	1940	13.90	7.93
	Dorm1_fast2	484	1639	16.45	9.2
	Dorm2_fast1	564	1351	20.80	11.02
	Hotel_fast1	645	1367	14.83	9.69

Table 3. ATE RMSE (cm) on TUM RGB-D and ETH3D datasets.

	TUM RGB-D [61]			ETH3D [22]
	Desk	xyz	Office	cs1	cs2	cs3
ORB-SLAM2 [62]	1.6	0.4	1.0	-	6.9	-
ElasticFusion [5]	2.0	1.1	1.7	8.4	-	-
InfiniTAM-v3 [56]	1.8	1.5	2.5	3.6	3.3	7.1
BundleFusion [4]	1.6	1.1	2.2	5.2	4.0	-
BAD-SLAM [22]	1.7	1.1	1.7	-	-	-
Ours w/o loop [10]	3.3	1.5	4.6	1.7	1.6	2.8
Ours	1.8	1.2	2.4	1.7	1.6	2.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Z.; Xu, Z.; Chen, R.; Wang, T.; Wang, C.; Yan, C.; Xu, F. FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion. Remote Sens. 2022, 14, 3551. https://doi.org/10.3390/rs14153551

AMA Style

Zhu Z, Xu Z, Chen R, Wang T, Wang C, Yan C, Xu F. FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion. Remote Sensing. 2022; 14(15):3551. https://doi.org/10.3390/rs14153551

Chicago/Turabian Style

Zhu, Zunjie, Zhefeng Xu, Ruolin Chen, Tingyu Wang, Can Wang, Chenggang Yan, and Feng Xu. 2022. "FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion" Remote Sensing 14, no. 15: 3551. https://doi.org/10.3390/rs14153551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FastFusion: Real-Time Indoor Scene Reconstruction with Fast Sensor Motion

Abstract

1. Introduction

2. Related Work

2.1. 3D Reconstruction

2.2. Estimation of Fast Camera Motion

3. Overview

4. Method

4.1. Geometry-Aware Feature Tracking

4.1.1. SE Effect for Patch Deformation

4.1.2. Objective

4.2. Extended Kalman Filtering Framework

4.2.1. Kalman Prediction

4.2.2. Kalman Update

4.3. Model Fusion and Patch Update

4.4. Loop Closure

4.4.1. Loop Detection

4.4.2. Pose Graph Optimization

5. Experiments

5.1. Performance and Parameters

5.2. Speed Definition

5.3. Evaluation

5.3.1. Feature Tracking

5.3.2. IMU

5.3.3. Loop Detection

5.4. Comparison

5.4.1. Quantitative Comparison

5.4.2. Qualitative Comparison

6. Limitation

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI