Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction

Lin, Xiaobo; Xu, Shibiao

doi:10.3390/app14062425

Open AccessArticle

Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction

by

Xiaobo Lin

¹ and

Shibiao Xu

^2,*

¹

International School, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2425; https://doi.org/10.3390/app14062425

Submission received: 6 February 2024 / Revised: 6 March 2024 / Accepted: 11 March 2024 / Published: 13 March 2024

(This article belongs to the Special Issue UAV Remote Sensing and 3D Reconstruction)

Download

Browse Figures

Versions Notes

Abstract

:

In unmanned aerial vehicle (UAV) large-scale scene modeling, challenges such as missed shots, low overlap, and data gaps due to flight paths and environmental factors, such as variations in lighting, occlusion, and weak textures, often lead to incomplete 3D models with blurred geometric structures and textures. To address these challenges, an implicit–explicit coupling enhancement for a UAV large-scale scene modeling framework is proposed. Benefiting from the mutual promotion of implicit and explicit models, we initially address the issue of missing co-visibility clusters caused by environmental noise through large-scale implicit modeling with UAVs. This enhances the inter-frame photometric and geometric consistency. Subsequently, we enhance the multi-view point cloud reconstruction density via synthetic co-visibility clusters, effectively recovering missing spatial information and constructing a more complete dense point cloud. Finally, during the mesh modeling phase, high-quality 3D modeling of large-scale UAV scenes is achieved by inversely radiating and mapping additional texture details into 3D voxels. The experimental results demonstrate that our method achieves state-of-the-art modeling accuracy across various scenarios, outperforming existing commercial UAV aerial photography software (COLMAP 3.9, Context Capture 2023, PhotoScan 2023, Pix4D 4.5.6) and related algorithms.

Keywords:

UAV large-scale scene; implicit–explicit coupling; dense point cloud reconstruction; 3D mesh modeling

1. Introduction

The 3D point cloud reconstruction and mesh modeling of UAV large-scale scenes are key technologies in various domains such as environmental quality monitoring, geographic resource surveys, and urban planning [1]. Typically, they capture a series of images to reconstruct the 3D geometric information of scanned scenes. Consequently, a significant challenge in modern large-scale scene modeling lies in efficiently utilizing the multi-view images acquired from UAV platforms to reconstruct three-dimensional models of large-scale scenes.

However, prevailing methods [2,3,4] for modeling extensive 3D maps from UAV aerial images often encounter challenges such as incomplete scene modeling and blurred surface geometry and textures. These issues are frequently attributed to limitations in existing visual 3D reconstruction methods and suboptimal UAV aerial image acquisition techniques.

Current mainstream methods for large-scale multi-view 3D reconstruction, such as structure from motion (SfM) [5,6,7,8], Multi-view stereo (MVS) [9,10,11], and simultaneous localization and mapping (SLAM) [12,13], have been successfully integrated into various commercial software, producing satisfactory high-quality scene models. However, challenges arise when applying these methods to UAV video data, particularly in complex large-scale geographic environments, such as areas with diverse terrain like plateaus, plains, hills, and mountains. Various factors contribute to these challenges, including numerous blind spots in the field of view, unstable signal propagation, and thin air at high altitudes, which collectively limit the operational radii of UAVs. These limitations can impact the redundancy and effectiveness of UAVs’ multi-view image collection. Even in scenarios where elements in the scene, such as buildings, may appear in multiple frames, certain aspects of the target or scene may be overlooked due to geographical interference or insufficient coverage in some images. This directly affects the accuracy of camera pose estimation and loop closure detection, resulting in reduced completeness and accuracy in subsequent 3D map modeling and leading to deformations in the modeling process.

On the other hand, large-scale aerial surveys are impacted by climatic conditions and the material properties of target surfaces. The complexities of large-scale environments, with variations in lighting at different times, contribute to inconsistent data collection. Many previous methods rely on multi-view stereo (MVS) [14,15,16] for scene reconstruction, predicting dense depth maps for each frame and fusing them to create a comprehensive model. However, depth-based methods, while adept at accurately reconstructing local geometric shapes, face challenges in enhancing the accuracy of depth information due to inconsistencies between different viewpoints. Additionally, in complex scenes, numerous textureless, repetitive, or reflective objects (e.g., fields, water surfaces, glass walls, etc.) can directly lead to depth estimation errors or missing data, posing difficulties in effective fusion of these depth maps.

Finally, existing reconstruction methods [17,18,19,20,21,22,23,24], frequently omit images unsuitable for geometric modeling to mitigate the influence of redundant frames and images with excessively wide or narrow baselines on reconstruction accuracy. However, this curation process may lead to the exclusion of images containing vital texture information, thereby diminishing the quality of subsequent modeling. In scenarios with relatively sparse imagery data of building surfaces, particularly in areas lacking texture, this selective approach can substantially affect the accuracy of measurements (depth estimation), lighting consistency, the reconstruction of dense point clouds, and the realism of geometric modeling. Ultimately, for areas posing reconstruction challenges or exhibiting modeling gaps, manual filling and repair by professionals may be essential to establish a complete and detailed geometric model structure.

In summary, it is evident that existing professional large-scale indoor modeling [25,26,27] requires extensive division, block-level computation, data merging, heightened computational complexity, and reduced fault tolerance. The entire workflow process encounters numerous challenges, emphasizing the necessity for operators to possess extensive experience in both indoor and outdoor operations to adeptly handle various situations that may arise. For the majority of non-professionally captured UAV imagery, the accuracy of utilizing existing reconstruction methods is significantly compromised and may even lead to failure. The entire data collection and reconstruction process is tailored for professional operators, rendering it complex, time-consuming, and costly. Consequently, extending building scene modeling with UAVs to ordinary users and achieving consumer-level applications proves challenging.

Recently, the advent of NeRF technology [28] has facilitated the modeling of non-professional UAV-captured data [29]. Through the calibration of UAV image parameters, precise implicit model reconstruction is attainable, enabling the reconstructed implicit model to generate high-definition images from specified viewpoints. Addressing unbounded scenes, NeRF++ [30] partitions the space into foreground and background regions, employing separate MLP models for each region. These models independently conduct ray rendering before their final combination. Furthermore, there are implicit modeling methods specifically designed for large scenes, such as CityNeRF [31], which utilizes multiscale data inputs, including LiDAR, to achieve large-scale modeling. In contrast, BlockNeRF [32] and Mega-NeRF [33] spatially decompose the scene, aligning with our approach in the implicit modeling stage. However, the spatial decomposition algorithms in these methods are relatively straightforward, resulting in mediocre performance.

In this paper, our objective is to develop a more flexible and straightforward large-scale UAV scene modeling method that strikes a balance between high accuracy and efficiency. This method harnesses the mutually beneficial aspects of implicit and explicit models at different stages of modeling, employing a novel neural radiation field to synthesize images with spatiotemporal coherence between frames. Subsequently, it engages in dense depth estimation unit-by-unit using co-visible clusters and integrates the implicit scene model to recover missing spatial information, resulting in a more comprehensive and high-quality dense point cloud. In the modeling phase, it reverses the neural radiation field to recover keyframe poses that were initially omitted during reconstruction initialization. Furthermore, it integrates more texture images into the signed distance field (SDF) [34] to generate a more realistic and high-quality scene mesh model. The comparative results can be observed in Figure 1. The innovations in this paper are as follows:

(1) A novel implicit–explicit coupling framework is designed for improving UAV large-scale scene modeling performance. The dense reconstruction module (explicit) supervises the neural radiation field (implicit) for the precise implicit modeling of large-scale UAV scenes. Expanding on this foundation, the neural radiation field employs neural rendering to synthesize high-precision scene images. This compensates for occlusion and depth missing data caused by obstacles and environmental noise during the dense reconstruction process, thereby further enhancing the accuracy of the dense point cloud reconstruction. Ultimately, in the mesh modeling stage, the texture-rich images overlooked in the dense reconstruction are recovered by reversing the neural radiation field, thereby improving the details of the mesh modeling. The experiments show that our method achieves SOTA performance compared with the related mainstream methods and commercial software.

(2) An implicit synthetic co-visibility-cluster-guided dense point cloud reconstruction is proposed. We integrated the rendering of images from new perspectives, utilizing a neural radiance field implicit model, into the construction of traditional co-visibility clusters. This method addresses the challenge of viewpoint occlusion and low overlap resulting from flight path and environmental factors. In contrast to traditional co-visibility clusters, implicitly synthesized co-visibility clusters leverage the capabilities of the NeRF model to implicitly render new perspective images, compensating for viewpoint gaps induced by flight trajectories and environmental conditions. The NeRF model learns the depth and illumination information of a 3D scene from images, providing implicit synthesis co-visibility clusters with robustness against variations in lighting, occlusions, and low-texture regions. Consequently, this robustness facilitates guiding the dense point cloud reconstruction module to achieve more robust reconstructions.

(3) An implicit texture image-pose-recovery-based high-accuracy mesh modeling is proposed. This method innovatively restores image poses that were not considered in the reconstruction process by reversing the neural radiance field. This decouples the mapping of image textures from the constraints of the 3D scene reconstruction. Through the implicit model, we accurately recover image poses that were excluded during the reconstruction phase, thereby incorporating more texture information into the mesh model. Specifically, the mesh model takes dense point clouds as input and undergoes surface mesh computation through Poisson reconstruction. Additionally, local optimization of the mesh is performed by integrating illumination models, achieving a more realistic and finely detailed 3D mesh modeling of large-scale UAV scenes.

2. Implicit–Explicit Coupling-Enhancement-Based 3D Reconstruction

This article presents an implicit–explicit coupling framework for UAV large-scale scene modeling. This framework eliminates the necessity for professional UAV trajectory planning and data preprocessing, enabling high-quality large-scale 3D reconstruction with non-professional aerial photography. The resultant model surfaces also exhibit finer textures. The overall framework is illustrated in Figure 2.

Specifically, the proposed framework addresses the challenge of processing non-trajectory-planned and non-preprocessed UAV cityscape aerial images. The framework is meticulously structured into two distinct stages, each contributing to the overall efficacy of the reconstruction process.

In the initial stage, a cutting-edge multi-view 3D reconstruction method is strategically employed. This method serves the crucial purpose of rapidly estimating intricate camera parameters and depth information within tightly knit clusters of co-visible images. In this way, the framework gains access to highly accurate poses and depth information, which play a pivotal role in supervising the neural radiation field. This supervision is paramount for establishing the implicit representation of the large-scale scene, ensuring the faithful reconstruction of the intricate details within the cityscape.

Furthermore, to enhance the completeness and accuracy of the reconstructed scene, a sophisticated neural radiance field (NeRF) model is brought into play. This well-trained NeRF model is specifically tailored to the characteristics of the large scene under consideration. Leveraging this model, the framework synthesizes images and corresponding dense depth data for regions where information may be missing or overlap is limited due to unplanned flight paths. This integration of the NeRF model not only addresses potential data gaps but also significantly contributes to the overall robustness of the reconstruction process.

In summary, the framework’s comprehensive approach begins with the rapid acquisition of camera parameters and depth information, followed by the meticulous supervision of the neural radiation field for implicit scene representation. The incorporation of a specialized NeRF model further enhances the reconstruction accuracy by synthesizing data for areas with limited information, ultimately culminating in an explicit high-accuracy mesh model of the UAV cityscape aerial images.

Subsequently, the synthesized data from the NeRF model is utilized to optimize the co-visible relationships within the multi-view 3D dense reconstruction. This optimization specifically addresses feature mismatch problems arising from low overlap, occlusion, and environmental noise caused by unplanned flight paths and pose planning. Concurrently, depth information is employed to rectify point cloud data in textureless or repetitive texture areas, generating a dense, high-precision point cloud model. Finally, the image and depth information are fused into a signed distance field (SDF), and the reversal of the neural radiation field is applied to recover the excluded near-range image poses. This precision texturing of building exteriors results in a high-quality 3D model of large-scale scenes. Each key step of the algorithm will be elaborated upon in the subsequent sections.

2.1. Implicit Synthetic Co-Visibility-Cluster-Guided Dense 3D Reconstruction

In this section, we will provide a detailed explanation of the UAV-based large-scale scene dense point cloud reconstruction module, that is guided by implicit synthetic co-visibility clusters. The module initiates by utilizing a generic visual odometry method to rapidly and accurately compute the pose and depth information from input images, which is used for the supervised training of the implicit model. Subsequently, the trained implicit model is employed to render synthesized new perspectives along with their corresponding dense depth data. This process optimizes spatiotemporal co-visible clusters for dense point cloud reconstruction. Finally, robust and accurate reconstruction results were obtained through the explicit model for dense point cloud reconstruction.

2.1.1. Implicit Representation of UAV Large-Scale Scenes

In this work, we commence with an input sequence of UAV-captured frames for a large-scale scene, denoted as

K = {i = 1, 2, 3, \dots, n}

, without prior trajectory planning or preprocessing. Our approach initially employs the visual odometry provided by a dense reconstruction module [35] to generate accurate depth and pose information. Similar to the RAFT [36] framework, we iteratively and globally optimize camera pose and depth information with finer updates to the optical flow field. This iterative process is particularly well suited for addressing issues such as drift in long trajectories and loop closures that commonly arise in large-scale UAV scene reconstruction. Upon determining the 3D point positions (D) and camera parameters (T) that minimize the reprojection error, this data is employed as supervised training data for the subsequent training of the large-scale implicit model for UAV-based scene reconstruction. This approach effectively leverages the accurate depth and pose information obtained from visual odometry to generate supervised data for training the implicit model. This step is crucial for achieving high-quality large-scale scene reconstructions using UAVs without the need for prior trajectory planning and preprocessing.

In contrast to conventional large-scale implicit modeling, this work primarily focuses on non-professional data acquisition with UAVs. Unplanned flights can introduce more variation in pitch and roll angles, resulting in irregular alignment of non-professional UAV images. This irregularity is mainly evident in significant variations in heading and cross-track overlap, leading to the possibility of some parts of the scene being missed in the images, sparse image overlap, and significant changes in yaw angles between adjacent image frames. However, utilizing a single multi-layer perceptron (MLP) in traditional neural radiance field (NeRF) models has limitations in terms of capacity. If a specific area in the scene cannot capture a substantial number of images and complex scene details, it can lead to underfitting in the neural radiance field, resulting in significant blurring of synthesized new perspective images. This can be detrimental for the subsequent training of implicit models for large-scale scenes and can negatively impact modeling accuracy.

Based on the characteristics of UAV-captured data, it is observed that the images in the sequence are arranged in chronological order, representing the progression of time. Consequently, adjacent images along the time axis are likely to exhibit high spatial similarity. However, within short time intervals, the number of images may be insufficient to supervise the construction of implicit models. Regardless of whether the aerial photography is professional or non-professional, the nature of UAV flights often involves multiple repeated scans and loops in a given region. This implies that images taken at different times might depict the same scene. Recognizing this pattern, the approach adopted in this work is to divide the entire large-scale scene into regions based on the UAV’s trajectory. For each region, a spatiotemporal consistency co-visible cluster is constructed, ensuring that each region of the large scene has a sufficient number of images with high overlap. This method provides each region of the large scene with good training data for implicit modeling.

Specifically, the approach commences with the first frame

I_{1}

of the aerial image sequence as the starting point. It then utilizes the LoFTR [37] method, which can perform keypoint detection for full-image coverage and is effective in handling low-texture or repetitive pattern areas, to conduct similarity matching with subsequent images. For similarity calculation, any image frame with a similarity exceeding a predefined threshold

ξ_{c o v i s i b i l i t y}

(typically around 20%) is considered as forming a spatial co-visible relationship. The exact method for calculating similarity

ξ

is as follows:

ξ = \frac{2 N_{f i t}}{N_{I_{i}}^{L o F T R} + N_{I_{j}}^{L o F T R}}

(1)

where

N_{I_{i}}^{L o F T R}

represents the number of LoFTR features in image

I_{i}

, and

N_{f i t}

represents the number of matched feature pairs.

Due to the decreasing similarity between

I_{1}

and nearby images

(I_{2}, I_{3}, \dots, I_{i})

, the last image

I_{i}

with a similarity greater than or equal to

ξ_{c o v i s i b i l i t y}

is selected as the central scene for the co-visible cluster. Using

I_{i}

as the reference, subsequent images are compared for scene overlap. Images in the sequence are sampled based on their temporal relationship with

I_{i}

, and those with a similarity exceeding

ξ_{c o v i s i b i l i t y}

are added to the co-visible cluster sequence

C_{i} {I_{i}, \dots}

. This completes the selection of the first co-visible cluster. Similarly, we need to find the last image among the frames before and after

I_{i}

, where the regional overlap is greater than

ξ_{c o v i s i b i l i t y}

, and this image becomes the central scene for the second co-visible cluster. This process is repeated until the last co-visible center in the vicinity still has a similarity greater than

ξ_{c o v i s i b i l i t y}

when compared to the end image

I_{n}

of the image sequence. At this point, all co-visible clusters within the existing image sequence have been identified, and the subsequent neural radiance field training phase begins.

The method for establishing an individual neural radiance field is as follows: First, obtain the depth map

D_{i}

and camera parameters

T_{i}

for each frame based on Droid SLAM. Next, use an autoencoder structure to encode the input image into a latent vector and decode this latent vector into parameters for a multi-plane NeRF. These NeRF parameters include color and density for each plane. Then, employ a renderer to generate the image for each frame based on the parameters of the multi-plane NeRF and the camera pose. The renderer uses the same volumetric rendering formula as Instant NGP [38], taking into account the color and density of each plane and the distance between the camera and the planes to compute the color of each pixel. Finally, minimize the reprojection loss function to minimize the L1 norm between the rendered image and the input image. The reprojection loss function is as follows:

L_{r e p} = \sum_{i} {∥I_{i} - \hat{I_{i}}∥}_{1}

(2)

where

I_{i}

is the input image, and

\hat{I_{i}}

is the rendered image.

To generate more accurate modeling results, the neural implicit representation not only requires providing color images from new viewpoints but also needs corresponding dense depth maps to eliminate the impact of holes in textureless surface models caused by MVS methods. However, the output dimensions of the neural radiance field model do not directly contain depth information, but only include scene opacity information, which is related to the three-dimensional position of the described point and independent of the selected viewpoint. Clearly, we cannot simply use opacity information as depth information for calculation. Therefore, it is necessary to derive the correspondence between density predictions from the neural radiance field and depth. First, let us assume a ray originates from a point x in space. To obtain depth information from a specific viewpoint

v_{i}

to the point x, we need to sample along the direction of the ray. Let

x_{i}

be a sampling point along the ray, and each point has its corresponding density value

σ (x_{i})

, which represents the probability of the point emitting or reflecting rays. To calculate depth, we need to determine the position where the ray terminates when it encounters an opaque object. In the neural radiance field, when a ray encounters an opaque object, it is completely absorbed or reflected and cannot continue to propagate. At this point, the final position the ray reaches is the depth. We use

d (x)

to represent the distance from x to the termination position. Therefore, the depth at point x when the ray arrives is the weighted sum of the distances between all sampling points, where the probability of each sampling point as the termination position is equal to the probability of emitting or scattering rays at that point multiplied by the probability of not being absorbed or reflected from x to that point. The formula is as follows:

d (x) = \sum_{i = 1}^{N} t_{i} σ (x_{i}) e^{- \sum_{j = 1}^{i - 1} t_{j} σ (x_{j})}

(3)

where

t_{i}

is the distance between the

(i - 1)

th sampling point and the ith sampling point.

So far, we have described the large-scale NeRF modeling method for the original images. Next, we need to supplement the high-overlap scene co-visibility clusters based on the modeling results of the neural implicit model.

2.1.2. Multi-View Implicit Synthetic Co-Visibility Cluster Generation

We now have the foundation for synthesizing new-view color images and depth images. However, we need to determine which viewpoints to output as supplements. As previously mentioned, images with overlaps greater than

ξ_{d e n s e}

(usually set at 80% as an empirical value) are considered dense. If every image within a co-visibility cluster has overlaps with the surrounding images greater than

ξ_{d e n s e}

, we can consider the entire cluster to be sufficiently dense and capable of providing good modeling results. To achieve this, we establish a mathematical model. First, we select an image frame

I_{k}

from the co-visibility cluster and identify the set of images

{I_{l}, \dots}

within the same cluster that share a co-visibility relationship with it. We find images in the set where the feature overlap with the current image is less than

ξ_{d e n s e}

. In the previous assumption, image captures were all performed on the same plane, and the generation of viewpoints only required determining the plane coordinates. In the plane coordinates, when the center-to-center distance between two rectangles is less than a certain value, the overlap is guaranteed to be greater than

ξ_{d e n s e}

. Through simple set deduction, we can derive that when the center-to-center distance

D \leq (1 - {ξ_{d e n s e}}^{\frac{1}{2}}) \sqrt{W^{2} + H^{2}}

(4)

where W and H are the width and height of the image, respectively, the overlap between each image is greater than

ξ_{d e n s e}

.

Therefore, when generating the plane coordinates of new viewpoints, we only need to calculate the equation of the line connecting image frame

I_{k}

and the images in the co-visibility set

I_{l}

, as well as the Euclidean distance

D_{k l}

. We sample new viewpoints along the line at intervals of

⌊\frac{D_{k l}}{(1 - {ξ_{d e n s e}}^{\frac{1}{2}}) \sqrt{W^{2} + H^{2}}}⌋ - 1

where the spacing between viewpoints is

\frac{D_{k l}}{⌊\frac{D_{k l}}{(1 - {ξ_{d e n s e}}^{\frac{1}{2}}) \sqrt{W^{2} + H^{2}}}⌋}

. We apply this process to generate new viewpoints for every image within the co-visibility cluster that has a shared co-visibility relationship. Combined with height and camera angle information, we input the obtained RGB images and depth maps into the NeRF model. This process produces a highly dense dataset of RGBD aerial viewpoints for the subsequent reconstruction phase, ensuring the availability of data.

2.1.3. Dense Point Cloud Reconstruction of UAV Large-Scale Scenes

In our algorithm, each co-visibility cluster corresponds to a series of multi-view depth images. However, the original depth maps from the input images may have their own missing depth information due to the use of the dense reconstruction [35] method for prediction. The depth information generated based on NeRF is dense. To address this, we refer to the depth maps of the new viewpoints to further supplement the missing information in the original image’s depth maps. Ultimately, we generate complete depth maps for each corresponding image frame within a co-visibility cluster. Next, we perform depth optimization on all generated depth maps to ensure depth consistency for the purpose of merging information from multi-view depth images. For each pixel in an image, we use its depth information

λ

and camera parameters to project it back into three-dimensional space:

X = λ R_{i}^{T} K_{i}^{- 1} p + C_{i}

(5)

where p is the pixel’s homogeneous coordinate, and X is the three-dimensional point in the world coordinate system. Then, we further project X into the neighboring images

N (i)

of

I_{i}

. Assuming

N_{k}

is the kth neighboring image in the set of adjacent depth images

N (i)

, here we use

d (X, N_{k})

as the depth information of the image

N_{k}

, and use

λ (X, N_{k})

to represent the reprojected depth value, as mentioned above. If

λ (X, N_{k})

is close enough to

d (X, N_{k})

, we can determine that X is consistent in both

I_{i}

and

N_{k}

, which means

\frac{|d (X, N_{k}) - λ (X, N_{k})|}{λ (X, N_{k})} < τ_{2}

(6)

where

τ_{2}

is a threshold. If X is consistent in at least two neighboring images in the set

N (i)

, it is considered a reliable scene point, and the depth value for the corresponding pixel p in

I_{i}

is retained. Otherwise, it is removed. Finally, all the depth maps are reprojected into a 3D form and merged into a single complete point cloud.

2.2. Detail-Implicit Texture Mapping Enhancement for High-Accuracy Mesh Modeling

This section focuses on enhancing building model surface texture and modeling based on implicit 3D pose guidance. It aims to address the challenges of achieving high-accuracy modeling and handling common tasks in scene modeling, such as segmentation and texturing.

2.2.1. Implicit Pose-Recovery-Based Texture Refinement

In order to achieve higher modeling accuracy, it is necessary to collect information at multiple heights during aerial data acquisition. In addition to capturing data from scenes at higher altitudes, it is common to include close-range information such as details of buildings. However, traditional reconstruction methods often discard images containing fine details, as they typically lack a sufficient number of co-observable scene images and are, thus, excluded during scene modeling. These images, containing rich texture information, have the potential to significantly improve the accuracy of scene modeling.

To effectively utilize this information, it is essential to remap multi-view images containing potential texture details back into three-dimensional space. Nevertheless, the images filtered out during the reconstruction process have lost their original camera poses, making texture mapping challenging. Inspired by the implicit modeling of the entire UAV scene, drawing inspiration from inerf [39], we can establish a differential relationship between the camera pose T and the rendered image I through an end-to-end implicit neural rendering method. Given the camera pose T, the rendered image I can be represented as

I = F_{θ} (T)

, where

θ

represents the learnable parameters of the neural network. Thus, if the scene image for the pose to be estimated is denoted as

I_{t a r}

, it is necessary to solve for its true pose

T_{t a r}

. Based on the differential relationship between images and poses established by NeRF, the problem of missing image poses can be addressed by approximating the pose

T_{e s t}

to approach

T_{t a r}

using the following equation:

T_{e s t} = a r g m i n_{T_{e s t}} ||F_{θ} (T_{e s t}) - I_{t a r}||

(7)

However, usually, after the optimization process mentioned above, precise pose estimation results can be gradually obtained. Nevertheless, due to the lack of depth information supervision, direct utilization of Formula (7) can easily lead to local minima in complex scenes, making it challenging to achieve optimal results. On the other hand, NeRF relies on pose initialization for iterative fitting, resulting in slow convergence and susceptibility to error accumulation. To address this, we integrate the pose calculation method of NeRF depth estimation and feature point detection. By rapidly obtaining a better initial pose for NeRF, we enhance the accuracy and efficiency of pose recovery. Initially, for scene images

I_{t a r}

and

I_{e s t}

, we extract feature points and calculate matched feature point pairs

S_{t a r, e s t}

. For

I_{e s t}

, NeRF provides its corresponding depth

D_{e s t}

. Thus, combining

D_{e s t}

and

S_{t a r, e s t}

, the pose estimation problem transforms into a PnP problem. The EPnP algorithm is employed to rapidly solve for an appropriate initial pose. Finally, based on more accurate pose initialization information, we iteratively fit the image to better approximate

T_{t a r}

, obtaining a higher-precision camera pose

P_{C}

. The corresponding dense depth map is generated through the neural radiance field. Following the reconstruction method in Section 2.1.3, the texture details of the image are projected onto the reconstructed scene surface.

2.2.2. UAV Large-Scale Scene Voxel Mesh Modeling

To reconstruct more detailed scene models, it is crucial to consider the impact of lighting on the reconstructed textures and project the rich textures, combined with the lighting model, to appropriate distances. To achieve this, we first use Poisson reconstruction to convert the previously generated point cloud into a mesh model. Poisson reconstruction is an implicit surface-based surface reconstruction method that can recover a smooth surface from an unordered point cloud. For the point cloud obtained from the previous dense reconstruction, we estimate the normal vector for each point. These normal vectors are then used as samples for the gradient field. We solve a Poisson equation to obtain an implicit function, where the zero-level set represents the desired surface. The Poisson equation can be expressed as

\nabla^{2} f = \nabla \cdot v

(8)

where ∇ represents the gradient, v is a vector field obtained by interpolating the normal vectors of the point cloud, and f is the unknown implicit function. To solve this equation, we need to discretize space into an octree grid. Then, at each grid node, an unknown variable

f_{i}

is defined to represent the value of the implicit function at that node. We can use finite difference methods to approximate the Poisson equation, resulting in a system of linear equations:

L f = b

(9)

where L is the Laplacian matrix, f is a vector composed of all

f_{i}

values at grid nodes, and b is a vector composed of values of

\nabla \cdot v

at grid nodes. We utilize the preconditioned conjugate gradient method as an iterative algorithm to solve this system of linear equations, obtaining an approximate solution for f. Finally, we use an isosurface extraction algorithm, marching cubes, to extract the zero-level set from f, which represents the reconstructed surface.

Next, we use a virtual laser scanner to scan the mesh from multiple angles and then employ a kd-tree to find the nearest surface points. The sign of the signed distance function (SDF) is determined based on either normals or a depth buffer. To reconstruct the initial SDF from the mesh model, we simulate a virtual laser scanner that projects the mesh from different directions, producing a series of depth images. Each pixel on these depth images corresponds to a point in space. We use the kd-tree algorithm to find the nearest neighbor point on the mesh for each pixel, i.e., the nearest surface point, and calculate the distance between these two points as the SDF value for that pixel.

To determine the sign of the SDF (whether the point is inside or outside the surface), we employ two methods based on the characteristics of different implicit surfaces. For smooth and continuous surfaces, we use the angle between the point’s normal and the surface normal: if the angle is greater than 90 degrees, the point is considered inside; otherwise, it is considered outside. For complex or non-smooth surfaces, we introduce a depth buffer. If the depth value of a point is greater than the depth value of the corresponding surface point, the point is considered inside; otherwise, it is considered outside. This dynamic selection of methods based on the surface type ensures optimal results. Ultimately, we obtain the SDF values for each point, completing the initial SDF reconstruction.

Finally, we utilize spatially varying spherical harmonic functions [40] to solve the reflectance of each image frame and the scene illumination. Employing structure-from-shading (SfS) techniques, we perform local geometric optimization for each voxel within the SDF volume to minimize the reprojection errors of image frames and ensure voxel smoothness. We iterate through these steps until convergence or the maximum iteration count is reached. This process results in a high-quality three-dimensional model with intricate geometric details and consistent surface textures.

3. Experiments

To demonstrate the effectiveness of our method, we conducted experiments on five different scenes, including urban, rural, countryside, mountain, and campus scenarios. The test data consisted of multiscale aerial images, covering areas of several square kilometers. During the data collection with UAVs, the UAV model used was DJI Mavic 2. We conducted aerial operations at an altitude of 120 m, capturing photographs at a 30° angle. The final image resolution was 4000*3000. We employed a ground station for UAV positioning to ensure an image overlap rate exceeding 85%. The data included regions with rich textures (buildings), repetitive textures (roads and vegetation), and non-textured areas (water). Additionally, since the collected data were not acquired simultaneously, they had variations in lighting conditions, making precise 3D reconstruction of the entire dataset challenging. We evaluated the dataset from multiple perspectives. In the implicit modeling phase, we compared our method to some advanced techniques (NeRF [28], TensorRF [41], Mega-NeRF [33]). In the explicit modeling phase, we compared our results to those of various existing aerial modeling software (including COLMAP 3.9, Context Capture 2023, PhotoScan 2023, Pix4D 4.5.6).

3.1. Implicit Modeling for Image Generation

We tested the performance of implicit model generation, with the results shown in Figure 3 and Table 1. PSNR (peak signal-to-noise ratio), SSIM (structural similarity index), and LPIPS (learned perceptual image patch similarity) are metrics used for assessing image quality. Significant improvements were achieved in both the visual quality and evaluation metrics. The neural radiance fields supervised by depth information outperformed methods relying solely on visual images, exhibiting fewer artifacts and achieving a photo-realistic quality. This indicates that the new viewpoint images generated by our method are sufficiently realistic to be included in the co-visibility clusters to compensate for the missing views required for reconstruction.

3.2. The 3D Point Cloud Reconstruction Accuracy

Our GPU-accelerated offline modeling framework, based on the coupling of implicit–explicit representations for large-scale UAV aerial scenes, strikes a balance between reconstruction accuracy and efficiency in complex aerial environments. It is suitable for rapid presentation and analysis in practical applications. In order to demonstrate the effectiveness of our method based on implicit co-visibility clusters, we conducted ablation experiments by subsampling video frames. We simulated scenarios for non-professional UAV aerial scenes (overlap was only 50%). The results generated using traditional methods such as COLMAP for low-overlap images are often poor. Figure 4 presents the comparative results of point cloud reconstruction for low-overlap scenes.

Our method achieves competitive quality in sparse and dense reconstruction compared to software like Context Capture 2023, Pix4D 4.5.6, and COLMAP 3.9. To assess the accuracy, we randomly selected several ROIs from the point cloud, search for their nearest points in the reference literature, and computed their mean Euclidean distance (MD) and standard deviation (SD). The results are provided in Table 2 (and compared to PhotoScan).

To further demonstrate the accuracy and quality of our method in dense point cloud reconstruction of natural scenes, we compare our approach to the state-of-the-art method RTSfM [42], used for 3D dense point cloud reconstruction. We propose an efficient and high-precision solution for large-baseline high-resolution aerial imagery. Compared to the most advanced 3D reconstruction methods, our system can provide real-time generation of large-scale high-quality dense point cloud models.

Additionally, to illustrate our model’s performance, we use very large aerial images (rural and mountain scenarios) as test data. These datasets have a wide range of capture times, different textures, and occlusions, which pose significant challenges for dense 3D point cloud reconstruction. As shown in Table 3, the point cloud results generated by our method are on a par with the state-of-the-art RTSfM. This indicates that our reconstruction accuracy is highly competitive.

3.3. The Accuracy of 3D Mesh Modeling

We tested the final mesh modeling results using PhotoScan as the baseline. We randomly sampled the triangular faces of the model and calculated the face centroids as sampling points. Similarly, we evaluated our final mesh modeling results by calculating their MD (mean Euclidean distance) and SD (standard deviation). Table 4 shows the superiority of our method, which leverages co-visibility constraints for large-scale scene representation, uses neural implicit modeling to compensate for missing views, and finally, accurately reconstructs the point cloud through dense SLAM. After texture enhancement and smoothing, we obtain high-quality mesh models. It is evident that our method achieves good accuracy in most scenes, and it ranks first in terms of stability across all scenarios. Finally, in Figure 5, we show the final modeling results of our method for all the scenes we considered.

4. Conclusions

Due to challenges such as the trajectories of UAV data acquisition and environmental noise, UAV large-scale scene modeling often yields low model accuracy, incompleteness, and blurry geometric structures and textures. This paper proposes a novel implicit–explicit coupling framework for a high-accuracy UAV large-scale scene 3D modeling framework. It comprises an implicit co-visibility-cluster-guided large-scale scene dense 3D reconstruction and an implicit 3D pose-recovery-based 3D modeling and surface texture enhancement.

Initially, the framework utilizes implicit rendering to synthesize new views, ensuring spatiotemporal consistency in co-visible relationships. Subsequently, within the co-visibility clusters, it employs implicit scene models to restore missing depth information from explicit multi-view pixel matching. Finally, the framework jointly optimizes image and depth information fused into the signed distance function (SDF) and estimates spatially varying lighting for building surface texture images and geometric shapes, resulting in high-quality 3D models of buildings. In comparison to existing methods, our approach achieves significant improvements in accuracy and reconstruction details, demonstrating strong competitiveness against commercial software.

Nevertheless, limitations exist, such as vulnerability to the influence of dynamic objects and restricted generalization to new scenes, necessitating substantial training data for different scenarios. On the other hand, due to the extensive use of implicit modeling in this paper to compensate for perspective and depth information loss caused by occlusion or blurriness, our approach inherits limitations from the neural radiance field (NeRF) method when dealing with large-scale, complex, or dynamic three-dimensional scenes involving UAVs. These limitations include:

(1) Memory and Computational Efficiency Issues: NeRF typically demands significant memory for storing and optimizing neural network parameters. Especially, as the scene size increases, the required parameter count sharply rises. Training time and computational costs significantly increase with the complexity and scale of the scene, impacting the practical applicability of our method in large-scale UAV scene modeling.

(2) Resolution Limitation Issues: The modeling resolution produced by our method is constrained by the number of points sampled by NeRF and the learning capacity of the network. In large scenes, maintaining overall and local accuracy may necessitate higher-resolution sampling, greatly escalating computational complexity. Therefore, the model may struggle to accurately capture intricate geometric and textural details in distant or extensive areas.

(3) Scalability and Updatability Issues in Scene Reconstruction: Original NeRF methods face challenges in achieving incremental updates and expansions for continuously expanding large or dynamically changing scenes. In other words, efficient learning and model updates focusing on newly added or altered parts are difficult. To address this, new algorithmic frameworks need to be designed to support modular, chunk-based, or adaptive updates, better meeting the modeling requirements of large-scale and dynamic environments.

Our current method solely focuses on dense 3D modeling and model surface texture accuracy in large-scale UAV scenarios. However, the resulting 3D scene models lack interactivity, editability, necessary semantic information for users, and generalization across different dynamic and static scenes, making them challenging for direct practical applications. Hence, future enhancements in this work will consider:

(1) Integration of Semantic and Instance Segmentation: Developing a 3D semantic reconstruction framework for large UAV scenes that integrates semantic information, not only for geometric modeling but also for semantic understanding of the scene and object-level decoupling mapping.

(2) Interactivity and Editability: Creating a large-scale reconstruction system with enhanced interactivity and editability, allowing users to modify or update constructed scene models in real time.

(3) Generalization and Open-World Adaptability: Improving the model’s generalization ability to adapt to different environmental and lighting conditions, as well as entirely new and unseen scenes.

In conclusion, the current research outcomes of this work can directly provide high-quality 3D map grid models for various mobile robots (such as autonomous cars and UAVs). The related technologies can also offer technical support for virtual roaming, urban planning, military simulation, forensic investigation, VR gaming, and other applications in virtual reality intelligence, demonstrating significant practical value.

Author Contributions

Software, X.L.; Validation, X.L.; Investigation, X.L.; Data curation, X.L.; Writing—original draft, X.L.; Writing—review & editing, S.X.; Supervision, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Beijing Natural Science Foundation No. JQ23014, in part by the National Natural Science Foundation of China (No. 62271074).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Xu, S.; Zhao, Y.; Sun, J.; Xu, S.; Zhang, X. Aerial orthoimage generation for UAV remote sensing: Review. Inf. Fusion 2023, 89, 91–120. [Google Scholar] [CrossRef]
Haala, N.; Kada, M. An update on automatic 3D building reconstruction. ISPRS J. Photogramm. Remote Sens. 2010, 65, 570–580, ISPRS Centenary Celebration Issue. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D.; Breitkopf, U.; Jung, J. Results of the ISPRS benchmark on urban object detection and 3D building reconstruction. ISPRS J. Photogramm. Remote Sens. 2014, 93, 256–271. [Google Scholar] [CrossRef]
Snavely, N.; Seitz, S.M.; Szeliski, R. Photo tourism: Exploring photo collections in 3D. In Proceedings of the ACM Siggraph 2006 Papers, Boston, MA, USA, 30 July–3 August 2006; pp. 835–846. [Google Scholar] [CrossRef]
Wu, C. Towards Linear-Time Incremental Structure from Motion. In Proceedings of the 2013 International Conference on 3D Vision—3DV 2013, Seattle, WA, USA, 29 June–1 July 2013; pp. 127–134. [Google Scholar] [CrossRef]
Moulon, P.; Monasse, P.; Marlet, R. Adaptive Structure from Motion with a Contrario Model Estimation. In Computer Vision—ACCV 2012: Proceedings of the 11th Asian Conference on Computer Vision, Daejeon, Republic of Korea, 5–9 November 2012, Revised Selected Papers, Part IV; Springer: Berlin/Heidelberg, Germany, 2013; pp. 257–270. [Google Scholar]
Moisan, L.; Moulon, P.; Monasse, P. Automatic Homographic Registration of a Pair of Images, with A Contrario Elimination of Outliers. Image Process. Line 2012, 2, 56–73. [Google Scholar] [CrossRef]
Cui, H.; Gao, X.; Shen, S.; Hu, Z. HSfM: Hybrid Structure-from-Motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lhuillier, M.; Quan, L. A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 418–433. [Google Scholar] [CrossRef] [PubMed]
Wu, T.P.; Yeung, S.K.; Jia, J.; Tang, C.K. Quasi-dense 3D reconstruction using tensor-based multiview stereo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1482–1489. [Google Scholar] [CrossRef]
Hamzah, R.A.; Kadmin, A.F.; Hamid, M.S.; Ghani, S.F.A.; Ibrahim, H. Improvement of stereo matching algorithm for 3D surface reconstruction. Signal Process. Image Commun. 2018, 65, 165–172. [Google Scholar] [CrossRef]
Romanoni, A.; Delaunoy, A.; Pollefeys, M.; Matteucci, M. Automatic 3D reconstruction of manifold meshes via delaunay triangulation and mesh sweeping. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–8. [Google Scholar] [CrossRef]
Wang, P.S.; Liu, Y.; Guo, Y.X.; Sun, C.Y.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. 2017, 36, 1–11. [Google Scholar] [CrossRef]
Ravendran, A.; Bryson, M.; Dansereau, D.G. Burst imaging for light-constrained structure-from-motion. IEEE Robot. Autom. Lett. 2021, 7, 1040–1047. [Google Scholar] [CrossRef]
Lao, Y.; Ait-Aider, O.; Bartoli, A. Rolling shutter pose and ego-motion estimation using shape-from-template. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–482. [Google Scholar]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4104–4113. [Google Scholar]
Ye, X.; Ji, X.; Sun, B.; Chen, S.; Wang, Z.; Li, H. DRM-SLAM: Towards dense reconstruction of monocular SLAM with scene depth fusion. Neurocomputing 2020, 396, 76–91. [Google Scholar] [CrossRef]
Yousif, Y.M.; Hatem, I. Video Frames Selection Method for 3D Reconstruction Depending on ROS-Based Monocular SLAM. In Robot Operating System (ROS) The Complete Reference (Volume 5); Springer: Cham, Switzerland, 2021; pp. 351–380. [Google Scholar]
Cheng, X.; Wang, P.; Yang, R. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 103–119. [Google Scholar]
Fan, Y.; Zhang, Q.; Tang, Y.; Liu, S.; Han, H. Blitz-SLAM: A semantic SLAM in dynamic environments. Pattern Recognit. 2022, 121, 108225. [Google Scholar] [CrossRef]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
Rosinol, A.; Leonard, J.J.; Carlone, L. Nerf-slam: Real-time dense monocular slam with neural radiance fields. arXiv 2022, arXiv:2210.13641. [Google Scholar]
Shivakumar, S.S.; Nguyen, T.; Miller, I.D.; Chen, S.W.; Kumar, V.; Taylor, C.J. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 13–20. [Google Scholar]
Chen, X.; Zhu, X.; Liu, C. Real-Time 3D Reconstruction of UAV Acquisition System for the Urban Pipe Based on RTAB-Map. Appl. Sci. 2023, 13, 13182. [Google Scholar] [CrossRef]
Chen, K.; Lai, Y.K.; Wu, Y.X.; Martin, R.; Hu, S.M. Automatic semantic modeling of indoor scenes from low-quality RGB-D data using contextual information. ACM Trans. Graph. 2014, 33, 1–12. [Google Scholar] [CrossRef]
Choi, S.; Zhou, Q.Y.; Koltun, V. Robust Reconstruction of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Ning, X.; Gong, L.; Li, F.; Ma, T.; Zhang, J.; Tang, J.; Jin, H.; Wang, Y. Slicing components guided indoor objects vectorized modeling from unilateral point cloud data. Displays 2022, 74, 102255. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Jiang, C.; Shao, H. Fast 3D Reconstruction of UAV Images Based on Neural Radiance Field. Appl. Sci. 2023, 13, 10174. [Google Scholar] [CrossRef]
Zhang, K.; Riegler, G.; Snavely, N.; Koltun, V. Nerf++: Analyzing and improving neural radiance fields. arXiv 2020, arXiv:2010.07492. [Google Scholar]
Xiangli, Y.; Xu, L.; Pan, X.; Zhao, N.; Rao, A.; Theobalt, C.; Dai, B.; Lin, D. Citynerf: Building nerf at city scale. arXiv 2021, arXiv:2112.05504. [Google Scholar]
Tancik, M.; Casser, V.; Yan, X.; Pradhan, S.; Mildenhall, B.; Srinivasan, P.P.; Barron, J.T.; Kretzschmar, H. Block-nerf: Scalable large scene neural view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8248–8258. [Google Scholar]
Turki, H.; Ramanan, D.; Satyanarayanan, M. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12922–12931. [Google Scholar]
Oleynikova, H.; Millane, A.; Taylor, Z.; Galceran, E.; Nieto, J.; Siegwart, R. Signed distance fields: A natural representation for both mapping and planning. In RSS 2016 Workshop: Geometry and Beyond-Representations, Physics, and Scene Understanding for Robotics; University of Michigan: Ann Arbor, MI, USA, 2016. [Google Scholar]
Teed, Z.; Deng, J. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Adv. Neural Inf. Process. Syst. 2021, 34, 16558–16569. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
Yen-Chen, L.; Florence, P.; Barron, J.T.; Rodriguez, A.; Isola, P.; Lin, T.Y. inerf: Inverting neural radiance fields for pose estimation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1323–1330. [Google Scholar]
Maier, R.; Kim, K.; Cremers, D.; Kautz, J.; Nießner, M. Intrinsic3D: High-quality 3D reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 3114–3122. [Google Scholar]
Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 333–350. [Google Scholar]
Zhao, Y.; Chen, L.; Zhang, X.; Xu, S.; Bu, S.; Jiang, H.; Han, P.; Li, K.; Wan, G. Rtsfm: Real-time structure from motion for mosaicing and dsm mapping of sequential aerial images with low overlap. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]

Figure 1. Comparison with the SOTA open-source reconstruction method COLMAP (a general-purpose structure-from-motion and multi-view stereo pipeline with a graphical and command-line interface), shows that our method outperforms in handling UAV images with low overlap, achieving the best performance.

Figure 2. Overall framework of our method. This work employs the NeRF model to supplement low-overlap co-visibility clusters, which are then processed by the explicit reconstruction module for the dense reconstruction. Meanwhile, the reverse radiation field is utilized for missing camera pose recovery and achieve high-precision texture mapping through the illumination model. Ultimately, large-scale scene modeling of UAVs is achieved through dense reconstruction with texture details.

Figure 3. Qualitative comparison between the baseline methods ((A): [28], (B): [41], (C) [33]) and our approach (D), in comparison to the Ground Truth (E). In large-scale scenes, all three methods suffer from significant blur, artifacts, and noticeable noise. Our method, on the other hand, performs well in this test.

Figure 4. The impact of overlap on point cloud reconstruction (compared with COLMAP in 50% overlap scene). The orange area provides a more detailed display of the red area. At low overlap, the model typically contains many gaps. After supplementing with implicit co-visibility cluster, denser point cloud results can be obtained (ablation experiment).

Figure 5. The results of the three-dimensional mesh reconstruction.

Table 1. A quantitative comparison of the datasets from the three scenes, where the urban scene features numerous buildings with rich textures, the countryside scene contains many textureless water surfaces, and the rural scene has many repetitive textures in the form of plants.

Scene	Urban			Rural			Countryside
Metric	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
NeRF	19.531	0.417	0.622	20.105	0.509	0.562	21.651	0.535	0.541
TensoRF	20.912	0.452	0.531	20.678	0.577	0.491	23.8	0.67	0.478
Mega-NeRF	23.096	0.466	0.51	22.625	0.545	0.477	23.508	0.642	0.431
Ours	24.887	0.712	0.373	25.307	0.757	0.358	25.463	0.811	0.208

Table 2. Accuracy comparison of 3D dense point cloud reconstruction (measurement units of MD and SD are meters).

Dataset	Open MVS		COLMAP		Context Capture		Pix4D		Ours
Dataset	MD	SD	MD	SD	MD	SD	MD	SD	MD	SD
Urban	0.385	0.291	0.119	0.112	0.187	0.166	0.192	0.152	0.111	0.107
Rural	0.853	0.21	0.141	0.176	0.306	0.229	0.278	0.261	0.121	0.116
Countryside	0.062	0.161	0.335	0.256	0.295	0.301	0.164	0.166	0.138	0.096

Table 3. RTSfM [42] and our method are evaluated for accuracy using new data (scene 4 and scene 5 are large-scale scene data; ground truth is obtained from PhotoScan; and the units for MD and SD measurements are meters).

Method	RTSfM		Ours
Method	MD	SD	MD	SD
Mountain	0.762	0.583	0.626	0.508
Campus	0.416	0.441	0.213	0.437

Table 4. The accuracy of 3D mesh modeling (highlighted with the best data for reference) based on PhotoScan as the benchmark.

Dataset	COLMAP		Context Capture		Pix4D		Ours
Dataset	MD	SD	MD	SD	MD	SD	MD	SD
Urban	1.405	0.953	1.563	0.661	1.732	0.891	1.412	0.553
Campus	2.528	2.611	1.345	1.737	3.022	2.145	1.257	1.316
Rural	0.953	0.776	3.589	2.718	1.658	0.946	0.736	0.641
Mountain	8.631	13.253	6.763	5.707	3.713	2.017	2.828	1.716
Countryside	2.219	1.546	1.894	1.993	0.942	1.033	1.106	0.977

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, X.; Xu, S. Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction. Appl. Sci. 2024, 14, 2425. https://doi.org/10.3390/app14062425

AMA Style

Lin X, Xu S. Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction. Applied Sciences. 2024; 14(6):2425. https://doi.org/10.3390/app14062425

Chicago/Turabian Style

Lin, Xiaobo, and Shibiao Xu. 2024. "Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction" Applied Sciences 14, no. 6: 2425. https://doi.org/10.3390/app14062425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implicit–Explicit Coupling Enhancement for UAV Scene 3D Reconstruction

Abstract

1. Introduction

2. Implicit–Explicit Coupling-Enhancement-Based 3D Reconstruction

2.1. Implicit Synthetic Co-Visibility-Cluster-Guided Dense 3D Reconstruction

2.1.1. Implicit Representation of UAV Large-Scale Scenes

2.1.2. Multi-View Implicit Synthetic Co-Visibility Cluster Generation

2.1.3. Dense Point Cloud Reconstruction of UAV Large-Scale Scenes

2.2. Detail-Implicit Texture Mapping Enhancement for High-Accuracy Mesh Modeling

2.2.1. Implicit Pose-Recovery-Based Texture Refinement

2.2.2. UAV Large-Scale Scene Voxel Mesh Modeling

3. Experiments

3.1. Implicit Modeling for Image Generation

3.2. The 3D Point Cloud Reconstruction Accuracy

3.3. The Accuracy of 3D Mesh Modeling

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI