Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots

Sumetheeprasit, Borwonpob; Rosales Martinez, Ricardo; Paul, Hannibal; Shimonomura, Kazuhiro

doi:10.3390/rs16020234

Open AccessArticle

Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots

Department of Robotics, Ritsumeikan University, Kusatsu 525-8577, Shiga, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(2), 234; https://doi.org/10.3390/rs16020234

Submission received: 15 November 2023 / Revised: 28 December 2023 / Accepted: 3 January 2024 / Published: 7 January 2024

(This article belongs to the Special Issue Intelligent Processing and Application of UAV Remote Sensing Image Data)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Aerial robots, or unmanned aerial vehicles (UAVs), are widely used in 3D reconstruction tasks employing a wide range of sensors. In this work, we explore the use of wide baseline and non-parallel stereo vision for fast and movement-efficient long-range 3D reconstruction with multiple aerial robots. Each viewpoint of the stereo vision system is distributed on separate aerial robots, facilitating the adjustment of various parameters, including baseline length, configuration axis, and inward yaw tilt angle. Additionally, multiple aerial robots with different sets of parameters can be used simultaneously, including the use of multiple baselines, which allows for 3D monitoring at various depth ranges simultaneously, and the combined use of horizontal and vertical stereo, which improves the quality and completeness of depth estimation. Depth estimation at a distance of up to 400 m with less than 10% error using only 10 m of active flight distance is demonstrated in the simulation. Additionally, estimation of a distance of up to 100 m with flight distance of up to 10 m on the vertical axis and horizontal axis is demonstrated in an outdoor mapping experiment using the developed prototype UAVs.

Keywords:

UAV; multi-rotor; 3D reconstruction; aerial robots; mapping; depth estimation; stereo vision; flexible configuration stereo; wide baseline stereo

1. Introduction

Unmanned aerial vehicles (UAVs) are widely used in the role of image acquisition in aerial surveillance and reconnaissance tasks [1]. Specifically in recent years, multi-rotor platforms, due to their mobility in deployment and operation, allow for their use in small- to medium-sized areas of operation. Various approaches to 3D reconstruction and mapping utilising one or a fusion of sensors using multi-rotor platforms exist, such as photogrammetry and structure from motion (SfM) for medium- to high-altitude mapping [2,3], light detection and ranging (LiDAR)-based 3D mapping [4,5,6], and visual simultaneous localisation and mapping (SLAM)-based global navigation satellite system (GNSS)-denied area mapping [7,8].

The mentioned approaches require an extensive movement distance of the UAV in order to cover large-scale mapping. The distance of the area covered at a time is limited by the maximum distance of the sensor used. Therefore, in the preceding work, a 3D reconstruction and mapping remote sensing method using multi-rotor-platform aerial robots and variable baseline and flexible configuration stereo vision as a flight-time and movement-efficient mapping algorithm was introduced [9]. The work explored the use of multiple baselines and their effective depth estimation range and introduced the baseline fusion algorithm, which combines point clouds generated from each baseline. The resultant point cloud has a broad depth estimation range compared with using only each individual baseline. The possibility and concept of the use of a variable baseline and flexible configuration stereo as a low-altitude long-range mapping algorithm was established as a distinctive approach compared with the aforementioned approaches. Variable baseline and flexible configuration stereo establish the possibility of medium- to large-scale mapping with the advantage of significantly reduced movement distance, processing time per area gain, and real-time monitoring capability.

In this work, various system improvements are introduced. A semi-autonomous aerial robot prototype UAV is developed to extend the capability of the system. This work explores the use of even wider baselines of up to 10 m in long-range 3D reconstruction tasks. Furthermore, this work introduces the advantage of non-parallel stereo by utilising the inward yaw tilt of the stereo pair, resulting in the increase in stereo matching performance and the completeness of depth estimation when a wide baseline is used. Additionally, this work shows the use of multiple aerial robots, increasing the number of viewpoints of the system, as well as introducing the scalability of the system by adding even more mapping agents. In this fashion, this multiple stereo vision configuration can be used at the same time, including the simultaneous use of horizontal and vertical stereo and multiple baselines. The advantage of using more viewpoints leads to improved depth estimationaccuracy and completeness as well as extending the range of depth estimation. Finally, an improvement in the tracking method and pose calculation method using inside-out sensors, such as depth cameras and tracking cameras, and an augmented reality (AR) marker is proposed. The design completely removes the need for a portable motion capture camera as a ground tracking station as implemented in the previous work while allowing for the re-calibration of the relative pose without the need to return to the vicinity of the ground-tracking station.

Several applications are recommended using the proposed system, exploiting its advantages, i.e., rapidity, movement efficiency, and the ability to reconstruct the 3D map of a large area without the need to enter the area of interest. The proposed system is used as an early reconnaissance system for another operator robot, which is similar to the system proposed by Kim et al. (2019) [10]. The system creates a rough map of an area and maintains the 3D monitoring of an area of operation without entering the area, which might pose a risk to the equipment. After knowing the rough estimate of the area of operation, another ground robot or aerial robot can be sent into the area of operation to carry out the mission without the need to create the map by itself. An example use case is a disaster-stricken area where the situation is unpredictable and time is of the essence, but directly sending in a rescue robot is deemed risky. The proposed system can create an approximate map of the area to determine a safe path and continue to guide the rescue robot from a distance. Another use case is the map creation of a metropolis where there is such a large number of vehicles and pedestrians in the area that sending a UAV into the area would be dangerous. The proposed system can create a long-range map from a distance without the need to enter the city and without the need to fly at a high altitude, which would disrupt the operation of aircraft.

2. Concept

In this work, we distribute the viewpoints of the stereo vision system on separate aerial robots. Each viewpoint is, therefore, independently and conveniently adjustable in position and orientation, allowing for the adjustment of stereo parameters: baseline length; configuration axis, namely, horizontal configuration and vertical configuration; and inward yaw tilt angle between each viewpoint. The adjustment of the above parameters is thus called flexible configuration stereo.

Flexible configuration stereo allows for the convenient realisation of wide baseline stereo, which is enabled by the mobility and the ability to hold the position and orientation of multi-rotor aerial robots. The use of a large baseline essentially decreases the estimation error of stereo depth estimation at long range by reducing the low confidence of the estimated depth due to the pixel-per-space limitation [11]. However, if the baseline-to-object distance ratio is too wide (we can refer to this baseline as excessive), this would result in the deterioration of the disparity image. In such cases, the disparity computation of target objects or areas becomes noisy or fails to compute, resulting in a void space in the resulting disparity image due to the occlusion of the objects close to the camera frames [12]. In this work, we emphasise the use of the inward yaw tilt angle to counter the degradation of disparity due to an excessive baseline.

Due to the limited effective range of each individual baseline, reliable depth estimation of a broad distance variation is difficult. In the previous work [9], simulation results have pointed out the issue. In this work, the simplification of system scalability is introduced. The system is implemented in an easily scalable fashion, using only inside-out sensors for both localisation and relative pose calculation. As a foundation, each aerial robot uses, but is not limited to, an identical set of sensors, frame and flight hardware. The modular design of the system allows the increment in the number of aerial robots. The use of multiple mapping agents, each utilising different configurations, such as using multiple baselines at once, is demonstrated in this paper to enhance the distance variation tolerance and broaden the effective range achieved by the system in the instance.

3. Methodology

3.1. System Design

Two major factors are considered in the design of the system: the operational ability without the use of an external localisation or tracking system, and the scalability of the system. Therefore, a small quadrotor prototype UAV is developed for the realisation of the proposed system; the specifications are shown in Table 1. Figure 1 shows the components overview of the system. A small QAV250 quadrotor frame is chosen as the platform due to its small cross-section, allowing a smaller distance so that the UAVs can safely hover next to each other without crashing into each other or interfering with each other’s thrust. The minimum distance between two UAVs horizontally is around 1 m and vertically is around 3 m; thus, the value indicates the minimum baseline distance used in each setup. A Pixhawk 6C flight controller with Ardupilot firmware is used for flight control. An Intel UP 4000 onboard PC is used for command and sensor data relaying from and to a ground control station. The additional sensors used in the system are a front-facing Realsense D435F depth camera and a down-facing Realsense T265 tracking camera providing localisation. The D435F global shutter is utilised for 2D imaging of the scene. In this work, depth data acquired from the D435F camera are not used. The probability of using the depth data as future work is discussed in Section 5. The camera settings are shown in Table 2. The T265 tracking camera provides accurate high-frequency 6-DOF localisation data of the UAV. Additionally, an AR marker [13] with a unique ID is attached on the back of each UAV. The AR marker is used for measurement of the relative position and rotation of each UAV in the pose calibration process.

3.2. Data Flow, Communication and Time Synchronisation

A communication scheme is devised for control of the UAVs as well as the information relaying. The main communication framework used is robot operating system (ROS) protocol. Figure 2 shows the overall communication. In each UAV, the onboard PC runs its own ROS master to relay information between sensors and the flight controller. Likewise, the control PC with AMD Ryzen 7 6800H, 4.7 GHz, 16 logical core processor runs its own ROS master. Each element of the system is communicated through a WiFi router with an IEEE 802.11ac 5.0 GHz WiFi standard. The crucial reason for each UAV to have their own ROS master instead of a shared master on a control PC is that, even though it might complicate the overall information communication between the UAVs and the control PC, it ensures that pose data are always relayed to the flight controller even when the connection to the control PC is unstable. As a result, the UAVs can continue to fly manually using a transmitter even if they do not receive commands from the control PC or there is a problem with the tracking system. Therefore, in order to relay the communication from the ROS master of each UAV to the control PC, a communication relay package is used [14]. Only a selection of necessary topics is communicated to reduce the limited bandwidth of data being transferred over WiFi.

The communication between each UAV and the control PC is simply straightforward. From each UAV, image data, camera parameters, localisation pose, and the state of the flight controller are sent to the command node on the control PC. From the control PC, setpoints for each UAV are sent by the command node for position control. The pose data obtained from the T265 camera and calibrated pose in Section 3.3 are sent to the command node. The largest part of data is the colour image data from the front-facing camera of each UAV. With the resolution of 640 × 480 pixels at 30 Hz, the raw data would take up to 222 Mbps. In practice, the image reaches the control PC at only half the frequency using the currently implemented communication device.

In order to synchronise the time stamp of the message between each UAV and the control PC, a network time protocol (NTP) server is implemented. The control PC’s clock is used as the NTP server, and the UAVs synchronise their time with the server. Since NTP accommodates the time shift from communication delay, the received message time stamps are using the same synchronised clock.

3.3. Tracking and Pose Calibration

In stereo vision, the quality of rectification based on relative orientation directly affects the quality of stereo matching [15]. Subsequently, the accuracy of the relative position directly affects the accuracy of depth estimation and consequently point cloud projection. Therefore, the accuracy of pose tracking is crucial for the overall performance and accuracy of a flexible configuration stereo system.

In each individual UAV of the proposed system, their respective pose is tracked using the downfacing Realsense T265 tracking camera. According to Realsense, the position error of the tracking camera estimates around 1% positional error per movement distance. The tracking information is used for position control of the UAV. The control PC sends position setpoints to the UAVs. At each setpoint, the position is held until the imaging process is completed.

In the stereo vision system, at least two UAVs are operated simultaneously. In order to control the UAVs to hold the desired position, pose calibration is required for the UAVs to know the relative pose of each other. Pose calibration is performed at least once after take-off using the AR marker attached to the UAV directly in front of it. At the beginning, the UAVs take off manually. The UAVs are then controlled manually to form a column and hold their position, such that the AR marker is visible in the front-facing camera of the UAV behind. The pose of the AR marker of the UAV in front is acquired by marker pose estimation. The pose obtained is used for the relative pose calibration of the UAVs by calculating the offset of the localisation pose origin of the UAVs based on the AR marker pose using Equation (1). Figure 3 shows the vectors used in the calculation. In case of multiple UAVs, they are named UAV n, UAV

n - 1

, and so on until UAV 1. The UAV in question UAV n is denoted as

D_{n}

and the UAV preceding its position, UAV

n - 1

, is denoted as

D_{n - 1}

. Here,

R_{O_{n}}^{O_{n - 1}}

is the rotation matrix that rotates the localisation origin of UAV n,

O n

, to the localisation origin of the previous UAV,

O_{n - 1}

. Meanwhile,

R_{D_{n}}

is the orientation of

D n

in its own localisation origin provided by the T265 camera,

R_{A}

is the rotation from the front-facing camera of

D n

to the AR marker of the UAV in front

D_{n - 1}

measured using marker pose estimation, and

R_{D_{n - 1}}

is the orientation of

D_{n - 1}

in its own localisation origin provided by the T265 camera. Subsequently,

T_{O_{n}}^{O_{n - 1}}

is the translation from the origin of

D_{n}

to the origin of UAV

D_{n - 1}

. Here,

T_{D_{n}}

is the position of

D_{n}

in its own localisation origin provided by the T265 camera,

T_{A}

is the translation from the front-facing camera to the AR marker of the UAV in front measured using marker pose estimation, and

T_{D_{n - 1}}

is the position of

D_{n - 1}

in its own localisation frame provided by the T265 camera. This calculation is performed for each UAV in the system. The localisation origin of the first UAV denoted as

O_{1}

is used as the origin point of the overall system. The calibration step is performed at least once before the actual mapping process. During the movement of the UAVs, if its position seems to be visually different compared to the information shown in the control PC, then re-calibration is required. The UAVs are set to repeat the calibration process. The most effective stage to perform calibration is right before the mapping process to reduce the chance of localisation drift that causes the error in position control.

\begin{matrix} R_{O_{n}}^{O_{n - 1}} = R_{D_{n}} R_{A} R_{D_{n - 1}}^{T} \\ T_{O_{n}}^{O_{n - 1}} = T_{D_{n}} + R_{D_{n}} (T_{C_{n}} - R_{A} (T_{A} + T_{D_{n - 1}})) \end{matrix}

(1)

3.4. Position Control, Baseline Selection and Image Acquisition

After the pose calibration of all the UAVs, the first UAV is moved to an ideal position where the desired 3D reconstruction target area is fully visible in the front-facing camera frame. The first UAV holds its position at a point of reference and sends this point of reference to other UAVs as well as the control PC to mark the stereo vision system origin. This stereo vision origin is the zero point of the movement of all the UAVs as well as the origin of tracking for position and yaw of all the UAVs in position control. Additional UAVs move to each desired checkpoint position and orientation based on this origin, depending on the desired configuration of the stereo vision system. At each checkpoint, the UAV holds the position for a certain period, usually a few seconds, until a sufficient number of images taken from within the vicinity of the setpoint are acquired. During that time period, the image data are acquired from the front-facing camera and sent to the control PC along with the tracking pose information, which is then processed into a point cloud. The UAV then moves to another checkpoint until the last checkpoint is reached and the reconstruction process is completed.

The condition of choosing the position and orientation of the camera, referred to as the configuration of the stereo system, is defined using the benchmarked information from simulation demonstrated in Section 4. The configuration parameters that are chosen include the vertical or horizontal stereo setup or the axis in which the UAVs should hold position relative to the first UAV, the baseline distance or the distance between the UAVs, and the yaw tilt angle of the UAV or the inward tilt angle of the cameras in case of horizontal stereo. Note that in the current system, an inward pitch tilt in case of vertical setup is not possible because the UAVs cannot hold position and pitch down at the same time. An additional camera-tilting mechanism is required to achieve the pitch tilt.

The condition of choosing the parameters is based on benchmark data that are shown in the previous work [9] and in addition to the simulation result in Section 4. A vertical or horizontal setup is chosen based on the shape and orientation of the object in the scene as well as the area in which to be mapped. For example, the following apply:

In the horizontal setup, the movement axis is to the right of the primary UAV. Therefore, a sizable space is needed depending on the distance of baseline used which varies based on the distance of the area of interest. Reconstruction of an open area, a field or a far view of a structure is recommended. Horizontal setup also tends to be more effective at mapping objects with a vertically spanning shape, such as poles.
The movement axis of the vertical setup is upward; therefore, less space in the movement area is needed, which is ideal for mapping an area with less open space. The reconstruction of a small alleyway surrounded by buildings on two sides is more effective using a vertical setup, because the translation in the horizontal axis will create occlusion from the building. Reconstruction of the shape of objects that span horizontally, for example a bridge, is more effective when a vertical setup is used due to the characteristic of stereo matching.

A mixture of both setups can be used with a single scene to create a more complete reconstruction as well.

Subsequently, the baseline distance is chosen based on the error threshold per baseline Equation (4) introduced in the preceding work [9] as well as an estimation from benchmark data in the simulation. The error estimation is based on the quantisation error equation of stereo depth estimation proposed by Rodriguez et al. (1988) [16]:

ϵ_{z} = \frac{z^{2}}{b f} \cdot ϵ_{d}

(2)

where

ϵ_{z}

is the depth estimation error in metres, z is the target distance in metres, b is the baseline distance in metres, f is the focal length in pixels, and

ϵ_{d}

is the disparity error in pixels. Additionally, the calculation of maximum range of each baseline can be calculated using:

z_{c} = \sqrt{\frac{ϵ_{c} b f}{ϵ_{d}}}

(3)

where

z_{c}

is the maximum distance, and

ϵ_{c}

is the desired error threshold. In the case of the current work, we specify a set of the baseline distance so that it is easy to understand, command, and monitor. The maximum and minimum distance required for mapping are used in Equation (4), rounding the maximum and minimum baseline distance obtained from the equation and then specifying a whole number interval between the max and min baseline to generate baselines in between. Since the baseline used in this work is much larger than the previous work, the length of baselines are specified in the interval of 1 m or 2 m; for example, the baselines used in the outdoor mapping experiment are 2, 4, 6, 8 and 10 m. However, in real use cases where the baseline chosen is excessive and the quality of stereo matching deteriorates, the baseline distances can be adjusted per situation, since it is impossible to estimate the maximum baseline that can be used. Finally, the inward tilt angle is adjusted to compensate, at least partially, for the excessive baseline. In Section 4, simulation results have shown that inward tilting of the cameras can reduce the degradation of stereo matching when a large baseline is used as well as reduce the non-overlap area of the stereo pair caused by the translation of the camera. In principle, a tilt angle is set so that the target object or area appears as whole in all the images taken from all cameras to maximise the visibility of the object.

b = \frac{z_{m a x}^{2}}{ϵ_{c} f} ϵ_{d}

(4)

3.5. Image Processing and Point Cloud Reprojection

The image processing steps are similar to the traditional fixed-configuration stereo setup with the exception of the rectification matrices and projection matrices that need to be calculated in real time. Intrinsic matrices, namely the camera matrix and distortion coefficient matrix, are obtained from the camera calibration process, or in the case of the proposed system, the intrinsics are saved in the D435F camera and published in ROS along with the images. The calculated relative orientation is used in the image rectification process. Using the origin frame of UAV 1 as the reference frame, the pitch and roll axis of the images obtained from all the UAVs are rectified to be level with the pitch and roll origin, namely the pitch and roll level of UAV 1, and the yaw of the stereo vision origin. This specific position and rotation origin is denoted as

O_{s}

, or the stereo vision origin. Therefore, the rectification matrix of each UAV n in this system,

R_{r e c t}_{n}

, can be formed with the Equation (5). Here,

R_{n}

is the rotation of

D_{n}

in its own localisation origin and

R_{O_{n}}^{O_{1}}

is the rotation from the localisation origin of

O_{n}

to the stereo vision origin

O_{S}

, which can be calculated using Equation (6). The reason for choosing the orientation origin of UAV 1 is the orientation origin of the Realsense T265 camera is close to horizontally level. By rectifying the images to this plane, the resulting disparity image and point cloud would also be level with the ground plane. Additionally, the yaw of

O_{S}

is used as the yaw zero in the rectification process. The images are undistorted and rectified using OpenCV’s undistort rectify algorithm.

R_{r e c t}_{n} = R_{n} R_{O_{n}}^{O_{s}}

(5)

R_{O_{n}}^{O_{s}} = R_{O_{n}}^{O_{n - 1}} R_{O_{n - 1}}^{O_{n - 2}} \dots R_{O_{2}}^{O_{1}} R_{O_{1}}^{O_{s}}

(6)

After the rectification, the parallel stereo image pairs are processed using the quasi-dense stereo matching algorithm [17]. The reason for choosing this algorithm is its dense stereo matching even if the stereo image pairs are not perfectly rectified, which is a common case in the proposed system. The resultant disparity image is then calculated and reprojected into a point cloud with the OpenCV reprojection algorithm. The projection matrix Q used in the reprojection is constructed as shown in Equation (7). Here, the Q matrix is the projection matrix,

c_{x}

is the horizontal image centre of the camera,

c_{y}

is the vertical image centre of the camera, f is the focal length, and b is the baseline distance.

c_{x}

,

c_{y}

, and f are obtained from camera calibration; in the case of the proposed system, these intrinsic parameters are saved in the camera and sent via ROS along with other data to the command node.

Q = [\begin{matrix} 1 & 0 & 0 & - c_{x} \\ 0 & 1 & 0 & - c_{y} \\ 0 & 0 & 0 & f \\ 0 & 0 & - 1 / b & 0 \end{matrix}]

(7)

After reprojection, point clouds with their origin at the centre of UAV 1’s front-facing camera are generated. In the case of multiple UAVs, if the frame used as the reference frame in stereo matching is from another UAVs’ camera, the point cloud can be translated and rotated to the same coordinate system as UAV 1’s by using the pose offset obtained from the pose calibration process and the current localisation position of the reference UAV using Equation (8). Here,

P_{n}

is the translated and rotated point cloud of

D_{n}

represented in

O_{1}

, and

p_{x}

,

p_{y}

, and

p_{z}

are the point cloud positions in the represented frame of the chosen reference UAV,

D_{n}

. In practice,

D_{1}

should be the reference frame unless another viewpoint is used.

P_{n} = [\begin{matrix} R_{O_{n}}^{O_{1}} & | & T_{O_{n}}^{O_{1}} \end{matrix}] [\begin{matrix} p_{x} \\ p_{y} \\ p_{z} \\ 1 \end{matrix}]

(8)

Finally, the obtained point cloud is filtered and trimmed using the depth estimation error-based point cloud trimming method proposed in the preceding work. Equation (9) is used for calculation of trimming distance. The trimmed point clouds are then joined together into a resultant map.

z_{t r i m} = \sqrt{\frac{ϵ_{c} b f}{ϵ_{d}}}

(9)

4. Experiment and Evaluation

4.1. Large Baseline and Inward Tilt Simulation

Simulation software AirSim [18] is used in the experiment for verification of the system in a virtual environment. The first experiment goal is to understand the behaviour of large baselines in an area with a wide range of depth. The experiment area from the AirSim environment called CityEnviron is shown in Figure 4, where its ground truth depth is shown in Figure 5. The chosen street has buildings on both sides with a distance ranging from 20 to 400 m. Stereo images are taken from the simulation with the baseline of 2, 4, 6, and 8 m in a horizontal setup with parallel cameras. All disparity images are filtered to be in the range of 32 to 255 pixels. Their respective image and disparity image is shown in Figure 6. The effect of occlusion can be clearly seen in large baselines where the disparity estimation of closer buildings becomes mostly noise.

In order to mitigate the problem of deterioration, tilt angles at larger baselines are introduced. Figure 7 shows disparity images of the tilted stereo vision system where the inward yaw tilt of the moving viewpoint is increased by 1 degree every 1 m of baseline. The completeness of the disparity map of closer buildings, especially Building C and Building D, can be seen retained even in 6.0 and 8.0 m baseline in contrast to non-tilting cases.

For each baseline of the tilted case, a point cloud is generated and trimmed using the estimation error-based point cloud trimming algorithm described in Section 3.5 and introduced in the preceding work. In the case of this experiment, the error threshold is set to be 10 percent of the estimated distance, and the trimming point for each baseline is 120, 240, 360, 480 m, respectively, where the depth limit is set to be between 0 and 400 m. The trimmed and merged point cloud map, with its comparative of a point cloud merged without trimming, is shown in Figure 8. The result shows that even though the trimmed point cloud can be seen to be less dense, by trimming point clouds, the depth estimation noise caused by incorrect depth estimation from inappropriate baselines is greatly reduced.

4.2. Multiple Configuration Simulation

In this experiment, the advantage of using more than two mapping agents is demonstrated in the simulation software. In this experiment, three mapping agents are used simultaneously for mapping.

D 1

, the reference frame UAV, holds its position at a reference point.

D 2

and

D 3

move to different configurations. The images taken from

D 2

and

D 3

are stereo matched with images taken from

D 1

.

Figure 9 shows the use of the multiple agents using different configurations. A horizontal setup and vertical setup, both with 2 m baseline, are used simultaneously. The depth image computed from the two setups is shown along with a combined depth. The combination of depth images is computed by averaging the depth computed from both configurations. Additionally, the missing depth in each configuration is filled by the depth from another if available. The improvement in terms of completeness and density is observed in the combined depth compared to each individual setup.

In addition to multiple setups, various baseline distances can be used simultaneously to monitor several different depth ranges. Figure 10 shows the use of two configurations with different baselines simultaneously. Figure 10a shows the depth image from

D 2

with a horizontal setup with a 2 m baseline, while Figure 10b shows the depth image from

D 3

with a vertical setup with a 4 m baseline. The depth image of

D 2

is trimmed at the range of 0 to 70 m, referring to Figure 4; the depth of 70 m is just behind the back of Building A. The depth image of

D 3

is trimmed to a depth range of 70 to 250 m, starting from the beginning of the intersection behind Building A. By keeping

D 2

and

D 3

in different configurations, different depth ranges can be monitored in 3D in real time. This example is useful in monitoring vehicles moving on a street with a wide range of depth without losing the accuracy of depth estimation limited by baseline distance.

4.3. Mapping with Prototype UAVs

An experiment using prototype UAVs is carried out in order to verify the usability of the system. Two UAVs are used in this experiment. Figure 11 shows the time lapse steps of the mapping process using two UAVs. First, the two UAVs are taken off and controlled to the starting position manually. Second, the two UAVs are manually controlled to form a column (one in front of the other) so that the AR marker of the UAV in front is visible in the camera frame. Third, the two UAVs are then controlled semi-automatically from a control PC to follow the predetermined data collection set points. The chosen configurations are a horizontal configuration with a baseline of 2, 4, 6, 8, and 10 m, and a vertical configuration with a baseline of 5, 7, and 10 m. In the horizontal setup, the inward yaw tilt is increased by 2 degrees for every 2 m of baseline.

Rectified image pairs and the disparity of their setups are shown in Figure 12 and Figure 13. Due to the high amount of occlusion and noises in disparity acquired from the horizontal setup using the 10 m baseline caused by excessive baseline distance, data collected from the configuration are excluded from the result. Disparity from the current single-threaded stereo matching node used in this experiment results with the frame rate of around 0.3 FPS using the control PC. The frame rate can be increased if a multi-threaded pipeline is implemented. The disparity image can be shown to the operator in real time to predetermine the quality of the depth estimation. The shown disparity in Figure 12 and Figure 13 is filtered in the range of 16 to 256 pixels, locally normalised, and a colour map applied to visualise the disparity. The disparity images are used as the pre-evaluation of the quality of the resulting point cloud. Several pieces of crucial information can be extracted from a disparity image, including the amount of occlusion, objects in the scene successfully matched, and whether a baseline is too large or too small. The disparity obtained from each configuration is evaluated and chosen regarding whether it can be included in the final resultant point cloud by the operator based on it.

In order to evaluate the depth estimation accuracy of the proposed algorithm, a selection of points of interest (POIs) in the mapped area is chosen and denoted as the rectangles shown in Figure 14. The three main areas of evaluation chosen are the POI A in red, the POI B in green, and the POI C shown in blue. The depth estimation of the three POIs are sampled within the rectangle of the size 160 × 30 pixels as shown in Figure 14. The evaluation of the depth estimation, namely the z-axis of the point cloud, is shown in Figure 15. The estimated depth from the proposed system can be compared with the ground truth planar map shown in Figure 16, which was obtained from Google Maps [19]. Two parameters are used for the analysis: the distance estimated and the data distribution as well as the percentage of the sampling area successfully stereo matched. For the POI A, as shown in the red-coloured graphs, the horizontal 2 m and 4 m baselines are excluded due to the building being below the disparity threshold of 16 pixels: the building is too far, and the baseline is too small. From the graph, high match rates are observed in all baselines and configurations. The mean value of depth estimation is close to the ground truth value of 100 m; the norm of the estimation increasingly approaches the ground truth value as the baseline increases. The second point of interest is the POI B denoted in green. The norm of the mean value of depth estimation is more varied, while the match rate is relatively acceptable. The reason for the high variation is also observed in Figure 12 and Figure 13 where the disparity images of the tree area vary on value. This is caused by the shape of the tree being full of dimples and dents in the view of the stereo matching algorithm, making accurate depth difficult to estimate. Furthermore, the area is partially occluded by meshed fences in front of the trees, which induce more variation in the depth estimation. However, the mean value of the estimated depth falls in the range of 45 to 60 m, which is close to the value of the ground truth at roughly 50 m. Additionally, the accuracy of depth estimation can be seen to vary regarding baseline distances and the setup axis. The most accurate estimation can be observed around the 4 and 6 m baseline in the horizontal setup and 7 and 10 m in the vertical setup. The appropriate baseline distance for this POI is estimated to be the said distance. Finally, the POI C is denoted in blue. The number of matches can be clearly seen as fairly low, and the depth estimation can be seen to largely vary. This is caused by the low texture surface of the building as well as the building not fully appearing in the image frame of both cameras. The depth estimation of this building can be largely seen as noise in the depth estimation.

Finally, the point cloud from each baseline and setup are merged together into a resultant point cloud. The trimming distance is based on the estimated distance of the POI. For the horizontal setups with baselines of 2, 4, 6, and 8 m, the point cloud of each baseline is trimmed at the distances of 30, 60, 90, and 120 m, respectively. For vertical distance, the point clouds of the 5, 7 and 10 m baselines are trimmed at the distances of 40, 80, and 120 m, respectively. The reason for choosing the specific distance is the distance of POIs. The closest POI is the field, which has a distance of roughly up to 40 m, while POI B has a distance of roughly 50 m, POI C is the building, which has a distance of roughly 65 m, and POI A has a distance of roughly 100 m. In this experiment, the operator tries to use each baseline for each POI; by specifying the trimming point as shown, the distance of the POIs will fall in the range of different baselines and not in between.

The resultant point cloud is shown in Figure 17. Although the point cloud represented using 2D images is difficult to evaluate, the 3D view of point cloud data is provided in Supplementary Video S1. The evaluation of the resultant point cloud against the ground truth planar data acquired from Google Maps is shown in Figure 18. Coloured lines are drawn on the POIs to compare the POIs in ground truth and the resultant point cloud. Depth estimation is observed to be relatively accurate, where there are some positional errors due to the imperfection in the rectification process caused by camera vibration.

5. Conclusions

This work proposed a system for a fast and movement-efficient medium to large-scale mapping method using wide baseline flexible configuration stereo vision. A prototype quadrotor UAV is developed. The tracking method and pose calibration process using inside-out sensors and AR markers is introduced. The design of the prototype UAV is developed with the goal of being scalable, meaning that the number of UAVs can be increased using the same set of sensors and AR markers, but they are not limited to the same components. In the simulation experiment, a large-scale mapping of a street with the estimated distance up to 400 m with only 10 m of UAV active movement distance is demonstrated. Additionally, the possibility of using multiple configurations to increase the completeness of the resultant point cloud as well as multiple range 3D monitoring is demonstrated. Finally, the physical system implementation is evaluated in an outdoor mapping experiment of a distance up to 100 m with only 10 m movement of the UAV in the horizontal and vertical axis. The reason for the range reduction from 400 m in the simulation to 100 m in the outdoor mapping experiment is the difficulty securing a larger space for the experiment.

In future work, some of the issues in the system implementation are to be addressed. The vibration of the camera statically mounted on the UAVs has caused an error in the rectification process, which worsens the accuracy depth estimation. This problem can be addressed by using a stabilisation mechanism, such as a gimbal, or an image stabilisation algorithm. In addition, the evaluation of disparity based on the operator’s discretion is counterintuitive for an autonomous mapping system. To counter this problem, a neural network model can be used for the evaluation of disparity image and exclude the noisy part of the disparity image from the resultant depth estimation. Moreover, the current limitation of the data transmission bandwidth and the stability of the WiFi module limits the amount of data that can be sent to the control PC. A better alternative with longer range and stability can be used to counter this problem in the future.

In terms of system improvement, currently, the depth image from the front-facing depth camera is not used in the mapping process. In future work, the depth image from the D435F can be used with the combination of the tracking data of the T265 camera to create a point cloud of near-field objects. Objects that are too close to the UAVs to use the proposed system can be mapped using a visual–inertial SLAM such as RTAB-Map [20] on the front-facing D435F. The map generated from this can be added to the resultant map.

Finally, the estimation error-based baseline fusion algorithm is implemented based on the fact that the rough distance to the POIs is known. Furthermore, even though the baseline fusion algorithm can estimate the maximum effective distance of each baseline, the minimum effective distance cannot be estimated due to the variation in size, shape, texture and other characteristics of the objects in the scene. Additionally, the current baseline fusion algorithm is a hard-trim algorithm, meaning that the point cloud is cut cleanly at the calculated fusion point. A fusion algorithm implementing a soft-trim algorithm that can fuse the point cloud by considering some degree of depth from adjacent baselines can improve the quality of the resultant point cloud.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/rs16020234/s1, Video S1: System demonstration.

Author Contributions

Conceptualisation, Formal analysis, Investigation, Software, Visualisation, Writing—original draft, B.S.; Methodology, Validation, Writing—review and editing, R.R.M. and H.P.; Conceptualisation, Resources, Supervision, Validation, Writing—review and editing, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article and supplementary materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AR	Augmented Reality
LiDAR	Light Detection and Ranging
NTP	Network Time Protocol
POI	Point of Interest
ROS	Robot Operating System
SfM	Structure from Motion
SLAM	Simultaneous Localisation and Mapping
UAV	Unmanned Aerial Vehicle

References

Nex, F.; Remondino, F. UAV for 3D mapping applications: A review. Appl. Geomat. 2014, 6, 1–15. [Google Scholar] [CrossRef]
Li, Q.; Huang, H.; Yu, W.; Jiang, S. Optimized views photogrammetry: Precision analysis and a large-scale case study in qingdao. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1144–1159. [Google Scholar] [CrossRef]
Śledź, S.; Ewertowski, M.W.; Piekarczyk, J. Applications of unmanned aerial vehicle (UAV) surveys and Structure from Motion photogrammetry in glacial and periglacial geomorphology. Geomorphology 2021, 378, 107620. [Google Scholar] [CrossRef]
Room, M.; Anuar, A. Integration of Lidar system, mobile laser scanning (MLS) and unmanned aerial vehicle system for generation of 3d building model application: A review. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2022; Volume 1064, p. 012042. [Google Scholar]
Setyawan, A.A.; Taftazani, M.I.; Bahri, S.; Noviana, E.D.; Faridatunnisa, M. Drone LiDAR application for 3D city model. J. Appl. Geospat. Inf. 2022, 6, 572–576. [Google Scholar] [CrossRef]
Meng, H.; Wang, G.; Han, Y.; Zhang, Z.; Cao, Y.; Chen, J. A 3D modeling algorithm of ground crop based on light multi-rotor UAV lidar remote sensing data. In Proceedings of the 2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI), Xi’an, China, 22–24 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 246–250. [Google Scholar]
Martinez Rocamora Jr, B.; Lima, R.R.; Samarakoon, K.; Rathjen, J.; Gross, J.N.; Pereira, G.A. Oxpecker: A tethered uav for inspection of stone-mine pillars. Drones 2023, 7, 73. [Google Scholar] [CrossRef]
Leclerc, M.A.; Bass, J.; Labbé, M.; Dozois, D.; Delisle, J.; Rancourt, D.; Lussier Desbiens, A. NetherDrone: A tethered and ducted propulsion multirotor drone for complex underground mining stopes inspection. Drone Syst. Appl. 2023, 11, 1–17. [Google Scholar] [CrossRef]
Sumetheeprasit, B.; Rosales Martinez, R.; Paul, H.; Ladig, R.; Shimonomura, K. Variable Baseline and Flexible Configuration Stereo Vision Using Two Aerial Robots. Sensors 2023, 23, 1134. [Google Scholar] [CrossRef] [PubMed]
Kim, P.; Park, J.; Cho, Y.K.; Kang, J. UAV-assisted autonomous mobile robot navigation for as-is 3D data collection and registration in cluttered environments. Autom. Constr. 2019, 106, 102918. [Google Scholar] [CrossRef]
Olson, C.F.; Abi-Rached, H. Wide-baseline stereo vision for terrain mapping. Mach. Vis. Appl. 2010, 21, 713–725. [Google Scholar] [CrossRef]
Strecha, C.; Fransens, R.; Van Gool, L. Wide-baseline stereo from multiple views: A probabilistic account. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, Washington, DC, USA, 27 June–2 July 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 1, pp. 552–559. [Google Scholar]
Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.J.; Marín-Jiménez, M.J. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
Tiderko, A.; Hoeller, F.; Röhling, T. The ROS multimaster extension for simplified deployment of multi-robot systems. In Robot Operating System (ROS) the Complete Reference; Springer: Cham, Switzerland, 2016; Volume 1, pp. 629–650. [Google Scholar]
Hirschmuller, H.; Gehrig, S. Stereo matching in the presence of sub-pixel calibration errors. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 437–444. [Google Scholar]
Rodriguez, J.J.; Aggarwal, J. Quantization error in stereo imaging. In Proceedings of the CVPR’88: The Computer Society Conference on Computer Vision and Pattern Recognition, Ann Arbor, MI, USA, 5–9 June 1988; IEEE Computer Society: Washington, DC, USA, 1988; pp. 153–154. [Google Scholar]
Stoyanov, D.; Scarzanella, M.V.; Pratt, P.; Yang, G.Z. Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010: 13th International Conference, Beijing, China, 20–24 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 275–282. [Google Scholar]
Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference; Springer: Cham, Switzerland, 2018; pp. 621–635. [Google Scholar]
Google Maps. Ritsumeikan University Biwako-Kusatsu Campus. Available online: https://goo.gl/maps/BNZfmefa5A41mT2c7 (accessed on 6 November 2023).
Labbé, M.; Michaud, F. RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. J. Field Robot. 2019, 36, 416–446. [Google Scholar] [CrossRef]

Figure 1. Prototype UAV components overview from the front (left) and from behind (right).

Figure 2. Block diagram of the communication and data processing nodes.

Figure 3. Coordinate frames of each UAV in the calibration process.

Figure 4. Reference frame with area of interests marked in color.

Figure 5. Ground truth of the experiment area viewed from (a) 45 degrees pitch, and (b) directly above with each area of interest marked in colored lines.

Figure 6. Image frame from baseline (a) 2.0 m, (b) 4.0 m, (c) 6.0 m and (d) 8.0 m, and their respective disparity images.

Figure 7. Disparity image from baseline (a) 2.0 m, (b) 4.0 m, (c) 6.0 m and (d) 8.0 m. The left column is computed from non-tilting stereo pairs and the right column is computed from tilting stereo pairs.

Figure 8. Resultant point cloud maps of (a) ground truth, (b) trimmed point cloud, and (c) non-trimmed point cloud.

Figure 9. Normalised depth image of a multiple configuration system with (a) horizontal setup, (b) vertical setup, and (c) averaged depth between two setups.

Figure 10. Normalised depth image of a system with two configurations and two baselines with (a) horizontal setup 2 m baseline, (b) vertical setup 4 m baseline, and (c) combined depth image of both configurations.

Figure 11. A time lapse of mapping process including (a) UAVs moving to a starting point manually, (b) pose calibration using AR marker, and (c) the movement vectors in the stereo images collection process.

Figure 12. Disparity images and their image pairs from each corresponding baseline using horizontal configuration.

Figure 13. Disparity images and their image pairs from each corresponding baseline using vertical configuration.

Figure 14. Rectangles denoting depth sampling area of three areas of interest in the mapping process, POI A (red), POI B (green), and POI C (blue).

Figure 15. Depth estimation and their corresponding match percentage of area of interest POI A (red), POI B (green), and POI C (blue).

Figure 16. Ground truth acquired from Google Maps of the area of mapping fitted with a distance grid.

Figure 17. Resultant point cloud seeing from pitch minus 60 degrees (left) and directly above (right).

Figure 18. Ground truth acquired from Google Maps (left) and resultant point cloud from point cloud fusion (right).

Table 1. Developed prototype UAV specification.

Dimensions:	35.9 × 30.5 × 19 cm	Rotor Span:	25 cm
Flight Controller:	Pixhawk 6C	Firmware:	Ardupilot
Weight with Battery:	1.5 kg	Motor:	ARRIS 2205 (2300 KV)
Propeller:	Inverted Tri-wing 5045	Total Thrust:	4 kg

Table 2. Realsense D435F settings and specification.

Resolution:	640 × 480 pixels	Field of view:	56 degrees
Frame Rate:	30 FPS	Exposure:	Automatic

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sumetheeprasit, B.; Rosales Martinez, R.; Paul, H.; Shimonomura, K. Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots. Remote Sens. 2024, 16, 234. https://doi.org/10.3390/rs16020234

AMA Style

Sumetheeprasit B, Rosales Martinez R, Paul H, Shimonomura K. Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots. Remote Sensing. 2024; 16(2):234. https://doi.org/10.3390/rs16020234

Chicago/Turabian Style

Sumetheeprasit, Borwonpob, Ricardo Rosales Martinez, Hannibal Paul, and Kazuhiro Shimonomura. 2024. "Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots" Remote Sensing 16, no. 2: 234. https://doi.org/10.3390/rs16020234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Long-Range 3D Reconstruction Based on Flexible Configuration Stereo Vision Using Multiple Aerial Robots

Abstract

1. Introduction

2. Concept

3. Methodology

3.1. System Design

3.2. Data Flow, Communication and Time Synchronisation

3.3. Tracking and Pose Calibration

3.4. Position Control, Baseline Selection and Image Acquisition

3.5. Image Processing and Point Cloud Reprojection

4. Experiment and Evaluation

4.1. Large Baseline and Inward Tilt Simulation

4.2. Multiple Configuration Simulation

4.3. Mapping with Prototype UAVs

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI