KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles

Zhang, Zhiqi; Zhang, Yifan; Xiang, Shao; Wei, Lu

doi:10.3390/drones8020046

Open AccessArticle

KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China

³

School of Information Science and Engineering, Wuchang Shouyi University, Wuhan 430064, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2024, 8(2), 46; https://doi.org/10.3390/drones8020046

Submission received: 27 December 2023 / Revised: 28 January 2024 / Accepted: 30 January 2024 / Published: 1 February 2024

(This article belongs to the Special Issue Intelligent Processing and Application of UAV Remote Sensing Image Data)

Download

Browse Figures

Versions Notes

Abstract

:

As the application of UAVs becomes more and more widespread, accidents such as accidental injuries to personnel, property damage, and loss and destruction of UAVs due to accidental UAV crashes also occur in daily use scenarios. To reduce the occurrence of such accidents, UAVs need to have the ability to autonomously choose a safe area to land in an accidental situation, and the key lies in realizing on-board real-time semantic segmentation processing. In this paper, we propose an efficient semantic segmentation method called KDP-Net for characteristics such as large feature scale changes and high real-time processing requirements during the emergency landing process. The proposed KDP module can effectively improve the accuracy and performance of the semantic segmentation backbone network; the proposed Bilateral Segmentation Network improves the extraction accuracy and processing speed of important feature categories in the training phase; and the proposed edge extraction module improves the classification accuracy of fine features. The experimental results on the UDD6 and SDD show that the processing speed of this method reaches 85.25 fps and 108.11 fps while the mIoU reaches 76.9% and 67.14%, respectively. The processing speed reaches 53.72 fps and 38.79 fps when measured on Jetson Orin, which can meet the requirements of airborne real-time segmentation for emergency landing.

Keywords:

UAVs; semantic segmentation; emergency landing; scale change; real-time processing

1. Introduction

With the wide application of UAVs in many fields such as remote sensing and geological exploration, search and rescue, agriculture and forestry, and film and media production [1], the applicable scenarios of UAVs have been expanding, and the frequency of their use has been increasing. It is necessary to ensure the accuracy of emergency autonomous landing when UAVs respond to sudden emergencies such as low battery power and signal loss during flight.

In recent decades, with the rapid development of visual navigation technology [2,3,4], researchers have started to focus on vision-based autonomous landing techniques. Generally, UAV landing based on visual navigation requires the optical cameras that capture high-resolution images of their surroundings, allowing the visual navigation system to accurately identify ground features, obstacles, and other important navigational reference points. Alam et al. [5] proposed a detection method for landmark obstacles in small-scale safety detection by synthesizing different image processing and safe landing area detection (SLAD) algorithms. However, the method is limited to the presence of a small number of obstacles and is not reliable for visual guidance in complex terrains and scenes. Respall et al. [6] proposed an autonomous visual detection, tracking, and landing system that is capable of detecting and tracking unmanned ground vehicles (UGVs) online without artificial markers. However, UAVs must perform their tasks in the vicinity of moving UGVs, but in practice, it is difficult for UAVs to communicate and interact with ground stations in situations such as congested areas or disaster zones. Symeonidis et al. [7] proposed a UAV navigation method for safe landings, which relies on a lightweight computer vision module that is capable of being executed on the limited computational resources of the UAV. However, this network model needs to consider issues such as overfitting and generalization performance. It is not applicable to complex scene tasks and in cases with large landing height drops, where the larger the UAV landing height drop, the larger the scale change of the target, which leads to misjudgment during landing.

The traditional UAV landing method mainly relies on the UAV’s own sensors and controllers to complete the perception of fixed landing markers to execute emergency landing, but the identification and spatial localization of the fixed ground landing markers at a higher flight altitude have a strict requirement for the accuracy of small-target identification.

In recent years, with the rapid development of various technologies of deep learning [8,9,10,11,12], the semantic segmentation technology in natural images has become increasingly mature, the segmentation effect has constantly improved, and the relevant segmentation algorithms have been gradually applied to the semantic segmentation task of UAV images. Among them, semantic segmentation can realize the precise description of the shape and boundary of the target in the image, providing more detailed information for the subsequent visual task, which is suitable for more complex scenes. To improve the accuracy of UAV landing, researchers have started to use semantic segmentation methods in recent years. Bartolomei [13] proposed a method based on deep reinforcement learning for the autonomous landing of multi-rotor UAVs in emergencies, which maps semantic and depth information to sensors to allow the UAV to fly to a safe area. It also learns the spatial relationship between different categories of the scene by using multi-category semantic segmentation to allow the UAV to acquire more contextual information. However, due to the uncertainty of depth estimation, the system may not be able to better adapt to complex terrain and environmental changes, which can affect the landing accuracy of UAVs in different scenarios. Yu et al. [14] proposed a BiSeNet semantic segmentation network with a two-branch topology, which can accurately, at a fixed altitude, through the use of parallel small spatial paths and contextual paths, perform the segmentation of landing scenarios at a fixed altitude. However, the segmentation quality and speed of this method will be greatly reduced when facing the problems of large-scale differences between high- and low-altitude objects and large image resolutions in UAV landing in complex scenes [15].

In summary, this paper proposes a real-time semantic segmentation network, the Kernel-Sharing DepthWise and Partial Convolution Network (KDP-Net), for UAV landing in emergency situations. To address the problems of large feature scale changes and slow segmentation speeds in UAV landing, we design the KDP module for the capture and fusion of multi-scale features to improve the model’s ability to perceive targets at different scales, and we use Partial Convolution (PConv) to reduce the amount of computation and increase the computational speed of the model. In addition, we adopt the Bilateral Segmentation Network (BSN) for the fast recovery of high-resolution multi-scale feature map information, and we introduce the Canny edge detection operator to assist in improving the accuracy and real-time performance of the segmentation algorithm. On this basis, to improve the model’s focus on important feature categories at different heights, we design a special hybrid loss function, which can dynamically adjust the loss weight parameters according to the height changes, in order to pay more attention to these key categories. Our network focuses on solving the problems of large object scale changes and the untimely recognition of dynamic objects encountered when landing in natural scenes, improves the segmentation accuracy of small targets, and considers real-time image segmentation in emergency landing scenarios.

2. Related Works

2.1. Encoder–Decoder Architecture

Semantic segmentation is one of the three basic tasks of computer vision, which assigns labels to each pixel in an image [16,17]. In recent years, with the development of deep learning technology, many semantic segmentation algorithms have been applied in various fields, including smart cities, industrial production, remote sensing image processing, medical image processing, etc. [18,19,20,21,22]. Existing semantic segmentation methods usually rely on a convolutional encoder–decoder architecture, where the encoder generates low-resolution image features and the decoder up-samples the features and performs segmentation mapping using pixel-by-pixel class scores. U-Net [23] uses contraction and expansion paths based on the encoder–decoder framework, which fuses low-level and high-level semantic information by introducing hopping connections. DeepLabV3+ [24] uses extended convolution and larger convolution kernels to increase the strength of the receptive field. The Pyramid Scene Parsing Network (PSPNet) [25] designs a pyramid pool to capture local and global context information on an extended backbone. SegNet [26] utilizes an encoder–decoder architecture to recover high-resolution feature maps. MKANet [27] is also a typical encoder–decoder structure that uses a parallel shallow architecture to improve inference speed while supporting large-scale image block inputs. In addition, with the great success of transformers in the field of Natural Language Processing (NLP), researchers have begun to explore its application in computer vision tasks. Vision Transformer (ViT) [28] employs a fully transformer-based image classification design. SegFormer [29] combines a transformer with a lightweight multilayer perceptron (MLP) decoder to efficiently fuse locally focused and globally focused information, presenting a powerful feature representation capability.

However, when embedding semantic segmentation algorithms into terminal devices such as UAV chips, it is necessary to ensure that the model size and computational cost are small to meet the demand for fast interaction. Due to the high resolution of UAV images and the large variation in feature scales, it is difficult for the above methods to meet the requirement of fast and accurate landing scene segmentation. Therefore, for emergency UAV landing, it is still necessary to design a professional semantic segmentation model.

2.2. Kernel-Sharing Mechanism

Atrous convolution [30] has been widely used to increase the receptive field without increasing the convolution kernel size or computational effort. Since Atrous convolution does not require changing the model construction or adding other parameters, it can be perfectly embedded into any network model. On this basis, Huang et al. [31] proposed the shared kernel null convolutional network (KSAC) to improve the network’s ability to generalize and represent information at different scales and reduce the computational cost. Che et al. [32] proposed a hybrid convolutional attention (MCA) module, which enables the segmentation model to obtain richer feature information at the shallow level and improves the range of sensory field capture and the ability to distinguish different sizes and shapes of objects. Li et al. [33] applied the KSAC and multi-receptive field convolution on input feature maps to explore multi-scale contextual information, and which captured clear target boundaries by gradually recovering spatial information.

The above studies show that the convolution sharing mechanism provides an effective way to control the receptive fields and find the best trade-off between local and global information extraction. The convolutional sharing mechanism expands the deep network features according to different expansion rates, captures the contextual information of the multi-scale features using the spatial pyramid pooling module, and introduces a global average pooling module to complement it, which can refine the contextual information. Second, the input feature graph is expanded and grouped according to the expansion coefficient, and different scales of features are extracted by using different expansion kernels, which can obtain the information of different layer interactions and enhance the global semantic information expression. By aggregating shallow rough information and deep fine information, it also helps to effectively mitigate the loss of feature information due to differences in image resolution caused by various data sources. As a result, problems such as the loss of edge details of multi-scale feature targets as well as the absence and misclassification of small targets can be solved.

3. Proposed Method

In UAV emergency landing scenarios, real-time and accurate image segmentation is crucial for safe UAV landing. In this paper, an efficient semantic segmentation method, KDP-Net, for UAV emergency landing is proposed to realize the improvement in multi-scale target segmentation accuracy while maintaining excellent segmentation speed. It includes the KDP module, BSN, and edge extraction module in the loss function, as shown in Figure 1.

3.1. KDP Module

During the emergency landing process, the scale of features changes rapidly and dramatically. In order to improve the segmentation accuracy and efficiency during the process, we designed the Kernel-Sharing Depthwise and Partial Convolution (KDP) module, the structure of which is shown in Figure 2. This module fully considers the multi-scale features of objects in natural scenes, as well as the importance of local details and overall context. Several key techniques are fused within this module, such as the Kernel-Sharing mechanism [31], Depth-Wise Separable Convolution (DW Conv) [34], and Partial Convolution (PConv) [35], which enable the model to reduce the computational burden while maintaining high-precision segmentation results to meet the real-time and computational resource requirements of UAV emergency landing scenarios.

In order to avoid the problem of slowing down inference due to increasing network depth, our KDP module adopts the Kernel-Sharing mechanism, where multiple branches with different expansion rates can share a single convolutional kernel to efficiently learn the local details of small objects and the global semantic features of large objects, and the shared convolutional kernel enhances the model’s generalization ability and feature representation ability. The Kernel-Sharing mechanism is complemented with data augmentation in the preprocessing stage to enhance the representation capability of the convolutional kernel [27].

In addition, DW Conv uses a 3 × 3 convolution kernel to perform a channel-by-channel convolution operation on the input feature maps, and then uses a 1 × 1 convolution to perform a channel-relational mapping to further capture features at different scales. DW Conv has three parallel branches with expansion rates of 1, 2, and 3 to capture features at different scales, thus realizing multi-scale feature fusion and enhancing the model’s ability to perceive targets of different sizes. The ratio of computational effort between deep separable convolution and regular convolution is as follows:

\frac{\underset{d e p t h w i s e c o n v o l u t i o n}{\underset{⏟}{k \times k \times c_{i} \times h \times w}} + \underset{p o i n t w i s e c o n v o l u t i o n}{\underset{⏟}{1 \times 1 \times c_{i} \times c_{o} \times h \times w}}}{k \times k \times c_{i} \times c_{o} \times h \times w} = \frac{(k^{2} + c_{o}) c_{i} \times h \times w}{k \times k \times c_{i} \times c_{o} \times h \times w} = \frac{k^{2} + c_{o}}{k^{2} \times c_{o}} = \frac{1}{c_{o}} + \frac{1}{k^{2}}

(1)

where

c_{i} \times h \times w

is the input feature map size,

c_{o} \times h \times w

is the output feature map size,

k \times k \times c_{i} \times c_{o}

is the convolutional kernel size,

c_{o}

is the number of convolution kernels,

k \times k \times c_{i}

is the DW Conv channel-by-channel convolutional kernel size, and

1 \times 1 \times c_{i} \times c_{o}

is the 1 × 1 convolution kernel size. Since the usual convolution kernel size is much smaller than the number of output channels, the ratio of the computational volume of DW Conv to the conventional convolution is 1/

c_{o}

+ 1/k². When the convolution kernel size is set to 3 × 3, DW Conv can reduce the computation amount by about 8–9 times.

However, despite the fact that DW Conv performs well in reducing FLOPs, the increase in its channel number leads to higher memory accesses, which reduces the overall computational speed. Therefore, we introduce PConv as an alternative operator to DW Conv to address the memory access problem. PConv is a basic operator that further optimizes the computational cost by exploiting the redundancy between feature maps. As can be seen in Figure 3, the feature maps between different channels are highly similar. PConv reduces computational redundancy and memory access by applying regular Conv to only some of the input channels for spatial feature extraction. In addition, both FLOPs and memory accesses of PConv are drastically reduced compared to regular Conv, improving computational efficiency. Therefore, we input the results of the last two depth-separated convolutional branches into the PConv layer. Due to the special structural properties, PConv is not sampled, and only part of the input channels are routinely convolved for spatial feature extraction, while the rest of the channels are kept unchanged, which also ensures that the subsequent feature information circulates through all channels.

The ratio of computational effort for PConv to regular convolution is as follows:

\frac{k^{2} \times {c_{p}}^{2} \times h \times w}{k \times k \times c_{i} \times c_{o} \times h \times w} = \frac{{c_{p}}^{2}}{{c_{i} c}_{o}}

(2)

In the implementation of this paper, the partial factor is 1/4, only the input feature map of 1/4 convolution operation is performed, and the computation of PConv is only as much as that of the conventional convolution 1/16, where

k^{2} \times {c_{p}}^{2}

is the Partial Convolution kernel size,

k \times k \times c_{i} \times c_{o}

is the conventional convolution kernel size, and

c_{o}

is the number of convolution kernels. Since the size of the convolution kernel is usually much smaller than the number of output channels, the ratio of the computation amount of deep separable convolution to that of conventional convolution is about

{c_{p}}^{2}

/

{c_{i} c}_{o}

. Finally, to fully utilize the information from all channels, we append PW Conv to PConv. Next, the output of each branch is stitched and 1 × 1-convolved to transform the extracted texture features into 1D vectors, reducing the dimensionality and introducing the ReLU activation function. This utilizes the redundancy between the filters and further reduces the FLOPs, after which an image classifier is trained using the extracted features. The classifier is evaluated using the independent validation set and hyperparameters are adjusted to improve the performance, so the image classifier is able to understand and classify the semantic information in the image better.

3.2. Bilateral Segmentation Network

The scene segmentation of captured images during UAV emergency landing requires fast completion of the task requirements at different altitudes, the preservation of rich edge information of the segmented objects, and finally the gradual recovery of high-resolution information of multi-scale feature maps. To further improve the segmentation accuracy and maintain the segmentation speed, we propose a Bilateral Segmentation Network (BSN), which uses only three feature maps of 1/8, 1/16, and 1/32 of the UAV image scale as inputs, as shown in Figure 4, and the three feature layers form the first path after the auxiliary segmentation header for integrating low-level and high-level features. The coordinate attention mechanism constitutes the second path used to improve our method’s ability to perceive and localize targets, especially low-altitude targets, in scene segmentation, resulting in two outputs. Notably, the two outputs are computed by cross-entropy with different mask labels to obtain UAV segmentation results that are more focused on key categories.

In the BSN, each segmentation head consists of 3 × 3 convolution, batch normalization, the ReLU activation function, 1 × 1 convolution, and category logic operation, where the 1 × 1 convolution operation projects the dimensionality of the processed features onto the number of semantic categories. After the segmentation header, the input feature mapping is converted into category logic, after which it is restored to the same resolution as the input UAV image by up-sampling. Through this step, the predictive logic Output1 is obtained.

Additionally, this segmentation header enhances the feature representation in the training phase and can be discarded in the inference phase. Therefore, it adds little computational complexity in the inference phase. Figure 5 illustrates the details of the segmentation header. We can adjust the computational complexity of the segmentation header by controlling the channel dimension C_i.

To minimize other unnecessary computational overheads, our network first extracts low-level features and then directly reduces the resolution by a factor of four to obtain f1. In order to enhance the representation of the object of interest, we introduce the coordinate attention mechanism, which is inserted into our network with almost no computational overhead. By using the coordinate attention mechanism to fuse the multi-scale feature maps f2, f3, and f4, the position information is embedded into the channel attention while reducing the computational overhead for the UAV device.

Specifically, after processing through the encoder, the feature maps f3 and f4 are upsampled by a factor of 2 and 4, respectively, and spliced with the feature maps output from stage2. After the feature mapping is passed through the first coordinate attention mechanism to obtain the feature combination with enhanced representation, 1 × 1 convolution is used to facilitate the information communication between the channels, and the processed feature mapping is projected to the number of dimensions of the semantic category through the second coordinate attention mechanism with residual connections and then through the segmentation header to obtain Output2.

3.3. Hybrid Loss Function

Our research exploits the bias against most classes in the category imbalance problem by dynamically adjusting the loss weights for datasets with different altitude ranges, focusing the algorithm’s attention on the key categories that can affect the segmentation accuracy of the UAV landing scene, such as people, cars, and water. In this paper, to focus more on the key categories, output1 and output2 from the BSN are taken as inputs, and the hybrid loss function is designed to guide the training process, as shown in Figure 6.

In this study, cross-entropy (CE) loss is used as the main loss function, which, as a classical loss function for studying multi-class problems, can guide the proposed network to segment various types of targets in UAV images. The CE loss is defined as

C E - L o s s = - \sum_{i = 1}^{N} y_{i} \log (p_{i})

(3)

where i means class, N means the number of classes, and y_i is the ground truth labels of class i. p_i is the predicted probability distribution for class i.

Also, L_main consists of three parts:

L_main = L₁ + L₂ + L_canny

(4)

where L₁ denotes the loss obtained by cross-entropy loss calculation between the network output Output1 (as shown in Figure 4) and the truth value, L₂ denotes the loss obtained by cross-entropy loss calculation between the network output Output2 (as shown in Figure 4) and the truth value, and L_canny denotes the loss obtained by cross-entropy loss calculation between the network output Output2 and the Canny edge mask. The boundary calculation for the segmented object makes the segmented image have finer edge lines. Among them, the generation process of the Canny edge mask is shown in Figure 7.

To enhance the degree of spatial detail learning and target boundary recovery in UAV images, we introduce the Canny edge detection operator [36] as the edge extraction module, which aims to extract the main edge features in the image while suppressing noise and fine edges. To enhance spatial detail learning and boundary recovery, the edge extraction module performs Canny operator convolution and unfolding operations on the ground truth mask to generate Canny edge masks with clearer contour features, which are used to enhance the edge weights in the training phase and are used as a target to assist in the computation of loss. The edge extraction module connects neighboring weak edge pixels to form continuous edge lines, which is especially important for improving the classification accuracy of fine ground features, such as well covers and people with smaller scales as shown in Figure 7.

For multi-category UAV landing scenarios, this study introduces the focal loss function [37] as an auxiliary loss function. The focal loss function is defined as follows:

L_{aux} = FL (p) = \sum_{i = 1}^{N} {α_{i} y_{i} ({1 - p}_{i})}^{γ} \log (p_{i})

(5)

where the hyperparameter α_i is the equilibrium weight of class i. The hyperparameter γ is called the focusing parameter, which is shared by all classes in the original focusing loss and reduces the weight of the safety class. In our method, α sets the parameter list according to the dataset and the corresponding key categories, and γ is set to the default value of 2.0. p_i is the probability value of the predicted result after softmax; the smaller the p_i value corresponding to the current sample category, the more inaccurate the prediction; thus, the item (1 − p_i)^γ will increase with the coefficient of the key samples, and the more inaccurate the prediction, the more the focal loss will prefer to treat this sample as the key samples, and the larger the coefficient will be in order to allow the key samples make a larger contribution to the loss and the gradient. α_i is the category weight coefficient, which assigns a high weight to the loss contribution for key categories, such as water, roof, car, person.

In summary, the hybrid loss function consists of two parts:

L = αL_main + (1 − α)L_aux

(6)

where L_main means the main loss function, which represents the loss of semantic category classification, and L_aux is the auxiliary loss function, which is used to adjust the sample parameters of the key feature categories. α is the parameter that balances the two loss functions, α = 0.8 in the experiment.

4. Experiments

4.1. Datasets

We benchmarked KDP-Net on two publicly available UAV semantic segmentation datasets: the Urban Drone Dataset (UDD, [38]) and the Semantic Drone Dataset (SDD, [39]). The UDD represents high-altitude scenarios during UAV landings, while the SDD represents low-altitude scenarios. The UDD containing 6 semantic categories is used for all experiments in this work. It consists of 141 aerial images with resolutions of 3840 × 2160 or 4096 × 2160 or 4000 × 3000. Road, roof, vehicle, other, building facade, and vegetation are its six segmentation classes. The SDD focuses on the semantic understanding of urban scenes and contains more than 20 houses taken from a bird’s-eye view with a total of 400 images, with a resolution of each image of 6000 × 4000. The SDD contains 23 classes: unlabeled, tree, bald-tree, grass, vegetation, dirt, gravel, rocks, water, paved-area, pool, people, dog, car, bicycle, roof, wall, fence, fence-pole, windows, door, ar-marker, and obstacle. Figure 8 and Figure 9 show the images captured during the landing of the UAV; Figure 8 shows the UDD6 with a height range of 60–100 m, simulating the perspective of the UAV landing at a high altitude, and Figure 9 shows the SDD with a height range of 5–30 m, mimicking the perspective of the UAV when it lands at a low altitude, and there is a significant difference between the heights and the proportions of the objects in these images.

4.2. Implementation Details and Metrics

Prior to the experiments in this paper, we performed preprocessing work on the datasets used by cropping all datasets to a uniform resolution of 1024 × 1024, and the ratio of training, validation, and testing sets for both datasets was set according to the ratio of 6:2:2. A hybrid loss function (Section 3.3) was used to guide the network training, with stochastic gradient descent SGD set as the optimizer; the momentum and learning rate set to 0.9 and 0.001, respectively; the weight decay set to 1 × 10⁻⁴; and the cosine annealing learning rate scheduling used. Our implementation was based on an Intel(R) Core(TM) i9-10900K deca-core 3.7 GHz processor and an Nvidia GeForce GTX 4090 GPU graphics card with 24 GB of video memory. The software environment was PyTorch 1.10.1 and CUDA 11.4 with CUDNN 8.2.4 library.

In this paper, the generalized segmentation metrics of semantic segmentation algorithms were used to verify the segmentation accuracy of the proposed network, including mean accuracy (mAcc), mean intersection ratio and merge (mIoU), category intersection ratio and merge (category IoU), number of parameters (Params), and frames per second (FPS). The formula is as follows:

mAcc = \frac{1}{N} \sum_{i = 1}^{N} \frac{{T P}_{i} + {T N}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i} + {T N}_{i}}

(7)

mIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(8)

F 1 = 2 \times \frac{P \times R}{P + R}

(9)

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

where N denotes the number of semantic categories,

{T P}_{i}

is the number of correctly categorized pixels,

{F P}_{i}

is the number of pixels in the predicted image in which category i is categorized as other categories,

{F N}_{i}

is the number of pixels in the predicted image in which other categories are categorized as category i, and

{T N}_{i}

is the number of correctly categorized pixels in the predicted image in which other categories are correctly categorized except for category i. P and R stand for Precision and Recall, respectively.

Real-time performance is a key metric for UAV safe emergency landing area recognition applications, especially important on performance-constrained embedded devices [40,41,42,43]. In this study, inference speed is used as a metric for real-time performance evaluation. Model parameters are used to evaluate the utility of the proposed network on UAV devices. In addition, after actual deployment on an embedded device with limited performance, Jetson Orin is used for real-time testing.

4.3. Experimental Results

In order to verify the effectiveness of the proposed methods, we compare DeepLabV3+ [24], BiSeNetV2 [44], SegFormer [45], MKANet [27], and our proposed KDP-Net on the high-altitude dataset, UDD6, and the low-altitude dataset, SDD. All methods use 1024 × 1024 resolution images as input. The quantitative results of the comparison on the high-altitude dataset are shown in Table 1. Table 2 shows the quantitative comparison of IoU for each category between our methods and the competing methods on the high-altitude dataset. The low-altitude dataset is shown in Table 3. Table 4 shows the quantitative comparison of IoU for each category between our methods and the competing methods on the low-altitude dataset. The visualizations of the UDD6 and the SDD are shown in Figure 10 and Figure 11. The overall performance of each algorithm for low and high altitude is shown in Figure 12.

4.3.1. Experimental Results on UDD6

Table 1 shows the experimental results for the high-altitude dataset UDD6. The overall performance of our proposed method is better than those of DeepLabV3+, BiSeNetV2, SegFormer, and MKANet, with advantages of 2.82%, 6.48%, 20.54%, and 6.71% in mIoU, respectively. In terms of Params, KDP-Net reduces the number of parameters by 23 M compared to MKANet. In terms of F1, KDP-Net improves by 5.56% compared to MKANet. It is worth mentioning that our proposed KDP-Net uses PConv to reduce the amount of computation and there are fewer repetitive feature extraction blocks in the network; therefore, in terms of inference speed, KDP-Net shows a faster processing image speed than several other networks.

Table 1. Quantitative comparison of UDD6 test dataset result with competitive methods. The best value of each metric is labeled in bold.

Method	Acc (%)	mIoU (%)	Params (M)	F1 (%)	FPS
DeepLabV3+ [24]	88.26	74.08	63	84.32	3.99
BiSeNetV2 [44]	85.84	70.42	66	82.00	69.47
SegFormer [45]	78.69	56.36	168	70.06	54.15
MKANet [27]	86.59	70.19	129	81.64	96.89
KDP-Net	89.38	76.90	106	86.76	101.17

A quantitative comparison of the IoUs of our method with those of competing methods for each category on the high-altitude dataset UDD6 is shown in Table 2, where the IoUs of the key categories we focus on also show competitive performance with other methods.

Table 2. Comparative results of quantitative experiments (%) on the UDD6 test set per category. The best value of each metric is labeled in bold.

Method	Others IoU (%)		Key IoU1(%)				mIoU1 (%)	mIoU (%)
Method	Background	Vegetation	Facade	Road	Vehicle	Roof	mIoU1 (%)	mIoU (%)
DeepLabV3+	62.28	82.64	72.93	76.23	69.16	81.27	74.89	74.08
BiSeNetV2	60.53	83.13	67.89	67.29	65.33	78.35	69.72	70.42
SegFormer	48.23	77.36	49.68	60.17	38.10	64.58	53.13	56.36
MKANet	61.57	82.39	69.41	68.99	57.79	80.95	69.29	70.19
KDP-Net	60.79	81.66	76.95	76.11	78.15	87.73	79.74	76.90

As can be seen from the table, for the key classes of facade, road, vehicle, and roof, it outperforms the other comparison networks in terms of average mIoU1 by 4.85%, 10.02%, 26.61%, and 10.45%, respectively. Our proposed method outperforms the other methods for three out of the four key classes. Vehicle is the class that accounts for the least number of pixels and the smallest segmentation object in the UDD6, so segment vehicles are very challenging, but our method has the highest segmentation accuracy for this class compared to other segmentation networks. For roofs, our method also shows good performance with an mIoU of 87.73%, which is higher than those of other networks by at least 6.46%.

In Figure 10, we perform qualitative experiments on the UDD6 test set to compare the segmentation visualization effect of our method and the boundary details of the segmented objects. The UAV image samples in the UDD6 have heights ranging from 60 to 100 m, and we use three stages of them with significantly different heights to perform the qualitative experiments.

The first image sample is captured from low altitude (row 1), and it can be seen that the key category of vehicles circled by the red box cannot be accurately segmented by DeepLabV3+, BiSeNetV2, SegFormer, and MKANet, whereas the visual segmentation result of KDP-Net achieves a more pronounced segmentation boundary of the vehicles as well as a clearer background in the yellow boxed image, and our method also shows clearer segmentation boundaries than the other methods. The second image sample was taken at mid-elevation (row 2), and it can be seen from the red rectangles that DeepLabV3+, BiSeNetV2, and MKANet can segment the key categories such as vehicles, roads, and vegetation, while the segmentation boundary details of the KDP-Net proposed in this paper outperform those of the other two networks. The third image sample represents the high-altitude UAV image (row 3). As can be seen from the key category vehicles in the red box, DeepLabV3+, BiSeNetV2, and MKANet can still recognize the vehicles, but there has been a significant decrease in the contour boundary segmentation capability of the vehicles, and no satisfactory segmentation results can be obtained for any of the vehicles in the figure; however, despite the high-altitude stage, the KDP-Net proposed in this paper is still superior to the other methods in terms of the boundary details for vehicle segmentation.

4.3.2. Experimental Results on SDD

Table 3 shows a comparison of the quantitative results of our method with the competing methods on the low-altitude dataset SDD. Compared to the other methods, KDP-Net has advantages of 5.16%, 4.30%, 35.06%, and 14.85% in mIoU, which is at least 4.30% higher than those of the other methods. KDP-Net reduces the number of parameters by 23 M compared to MKANet, and in terms of F1, KDP-Net improves by 5.56% over MKANet. The advantages in terms of FPS are 104.5, 37.67, 53.11, and 6.91, respectively, indicating that our KDP-Net likewise exhibits a higher inference speed than several other networks.

Table 3. Quantitative comparison of Semantic Drone test dataset result with competitive methods. The best value of each metric is labeled in bold.

Method	Acc (%)	mIoU (%)	Params (M)	F1 (%)	FPS
DeepLabV3+ [24]	91.76	62.40	63	82.51	3.61
BiSeNetV2 [44]	90.68	63.26	66	78.86	70.44
SegFormer [45]	78.75	32.50	168	49.05	55.00
MKANet [27]	88.27	52.71	129	73.17	101.20
KDP-Net	91.96	67.56	106	79.01	108.11

The quantitative comparison of IoU for each category in the SDD between our method and the competing methods is shown in Table 4. The advantages of our proposed method in the key category mIoU1 are 8.69%, 6.65%, 39.81%, and 16.04%, respectively.

Table 4. Comparative results of quantitative experiments (%) on the SDD test set per category. The best value of each metric is labeled in bold.

Method	Others IoU (%)		Key IoU₁ (%)										mIoU₁	mIoU
Method	Unlabelled	Paved-Area	Dirt	Gravel	Water	Rocks	Pool	Roof	Wall	Windows	Door	Fence	(%)	(%)
DeepLabV3+	87.74	60.30	67.68	63.76	89.44	70.88	97.44	80.40	42.43	49.66	0.00	51.41	60.05	62.40
BiSeNetV2	88.41	53.69	63.01	64.70	88.69	66.51	96.83	81.71	43.16	46.89	18.34	41.12	62.09	63.26
SegFormer	79.38	29.26	43.66	32.95	57.98	21.00	68.09	64.53	15.24	18.91	0.00	13.78	28.93	32.50
MKANet	86.10	40.03	65.88	57.35	89.51	56.89	96.11	87.31	33.65	42.10	0.00	41.96	52.70	52.71
KDP-Net	76.79	41.13	71.99	67.55	92.93	69.46	98.14	87.29	47.21	47.86	55.49	53.56	69.48	67.56
Method	Others IoU (%)		Key IoU₁(%)										mIoU₁	mIoU
Method	Grass	Vegetation	Fence-Pole	Person	Dog	Car	Bicycle	Tree	Bald-Tree	Ar-Marker	Obstacle		(%)	(%)
DeepLabV3+	76.45	66.89	0.00	71.37	0.00	97.49	79.80	82.02	85.04	75.22	36.84		60.05	62.40
BiSeNetV2	76.09	57.11	16.42	69.38	63.59	92.12	77.64	65.75	72.33	76.33	35.80		62.09	63.26
SegFormer	51.52	35.11	0.00	22.13	0.00	59.88	11.88	37.54	47.98	8.80	18.59		28.93	32.50
MKANet	29.26	52.71	0.00	55.76	0.00	93.70	44.37	67.12	62.82	72.79	33.51		52.70	52.71
KDP-Net	65.03	50.81	36.10	72.15	73.51	99.20	70.79	76.83	76.98	81.90	41.20		69.48	67.56

As can be seen from the table, the proposed method outperforms the other methods in 13 out of 19 key categories in the key category IoU₁. In the SDD, stones, swimming pools, doors, fence holes, and dogs have fewer samples, especially in the categories of doors, fence holes, and dogs, which have at least 10 times fewer pixels than the stones category. Thus, DeepLabV3+, SegFormer, and MKANet cannot be computed to obtain their IoUs. However, our network exhibits a good performance for these three categories, with mIoUs of 49.49%, 40.25%, and 68.56%. In addition, our method has the highest segmentation accuracy compared to other networks for the extremely critical categories of humans, water, and swimming pools, with 72.76%, 92.56%, and 98.01%, respectively.

In Figure 11, we compare the effect of segmentation visualization of the SDD on various methods. The SDD acquisition height is between 5 and 30 m. We choose three samples of UAV images at different heights to conduct qualitative experiments.

The first sample (row 1) was captured at low altitude, and the outlines of the key categories of people on the ground can be clearly seen in the input image. As can be seen from the standing person zoomed in by the red rectangle, DeepLabV3+, BiSeNetV2, and MKANet were unable to segment the category of humans effectively, whereas KDP-Net showed the best detail of segmentation boundaries out of the remaining two networks. The second sample (row 2) was taken in the middle stage of the height range of the dataset, with the red box zoomed in on bicycle and human as the key categories. It can be seen that only SegFormer and KDP-Net were able to segment them, while SegFormer did not accurately segment the bicycle and human. The third sample (row 3) is an overhead scene within the dataset, next to a pool and with a more complex background. BiSeNetV2, SegFormer, and MKANet are unable to segment the two key categories of people enlarged by the red rectangles, whereas DeepLabv3+ and the KDP-Net proposed in this paper can effectively recognize the categories. Compared to DeepLabv3+, the KDP-Net proposed in this paper also improves the focus on vehicles. Therefore, it can segment small target bicycles, and the comparison reveals that our method achieves outstanding performance in boundary detail preservation.

4.3.3. Overall Performance

Figure 12 illustrates the change in overall segmentation accuracy, processing speed, and overall IoU performance for key categories on the comparison methods for the high- and low-altitude datasets. From the figure, we can see that our method has the highest IoU and FPS, and the mIoU1 value of the key category we focus on is also the highest and exceeds the overall IoU, which indicates that our method succeeds in focusing more attention on the key category, which results in the UAV paying more attention to the safety of the people and objects on the ground during the emergency landing and minimizes the potential risks and losses.

4.4. Ablation Study

The effectiveness of the proposed module is verified by ablation experiments. The experiments are compared on the UDD6, using the same training and inference settings. The baseline model does not contain any of the proposed modules and the fusion method is element-wise summation. The quantitative results of the ablation study are given in Table 5. We compare the effect of the value of the hybrid loss function α on the network accuracy, where α determines the magnitude of the weights of the main loss function and the auxiliary loss function. As can be seen from Table 5, the hybrid loss function is significantly better; specifically, the best results are achieved by the network when the hybrid loss function parameter is α = 0.8, at which time it outperforms the baseline using the cross-entropy loss function by 2.89%, 5.13%, and 4.00% in terms of Acc, mIoU, and the average F1 score, respectively. This shows that our designed hybrid loss function has excellent UAV image feature extraction and representation capabilities. In addition, we can find that the KDP module improves the mIoU by 5.43%, where the number of parameters is reduced by 19 M, and in terms of FPS, the number of processed frames per second is increased by nearly 15 frames per second, which indicates that the KDP module reduces the redundant computation in a better way. Also, the added BSN improves the mIoU of the network by 3.17%. The addition of hybrid loss also improves the segmentation accuracy, but the inference speed is slightly decreased. Based on the three proposed modules, KDP-Net achieves an mIoU 76.9% and an FPS of 101.17. Figure 12 provides a qualitative comparison. We can observe that the KDP-Net predicted image is more consistent with the ground truth when hybrid loss, the KDP module, and the BSN with different parameters are added sequentially. In conclusion, our proposed module is effective for semantic segmentation.

Table 5. The effect of the different policy in the segmentation network using the UDD6.

Method	Hybrid Loss α = 0.2	Hybrid Loss α = 0.4	Hybrid Loss α = 0.8	KDP Module	BSN	Acc (%)	mIoU (%)	Params (M)	F1 (%)	FPS
Baseline						82.76	64.87	125	77.75	86.77
KDP-Net	√					83.94	68.38	125	80.54	85.09
KDP-Net		√				85.53	69.83	125	81.64	86.79
KDP-Net			√			85.65	70.00	125	81.75	86.50
KDP-Net				√		86.01	70.30	106	81.94	101.42
KDP-Net				√	√	87.72	73.47	106	84.15	102.09
KDP-Net			√	√	√	89.38	76.90	106	86.76	101.17

Figure 13 shows the segmentation visualization effect and the boundary details of the segmented objects for the module used in the ablation experiment. In this paper, CE Loss, hybrid loss α = 0.2, hybrid loss α = 0.4, hybrid loss c, and the KDP module are set to compare the visualization with the proposed KDP-Net, where KDP-Net consists of hybrid loss α = 0.8, the KDP module, and the BSN.

This image sample was captured in a street scene, and it can be seen that the car circled by the yellow box cannot be accurately segmented by the baseline network using CE Loss. But in the semantic segmentation network using Hybrid Loss, the visual segmentation result of KDP-Net achieves the desired segmentation boundaries for the car. In addition, it can be seen from the red rectangles that KDP-Net can recognize cars that are not on the main road when α is taken as 0.8 compared to when α is taken as 0.2 or 0.4 in the hybrid loss function. The approximate contours of the cars that are not on the main road can be segmented after using PConv. In conclusion, the KDP-Net proposed in this paper focuses more on key categories, such as cars, so the segmentation boundary details of our method for key categories are better than baseline.

5. Discussion

UAV aerial images have complex scenes and confusing backgrounds, and UAVs face great challenges in performing emergency landing tasks. Since the semantic segmentation task can assign unique category information to each pixel point in the entire image, the target in the image is able to obtain finer object edge information. In addition, it can predict regions that cannot be detected by the object detection method, such as pools, grass, and other image regions that do not have a fixed shape. Therefore, the semantic segmentation method is more suitable for UAV landing scenarios. And the proposed method has significant improvement for the segmentation.

As shown in Table 2 and Table 4 and Figure 9 and Figure 10, SegFormer does not focus its attention on the appropriate region, which leads to the worst segmentation accuracy, and other semantic segmentation networks can utilize contextual information for segmentation but do not focus their attention on the key categories, which leads to the possibility of semantic segmentation misclassification, making the UAV landing harmful to people on the ground or other targets. In contrast, our proposed KDP-Net artificially increases the network’s attention to the key categories while equalizing the samples using a hybrid loss function, which excels in key category identification and obtains multi-scale features through different operations; it is important for small-target segmentation.

In addition, UAVs operate at relatively high altitudes, the visibility of small targets such as people on the ground is limited, and the movements of such targets are random. This situation makes the requirements for real-time processing algorithms more stringent. Therefore, on the basis of research of the accuracy of the semantic segmentation algorithm, we focus on the real-time performance of the algorithm.

In detail, we consider that in general, the maximum descent speed of UAV emergency landing is around 5 m/s, assuming that the captured video adopts a video frame rate of 24 fps, and the descent speed is the highest speed. In the high-altitude stage, we can test on the UDD6, assuming that the altitude range is from 100 to 60 m, and the time required for the UAV to land is 40 m × 5 m/s = 8 s. The total images required to process is 8 s × 24 fps = 192 frames, without frame extraction calculation. According to the FPS of our method, to process these images, we need about 192 frames × 9.88 ms/frame ≈ 1.9 s, which is much lower than the time required for landing. In addition, tested on the onboard embedded device, Jetson Orin, we achieved an inference speed of 53.72 fps. Then, the time required by the Orin device to process 192 frames is about 192 frames × 18.62 ms/frame ≈ 3.57 s. In the low-altitude stage, we can validate on the SDD. Assuming that the altitude range is from 30 to 10 m, the time required for the UAV to land is 20 m × 5 m/s = 2 s; then, 2 s × 24 frames/s = 48 frames of images need to be processed. According to the FPS of our method, it takes about 9.26 ms to process one frame of images, so the time required to process 48 frames of images is about 0.44 s. Furthermore, the KDP-Net on the Jetson Orin can achieve a processing speed of 38.79 fps; then, the time required to process 48 frames is about 1.24 s. Therefore, the speed shows that KDP-Net can meet the requirements of real-time processing of UAV landing scenarios.

6. Conclusions

In this study, in order to reduce the occurrence of accidents such as accidental injuries to personnel, property damage, and loss and destruction of UAVs due to accidental UAV crashes, this paper proposes an efficient semantic segmentation method, KDP-Net, for the characteristics of large changes in the scale of features during the emergency landing process and the strict real-time requirements for processing, to enable UAVs to have the ability to autonomously select a safe area to land in an accidental situation. We designed the KDP module, which aims to capture multi-scale features and fuse them to enhance the model’s ability to perceive targets at different scales, and by introducing the PConv, we reduce the consumption of computational resources, which enables our method to operate efficiently in practical applications. Meanwhile, we adopt a Bilateral Segmentation Network to integrate low-level and high-level features through two parallel paths, which improves our method’s ability to perceive and localize targets, especially low-altitude targets, in scene segmentation. On this basis, we propose a hybrid loss function to dynamically adjust the loss weight parameters, in which we design an edge extraction module to improve the accuracy of the model’s segmentation of fine feature categories, thus improving the accuracy of landing.

In the application scenario of UAV emergency landing, the UAV needs to acquire and process the image data in time to make a response and decision, and the timeliness is also a very important consideration. Our KDP-Net method can reach 108 fps for processing 1024 × 1024-pixel images, which can be mounted on the Jetson Orin, an embedded computing device of a common UAV, and is tested in practice. The method in this paper is able to achieve a processing speed of 53 fps, which can meet the application requirements of real emergency landing scenarios.

Given that we use a purely visual solution based on semantic segmentation, our first priority is to improve the segmentation accuracy and performance of images obtained during UAV landing without additional devices. The experimental results show that compared with other methods, the proposed method has higher accuracy and better performance, and can meet the real-time processing requirements on the embedded device. It is worth noting that in order to better compare the performance of the proposed method with others, this paper mainly uses stationary images, which has limitations. In future work, we will further carry out on-board adaptation, testing, and fine-tuning of this method, in order to better apply this method to complex and changeable practical application scenarios by accessing and processing video frames in real-time.

Author Contributions

Z.Z.: Conceptualization, Methodology, Writing—Original Draft, Resources, Funding Acquisition. Y.Z.: Methodology, Software, Writing—Original Draft. S.X.: Resources, Software, Writing—Original Draft, Writing—Review and Editing. L.W.: Methodology, Writing—Original Draft, Writing—Review and Editing, Project Administration. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61901307) and the Scientific Research Foundation for Doctoral Program of Hubei University of Technology (No. BSQD2020054).

Data Availability Statement

The data presented in this study are openly available at https://github.com/MarcWong/UDD (UDD) and http://dronedataset.icg.tugraz.at/ (SDD), accessed on 8 November 2022.

Acknowledgments

The authors would like to thank the creators and sharers of the UDD and SDD datasets.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Hall, O.; Wahab, I. The use of drones in the spatial social sciences. Drones 2021, 5, 112. [Google Scholar] [CrossRef]
Meng, Y.; Wang, W.; Han, H.; Ban, J. A visual/inertial integrated landing guidance method for UAV landing on the ship. Aerosp. Sci. Technol. 2019, 85, 474–480. [Google Scholar] [CrossRef]
Falanga, D.; Zanchettin, A.; Simovic, A.; Delmerico, J.; Scaramuzza, D. Vision-based Autonomous Quadrotor Landing on a Moving Platform. In Proceedings of the 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR), Shanghai, China, 11–13 October 2017. [Google Scholar] [CrossRef]
Zhang, H.T.; Hu, B.B.; Xu, Z.; Cai, Z.; Liu, B.; Wang, X.; Geng, T.; Zhong, S.; Zhao, J. Visual Navigation and Landing Control of an Unmanned Aerial Vehicle on a Moving Autonomous Surface Vehicle via Adaptive Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5345–5355. [Google Scholar] [CrossRef]
Alam, M.S.; Oluoch, J. A survey of safe landing zone detection techniques for autonomous unmanned aerial vehicles (UAVs). Expert Syst. Appl. 2021, 179, 115091. [Google Scholar] [CrossRef]
Respall, V.M.; Sellami, S.; Afanasyev, I. Implementation of Autonomous Visual Detection, Tracking and Landing for AR. Drone 2.0 Quadcopter. In Proceedings of the 2019 12th International Conference on Developments in eSystems Engineering (DeSE), Kazan, Russia, 7–10 October 2019; pp. 477–482. [Google Scholar] [CrossRef]
Symeonidis, C.; Kakaletsis, E.; Mademlis, I.; Nikolaidis, N.; Tefas, A.; Pitas, I. Vision-based UAV safe landing exploiting lightweight deep neural networks. In Proceedings of the 2021 4th International Conference on Image and Graphics Processing, Sanya, China, 1–3 January 2021; pp. 13–19. [Google Scholar] [CrossRef]
Zhang, Z.; Zheng, H.; Cao, J.; Feng, X.; Xie, G. FRS-Net: An Efficient Ship Detection Network for Thin-Cloud and Fog-Covered High-Resolution Optical Satellite Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2326–2340. [Google Scholar] [CrossRef]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object detection in high resolution remote sensing imagery based on convolutional neural networks with suitable object scale features. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2104–2114. [Google Scholar] [CrossRef]
Xiang, S.; Wang, M.; Xiao, J.; Xie, G.; Zhang, Z.; Tang, P. Cloud Coverage Estimation Network for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6004505. [Google Scholar] [CrossRef]
Xiang, S.; Wang, M.; Jiang, X.; Xie, G.; Zhang, Z.; Tang, P. Dual-task semantic change detection for remote sensing images using the generative change field module. Remote Sens. 2021, 13, 3336. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, W.; Feng, X.; Cao, J.; Xie, G. A Discriminative Feature Learning Approach with Distinguishable Distance Metrics for Remote Sensing Image Classification and Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 889–901. [Google Scholar] [CrossRef]
Bartolomei, L.; Kompis, Y.; Teixeira, L.; Chli, M. Autonomous Emergency Landing for Multicopters using Deep Reinforcement Learning. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 3392–3399. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar] [CrossRef]
Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Zhang, T.; Lin, G.; Cai, J.; Shen, T.; Shen, C.; Kot, A.C. Decoupled spatial neural attention for weakly supervised semantic segmentation. IEEE Trans. Multimed. 2019, 21, 2930–2941. [Google Scholar] [CrossRef]
Gao, G.; Xu, G.; Yu, Y.; Xie, J.; Yang, J.; Yue, D. Mscfnet: A lightweight network with multi-scale context fusion for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2021, 23, 25489–25499. [Google Scholar] [CrossRef]
Xing, Y.; Zhong, L.; Zhong, X. An encoder-decoder network based FCN architecture for semantic segmentation. Wirel. Commun. Mob. Comput. 2020, 2020, 8861886. [Google Scholar] [CrossRef]
Saiz, F.A.; Alfaro, G.; Barandiaran, I.; Graña, M. Generative adversarial networks to improve the robustness of visual defect segmentation by semantic networks in manufacturing components. Appl. Sci. 2021, 11, 6368. [Google Scholar] [CrossRef]
Xiang, S.; Xie, Q.; Wang, M. Semantic Segmentation for Remote Sensing Images Based on Adaptive Feature Selection Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8006705. [Google Scholar] [CrossRef]
Wang, M.; Dong, Z.; Cheng, Y.; Li, D. Optimal Segmentation of High-Resolution Remote Sensing Image by Combining Superpixels with the Minimum Spanning Tree. IEEE Trans. Geosci. Remote Sens. 2018, 56, 228–238. [Google Scholar] [CrossRef]
Khan, M.Z.; Gajendran, M.K.; Lee, Y.; Khan, M.A. Deep neural architectures for medical image semantic segmentation. IEEE Access 2021, 9, 83002–83024. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Part III, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, W.; Cao, J.; Xie, G. MKANet: An efficiFent network with Sobel boundary loss for land-cover classification of satellite remote sensing imagery. Remote Sens. 2022, 14, 4514. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Q.; Jia, W.; Lu, Y.; Li, Y.; He, X. See more than once: Kernel-sharing atrous convolution for semantic segmentation. Neurocomputing 2021, 443, 26–34. [Google Scholar] [CrossRef]
Che, Z.; Shen, L.; Huo, L.; Hu, C.; Wang, Y.; Lu, Y.; Bi, F. MAFF-HRNet: Multi-Attention Feature Fusion HRNet for Building Segmentation in Remote Sensing Images. Remote Sens. 2023, 15, 1382. [Google Scholar] [CrossRef]
Li, M.; Rui, J.; Yang, S.; Liu, Z.; Ren, L.; Ma, L.; Li, Q.; Su, X.; Zuo, X. Method of Building Detection in Optical Remote Sensing Images Based on SegFormer. Sensors 2023, 23, 1258. [Google Scholar] [CrossRef] [PubMed]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Lu, P.; Chen, Y.S.; Wang, G.P. Large-scale structure from motion with semantic constraints of aerial images. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Guangzhou, China, 23–26 November 2018; pp. 347–359. [Google Scholar] [CrossRef]
Fraundorfer, F.; Weilharter, R.J.; Sormann, C.; Ainetter, S. Semantic Drone Dataset. Available online: https://dronedataset.icg.tugraz.at/ (accessed on 8 November 2023).
Zhang, Z.; Wei, L.; Xiang, S.; Xie, G.; Liu, C.; Xu, M. Task-Driven Onboard Real-Time Panchromatic Multispectral Fusion Processing Approach for High-Resolution Optical Remote Sensing Satellite. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7636–7661. [Google Scholar] [CrossRef]
Zhang, Z.; Qu, Z.; Liu, S.; Li, D.; Cao, J.; Xie, G. Expandable on-board real-time edge computing architecture for Luojia3 intelligent remote sensing satellite. Remote Sens. 2022, 14, 3596. [Google Scholar] [CrossRef]
Wang, M.; Zhang, Z.; Zhu, Y.; Dong, Z.; Li, Y. Embedded GPU implementation of sensor correction for on-board real-time stream computing of high-resolution optical satellite imagery. J. Real-Time Image Process. 2018, 15, 565–581. [Google Scholar] [CrossRef]
Wang, M.; Zhang, Z.; Dong, Z.; Jin, S.; Su, H. Stream-computing based high accuracy on-board real-time cloud detection for high resolution optical satellite imagery. Acta Geod. Cartogr. Sin. 2018, 47, 760. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comp. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar] [CrossRef]

Figure 1. The whole framework of the KDP-Net.

Figure 2. The framework of the KDP module.

Figure 3. The texture of feature maps in detail.

Figure 4. The whole framework of BSN.

Figure 5. Details of the segmentation head in BSN.

Figure 6. Hybrid loss function in detail.

Figure 7. The Canny edge mask in detail.

Figure 8. Urban Drone Dataset (UDD6) containing UAV-captured aerial outdoor images.

Figure 9. Semantic Drone Dataset (SDD) containing UAV-captured aerial city images.

Figure 10. Qualitative prediction results on the UDD6 test dataset. (a) Input image; (b) Gground tTruth; (c) DeepLabv3+; (d) BiSeNetV2; (e) SegFormer; (f) MKANet; (g) KDP-Net.

Figure 11. Qualitative prediction results on the SDD test dataset. (a) Input image; (b) gGround tTruth; (c) DeepLabV3+; (d) BiSeNetV2; (e) SegFormer; (f) MKANet; (g) KDP-Net.

Figure 12. The detail of Overall performance in detail.

Figure 13. Qualitative prediction results of the different policy on the UDD6 test dataset. (a) Input image; (b) ground truth; (c) baseline; (d) hybrid loss α = 0.2; (e) hybrid loss α = 0.4; (f) hybrid loss α = 0.8; (g) KDP module; (h) KDP-Net.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Zhang, Y.; Xiang, S.; Wei, L. KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles. Drones 2024, 8, 46. https://doi.org/10.3390/drones8020046

AMA Style

Zhang Z, Zhang Y, Xiang S, Wei L. KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles. Drones. 2024; 8(2):46. https://doi.org/10.3390/drones8020046

Chicago/Turabian Style

Zhang, Zhiqi, Yifan Zhang, Shao Xiang, and Lu Wei. 2024. "KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles" Drones 8, no. 2: 46. https://doi.org/10.3390/drones8020046

Article Menu

KDP-Net: An Efficient Semantic Segmentation Network for Emergency Landing of Unmanned Aerial Vehicles

Abstract

1. Introduction

2. Related Works

2.1. Encoder–Decoder Architecture

2.2. Kernel-Sharing Mechanism

3. Proposed Method

3.1. KDP Module

3.2. Bilateral Segmentation Network

3.3. Hybrid Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details and Metrics

4.3. Experimental Results

4.3.1. Experimental Results on UDD6

4.3.2. Experimental Results on SDD

4.3.3. Overall Performance

4.4. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI