DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images

Lu, Lei; Liu, Tongfei; Jiang, Fenlong; Han, Bei; Zhao, Peng; Wang, Guoqiang

doi:10.3390/electronics12224592

Open AccessArticle

DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images

by

Lei Lu

^1,*,

Tongfei Liu

^2,*

,

Fenlong Jiang

³

,

Bei Han

¹,

Peng Zhao

¹ and

Guoqiang Wang

¹

School of Information Engineering, Yulin University, Yulin 719000, China

²

Shaanxi Joint Laboratory of Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

³

Key Laboratory of Collaborative Intelligence Systems, School of Computer Science and Technology, Ministry of Education, Xidian University, Xi’an 710071, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(22), 4592; https://doi.org/10.3390/electronics12224592

Submission received: 16 October 2023 / Revised: 3 November 2023 / Accepted: 8 November 2023 / Published: 10 November 2023

(This article belongs to the Special Issue Applications of Computational Intelligence, Volume 2)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of very-high-resolution (VHR) remote-sensing technology, automatic identification and extraction of building footprints are significant for tracking urban development and evolution. Nevertheless, while VHR can more accurately characterize the details of buildings, it also inevitably enhances the background interference and noise information, which degrades the fine-grained detection of building footprints. In order to tackle the above issues, the attention mechanism is intensively exploited to provide a feasible solution. The attention mechanism is a computational intelligence technique inspired by the biological vision system capable of rapidly and automatically catching critical information. On the basis of the a priori frequency difference of different ground objects, we propose the denoising frequency attention network (DFANet) for building footprint extraction in VHR images. Specifically, we design the denoising frequency attention module and pyramid pooling module, which are embedded into the encoder–decoder network architecture. The denoising frequency attention module enables efficient filtering of high-frequency noises in the feature maps and enhancement of the frequency information related to buildings. In addition, the pyramid pooling module is leveraged to strengthen the adaptability and robustness of buildings at different scales. Experimental results of two commonly used real datasets demonstrate the effectiveness and superiority of the proposed method; the visualization and analysis also prove the critical role of the proposal.

Keywords:

computational intelligence; neural networks; building footprint extraction; attention mechanism; remote-sensing images

1. Introduction

With the rapid development of satellite, aircraft, and UAV technology, it has become easier to obtain high-resolution and very-high-resolution (VHR) remote-sensing images [1]. Based on these high-quality remote-sensing images, the detailed information of ground objects can be clearly depicted, which facilitates many remote-sensing tasks, including but not limited to land-cover classification [2], object detection [3], change detection [4], etc. Among the ground objects covered by VHR images, buildings, as the carrier of human production and living activities, are of vital significance to the human living environment, and are good indicators of population aggregation, energy consumption intensity, and regional development [5]. Therefore, the accurate extraction of buildings from remote-sensing images is conducive to the study of urban dynamic expansion and population distribution patterns, promoting the digital construction and management of cities, and enhancing the sustainable development of cities [6].

Although some research progress has been made in building footprint extraction in recent years, the diversity of remote-sensing image sources and the complexity of the environment still bring many challenges to this task, mainly including:

(a): In optical remote-sensing images, buildings have small inter-class variance and large intra-class variance [7]. For example, non-buildings such as roads, playgrounds, and parking lots have similar characteristics (such as spectrum, shape, size, structure, etc.), which are easy to confuse the extraction method [8].
(b): Due to the different imaging angles of sensors, high-rise buildings often produce different degrees of geometric distortion, which increases the difficulty of algorithm recognition [9].
(c): Due to the difference in the sun’s altitude angle when shooting, buildings tend to produce shadow areas at different angles, which not only interferes with the coverage area of the building itself, but also easily conceals the characteristics of other buildings covered by shadows [10].

In recent years, deep learning methods represented by the convolutional neural network (CNN) have shown great potential in the fields of computer vision [11,12] and remote-sensing image interpretation [13,14]. With the powerful ability to extract high-level features, CNN-based building footprint extraction methods alleviate the above-mentioned problems to a certain extent. Most of these methods adopt the fully convolutional architecture of the encoder–decoder. For example, Ji et al. proposed a Siamese U-shaped network named SiU-Net for building extraction, which enhances the robustness of buildings of different scales by simultaneously processing original images and downsampled low-resolution images [15]. The method proposed by Sun et al. improves the detection accuracy of building edge by combining CNN with active contour model [16]. Yuan et al. designed a CNN with a simple structure, which integrates pixel-level prediction activated by multiple layers and introduces a symbolic distance function to establish boundaries to represent the output, which has a stronger representation ability [17,18]. In addition, BRRNet proposed by Shao et al. introduced the atrous convolution of different dilation rates to extract more global features by gradually increasing the receiving field in the feature extraction process and the residual refinement module to further refine the residual between the result of the prediction module and the real result [19]. However, existing approaches still suffer from challenges and limitations. Most of the methods above are an extension of the general end-to-end semantic segmentation method, do not carry out targeted analysis of the characteristics of the building itself, and do not filter the noise effectively.

Inspired by the human visual attention mechanism and the frequency characteristics of different ground objects, in this paper, we propose a denoising frequency attention network (DFANet) for building footprint extraction in VHR images. The whole network still adopts the fully convolutional architecture of encoder–decoder, but introduces two designed modules, namely the denoising frequency attention block (DFAB) and pyramid pooling module (PPM). Specifically, DFAB is parameter-free and is embedded in each layer of the network to better extract architectural footprints by refining the feature maps of different layers. It first uses a low-pass filter to filter out high-frequency noise in the feature map. Then, the feature map is reweighted in the transform domain to enhance the information more relevant to the building, and finally, a high-pass filter is used to reduce the loss of details. For the PPM, it is inserted in the middle of the encoder and decoder, which builds multi-scale feature maps by using different sizes of adaptive average pooling layers, and then stacks them together to obtain Better multi-scale object recognition. In this way, the proposed DFANet can effectively filter the background noise interference while enhancing the frequency details of buildings, highlighting the characteristics of the building itself and improving the extraction accuracy. We conduct extensive experiments on two commonly used real-world datasets, and the results demonstrate the validity and superiority of our proposal. The visualization and ablation analyses further demonstrate the critical role of our method. The main contributions of this paper can be summarized as follows:

(1): We propose a novel denoising frequency attention network (DFANet) for building footprint extraction in VHR images. It contributes to the enhancement of the frequency details of the building while filtering out the background noise interference, which in turn greatly improves the building extraction capability.
(2): We specifically design the denoising frequency attention block and pyramid pooling module to enable better extraction of building footprints by refining the feature mapping of different layers and constructing multi-scale fusion feature maps with adaptive average pooling layers of different sizes.
(3): Numerous experiments on public datasets demonstrate the advanced performance achieved by our method. In addition, both visualization analysis and ablation experiments confirm that our proposed DFAB and PPM have a positive effect on the improvement results.

The rest of this paper is organized as follows. Section 2 introduces some related work. Section 3 expounds the proposed approach in detail. In Section 4 and Section 5, experimental results are reported and discussed. The conclusion and future work are in Section 6.

2. Related Works

Remote-sensing imagery can provide effective data support for humans to reform nature, and it has been widely used in Earth observation [20,21,22]. With the rapid development of aerial photography technology such as satellite and aviation, high-resolution remote-sensing images allow for observing detailed ground targets such as buildings, roads, and vehicles. In particular, building footprint extraction is of great significance for urban development planning and urban disaster prevention and mitigation, since buildings are one of the main man-made targets for humans to transform the Earth’s surface [23,24,25,26]. Building footprint extraction has been a constant concern by scholars, and many building footprint extraction methods have been proposed in the past decade. These methods can be grouped into the following two categories: conventional building footprint extraction methods and deep-learning-based building footprint extraction methods. Here, we briefly review these methods as follows.

2.1. Conventional Building Footprint Extraction Methods

Building footprint extraction plays an important role in the interpretation and application of remote-sensing images [27]. In the early stage, scholars worked on extracting building footprints through different mathematical models or combining multiple types of data information. For instance, Reference [28] designed a fully automatic building footprint extraction approach from the differential morphological profile of high-resolution satellite imagery. In Reference [29], a Bayesian-based approach is proposed to extract building footprints through aerial LiDAR data. This method employs the shortest path algorithm and maximizes the posterior probability using linear optimization to automatically obtain building footprints. Sahar et al. utilized vector parcel geometries and their attributes to extract building footprints by using integrated aerial imagery and geographic information system (GIS) data [23]. These methods often require different types of data support to achieve building footprint extraction, and the results are not reliable enough [30,31]. In addition, scholars have devoted themselves to designing various hand-crafted features to automatically extract building footprints from high-resolution remote-sensing images. Zhang et al. devised a pixel shape index to extract buildings by classifying the shape and contour information of pixels [32]. Huang et al. proposed a morphological building index for automatic building extraction in [33]. Similarly, Huang et al. also developed a morphological shadow index for building extraction from high-resolution remote-sensing images [34]. Moreover, some methods use morphological attributes to achieve building footprint extraction [35,36]. In summary, these conventional approaches have been exploited to extract building footprints from high-resolution remote-sensing images.

2.2. Deep-Learning-Based Building Footprint Extraction Methods

Computational intelligence (CI) is a biology- and linguistics-driven computational paradigm [37,38]. In recent years, deep learning technology, as a main pillar, has been widely used in remote-sensing image interpretation with powerful layer-by-layer learning and nonlinear fitting capabilities, such as change detection [14], scene classification [39], semantic segmentation [40], object detection [41,42], etc. In this context, the building footprint extraction method based on deep learning has attracted the attention of many scholars. The building footprint extraction task can be treated as a single-objective semantic segmentation task [43]. Therefore, the direct idea is to use a deep learning-based semantic segmentation network for building footprint extraction, which can fully utilize mainstream deep neural networks (such as VGGNet [44], ResNet [45], etc.) to mine deep semantic features to recognize buildings. For example, compared with conventional methods, semantic segmentation networks such as fully convolutional network (FCN) [46] and U-Net [47] based on VGGNet can achieve a substantial improvement in the performance of building footprint extraction [17]. These methods promote the research of deep-learning-based building footprint extraction methods. According to this, recently, many deep-learning-based approaches have been proposed for building footprint extraction from high-resolution remote-sensing images in an end-to-end manner [43]. These recent methods can be broadly reviewed as follows.

As the spatial resolution of images continues to increase, the features of various building styles, such as material, color, texture, shape, scale, and distribution, have more obvious differences, which makes it difficult to accurately extract pixel-wise building footprints by using conventional semantic segmentation networks [48]. To overcome the above challenges, many novel networks based on multi-scale and attention structures have been proposed for building footprint extraction. For example, Ji et al. proposed a Siamese U-Net (SiU-Net) for multi-source building extraction [15]. SiU-Net [15] trains the network by inputting the down-sampled counterparts as the input of another Siamese branch to enhance the multi-scale perception ability of the network and improve the performance of building extraction. In [49], a novel network with an encoder–decoder structure, named building residual refine network (BRRNet), is devised for building extraction, which introduces a residual refinement module to enlarge the receptive field of the network, thus improving the performance of building extraction with various scales. Chen et al. proposed a context feature enhancement network (CFENet) to extract building footprints [50], which builds a spatial fusion module and focus enhancement module for enhancing multi-scale feature representation. Other similar networks can be found in [51,52]. In addition to these networks with multi-scale structures, attention-based networks have been able to enhance multi-scale feature representation, thus effectively improving building footprint extraction accuracy. For instance, Guo et al. developed a U-Net with an attention block for building extraction in [53]. In Reference [54], a scene-driven multitask parallel attention convolutional network is promoted for building extraction from high-resolution remote-sensing images. An attention-gate-based and pyramid network (AGPNet) with an encoder–decoder structure is designed for building extraction in [55], which is integrated with a grid-based attention gate and atrous spatial pyramid pooling module to enhance multi-scale features. Other attention-based building footprint extraction methods are available in [56,57,58,59].

Recently, some methods have introduced edge information and frequency information to enhance the recognition ability of buildings [48,60]. For instance, Zhu et al. proposed an edge-detail network for building extraction [61], which can consider the edge information of the images to enhance the identification ability to build footprints. In [62], a multi-task frequency–spatial learning network is promoted for building extraction. Zhao et al. adopted a multi-scale attention-guided UNet++ with edge constraint to achieve accurate building footprint segmentation in [63]. For other related papers, one can refer to the following studies [64,65,66]. In addition, advanced transformer-based networks have also received attention for building extraction, such as References [57,67,68]. These methods have largely contributed to the development of building footprint extraction.

3. Methodology

In this section, the proposed denoising frequency attention network (DFANet) is introduced in detail. Firstly, the overview of DFANet is demonstrated in Section 3.1. Then, the proposed denoising frequency attention block (DFAB) is illustrated in detail in Section 3.2. Finally, we introduce the pyramid pooling module (PPM) [69] in Section 3.3, which is a widely used module to better extract multi-scale objects.

3.1. Overview

To extract a fine building footprint in remote-sensing imagery, we employ the U-shape encoder–decoder architecture with skip connections as the backbone of DFANet, which is well proven in similar image segmentation tasks [47], as shown in Figure 1. It is well known that skip connections can help preserve detailed information in the deeper layers of the network, which can benefit the segmentation performance. Considering that there are plenty of small building objects that need to be well extracted, the backbone with such features can promote the detection of these objects.

In terms of details, the backbone consists of four different blocks, i.e., the input block, the encoder block, the decoder block, and the output block. The input block contains two 3 × 3 convolutional layers with batch normalization (BN) and a rectified linear unit (ReLU) activation function, which fit the input remote-sensing image into the DFANet. The encoder block has two convolutional layers with the kernel size of 3 × 3, followed by BN and ReLU, too. The difference between them is that the latter has a 2 × 2 max pooling layer, which decreases the spatial size. The decoder block has the same convolutional layers as that of the encoder block, but with different convolutional configurations, i.e., input and output channel sizes. Different from the encoder block, the decoder block has a 2 × 2 bilinear interpolation layer instead of the max pooling layers. Moreover, the output block has only one 1 × 1 convolutional layer without BN and ReLU to generate the raw prediction of DFANet.

Apart from the backbone, the proposed DFAB is stacked in the network to acquire a better building footprint by refining the feature maps of different layers. To achieve this, the DFAB first utilizes a low-pass filter to filter out the noises in feature maps, since noises are usually in the high-frequency spectrum. Then, the feature maps are reweighted in the transformation domain to enhance the relevant information for building footprint extraction. Finally, since some detailed information is weakened by the low-pass filter, we employ a high-pass filter to avoid the loss of details. Moreover, the DFAB works without the parameters that need to be trained, and we can refine the feature maps of DFANet at a lower cost.

Moreover, to better deal with buildings of varied shapes and sizes, we employ the widely used PPM [69] to increase the receptive field of DFANet and acquire better cognition of multi-scale objects. The proposed DFAB and employed PPM will be demonstrated in detail below.

3.2. Denoising Frequency Attention Block

Attention-based techniques have been time-tested in the remote-sensing field and similar applications, such as building change detection [8] and building extraction [48]. These attention mechanisms usually obtain the attention scores or masks based on the data distribution of the feature maps in networks, and try to utilize acquired scores to reweight the feature maps for better performance. Notably, the features in networks contain the noises caused by input data and premature feature mapping at the early stage of training, based on which we can potentially poison the generated attention scores and downgrade the performance. Moreover, to acquire capable ability, most attention mechanisms need to be guided and adjusted through training over specific tasks, which limits their usability. Moreover, most building footprint extraction methods with attention techniques refine their features with spatial and channel attention mechanisms, which can be affected by the noises inside feature maps. And it is notable that frequency-based attention mechanisms have been invented and are well proven for change detection [60]. Based on these facts, the proposed DFAB makes efforts to utilize the transformation-based attention mechanism and kernel filters to refine the features of DFANet without training, which can potentially avoid the aforementioned limitations, as shown in Figure 2. In general, DFAB is parameter-free and is embedded in each layer of the network to better extract architectural footprints by refining the feature maps of different layers. It first uses a low-pass filter to filter out high-frequency noise in the feature map. Then, the feature map is reweighted in the transform domain to enhance the information more relevant to the building, and finally, a high-pass filter is used to reduce the loss of details.

The mathematical style of the DFAB can be shown as follows: firstly, let the input feature maps of DFAB be

f_{i} \in R^{C \times H \times W}

, where C, H, and W represent the channel, height, and width sizes of

f_{i}

, respectively. To lower the high-frequency noise of input features, we utilize a low-pass filter, a Gaussian filter, to process the input feature maps. Different from other widely used low-pass filters, a Gaussian filter can preserve more detailed information due to its structure. And for fine-grained feature representation, we utilize a Gaussian filter with a size of 3 × 3 to suppress the noises, which can be denoted as:

f_{L} = G a u s s i a n (f_{i})

(1)

where

G a u s s i a n (\cdot)

represents the 3 × 3 Gaussian kernel filter. Then, we use discrete cosine transform (DCT) and global average pooling (GAP) to screen the informative feature maps [70] and further lower the impact of the noises, which can be represented as follows:

A_{s} = S o f t m a x (G A P (D C T (f_{L})))

(2)

where

D C T (\cdot)

and

G A P (\cdot)

denote the channel-wise DCT and GAP, respectively. Notably, the Softmax function is employed to further emphasize the relevant informative features and suppress the irrelevant features, which is represented as

S o f t m a x (\cdot)

. With these procedures, the attention score

A_{s} \in R^{C \times 1 \times 1}

is obtained. Considering that the low-pass filter inevitably damages the high-frequency information of the input features, we employ a high-pass filter, the Laplacian filter, to extract and enhance the detailed information, which can be represented as:

f_{H} = L a p l a c i a n (f_{L} \otimes A_{s})

(3)

where

L a p l a c i a n (\cdot)

indicates a 3 × 3 Laplacian kernel filter, and ⊗ denotes the channel-wise multiplication. Finally, we utilize a residual path to make the output more stable and avoid gradient vanishing [45], which can be denoted as:

f_{o} = f_{i} \oplus f_{H}

(4)

where ⊕ denotes the pixel-wise addition.

To sum up, the DFAB proposed in this paper attempts to achieve a better fine-grained feature representation by utilizing an attention mechanism based on a parameter-free transformation and a kernel filter. Through DFAB, the refined feature maps with less noise and more informative details can potentially benefit the building extraction performance of the proposed DFANet.

3.3. Pyramid Pooling Module

Convolutional neural-network-based methods usually suffer from a limited receptive field, which downgrades the performance when detecting multi-scale objects. To alleviate this problem, the PPM was proposed in [69], which greatly improves the segmentation performance. Encouraged by its success in natural image segmentation tasks, we employ the PPM in the building footprint extraction tasks to tackle buildings with varied scales. The PPM mainly utilizes the adaptive average pooling (AAP) layers of different sizes to construct multi-scale feature maps and then stacks them together to acquire better recognition of multi-scale objects. In the proposed DFANet, the PPM uses the same configuration as that of its original work. The detailed process of the PPM is shown in Figure 3, which also can be represented in mathematical style as follows:

Let the input feature maps of the PPM be

F_{I} \in R^{C \times H \times W}

, where C, H, and W, denote the sizes of channel, height, and weight, respectively. The input features will be processed by AAP layers of four different sizes, i.e., 1 × 1, 2 × 2, 3 × 3, and 6 × 6, which can be represented as:

F_{x} = {A A P}_{n} (F_{I})

(5)

where

F_{x} \in R^{C \times n \times n} \{x = 1, 2, 3, 4 | n = 1, 2, 3, 6\}

is the processed features, and

{A A P}_{n} (\cdot)

denotes AAP with an output size of n × n. Then, all of them will be compressed in channel dimension and up-sampled to H × W by convolutional layers and bilinear interpolation, respectively, which can be denoted as:

F_{x O} = U p s a m p l e ({c o n v}_{x} (F_{x}))

(6)

where

{c o n v}_{x} (\cdot)

and

U p s a m p l e (\cdot)

indicate 1 × 1 convolutional layers and the bilinear interpolation, respectively. Finally, the feature maps with multi-scale information are stacked together with the input feature to better detect multi-scale objects, which can be demonstrated as:

F_{O} = F_{I} ⊚ F_{1 O} ⊚ F_{2 O} ⊚ F_{3 O} ⊚ F_{4 O}

(7)

where ⊚ denotes the channel-wise concatenation.

To conclude, the PPM can increase the receptive field of convolutional neural networks by multi-scale pooling layers and better capture objects with varied sizes. As a result, we attempt to employ the PPM in the proposed DFANet to better deal with multi-scale building objects and further improve the building detection performance.

4. Experimental Results

In this section, we implement extensive experiments and ablation analysis to demonstrate the superior performance of the proposed method. First, the dataset, evaluation metrics, and implementation details are provided, and then six comparison algorithms are used to show that our method is able to achieve state-of-the-art performance, and finally, ablation analysis is implemented to demonstrate that each module contributes to the improvement of the results.

4.1. Datasets and Evaluation Metrics

In the experiments, two benchmark building detection datasets are employed to demonstrate the effectiveness of our proposed DFANet, namely the Massachusetts dataset [71] and the East Asia Dataset [15].

The Massachusetts dataset [71] was collected in the Boston area of the United States, which is approximately 340 square kilometers. This dataset contains a total of 151 aerial images of 1500 × 1500 pixels with a 1 m spatial resolution. In the original dataset version, 173, 4, and 10 of these images were used as training, validation, and test sets, respectively. In these experiments, consistent with [48], we sequentially crop each image into four non-overlapping 300 tiles of size 512 × 512. Therefore, the training set, validation set, and test set are composed of 548, 16, and 40 aerial image tiles, respectively. Some aerial image blocks are shown in Figure 4a. It can be seen that there are dense and sparse buildings in the Massachusetts dataset in the complex background at the same time, which puts forward strict requirements for the building extraction ability of the model in multiple scene categories.

The East Asia Dataset [15] contains an area of 550 square kilometers in East Asia, collected from six adjacent satellite images with a spatial resolution of 2.7 m. The image size of this dataset is 512 × 512, and some aerial image tiles are shown in Figure 4b. In this experiments, consistent with [48], We selected the parts with buildings from the whole dataset, containing 3153 tiles and 903 tiles in the training and testing sets, respectively. Since these images are collected from different data sources but have similar architectural styles in the same geographical area, this dataset can be leveraged to accurately test and evaluate the generalization ability of deep models for building extraction.

In terms of evaluation metrics, four commonly used building detection indicators, namely Precision, Recall, F1-Score, and Intersection over Union (IOU), are employed to measure the performance of all the methods. Precision and Recall are the two most commonly used quantitative indicators. Precision refers to the percentage of buildings predicted to extract correctly, while Recall is the proportion of positive pixels in the building that extract ground truths that are predicted to be correct. Due to the imbalance between the number of building pixels and the number of non-building pixels in the dataset, the two comprehensive indicators of F1-Score and IOU are also considered, which more objectively reflects the ability of the model to handle the building detection task. Their detailed definitions are as follows:

P r e c i s i o n = \frac{T P}{T P + F P},

(8)

R e c a l l = \frac{T P}{T P + F N},

(9)

F 1 - S c o r e = \frac{2 \times R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n},

(10)

I O U = \frac{T P}{T P + F N + F P} .

(11)

where TP and FP are the number of true-positive and false-positive pixels, respectively. Similarly, TN and FN represent the number of true-negative and false-negative pixels, respectively.

4.2. Implementation Details

In the experiments, we reproduced all the comparison methods and performed them under the same experimental conditions to ensure the fairness of the comparison with DFANet. We performed DFANet on the Pytorch platform of CUDA 11.6 by using a single NVIDIA RTX 3090 GPU with 24 GB video memory.

In the setting of hyperparameters, the batch size is set to 4, and the Adam optimizer is employed. In addition, the initial learning rate is set to 10

^{- 4}

, with a weight decay rate of 10

^{- 5}

. In addition, a multi-step learning rate delay is deployed to progressively update the learning rate during the training with gamma set to 0.9 and milestone set to [30, 35, 40, 45, 50, 55, 60, 65, 70]. It is worth mentioning that the experiments do not employ data augmentation strategies.

4.3. Comparison with Other Methods

4.3.1. Comparative Algorithms

To verify the effectiveness of our proposed DFANet, six excellent peers are selected as comparative algorithms, which are detailed as follows.

(1): U-Net [47] is built on the basis of FCN8s [46], which mainly includes a contraction path for extracting image features or context and an expansion path for precise segmentation.
(2): PANet [72] proposed a pyramid attention network to exploit the influence of global context information in semantic segmentation, and introduced feature pyramid attention and global attention upsampling to overcome the loss of localization information.
(3): SiU-Net [15] is designed based on a Siamese fully convolutional network, where the two branches of the network share weights, and the original image and its downsampled counterpart are used as input.
(4): BRRNet [49] consists of a prediction module and a residual refinement module. The prediction module obtains a larger receptive field by introducing dilated convolutions with different dilation rates, while the residual refinement module takes the output of the prediction module as input to improve the accuracy of building segmentation.
(5): AGPNet [55] is one of the state-of-the-art methods designed for architectural segmentation. It is an encoder–decoder structure that combines a grid-based attention gate and an atrous-space pyramid pooling module.
(6): Res2-Unet [73] is an end-to-end building detection network that employs granular-level multi-scale learning to expand the receptive field size of each bottleneck layer, focusing on pixels in complex background boundary regions.

4.3.2. Results on the Massachusetts Dataset

Table 1 shows the quantitative analysis results compared with various building extraction algorithms. It can be observed that in the two quantitative indicators of Precision and Recall, DFANet is behind the state-of-the-art (SOTA) algorithm (i.e., BRRNet), by 1.75% in Recall, but lags behind the SOTA, (i.e., U-Net) in the Precision indicator 5.24%. However, due to the data imbalance in quite a few images, these two indicators alone are not enough to reflect the performance of the algorithms. On two comprehensive indicators, DFANet can achieve the SOTA, outperforming BRRNet in F1-Score and IOU by 1.06% and 1.50%, respectively.

In addition to quantitative analysis, we also provide visual results for some cases in the Massachusetts dataset, as shown in Figure 5. It can be observed from the visualized results that DFANet has a more robust monitoring ability for the buildings, whether it is dense building changes or sparse building scenes. Take the first case in Figure 5 to illustrate that for the building scene similar to the background, other comparison algorithms inevitably have a large number of missing or false detection pixels, while our proposed DFANet can accurately detect most of the buildings. In addition, we can also conclude from the visualization results that the correct detection in building edges of DFANet is significantly better than other comparison algorithms.

4.3.3. Results of the East Asia Dataset

Table 2 shows the quantitative results of DFANet and comparative methods of the East Asia dataset. Although DFANet does not perform well in the indicator of Precision on the East Asia dataset, it leads the previous SOTA by 2.80% in the Recall indicator. However, these two indicators cannot be directly related to the effectiveness of the algorithms due to the problem of data imbalance. On the other hand, in terms of comprehensive indicators, DFANet can still reach the SOTA and exceeds AGPNet 0.18% in F1-Score and 0.25% in IOU.

Some cases of visualization results of the East Asia dataset are shown in Figure 6. As shown in the fourth case in Figure 6, the buildings are actually quite similar to the background scene, and it is even difficult to distinguish them with human eyes. Algorithms such as PANet, SiU-Net, BRRNet, and Res2-Unet missed detection, while only U-Net, AGPNet, and DFANet can accurately identify the buildings to a large extent. But compared with U-Net and AGPNet, DFANet is more excellent in building edge detection. For densely building scenes, DFANet also outperforms other comparative algorithms.

5. Discussion

During the implementation of DFANet, it is still challenging to visualize the critical role that our proposed modules play in detection performance. In other words, we would like to actively explore the relationship between the proposed modules and the improvement of building footprint extraction accuracy. We evaluate and analyze the proposed DFAB and PPM modules on two benchmark datasets with four quantitative metrics (i.e., Precision, Recall, F1-Score, and IOU). The quantitative results of the ablation experiments in the Massachusetts dataset and the East Asia dataset are shown in Table 3 and Table 4, respectively. It can be clearly concluded that compared with the backbone, the design of the DFAB and PPM both can improve the overall performance of the network, and the combination of these two contributions further improves the ability of the model to detect the buildings, rather than influencing each other. Specifically, in terms of two key comprehensive indicators, the DFAB, PPM, and their combination are, respectively, 1.26%, 0.68%, and 2.36% in F1-Score and 1.75%, 0.93%, and 3.30% in IOU higher than the backbone in the Massachusetts dataset. Meanwhile, in the East Asia dataset, the DFAB, PPM, and their combination are, respectively, 0.34%, 0.79%, and 0.86% in F1-Score and 0.49%, 1.13%, and 1.22% in IOU, better than the backbone.

In addition to the analysis of quantitative results, some case visualization results of feature map and heat map of the East Asia dataset are shown in Figure 7. It can be observed that the PPM can better capture building features of different shapes and sizes in the backbone network, while the DFAB can enhance the relevant information of the building to refine the feature map of each floor at a lower cost.

6. Conclusions

In this article, a novel denoising frequency attention network (DFANet) is proposed for building footprint extraction in VHR remote-sensing images. The proposed DFANet contains three parts: a U-shape backbone network, a denoising frequency attention block (DFAB), and a pyramid pooling module (PPM). In the proposed DFANet, we devised a parameter-free DFAB, which can enhance the relevant information about buildings, thereby refining the feature maps of each layer at a lower cost. In addition, in order to better capture building features of varied shapes and sizes, we also introduced a widely used PPM to enlarge the receptive field of our proposed DFANet. Experiments on two publicly available large building footprint extraction datasets demonstrate that our proposed DFANet is able to achieve competitive performance compared to other state-of-the-art methods. Moreover, sufficient ablation experiments show that introducing our designed parameter-free DFAB can effectively improve the building detection performance. In future work, we will further study this method from the following two aspects. On the one hand, we will use more building extraction datasets to further verify the robustness of the method. On the other hand, we will test the feasibility of the DFAB in other tasks.

Author Contributions

Conceptualization, L.L. and T.L.; methodology, L.L., T.L. and F.J.; validation, L.L. and B.H.; investigation, L.L. and P.Z.; writing—original draft preparation, L.L., T.L. and F.J.; writing—review and editing, L.L., T.L. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Science Foundation of China Funding Project for Department of Education of Shaanxi Province of China (Grant No. 22JC063), Natural Science and Technology Project Plan in Yulin of China (Grant No. CXY-2022-94, CXY-2022-92, CXY-2022-93, CXY-2021-102-04, CXY-2021-102-02, CXY-2021-102-01, CXY-2021-102-03, CXY-2021-102-05, CXY-2021-94-02), Natural Science Basic Research Plan in Shaanxi Province of China (2021NY-208), Scientific Research Program Funded by Yulin National High Tech Industrial Development Zone (Program No. CXY-2021-55, CXY-2021-57, CXY-2021-32, CXY-2021-40, 2022HX234), and Thanks for the help.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lv, Z.; Liu, T.; Benediktsson, J.A.; Falco, N. Land cover change detection techniques: Very-high-resolution optical images: A review. IEEE Geosci. Remote. Sens. Mag. 2021, 10, 44–63. [Google Scholar] [CrossRef]
Gong, M.; Li, J.; Zhang, Y.; Wu, Y.; Zhang, M. Two-path aggregation attention network with quad-patch data augmentation for few-shot scene classification. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1. [Google Scholar] [CrossRef]
Gong, Y.; Xiao, Z.; Tan, X.; Sui, H.; Xu, C.; Duan, H.; Li, D. Context-aware convolutional neural network for object detection in VHR remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2019, 58, 34–44. [Google Scholar] [CrossRef]
Jiang, F.; Gong, M.; Zheng, H.; Liu, T.; Zhang, M.; Liu, J. Self-Supervised Global-Local Contrastive Learning for Fine-Grained Change Detection in VHR Images. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Liu, P.; Liu, X.; Liu, M.; Shi, Q.; Yang, J.; Xu, X.; Zhang, Y. Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote. Sens. 2019, 11, 830. [Google Scholar] [CrossRef]
Zhang, Y.; Gong, M.; Li, J.; Zhang, M.; Jiang, F.; Zhao, H. Self-supervised monocular depth estimation with multiscale perception. IEEE Trans. Image Process. 2022, 31, 3251–3266. [Google Scholar] [CrossRef]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple attending path neural network for building footprint extraction from remote sensed imagery. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 6169–6181. [Google Scholar] [CrossRef]
Liu, T.; Gong, M.; Lu, D.; Zhang, Q.; Zheng, H.; Jiang, F.; Zhang, M. Building change detection for VHR remote sensing images via local–global pyramid network and cross-task transfer learning strategy. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1. [Google Scholar] [CrossRef]
Sun, Y.; Hua, Y.; Mou, L.; Zhu, X.X. CG-Net: Conditional GIS-aware network for individual building segmentation in VHR SAR images. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1. [Google Scholar] [CrossRef]
Kadhim, N.; Mourshed, M. A shadow-overlapping algorithm for estimating building heights from VHR satellite images. IEEE Geosci. Remote. Sens. Lett. 2017, 15, 8–12. [Google Scholar] [CrossRef]
Wu, Y.; Liu, J.; Gong, M.; Gong, P.; Fan, X.; Qin, A.; Miao, Q.; Ma, W. Self-Supervised Intra-Modal and Cross-Modal Contrastive Learning for Point Cloud Understanding. IEEE Trans. Multimed. 2023, 1–13. [Google Scholar] [CrossRef]
Gong, M.; Zhao, Y.; Li, H.; Qin, A.; Xing, L.; Li, J.; Liu, Y.; Liu, Y. Deep Fuzzy Variable C-Means Clustering Incorporated with Curriculum Learning. IEEE Trans. Fuzzy Syst. 2023, 1–15. [Google Scholar] [CrossRef]
Zhang, Y.; Gong, M.; Zhang, M.; Li, J. Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling. IEEE Trans. Neural Netw. Learn. Syst. 2023. ahead of print. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Li, J.; Yuan, Y.; Qin, A.; Miao, Q.G.; Gong, M.G. Commonality autoencoder: Learning common features for change detection from heterogeneous images. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4257–4270. [Google Scholar] [CrossRef] [PubMed]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote. Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Zhang, Y.; Gong, M.; Li, J.; Feng, K.; Zhang, M. Autonomous perception and adaptive standardization for few-shot learning. Knowl.-Based Syst. 2023, 277, 110746. [Google Scholar] [CrossRef]
Yuan, J. Learning building extraction in aerial scenes with convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2793–2798. [Google Scholar] [CrossRef]
Liu, T.; Gong, M.; Jiang, F.; Zhang, Y.; Li, H. Landslide inventory mapping method based on adaptive histogram-mean distance with bitemporal VHR aerial images. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Wu, Y.; Liu, J.; Yuan, Y.; Hu, X.; Fan, X.; Tu, K.; Gong, M.; Miao, Q.; Ma, W. Correspondence-Free Point Cloud Registration Via Feature Interaction and Dual Branch [Application Notes]. IEEE Comput. Intell. Mag. 2023, 18, 66–79. [Google Scholar] [CrossRef]
Lv, Z.; Zhong, P.; Wang, W.; You, Z.; Shi, C. Novel Piecewise Distance based on Adaptive Region Key-points Extraction for LCCD with VHR Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2023, 61. [Google Scholar] [CrossRef]
Li, J.; Li, H.; Liu, Y.; Gong, M. Multi-fidelity evolutionary multitasking optimization for hyperspectral endmember extraction. Appl. Soft Comput. 2021, 111, 107713. [Google Scholar] [CrossRef]
Lv, Z.; Zhang, P.; Sun, W.; Benediktsson, J.A.; Li, J.; Wang, W. Novel Adaptive Region Spectral-Spatial Features for Land Cover Classification with High Spatial Resolution Remotely Sensed Imagery. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 5609412. [Google Scholar] [CrossRef]
Sahar, L.; Muthukumar, S.; French, S.P. Using aerial imagery and GIS in automated building footprint extraction and shape recognition for earthquake risk assessment of urban inventories. IEEE Trans. Geosci. Remote. Sens. 2010, 48, 3511–3520. [Google Scholar] [CrossRef]
Van Etten, A.; Hogan, D.; Manso, J.M.; Shermeyer, J.; Weir, N.; Lewis, R. The multi-temporal urban development spacenet dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6398–6407. [Google Scholar]
Ma, H.; Liu, Y.; Ren, Y.; Yu, J. Detection of collapsed buildings in post-earthquake remote sensing images based on the improved YOLOv3. Remote. Sens. 2019, 12, 44. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Zhao, Y.; Gong, M.; Zhang, Y.; Liu, T. Cost-sensitive self-paced learning with adaptive regularization for classification of image time series. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 11713–11727. [Google Scholar] [CrossRef]
Song, W.; Haithcoat, T.L. Development of comprehensive accuracy assessment indexes for building footprint extraction. IEEE Trans. Geosci. Remote. Sens. 2005, 43, 402–404. [Google Scholar] [CrossRef]
Shackelford, A.K.; Davis, C.H.; Wang, X. Automated 2-D building footprint extraction from high-resolution satellite multispectral imagery. In Proceedings of the IGARSS 2004 IEEE International Geoscience and Remote Sensing Symposium, Anchorage, AK, USA, 20–24 September 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 1996–1999. [Google Scholar]
Wang, O.; Lodha, S.K.; Helmbold, D.P. A bayesian approach to building footprint extraction from aerial lidar data. In Proceedings of the 3rd International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06), Washington, DC, USA, 14–16 June 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 192–199. [Google Scholar]
Zabuawala, S.; Nguyen, H.; Wei, H.; Yadegar, J. Fusion of LiDAR and aerial imagery for accurate building footprint extraction. In Image Processing: Machine Vision Applications II; SPIE: Bellingham, WA, USA, 2009; Volume 7251, pp. 337–347. [Google Scholar]
Wang, J.; Zeng, C.; Lehrbass, B. Building extraction from LiDAR and aerial images and its accuracy evaluation. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 64–67. [Google Scholar]
Zhang, L.; Huang, X.; Huang, B.; Li, P. A pixel shape index coupled with spectral information for classification of high spatial resolution remotely sensed imagery. IEEE Trans. Geosci. Remote. Sens. 2006, 44, 2950–2961. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. A Multidirectional and Multiscale Morphological Index for Automatic Building Extraction from Multispectral GeoEye-1 Imagery. Photogramm. Eng. Remote. Sens. 2011, 77, 721–732. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. Morphological building/shadow index for building extraction from high-resolution imagery over urban areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2011, 5, 161–172. [Google Scholar] [CrossRef]
Ma, W.; Wan, Y.; Li, J.; Zhu, S.; Wang, M. An automatic morphological attribute building extraction approach for satellite high spatial resolution imagery. Remote. Sens. 2019, 11, 337. [Google Scholar] [CrossRef]
Li, J.; Cao, J.; Feyissa, M.E.; Yang, X. Automatic building detection from very high-resolution images using multiscale morphological attribute profiles. Remote. Sens. Lett. 2020, 11, 640–649. [Google Scholar] [CrossRef]
Wu, Y.; Ding, H.; Gong, M.; Qin, A.; Ma, W.; Miao, Q.; Tan, K.C. Evolutionary multiform optimization with two-stage bidirectional knowledge transfer strategy for point cloud registration. IEEE Trans. Evol. Comput. 2022, 1. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, Y.; Ma, W.; Gong, M.; Fan, X.; Zhang, M.; Qin, A.; Miao, Q. Rornet: Partial-to-partial registration network with reliable overlapping representations. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–14. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Gong, M.; Liu, H.; Zhang, Y.; Zhang, M.; Wu, Y. Multiform Ensemble Self-Supervised Learning for Few-Shot Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4500416. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Hoeser, T.; Kuenzer, C. Object detection and image segmentation with deep learning on earth observation data: A review-part I: Evolution and recent trends. Remote. Sens. 2020, 12, 1667. [Google Scholar] [CrossRef]
Hoeser, T.; Bachofer, F.; Kuenzer, C. Object detection and image segmentation with deep learning on Earth observation data: A review—Part II: Applications. Remote. Sens. 2020, 12, 3053. [Google Scholar] [CrossRef]
Luo, L.; Li, P.; Yan, X. Deep learning-based building extraction from remote sensing images: A comprehensive review. Energies 2021, 14, 7982. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Gong, M.; Liu, T.; Zhang, M.; Zhang, Q.; Lu, D.; Zheng, H.; Jiang, F. Context-content collaborative network for building extraction from high-resolution imagery. Knowl.-Based Syst. 2023, 263, 110283. [Google Scholar] [CrossRef]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images. Remote. Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Chen, J.; Zhang, D.; Wu, Y.; Chen, Y.; Yan, X. A context feature enhancement network for building extraction from high-resolution remote sensing imagery. Remote. Sens. 2022, 14, 2276. [Google Scholar] [CrossRef]
Ma, J.; Wu, L.; Tang, X.; Liu, F.; Zhang, X.; Jiao, L. Building extraction of aerial images by a global and multi-scale encoder-decoder network. Remote. Sens. 2020, 12, 2350. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. Int. J. Remote. Sens. 2019, 40, 3308–3322. [Google Scholar] [CrossRef]
Guo, M.; Liu, H.; Xu, Y.; Huang, Y. Building extraction based on U-Net with an attention block and multiple losses. Remote. Sens. 2020, 12, 1400. [Google Scholar] [CrossRef]
Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 4287–4306. [Google Scholar] [CrossRef]
Deng, W.; Shi, Q.; Li, J. Attention-gate-based encoder–decoder network for automatical building extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 2611–2620. [Google Scholar] [CrossRef]
Yang, H.; Wu, P.; Yao, X.; Wu, Y.; Wang, B.; Xu, Y. Building extraction in very high resolution imagery by dense-attention networks. Remote. Sens. 2018, 10, 1768. [Google Scholar] [CrossRef]
Yuan, W.; Xu, W. MSST-Net: A multi-scale adaptive network for building extraction from remote sensing images based on swin transformer. Remote. Sens. 2021, 13, 4743. [Google Scholar] [CrossRef]
Tian, Q.; Zhao, Y.; Li, Y.; Chen, J.; Chen, X.; Qin, K. Multiscale building extraction with refined attention pyramid networks. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zhou, D.; Wang, G.; He, G.; Long, T.; Yin, R.; Zhang, Z.; Chen, S.; Luo, B. Robust building extraction for high spatial resolution remote sensing images with self-attention network. Sensors 2020, 20, 7241. [Google Scholar] [CrossRef] [PubMed]
Zheng, H.; Gong, M.; Liu, T.; Jiang, F.; Zhan, T.; Lu, D.; Zhang, M. HFA-Net: High frequency attention siamese network for building change detection in VHR remote sensing images. Pattern Recognit. 2022, 129, 108717. [Google Scholar] [CrossRef]
Zhu, Y.; Liang, Z.; Yan, J.; Chen, G.; Wang, X. ED-Net: Automatic building extraction from high-resolution aerial images with boundary information. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 4595–4606. [Google Scholar] [CrossRef]
Yu, B.; Chen, F.; Wang, N.; Yang, L.; Yang, H.; Wang, L. MSFTrans: A multi-task frequency-spatial learning transformer for building extraction from high spatial resolution remote sensing images. GISci. Remote Sens. 2022, 59, 1978–1996. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, H.; Zheng, X. A multiscale attention-guided UNet++ with edge constraint for building extraction from high spatial resolution imagery. Appl. Sci. 2022, 12, 5960. [Google Scholar] [CrossRef]
Jung, H.; Choi, H.S.; Kang, M. Boundary enhancement semantic segmentation for building extraction from remote sensed image. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Xu, Z.; Xu, C.; Cui, Z.; Zheng, X.; Yang, J. CVNet: Contour Vibration Network for Building Extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1383–1391. [Google Scholar]
Chen, S.; Shi, W.; Zhou, M.; Zhang, M.; Xuan, Z. CGSANet: A Contour-Guided and Local Structure-Aware Encoder–Decoder Network for Accurate Building Extraction From Very High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 15, 1526–1542. [Google Scholar] [CrossRef]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Hu, Y.; Wang, Z.; Huang, Z.; Liu, Y. PolyBuilding: Polygon transformer for building extraction. ISPRS J. Photogramm. Remote. Sens. 2023, 199, 15–27. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Feng, D.; Chu, H.; Zheng, L. Frequency Spectrum Intensity Attention Network for Building Detection from High-Resolution Imagery. Remote. Sens. 2022, 14, 5457. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling; University of Toronto: Toronto, ON, Canada, 2013. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
Chen, F.; Wang, N.; Yu, B.; Wang, L. Res2-Unet, a New Deep Architecture for Building Detection from High Spatial Resolution Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 1494–1501. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed DFANet.

Figure 2. The illustration of the proposed DFAB.

Figure 3. Architecture of the PPM.

Figure 4. Some examples of two benchmark datasets: (a) Massachusetts dataset; (b) East Asia Dataset. The first row in each subplot is the aerial image tile, and the second row is the ground truth.

Figure 5. Some case visualization results of the Massachusetts dataset. Each column from left to right is the image, T2 image, ground truth, U-Net, PANet, SiU-Net, BRRNet, AGPNet, Res2-Unet and our proposed DFANet.

Figure 6. Some case visualization results of the East Asia dataset. Each column from left to right is the image, T2 image, ground truth, U-Net, PANet, SiU-Net, BRRNet, AGPNet, Res2-Unet, and our proposed DFANet.

Figure 7. Visualization results of feature map and heat map of two cases (a,b) in the East Asia dataset. Notation: in each subgraph, the left one is the feature map, and the right one is the heat map.

Table 1. Quantitative results of DFANet and comparative methods on the Massachusetts dataset. The best and second best results are marked in bold and underlined, respectively.

Methods	Precision (%)	Recall (%)	F1-Score (%)	IOU (%)
U-Net [47] (2015)	88.66	72.19	79.58	66.09
PANet [72] (2018)	85.05	42.02	56.25	39.13
SiU-Net [15] (2019)	84.82	75.80	80.06	66.74
BRRNet [49] (2020)	79.48	81.46	80.46	67.31
AGPNet [55] (2021)	84.72	74.86	79.48	65.95
Res2-Unet [73] (2022)	81.04	65.65	72.64	56.91
DFANet (Ours)	83.42	79.71	81.52	68.81

Table 2. Quantitative results of DFANet and comparative methods of the East Asia dataset. The best and second-best results are marked in bold and underlined, respectively.

Methods	Precision (%)	Recall (%)	F1-Score (%)	IOU (%)
U-Net [47] (2015)	88.41	71.22	78.89	65.14
PANet [72] (2018)	86.29	66.60	75.18	60.23
SiU-Net [15] (2019	88.29	70.85	78.62	64.77
BRRNet [49] (2020)	84.06	78.02	80.93	67.97
AGPNet [55] (2021)	86.37	76.59	81.19	68.34
Res2-Unet [73] (2022)	84.07	69.14	75.88	61.14
DFANet (Ours)	81.93	80.82	81.37	68.59

Table 3. Ablation study of the proposed DFANet on Massachusetts dataset. Notation: the best results are marked in bold.

Methods	Precision (%)	Recall (%)	F1-Score (%)	IOU (%)
Backbone	80.63	77.74	79.16	65.51
Backbone + DFAB	81.48	79.39	80.42	67.26
Backbone + PPM	84.20	75.90	79.84	66.44
Backbone + DFAB + PPM (DFANet)	83.42	79.71	81.52	68.81

Table 4. Ablation study of the proposed DFANet on East Asia dataset. Notation: the best results are marked in bold.

Methods	Precision (%)	Recall (%)	F1-Score (%)	IOU (%)
Backbone	86.68	75.02	80.51	67.37
Backbone + DFAB	84.73	77.31	80.85	67.86
Backbone + PPM	83.83	78.93	81.30	68.50
Backbone + DFAB + PPM (DFANet)	81.93	80.82	81.37	68.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, L.; Liu, T.; Jiang, F.; Han, B.; Zhao, P.; Wang, G. DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images. Electronics 2023, 12, 4592. https://doi.org/10.3390/electronics12224592

AMA Style

Lu L, Liu T, Jiang F, Han B, Zhao P, Wang G. DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images. Electronics. 2023; 12(22):4592. https://doi.org/10.3390/electronics12224592

Chicago/Turabian Style

Lu, Lei, Tongfei Liu, Fenlong Jiang, Bei Han, Peng Zhao, and Guoqiang Wang. 2023. "DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images" Electronics 12, no. 22: 4592. https://doi.org/10.3390/electronics12224592

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DFANet: Denoising Frequency Attention Network for Building Footprint Extraction in Very-High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Conventional Building Footprint Extraction Methods

2.2. Deep-Learning-Based Building Footprint Extraction Methods

3. Methodology

3.1. Overview

3.2. Denoising Frequency Attention Block

3.3. Pyramid Pooling Module

4. Experimental Results

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison with Other Methods

4.3.1. Comparative Algorithms

4.3.2. Results on the Massachusetts Dataset

4.3.3. Results of the East Asia Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI