Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images

Jin, Yuwei; Xu, Wenbo; Zhang, Ce; Luo, Xin; Jia, Haitao

doi:10.3390/rs13040692

Open AccessArticle

Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images

by

Yuwei Jin

¹

,

Wenbo Xu

^1,2,*,

Ce Zhang

^3,4

,

Xin Luo

¹ and

Haitao Jia

¹

Yangtze Delta Region Institute (HuZhou) & School of Resource and Environment, University of Electronic Science and Technology of China, Huzhou 313001, China

²

Center for Information Geoscience, Qingshuihe Campus, University of Electronic Science and Technology, Chengdu 611731, China

³

Lancaster Environment Centre, Lancaster University, Lancaster LA1 4YQ, UK

⁴

UK Centre for Ecology & Hydrology, Lancaster LA1 4AP, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(4), 692; https://doi.org/10.3390/rs13040692

Submission received: 19 December 2020 / Revised: 3 February 2021 / Accepted: 11 February 2021 / Published: 14 February 2021

(This article belongs to the Special Issue Advances in Object-Based Image Analysis—Linked with Computer Vision and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional Neural Networks (CNNs), such as U-Net, have shown competitive performance in the automatic extraction of buildings from Very High-Resolution (VHR) aerial images. However, due to the unstable multi-scale context aggregation, the insufficient combination of multi-level features and the lack of consideration of the semantic boundary, most existing CNNs produce incomplete segmentation for large-scale buildings and result in predictions with huge uncertainty at building boundaries. This paper presents a novel network with a special boundary-aware loss embedded, called the Boundary-Aware Refined Network (BARNet), to address the gap above. The unique properties of the proposed BARNet are the gated-attention refined fusion unit, the denser atrous spatial pyramid pooling module, and the boundary-aware loss. The performance of the BARNet is tested on two popular data sets that include various urban scenes and diverse patterns of buildings. Experimental results demonstrate that the proposed method outperforms several state-of-the-art approaches in both visual interpretation and quantitative evaluations.

Keywords:

VHR aerial images; building extraction; convolutional neural network; feature fusion; context aggregation; boundary

1. Introduction

Automatic building extraction from VHR aerial images has been a hot topic in the field of photogrammetry and remote sensing for decades. The end product is of paramount importance for various applications such as urban planning, regional administration [1,2] and disaster management [3]. However, the heterogeneous spectral and structural characteristics among different buildings coupled with the highly complex urban scene pose enormous challenges to extract buildings precisely from VHR aerial images in an automatic fashion. Therefore, developing an advanced method for automated building extraction is essential and urgently needed.

The existing building extraction methods can be divided into three main categories based on different data sources, including optical image based [4], non-optical image based [5,6] and data fusion based approaches [7,8]. Furthermore, in terms of the adopted algorithms, these methods can be categorized into two major groups: non-learning based and learning based.

For non-learning based methods, buildings are extracted by: (1) thresholding buildings using specific characteristics such as spectral [9], shadow [10] and texture [11]; and (2) detecting building edges [12]. For learning based approaches, supervised classification methods, such as Support Vector Machine (SVM) [13], are applied to acquire building extraction maps at the pixel level or the object level [14,15]. However, it is rather difficult for the conventional methods to realize really automatic building extraction, particularly when handling complex VHR aerial images, because the empirically designed features vary across different building structures, imaging conditions and roof materials. Recently, with the rapid development of deep learning methods [16,17,18], a significant breakthrough has been made in mapping buildings using CNNs [19,20]. CNNs have the potential to address the challenge by closing the semantic gap between different semantic levels, and the feature representation can be learned autonomously from the data itself in a hierarchy. Simultaneously, remote sensing has entered the big data era, with massive amounts of aerial images being captured, providing the fuel for deep learning methods to learn for automatic building extraction. Thus, the latest research has paid much attention to exploiting CNNs for automatic building. In general, under the support of massive training data with high-quality annotations, the performance of CNN based approaches is superior to the other learning based methods in terms of generalization and precision. As a result, CNN based learning methods are widely utilized and represent an exciting program of research [21,22,23,24,25].

Building extraction aims to estimate a mask, where each pixel represents a specific category (i.e., building or non-building). Based on the Fully-Convolutional Network (FCN), previous works have achieved great success as reflected by numerous variants of FCNs, such as the encoder-decoder based U-Net [26] and SegNet [27]. In previous methods, the encoder-decoder structure [21,23] and some off-the-shelf context modelling approaches [22,25] were adopted to perform building extraction without considering the multi-scale problem and the boundary segmentation accuracy. There were some early attempts to tackle the multi-scale problem [28], but they are not yet adequate to cope with the scale variations in buildings and often involve insufficient multi-scale context aggregation, which leads to incomplete segmentation, particularly for large-scale building extraction. In addition, each pixel is treated equally for a standard FCN. Boundary estimation is extremely challenging since the spatial details are lost during the down-sampling process. As a consequence, the boundary accuracy of the building mask is limited. To support these statements, an instance of a segmentation error map acquired by U-Net, PSPNet [29], DeepLab-v3+ [30] and MA-FCN [23] is demonstrated in Figure 1, from which all four state-of-the-art methods present errors, with missed holes in the large buildings and inconsistent boundary segmentation. Therefore, a novel method needs to be developed to address these issues mentioned above and further enhance the performance.

Based on U-Net and DeepLab-v3+, the BARNet is proposed in this paper, with the aim to refine the building extraction with accuracy, particularly for those large buildings and boundary regions. Different from other studies, our BARNet method has a significant novelty to learn the boundary structure information in an end-to-end manner as the boundary-aware refined network. The performance of our method is compared with several state-of-the-art methods comprehensively through extensive experiments. The three major contributions of this work are summarized as follows:

(1): we develop the Gated-Attention Refined Fusion Unit (GARFU), which realizes a better fusion of cross-level features in the skip connection;
(2): we propose a Denser Atrous Spatial Pyramid Pooling (DASPP) module to capture dense multi-scale building features; and
(3): we design a boundary-enhanced loss that allows the models to pay attention to the boundary pixels.

The remainder of this article is organized as follows. Section 2 reviews the relevant works for this study. Section 3 presents the proposed method. Experiments and results are provided in Section 4. Section 5 gives some discussion of this work, and this paper is concluded in Section 6.

2. Related Work

In this section, we briefly review the relevant works of this study, i.e., CNNs for semantic segmentation in Section 2.1, multi-level feature fusion in Section 2.2, aggregation of the multi-scale context in Section 2.3 and boundary refinement in Section 2.4.

2.1. CNNs for Semantic Segmentation

Following the pioneering study of FCN [31], numerous FCN schemes have been put forward in the semantic segmentation domain, including dilated FCN, encoder-decoder and multi-path. For dilated FCNs, the last convolution layers of the backbone network (e.g., ResNet [16]) were usually replaced with dilated convolutions for maintaining the resolution of feature maps, and the transposed convolution and bilinear interpolation were embedded after the backbone network as segmentation heads. The encoder-decoder structure derived from U-Net [26] and SegNet [27] is composed of two parts: encoder and decoder. Because the low-level features containing rich details can be reused through multiple skip connections, this scheme enables the network to better restore the spatial details. The multi-path architecture often has multiple paths, such as the Bilateral Segmentation Network (BiSeNet) [32], which has a spatial and a context path. The highlight of this structure is that it is capable of constructing a lightweight network.

2.2. Multi-Level Feature Fusion

In general, there are rich spatial details such as edges in the high-resolution feature maps from shallow layers. By contrast, the abstract global representation is learned while the spatial details are lost with the successive convolution and down-sampling operations. Maintaining the resolution of feature maps is the key to semantic segmentation, while it is laborious for a standard CNN to balance the high-resolution and abstract semantic representation. Accordingly, it is natural to combine the high-level features from the bottom layers and the high-resolution features from the top layers for exploiting the complementary features. The FCN combines the low-level features and high-level features via element-wise addition operation. U-Net employs channel-wise concatenation in the lateral short-cuts between the encoder and decoder to reuse the low-level features. Based on the structure of U-Net, the Feature Pyramid Network (FPN) [28] integrates each same level low- and high-level features to make predictions. These fusion approaches fuse the different level features directly without awareness of the usefulness of all features, limiting the propagation for beneficial features.

2.3. Aggregation of the Multi-Scale Context

The context corresponds to the receptive field (RF) size of the CNN. Small objects need a small RF size, and vice versa. Owing to the fixed RF size, it is rather complicated for a CNN to suit the objects with diverse sizes. Common attempts to handle this issue are to append a well-designed sub-module after the backbone network for modelling the multi-scale context. PSPNet [29] introduces the Pyramid Pooling Module (PPM) to capture the multi-scale information. Based on the dilated convolution, the DeepLab series [30,33] proposes Atrous Spatial Pyramid Pooling (ASPP) to obtain the multi-scale context. DenseASPP [34] combines a dense skip connection with ASPP, which effectively enlarges the receptive field size of the network. Recently, inspired by the success of the attention mechanism in natural language processing, the self-attention mechanism has also been applied to aggregate the dense pixel-wise context [18,35,36,37]. The major drawback of self-attention is that it has excessive computation and memory consumption.

2.4. Boundary Refinement

Due to the inevitable degradation of spatial details caused by the down-sampling process, the boundary accuracy of the segmentation mask is usually limited for most existing CNN based methods. To resolve this problem, painstaking efforts have been made. One way is to employ many post-processing operations with high computational costs such as the Dense Conditional Random Field (DenseCRF) [38]. Another way relies on combining the boundary prior information and the extra sub-network that is responsible for detecting edges [39]. For example, Gated-SCNN [40] refines the boundary predictions by exploiting the duality between semantic segmentation and boundary segmentation with two branches and a regularizer. Evidently, this way increases the complexity and the parameter amount of the model. The others focus on exploiting the online hard example mining strategy [29,37] and perceptual loss [41], which both require careful re-training or fine-tuning of the hyperparameters.

3. Methodology

In this section, the proposed method is presented in detail. We first overview the architecture of the proposed BARNet briefly. Then, the proposed gated-attention refined fusion unit, denser atrous spatial pyramid pooling module, boundary-aware loss and training loss are elaborated.

3.1. Model Overview

The BARNet takes VHR aerial images as the input and performs pixel-level building extraction in an end-to-end manner. As illustrated in Figure 2, the BARNet is a standard encoder-decoder structure composed of three parts: encoder, context aggregation module and decoder. ResNet-101 [16] is adopted as the backbone network to encode the basic features of buildings. The fully-connected layer and the last global average pool layer in ResNet-101 are removed. To retain more details in the final feature map, for the last stage of ResNet-101, all

3 \times 3

convolutions are modified with dilated convolutions with dilated rates of

\{1, 2, 4\}

, and the stride in the down-sampling module is set to 1. After the encoder, DASPP is appended to capture the dense global context semantics. Before harvesting the low-level features into the decoder, each low-level feature map from the encoder is reduced to 256 channels with a

3 \times 3

convolution layer followed by Batch Normalization (BN) and ReLU layers, for less computational cost. The reduced low-level feature map and the corresponding high-level feature map from the decoder are fused in the GARFU. Then, the fused features are fed into the decoder block to restore the details. Each decoder block in the BARNet is equipped with two cascaded Conv-BN-ReLU blocks, the same as U-Net and DeepLab-v3+. At the end of the BARNet, a

1 \times 1

convolution layer and a softmax layer are applied to output the final predictions. To get the same size as the original input, the final predictions are further up-sampled by 4 times using bilinear interpolation.

3.2. Gated-Attention Refined Fusion Unit

The significant highlight of U-Net and FPN is integrating the cross-level features via simple concatenation and addition operations. Given a low-level feature map

E_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

and a high-level feature map

D_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

, in which i, C, H and W denote the level order, number of channels and height and width of the feature map, respectively, the two basic fusion strategies can be formulated as:

Concatenation : F_{i} = c o n c a t (E_{i}, U p s a m p l e (D_{i})),

(1)

Addition : F_{i} = E_{i} + U p s a m p l e (D_{i}),

(2)

where

c o n c a t (.)

denotes the concatenation operation,

U p s a m p l e (.)

means the up-sampling operation and

F_{i}

represents the fused feature map. It can be evidently observed from the above two equations that all feature maps are fused directly without considering the contribution of each feature. As described earlier in Section 2.2, different level features are complementary, which is beneficial for building extraction. However, to the best of our knowledge, the informative features in a feature map are mixed with massive redundant information [42]. As a result, the cross-level features should be re-calibrated before combining them for the best exploitation of the beneficial features. To achieve this goal, we developed the GARFU. As presented in Figure 2, the GARFU is embedded at each short-cut, which only incurs a small extra computational cost.

The GARFU consists of two major components: the channel-attention module and the spatial gate module. As displayed in Figure 3, the low-level feature map

E_{i}

is first re-calibrated by a channel-wise attention vector

α_{i} \in R^{C_{i} \times 1 \times 1}

. Then,

E_{i}^{^{'}}

and

D_{i}

are re-calibrated by multiplication of a gated map

β_{i}

and

(1 - β_{i})

, respectively, in which

β_{i} \in {[0, 1]}^{1 \times H_{i} \times W_{i}}

. The mechanism of the GARFU can be defined as:

E_{i}^{^{'}} = f_{S} (f_{C} (E_{i}, α_{i}), (1 - β_{i})),

(3)

D_{i}^{^{'}} = f_{S} (D_{i}, β_{i}),

(4)

F_{i} = c o n c a t (E_{i}^{^{'}}, D_{i}),

(5)

where

f_{C} (.)

represents the multiplication in the channel axis and

f_{S} (.)

represents the multiplication in the spatial dimension.

(1) Channel attention: The latest works demonstrated the effectiveness of modelling the contribution of each channel using the channel attention mechanism [1,43]. Therefore, we adopt the channel attention module reported in SENet [43] to exploit channel-wise useful features in the low-level feature map, because the low-level features usually carry more noise. The attention vector

α_{i}

is generated based on the concatenated result of

D_{i}

and

E_{i}

. Before concatenating them,

D_{i}

is first up-sampled with bilinear interpolation to make

D_{i}

have the same spatial shape as

E_{i}

. Then, the concatenated features are passed through a

1 \times 1

convolution layer for less parameters. Let

x = G A P (c)

, where c are the concatenated features and

G A P (.)

is the channel-wise global average pooling operation. The attention vector

α_{i}

is obtained by:

α_{i} = σ_{2} (W_{2} (σ_{1} (W_{1} (x)))),

(6)

in which

W_{1} \in R^{\frac{C_{i}}{2} \times 1 \times 1}

and

W_{2} \in R^{C_{i} \times 1 \times 1}

are two linear transformations,

σ_{1}

denotes the ReLU activation function and

σ_{2}

is the sigmoid activation.

α_{i}

is multiplied by

E_{i}

to enable the network to learn the salient channels that contribute to distinguishing buildings.

(2) Gate: Many studies [26,27,30] suggested that introducing the low-level information can improve the accuracy of predictions on the boundary and details, whereas lacking global semantics may lead to confusions in other regions due to a limited local receptive field size. On the other hand, there exists a semantic gap for low- and high-level features, that is not all features benefit building extraction. With this motivation, we adopt the gate mechanism to generate a gate map

β_{i}

, which serves as a guide to enhance the informative regions and suppress the useless regions both in low- and high-level features. Gates are widely used in deep neural networks to control the information propagation [44]. For example, the gated recurrent unit in the LSTM network is a typical gate [45]. In this work, the gate

β_{i}

is generated by:

β_{i} = γ σ (W (c_{i}))

(7)

where W is a linear transform parameterized with

R^{C_{i} \times 1 \times 1}

,

σ

denotes sigmoid activation for normalizing the value into

[0, 1]

and

γ

is a trainable scale factor to prevent the minima occurring during the initial training. The gate map is learned under the supervision of the ground-truth during training, and the pixel value in it measures the degree of importance for each pixel. The feature at position

(x, y)

would be highlighted when the value of

g_{i} (x, y)

is large, and vice versa. In this manner, the useless information is suppressed, and only useful features can be harvested to the following decoder block, thus obtaining better cross-level feature fusion. Different from the self-attention mechanism, the gate is learned with the explicit supervision of the ground-truth.

3.3. Denser Atrous Spatial Pyramid Pooling

As is well known, building scale variance frequently occurs in complex urban scenes, resulting in non-unified extraction scales. Thus, an ideal context modelling unit should capture the dense multi-scale features as much as possible. To achieve this goal, a new ASPP module is developed. As it is inspired by ASPP and the main idea is to capture the denser image pyramid feature, we name it denser atrous spatial pyramid pooling. As illustrated in Figure 4, the DASPP module consists of a skip connection, a Cascaded Atrous Spatial Pyramid Block (CASPB) and a global context aggregation block. The skip pathway is only composed of a simple

1 \times 1

convolution layer, aiming at reusing the high-level features and accelerating network convergence [16]. In CASPB, we cascade the hybrid multiple dilated

3 \times 3

depthwise separable convolution [46] layers with different dilation rates and connect them with dense connections. Here, the depthwise separable convolution is utilized for reducing the parameters of DASPP, and the negative effect is almost negligible. The CASPB can be formulated as

L_{i} = C o n v_{i, d_{i}} (c o n c a t (L_{0}, L_{1}, . . ., L_{i - 1}))

, where

d_{i}

represents the dilatation rate of the ith layer L,

C o n v (.)

means the convolution operation and

c o n c a t (.)

denotes the concatenation operation. In this study,

d = \{1, 3, 5, 7, 9, 11\}

. Compared to ASPP, this change brings us two main benefits: a denser feature pyramid and a larger receptive field. The sequence of the receptive field size in the original ASPP is 13, 25 and 37, respectively, when the output stride of the encoder is 16. However, the max receptive field of each layer in the CASPB is 3, 8, 19, 36, 55 and 78, which is denser and larger than ASPP. This means the CASPB is more robust with the building scale variations. In addition, the position-attention module of DANet [36] is also introduced to replace the image pooling branch in ASPP to generate denser pixel-wise global context representation. Unlike the global average pooling used in PPM and ASPP, the self-attention can a generate global representation and capture the long-range dependence between each pixel. As presented in Figure 4, the position-attention module re-weights each pixel according to the degree of correlation between any pixels.

3.4. Boundary-Aware Loss

Although the re-calibrated low-level features contribute to refining the segmentation results [30], this is still not sufficient enough to locate accurate building boundaries. As mentioned earlier, the commonly used per-pixel cross-entropy loss treats each pixel equally. In fact, depicting boundaries is more challenging than locating semantic bodies because of the inevitable spatial detail degradation. Consequently, an individual loss should be applied to force the model to pay more attention to boundary pixels explicitly. The key here is how to decouple the building edges from the final predicted maps. If the corresponding boundary maps are obtained, we could use the binary cross-entropy loss to reinforce the boundary prediction. Herein, the Laplacian operator, defined by Equation (8), is applied both on the final prediction maps and ground-truths to produce the boundary predictions and corresponding boundary labels.

f = \frac{\partial^{2} f}{\partial^{2} x} + \frac{\partial^{2} f}{\partial^{2} y},

(8)

where f is a 2D grey-scale image and x, y are the two coordinate directions of f. The output of Equation (8) is termed the gradient information map, where a higher value stands for the probability of a pixel locating at the boundary, and vice versa. We extend the Laplacian operator to process the multidimensional tensor. An instance of using the Laplacian operator to obtain the building boundary is given in Figure 5. With the yielded boundary maps

\tilde{B} \in R^{N \times 1 \times H \times W}

and boundary labels

B \in R^{N \times 1 \times H \times W}

, where N, H and W are the batch size, image height and image width, respectively, the boundary refinement is defined as a cost minimum problem expressed by:

\tilde{θ} = a r g m i n (\tilde{B}, B | θ),

(9)

where

θ

denotes the trainable parameters of the BARNet. For every single image, the weighted binary cross-entropy loss [26] is employed to compute the boundary-enhanced loss

L_{b e}

as:

L_{b e} = - \frac{1}{H \times W} \sum_{i}^{H \times W} [\frac{1}{Z^{+}} B_{i} log ({\tilde{B}}_{i}) + \frac{1}{Z^{-}} (1 - B_{i}) log (1 - B_{i})],

(10)

where

Z^{+}

and

Z^{-}

represent the number of pixels in boundary and non-boundary regions, respectively.

The final Boundary-Aware (BA) loss

L_{b a}

is defined as the addition of two losses, i.e.,

L_{b a} = λ_{b e} L_{b e} + L_{c e},

(11)

L_{c e} = - \sum_{i = 0}^{1} w_{i} {\tilde{y}}_{i} log (y_{i}),

(12)

in which

λ_{b e}

is empirically set to 1 to balance the contribution of boundary-enhanced loss,

w_{i}

is the class weight calculated using the median frequency balance strategy [1] and

{\tilde{y}}_{i}

and

y_{i}

denote the model predictions and corresponding labels, respectively.

3.5. Training Loss

The cross-entropy loss expressed in Equation (12) is utilized to supervise each learned gate. It should be noted that each gate is up-sampled to the same size as the ground-truth for computing loss. In addition, in order to facilitate the training process, an auxiliary loss with a weight of 0.4 is set on the output feature map at the third stage of ResNet-101 [29]. Thus, the final total loss of our network is:

L = \sum_{i = 1}^{3} λ_{i} L_{g a t e}^{i} + 0.4 \times L_{a u x} + L_{b a}

(13)

where

λ_{i} \in \{0.8, 0.6, 0.4\}

denotes the balanced weight parameter for different gate losses. Following the works [29,36], the Online Hard Example Mining (OHEM) strategy [29] is adopted to compute the BA loss

L_{b a}

during training, to boost the performance of the BARNet.

4. Experiments and Results

4.1. Datasets

Two standard open-source VHR aerial datasets were used to verify the effectiveness of the proposed method. All the images were collected in a complex urban scene by airborne sensors and have very high resolution. Two close-ups of the datasets are shown in Figure 6.

WHUaerial building dataset [21]: This dataset consists of more than 22,000 independent buildings extracted from aerial images with a 0.075 m spatial resolution and 450 km

^{2}

covering Christchurch, New Zealand. The structure and roof materials of these buildings vary in different locations, ranging from low-rise urban residential settlements to homogeneous industry areas. In the previous literature, this dataset is the most popular benchmark for building extraction. Due to the size of the original images with a 0.075 m ground resolution being very large, the organizer down-sampled them to a 0.3 m ground resolution and seamlessly cropped them into 8189 tiles with 512 × 512 pixels. The dataset was officially divided into three parts: 4736 tiles for training, 1036 tiles for validation and 2416 tiles for testing.

ISPRS Potsdam dataset: This dataset, consisting of 38 True Orthophoto (TOP) aerial images, is provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) and is widely used to evaluate the algorithm for urban remote sensing semantic labelling. The size of each image is 6000 × 6000 pixels. Compared to the WHU dataset, it is more challenging due to the finer spatial resolution of 0.05 m. Dense residual buildings with different shapes and roof materials dominate in this dataset, making it hard to accurately separate buildings from background objects. Following the official suggestion, fourteen images were set as the test set, and the remaining 24 images were randomly split into 7 images for validating and 17 images for training.

4.2. Experimental Settings

Experimental configurations: The proposed BARNet was implemented based on the Pytorch-1.6 framework in the Ubuntu 20.04 environment. All experiments were conducted on an Nvidia GeForce RTX 2080ti GPU with 11GB RAM.

Training settings: The encoder was initialized with the weight of ResNet-101 trained on ImageNet [47], and the rest was initialized using Kaiming uniform [48]. The Adam algorithm [49], where the initial learning rate was set to 0.0002 and the weight decay was set to 0.0005, was selected to optimize the network. The warmup strategy [50] with a base learning rate of 5

\times 10^{- 8}

and an exponential weight-decay strategy with a decay rate

γ = 0.9

was employed to adjust the learning rate for each epoch. We set the warmup period to 5 epochs. During the warmup period, the learning rate was linearly increased. The number of epochs was set to 100 for the WHU and Potsdam datasets. We set the batch size to 8 to make full use of the GPU memory. Note that all experiments were done using mixed-precision training [51].

Dataset settings: For the Potsdam dataset, the images used for training and validation were cropped into 512 × 512 pixels without overlap, and the tested images were cropped into 512 × 512 pixels with an overlap of 128 pixels. To avoid the risk of over-fitting, some commonly used data augmentation approaches, including random horizontal-vertical flipping, random cropping with a crop size of 512 × 512, random scaling within a range of

\{0.75, 1.0, 1.25, 1.5, 1.75\}

and random Gaussian smoothing, were applied on each training image.

Test settings: Following the works [29,30,36,37], the multi-scale (MS) inference strategy was employed. When using multi-scale inference, the final results were generated by averaging all predictions with scales

\{0.75, 1.0, 1.25, 1.50, 1.75\}

.

4.3. Comparison to State-Of-The-Art Studies

We evaluated the proposed method on the WHU and Potsdam datasets. To reveal whether the proposed method gains an advantage over other recent State-Of-The-Art (SOTA) studies, several remarkable CNNs for semantic segmentation and building extraction were chosen as comparative methods, namely U-Net [26], DeepLab-v3+ [30] and DANet [36]. Moreover, MA-FCN [23], which achieved a very high IoU score of 90.7% on the WHU dataset [23], was also chosen to verify the superiority of our method. Among these methods, DANet is a representative dilated FCN, where the self-attention mechanism was applied to aggregate the holistic context. The robustness of DeepLab-v3+ equipped with a strong decoder head has been proven in previous studies. MA-FCN, one of the variants of U-Net, focuses on the effects of building scale. Except the mini-batch size and the input size being changed to suit the GPU memory used in this study, all the comparative methods were reproduced using the default settings given by the authors.

4.3.1. Visualization Results

The comparisons with different models are elaborated as follows:

(1) WHU dataset: The results produced by different methods on the WHU dataset are illustrated in Figure 7. Visually, the proposed BARNet obtained the best global extraction results compared with other SOTA methods. As displayed in the first row of Figure 7a, a reasonable performance was achieved by U-Net, DeepLab-v3+, DANet and MA-FCN in a simple scene. However, with the increase in the complexity and structure of buildings, a dramatically decreased performance is clearly observed in the second and third rows of Figure 7, where parts of buildings are missed, implying that they have difficulty in accurately recognizing the buildings with irregular structures and large scales. This phenomenon seriously affects the visualizations. Conversely, almost all buildings were identified correctly and completely by the BARNet, and there were only a few errors according to the results presented in Figure 7g. This is mainly because our model can better aggregate contextual information. Comparing U-Net, PSPNet and MA-FCN, DeepLab-v3+ could maintain the completeness of the final predictions for relatively large buildings to some extent, whereas it also failed to handle the large buildings with complex shapes. A striking illustration of this can be seen in the third row of Figure 7d, where problems like a high missed rate occurred. Even though MA-FCN is the improved version of U-Net, its performance is similar to U-Net, sensitive to the variation in building scale and structure. To better clarify the detailed inspection, the close-ups of the selected regions (as marked in yellow rectangles in Figure 7) in the tested images are displayed in Figure 8. From the close-up views, we can observe that these SOTA methods exhibited a limited ability to separate the confusing non-building objects adjacent to buildings, yielding inaccurate boundary predictions. Nevertheless, owing to reinforcing the boundary and refining the multi-level feature fusion, our method performed well by generating only a small number of misclassified pixels in boundary regions.

(2) Potsdam dataset: Figure 9 provides the experimental results on the Potsdam dataset for different methods. According to the results displayed in Figure 9, the BARNet always obtained the most consistent results with the reference building maps visually. Compared with other methods, our model is robust to cope with buildings in different complex scenes. In contrast, the performance of the other methods looks unstable. A striking illustration of close-up views for different methods is given in Figure 10, from which we can see that there are only a few errors in Figure 10g. Among the SOTA methods, U-Net produced building extraction results with incomplete predictions, suggesting that it has a poor ability to aggregate multi-scale contextual information. Meanwhile, the boundary location is not precise enough. Similar behaviours are observed in Figure 9 and Figure 10 for DANet and DeepLab-v3+. Although DANet and DeepLab-v3+ are both equipped with a strong context modelling module, they still suffer an incomplete detection for large buildings. Another prominent problem for DANet and DeepLab-v3+ is that there exists an evident grid effect (see Column (d) and Column (e) in Figure 9) resulting from the atrous convolution, bringing unstable performance. Benefiting from the attention on the aggregation of the multi-scale representation, MA-FCN could extract most of the buildings stably, but its performance was not good enough. It can be clearly observed in Figure 10f that some buildings were not detected completely, and many non-building pixels were misclassified as buildings. For boundary refinement, as illustrated in the last row of Figure 10, the other four methods failed to accurately separate the cement square from the buildings that have similar characteristics to it, resulting in over-extracting.

According to the above analysis, we can conclude that the improvements of our method lie in recognizing the multi-scale buildings, especially the buildings with large scale, and locating building boundaries more accurately among different scenes, which demonstrates the effectiveness of the proposed method for automatic building extraction in urban VHR aerial images.

4.3.2. Quantitative Comparisons

To objectively assess the performance of the proposed method, following the work in [1,21,23], four commonly used metrics, i.e., precision, recall,

F_{1}

score and Intersection over Union (IoU), were adopted in the following experiments. These metrics are expressed as:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %, R e c a l l = \frac{T P}{T P + F N} \times 100 %,

(14)

F_{1} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \times 100 %,

(15)

I o U = \frac{T P}{T P + F N + F P} \times 100 %,

(16)

where

T P

(true-positive), the number of correctly identified building pixels,

F P

(false-positive), the number of missed building pixels,

T N

(true-negative), the number of correctly classified non-building pixels, and

F N

(false-negative), the number of non-detected non-building pixels. Precision reports the ratio of

T P

in the whole positive predictions, and recall assesses the proportion of

T P

over entire building pixels in the ground-truth. The

F_{1}

score is the weighted numerical assessment by taking both precision and recall into consideration. The IoU measures the ratio that building pixels are correctly identified as the building category.

The quantitative comparison of the results of different methods are listed in Table 1, where the best entries are in bold. According to Table 1, the BARNet achieved an outstanding performance, which was better than other SOTA methods. For the WHU dataset, we can see that the proposed method performs favourably against the SOTA methods in terms of all metrics. Compared with the second-best method (i.e., MA-FCN), the BARNet improved the IoU score from an already very high IoU (

90.70 %

) to a new highest score of

91.51 %

. In particular, we can see that the precision of the BARNet is boosted by 2.01 points over MA-FCN, which is a near-perfect performance on this dataset. The near-saturated performance confirms that the BARNet is capable of extracting buildings in VHR aerial images completely. Even in the face of the challenging Potsdam dataset, where the images have a very high resolution, the BARNet also achieved a remarkable performance with a precision score of

98.64 %

, an

F_{1}

score of

96.84 %

and an IoU score of

92.24 %

. Here, the recall score of the BARNet is lower than the method reported by Wang et al. [1] because the additional normalized Digital Surface Model (nDSM) data were not utilized to enhance the model performance. Under the condition of only using single R-G-B data, the proposed method improved the precision, F

_{1}

and the IoU by

2.94 %

,

1.14 %

and

1.68 %

, respectively, compared against the second-best metrics. The improvements indicate that the BARNet is robust enough to cope with building extraction in VHR aerial images with complex urban scenes.

To quantify the boundary segmentation quality of different methods, the comparison result of trimap boundary experiments performed on the WHU dataset is also reported, as illustrated in Figure 11. Notably, we did not conduct the trimap experiment on the Potsdam dataset owing to the evident texture distortion in building boundary regions for the tested images, leading to that the ground-truth not precisely corresponding to the actual building boundary. Specifically, as presented in Figure 11a, the eroded and dilated band along the boundary with a given width (pixels) is called a trimap. We utilized the morphological dilation and erosion operations to generate the boundary trimap with a width of

\{2, 4, 6, 8, 10, 12\}

. After obtaining the predicted trimap and reference trimap, we computed the mean IoU between them. The higher the mean IoU, the better the performance for boundary segmentation is. As shown in Figure 11, when the bandwidth of the trimap is lower than eight pixels, the comparative methods exhibited poor performance, suggesting that massive boundary pixels were classified wrongly. Nevertheless, it can be clearly observed that our method achieved significant performance on refining the boundary extraction, which verifies the positive effect of the GARFU and BA loss on enhancing model performance.

5. Discussion

5.1. Ablation Studies

The ablation experiments included two parts: (a) exploring for further investigating the contribution of each sub-module introduced in Section 3 and the improvement strategies adopted for training and inference; (b) several quantitative comparisons for the GARFU, the DASPP, the BA loss and the boundary-enhanced loss with other corresponding SOTA methods. Unless otherwise stated, all the involved experiments were conducted under the same conditions and settings given for fairness.

5.1.1. Network Design Evaluation

To verify the design of the BARNet, U-Net was chosen as the baseline model, and IoU was adopted to assess the effectiveness quantitatively. The detailed evaluation results are summarized in Table 2. We first replaced the encoder part in the baseline with ResNet-101. Meanwhile, all the encoder feature maps were reduced to 256 channels to keep consistent with the BARNet, which is different from the baseline. Due to the strong feature encoding ability of ResNet-101, these changes brought an improvement of a

1.49 %

IoU. After inserting the GARFU module in the lateral skip connections, the IoU was improved by

0.72 %

points, implying that selectively fusing cross-level features is better than the direct concatenation fusion. Adding the DASPP module to capture the multi-scale context obtained a significant improvement with an IoU of

1.12 %

, indicating that the proposed GARFU is robust to handle the critical scale issue for building objects in VHR aerial images. As expected, the BA loss notably boosted the performance with an increment of a

0.53 %

IoU, compared with the commonly used cross-entropy loss. Additionally, the OHEM strategy further improved the IoU by

0.53 %

points. With the help of MS inference, our model achieved a

91.51 %

IoU, which significantly outperformed the previous SOTA model, MA-FCN, which achieved a

90.7 %

IoU on the WHU dataset with multi-model voting and refined overlap strategies.

5.1.2. Comparison with the Multi-Level Feature Fusion Strategy

To test the proposed GARFU, herein, the BARNet served as the baseline model. We utilized the addition operation and the concatenation operation to replace the GARFU, respectively. As described in Section 5.1.1, all the feature maps from the encoder were also reduced to 256 channels using

3 \times 3

convolutions, and the others remained unchanged. Table 3 shows the comparison results, from which we can observe that the IoU reduced by

0.89 %

and

0.78 %

, respectively, without the GARFU. The decreased performance for addition and concatenation can be attributed to ignoring the semantic gap between high- and low-level features. We rethought the contribution of different level features and filled the gap well. The GARFU can adaptively harvest the useful information to fuse it by learning a spatially gated map and a channel attention vector from two adjacent high- and low-level features. This simple yet efficient fusion strategy contributes to making full use of different level features for building extraction.

5.1.3. Comparison with the Multi-Scale Context Scheme

The idea of DASPP is to make the network more stable for coping with the buildings in VHR remote sensing images with complex scenes by capturing a denser multi-scale semantic context. It has been proven that it is non-trivial for the semantic segmentation task to append an additional module after the backbone network for enlarging the receptive field size of the network. We compared the IoU performance of DASPP with several well verified context modelling approaches, i.e., the Pyramid Pooling Module (PPM) in PSPNet [29], ASPP in DeepLab-v3+ [30], self-attention in the non-local net [35], and dual-attention in DANet [36]. Additionally, complexity is also a key factor for a context modelling method; thus, the complexity is also reported. The experimental results are summarized in Table 4, from which we can find that DASPP outperforms other multi-scale context aggregation schemes both in terms of the IoU and Floating-Point Operations (FLOPs). Self-Attention based methods can establish dense pixel-wise relations and achieve a reasonable performance of the IoU, but the practical application is restricted by the high computational cost. Despite ASPP have the maximum trainable parameters, its performance looks insufficient. In contrast, DASPP can acquire the best performance when maintaining efficiency, which demonstrates the robustness of DASPP for coping with the building scale variation.

5.1.4. Analysis of the Generality and Effectiveness of the BE Loss

Apparently, the traditional cross-entropy loss function only considers the pixel-level similarities between predictions and labels, resulting in that this loss is not sensitive when tackling boundary pixels and non-boundary pixels. In this study, we developed the Boundary-Enhanced (BE) loss to strengthen boundary segmentation explicitly. Since the importance was proven in Section 5.1.1, we only compared the BE loss with the well verified conventional DenseCRF introduced in DeepLab-v1 [38]. We fine-tuned the hyperparameters in DenseCRF for yielding the optimal results. According to the results listed in Table 5, the proposed BE loss outperforms the DenseCRF no matter what baseline model is embedded. The DenseCRF only brings a slight improvement in the IoU. On the other hand, in the experiments, we observed that embedding the DenseCRF comes with a high computational cost. The comparisons powerfully verify that the generality and robustness of the BE loss are better than the SOTA DenseCRF.

5.2. Limitations and Future Works

Although the proposed approach has filled the left gap that the performance of most of the existing methods is insufficient to recognize large-scale buildings and locate building boundaries well and accurately, there are still some inherent pending problems that should not be ignored. First, the number of the total trainable parameters in the BARNet is 67.49 M, which is greatly larger than some medium-scale networks, such as U-Net (about 28.95 M). On the other hand, the efficiency of our method is relatively slow. It took about nine hours for training and about three minutes for inference on the WHU dataset, making the proposed method impractical to be deployed on mobile platforms such as a UAV. Thus, future work should pay attention to achieving real-time extraction. Then, in the experiments, we found that extracting small buildings was insufficient (54.29% IoU on the WHU dataset for small buildings with an area less than 2000 pixels). Therefore, a stronger high-resolution network should be developed to handle this problem. Last, we observed that there are many latent mistaken labels in several existing open-source benchmarks, making it quite hard to improve the performance further based on these datasets. For this reason, semi-supervised learning and few-shot learning should be given enough attention to reduce the dependence on massive high-quality labelled data.

6. Conclusions

Even though tremendous efforts have been made in automatic building extraction from VHR images using CNNs, extracting large-scale buildings completely and locating building boundaries precisely remain challenging issues due to the limited multi-scale context and the lack of boundary consideration. With such motivations, the BARNet is proposed to address the issues. Within the BARNet, the GARFU is introduced to make full use of multi-level features by re-calibrating the information contribution in the channel and the spatial dimensions. Besides, DASPP is developed to encode the multi-scale context better. In particular, the BE loss is embedded into the network to force the model to pay attention to the boundary. Comprehensive experiments performed on the WHU and ISPRS benchmarks indicate that the BARNet is suitable for processing building extraction in VHR aerial remotely sensed images over complex urban scenes. Compared with several SOTA models, our method exhibits the best performance with the highest accuracy consistently. A lightweight network and semi-supervised learning will be developed to improve the computational efficiency and extraction accuracy in our future research.

Author Contributions

Conceptualization, Y.J.; methodology, Y.J.; software, Y.J.; validation, X.L. and H.J.; resources, W.X.; data curation, Y.J.; writing–original draft preparation, Y.J.; writing–review and editing, C.Z.; visualization, Y.J.; supervision, W.X.; funding acquisition, W.X. All authors read and agreed to the published version of the manuscript.

Funding

This research was funded by the Sichuan Science and Technology Project (No. 2020YFG0306, No. 2020YFG0055 and No. 2020YFG0327) and Science and Technology Program of Hebei (No. 19255901D and No. 20355901D).

Acknowledgments

The authors thank Wuhan University and ISPRS for providing the open-access and free aerial image dataset. The authors would also like to thank the anonymous reviewers and the editors for their insightful comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
FCN	Fully-Convolutional Network
GARFU	Gated-Attention Refined Unit
FPN	Feature Pyramid Network
ASPP	Atrous Spatial Pyramid Pooling
DASPP	Denser Atrous Spatial Pyramid Pooling
BN	Batch Normalization
ReLU	Rectified Linear Unit
CASPB	Cascaded Atrous Spatial Pyramid Block
CE	Cross-Entropy
OHEM	Online Hard Example Mining
ISPRS	International Society for Photogrammetry and Remote Sensing
GPU	Graphics Processing Unit
PPM	Pyramid Pooling Module
DenseCRF	Dense Conditional Random Filed
MS	Multi-Scale inference

References

Wang, Y.; Chen, C.; Ding, M.; Li, J. Real-time dense semantic labeling with dual-Path framework for high-resolution remote sensing image. Remote Sens. 2019, 11, 3020. [Google Scholar] [CrossRef] [Green Version]
Chaudhuri, D.; Kushwaha, N.; Samal, A.; Agarwal, R. Automatic building detection from high-resolution satellite images based on morphology and internal gray variance. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 9, 1767–1779. [Google Scholar] [CrossRef]
Wang, X.; Li, P. Extraction of urban building damage using spectral, height and corner information from VHR satellite images and airborne LiDAR data. ISPRS-J. Photogramm. Remote Sens. 2020, 159, 322–336. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1 imagery. Photogramm. Eng. Remote Sens. 2011, 77, 721–732. [Google Scholar] [CrossRef]
Du, S.; Zhang, Y.; Zou, Z.; Xu, S.; He, X.; Chen, S. Automatic building extraction from LiDAR data fusion of point and grid-based features. ISPRS-J. Photogramm. Remote Sens. 2017, 130, 294–307. [Google Scholar] [CrossRef]
Awrangjeb, M.; Fraser, C.S. Automatic segmentation of raw LiDAR data for extraction of building roofs. Remote Sens. 2014, 6, 3716–3751. [Google Scholar] [CrossRef] [Green Version]
Huang, J.; Zhang, X.; Xin, Q.; Sun, Y.; Zhang, P. Automatic building extraction from high-resolution aerial images and LiDAR data using gated residual refinement network. ISPRS-J. Photogramm. Remote Sens. 2019, 151, 91–105. [Google Scholar] [CrossRef]
Awrangjeb, M.; Ravanbakhsh, M.; Fraser, C.S. Automatic detection of residential buildings using LIDAR data and multispectral imagery. ISPRS-J. Photogramm. Remote Sens. 2010, 65, 457–467. [Google Scholar] [CrossRef] [Green Version]
You, Y.; Wang, S.; Ma, Y.; Chen, G.; Wang, B.; Shen, M.; Liu, W. Building detection from VHR remote sensing imagery based on the morphological building index. Remote Sens. 2018, 10, 1287. [Google Scholar] [CrossRef] [Green Version]
Huang, X.; Yuan, W.; Li, J.; Zhang, L. A new building extraction postprocessing framework for high-spatial-resolution remote-sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 654–668. [Google Scholar] [CrossRef]
Zhai, W.; Shen, H.; Huang, C.; Pei, W. Fusion of polarimetric and texture information for urban building extraction from fully polarimetric SAR imagery. Remote Sens. Lett. 2016, 7, 31–40. [Google Scholar] [CrossRef]
Qin, X.; He, S.; Yang, X.; Dehghan, M.; Qin, Q.; Martin, J. Accurate outline extraction of individual building from very high-resolution optical images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1775–1779. [Google Scholar] [CrossRef]
Turker, M.; Koc-San, D. Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping. Int. J. Appl. Earth Obs. Geoinf. 2015, 34, 58–69. [Google Scholar] [CrossRef]
Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS-J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
Miao, Z.; Shi, W.; Gamba, P.; Li, Z. An object-based method for road network extraction in VHR satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4853–4862. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137. [Google Scholar] [CrossRef] [Green Version]
Jin, Y.; Xu, W.; Hu, Z.; Jia, H.; Luo, X.; Shao, D. GSCA-UNet: Towards Automatic Shadow Detection in Urban Aerial Imagery with Global-Spatial-Context Attention Module. Remote Sens. 2020, 12, 2864. [Google Scholar] [CrossRef]
Wu, G.; Guo, Z.; Shi, X.; Chen, Q.; Xu, Y.; Shibasaki, R.; Shao, X. A boundary regulated network for accurate roof segmentation and outline extraction. Remote Sens. 2018, 10, 1195. [Google Scholar] [CrossRef] [Green Version]
Liu, H.; Luo, J.; Huang, B.; Hu, X.; Sun, Y.; Yang, Y.; Xu, N.; Zhou, N. DE-Net: Deep Encoding Network for Building Extraction from High-Resolution Remote Sensing Imagery. Remote Sens. 2019, 11, 2380. [Google Scholar] [CrossRef] [Green Version]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. Int. J. Remote Sens. 2019, 40, 3308–3322. [Google Scholar] [CrossRef]
Wei, S.; Ji, S.; Lu, M. Toward automatic building footprint delineation from aerial images using cnn and regularization. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2178–2189. [Google Scholar] [CrossRef]
Kang, W.; Xiang, Y.; Wang, F.; You, H. EU-Net: An Efficient Fully Convolutional Network for Building Extraction from Optical Remote Sensing Images. Remote Sens. 2019, 11, 2813. [Google Scholar] [CrossRef] [Green Version]
Xie, Y.; Zhu, J.; Cao, Y.; Feng, D.; Hu, M.; Li, W.; Zhang, Y.; Fu, L. Refined Extraction Of Building Outlines From High-Resolution Remote Sensing Imagery Based on a Multifeature Convolutional Neural Network and Morphological Filtering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1842–1855. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–20 June 2015; pp. 3431–3440. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. arXiv 2019, arXiv:1909.11065. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1857–1866. [Google Scholar]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–20 June 2017; pp. 4681–4690. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European conference on computer vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 630–645. [Google Scholar]
Wu, Y.; Yuan, M.; Dong, S.; Lin, L.; Liu, Y. Remaining useful life estimation of engineered systems using vanilla LSTM neural networks. Neurocomputing 2018, 275, 167–179. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Mobilenets, H.A. Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–20 June 2015; pp. 1026–1034. [Google Scholar]
Da, K. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2017, arXiv:1710.03740. [Google Scholar]
Yang, H.; Wu, P.; Yao, X.; Wu, Y.; Wang, B.; Xu, Y. Building extraction in very high resolution imagery by dense-attention networks. Remote Sens. 2018, 10, 1768. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Examples of segmentation error map for several existing state-of-the-art methods performed on the WHUaerial building dataset [21]. Column (a), original images. Column (b), reference labels. Columns (c–f), results obtained by U-Net, PSPNet, DeepLab-v3+ and MA-FCN, respectively. In (b–f), green, white and blue indicate building pixels, non-building pixels and misidentified building pixels, respectively.

Figure 2. Overall structure of the proposed Boundary-Aware Refined Network (BARNet).

Figure 3. Structure of the proposed gated-attention refined fusion unit.

Figure 4. Architecture of the proposed denser atrous spatial pyramid pooling module.

Figure 5. Principle of the proposed boundary-aware loss.

Figure 6. Examples of the images and corresponding labels for the two employed datasets. (a) and (b) represent the WHU dataset and Potsdam dataset, respectively. The white regions in the two reference maps stand for buildings.

Figure 7. Examples of building extraction results obtained by different methods on the WHU dataset. (a) Original image. (b) Ground-truth. (c) U-Net. (d) DeepLab-v3+. (e) DANet. (f) MA-FCN. (g) BARNet. Note, in Columns (c–g), green, blue and red indicate true-positive, false-negative and false-positive, respectively. The yellow rectangles in (a) are the selected regions for close-up inspection in Figure 8.

Figure 8. Close-up views of the results obtained by different methods on the WHU dataset. Images and results shown in (a–g) are the subset from the selected regions marked in Figure 7a. (a) Original image. (b) Ground-truth. (c) U-Net. (d) DeepLab-v3+. (e) DANet. (f) MA-FCN. (g) BARNet.

Figure 9. Examples of building extraction results obtained by different methods on the Potsdam dataset. (a) Original image. (b) Ground-truth. (c) U-Net. (d) DeepLab-v3+. (e) DANet. (f) MA-FCN. (g) BARNet. Note, in Columns (c–g), green, blue and red indicate true-positive, false-negative and false-positive, respectively. The yellow rectangles in (a) are the selected regions for close-up inspection in Figure 10.

Figure 10. Close-up views of the results obtained by different methods on the Potsdam dataset. Images and results shown in (a–g) are the subset from the selected regions marked in Figure 9a. (a) Original image. (b) Ground-truth. (c) U-Net. (d) DeepLab-v3+. (e) DANet. (f) MA-FCN. (g) BARNet.

Figure 11. Instance of a trimap and comparisons of boundary refinement. (a) Boundary trimap. (b) Mean IoU results for trimap experiments with a bandwidth of

\{2, 4, 6, 8, 10, 12\}

.

Figure 11. Instance of a trimap and comparisons of boundary refinement. (a) Boundary trimap. (b) Mean IoU results for trimap experiments with a bandwidth of

\{2, 4, 6, 8, 10, 12\}

.

Table 1. Quantitative results of different methods on the two datasets. The entries in bold denote the best on the corresponding dataset. The short line in this table is the unknown results that were not given by the authors.

Datasets	Methods $^{1}$	Precision (%)	Recall (%)	F $_{1}$ (%)	IoU (%)
WHU	U-Net	92.88	93.18	91.61	86.95
	DeepLab-v3+	95.07	90.75	92.85	88.05
	DANet	94.25	93.93	94.09	88.87
	MA-FCN (our implementation)	94.20	94.47	94.34	89.21
	MA-FCN (overlap and vote) [23]	95.20	95.10	-	90.70
	SiU-Net [21]	93.80	93.90	-	88.40
	BARNet (ours)	97.21	95.32	96.26	91.51
Potsdam	U-Net	93.90	93.61	93.75	86.70
	DeepLab-v3+	95.70	93.95	94.81	88.21
	DANet	95.63	94.30	94.97	88.19
	MA-FCN	93.70	93.20	93.42	87.69
	DAN [52]	-	-	92.56	90.56
	Wang et al. [1]	94.90	96.50	95.70	-
	BARNet (ours)	98.64	95.12	96.84	92.24

¹ This table incorporates the numerical results reported by other works and the results obtained by our method.

Table 2. Evaluation results for the network design, where U-Net serves as the baseline model. ✔ indicates the corresponding component is adopted. ↑ stands for the improvement of the IoU for using the adopted component, compared to the previous step.

Baseline	ResNet-101	GARFU	DASPP	CE Loss	BA Loss	OHEM	MS	IoU (%)
✔				✔				86.95
✔	✔			✔				88.44 (1.49↑)
✔	✔	✔		✔				89.16 (0.72↑)
✔	✔	✔	✔	✔				90.28 (1.12↑)
✔	✔	✔	✔		✔			90.84 (0.56↑)
✔	✔	✔	✔		✔	✔		91.37 (0.53↑)
✔	✔	✔	✔		✔	✔	✔	91.51 (0.14↑)

Table 3. Comparison of different fusion strategies on the WHU dataset. ↓ stands for the decrement of the IoU compared with the baseline. Note, OHEM and MS are not used for this comparison.

Fusion Strategies	IoU (%)
GARFU (baseline)	90.84
Addition	89.94 (0.89↓)
Concatenation	90.06 (0.78↓)

Table 4. Comparison of different multi-scale contest schemes, where the best is in bold. We report the IoU results performed on the WHU dataset. FLOPs are calculated when processing the input with a size of

1 \times 2048 \times 128 \times 128

. Note, OHEM and MS are not used for this comparison.

Table 4. Comparison of different multi-scale contest schemes, where the best is in bold. We report the IoU results performed on the WHU dataset. FLOPs are calculated when processing the input with a size of

1 \times 2048 \times 128 \times 128

. Note, OHEM and MS are not used for this comparison.

Methods	IoU (%)	Parameters (M)	FLOPs (G)
PPM	89.79	22	619
ASPP	90.34	15.1	503
Self-Attention	90.42	10.5	619
Dual-Attention	90.45	10.6	1110
DASPP (Ours)	90.84	10.6	172

Table 5. Comparison of the Boundary-Enhanced (BE) loss and the Dense Conditional Random Field (DenseCRF), where the best is in bold. We report the results performed on the WHU dataset. ✔ means that the corresponding method is adopted. Note, OHEM and MS are not used for this comparison.

Methods	CE Loss	CE Loss + DenseCRF	CE Loss + BE Loss	IoU (%)
U-Net	✔			86.95
U-Net		✔		87.03
U-Net			✔	87.76
BARNet (Ours)	✔			90.28
BARNet (Ours)		✔		90.39
BARNet (Ours)			✔	90.84

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Y.; Xu, W.; Zhang, C.; Luo, X.; Jia, H. Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images. Remote Sens. 2021, 13, 692. https://doi.org/10.3390/rs13040692

AMA Style

Jin Y, Xu W, Zhang C, Luo X, Jia H. Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images. Remote Sensing. 2021; 13(4):692. https://doi.org/10.3390/rs13040692

Chicago/Turabian Style

Jin, Yuwei, Wenbo Xu, Ce Zhang, Xin Luo, and Haitao Jia. 2021. "Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images" Remote Sensing 13, no. 4: 692. https://doi.org/10.3390/rs13040692

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images

Abstract

1. Introduction

2. Related Work

2.1. CNNs for Semantic Segmentation

2.2. Multi-Level Feature Fusion

2.3. Aggregation of the Multi-Scale Context

2.4. Boundary Refinement

3. Methodology

3.1. Model Overview

3.2. Gated-Attention Refined Fusion Unit

3.3. Denser Atrous Spatial Pyramid Pooling

3.4. Boundary-Aware Loss

3.5. Training Loss

4. Experiments and Results

4.1. Datasets

4.2. Experimental Settings

4.3. Comparison to State-Of-The-Art Studies

4.3.1. Visualization Results

4.3.2. Quantitative Comparisons

5. Discussion

5.1. Ablation Studies

5.1.1. Network Design Evaluation

5.1.2. Comparison with the Multi-Level Feature Fusion Strategy

5.1.3. Comparison with the Multi-Scale Context Scheme

5.1.4. Analysis of the Generality and Effectiveness of the BE Loss

5.2. Limitations and Future Works

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI