1. Introduction
Building extraction (BE) and change detection (CD) are important yet very challenging topics in the field of earth observation. As a geospatial dense prediction task, BE is crucial for mapping, monitoring, urban management, and 3-D reconstruction [
1]. Owing to the complex background and mixed pixel problems in the remote sensing (RS) imagery, BE suffers from inaccurate classification and ambiguous boundary problems. Hence, it remains difficult to satisfy practical requirements [
2]. Compared to BE, building CD is an even more challenging task, as it is focused on the accurate change information prediction of buildings between images acquired at distinct intervals, rather than simply predicting buildings in one image [
3]. Building CD tasks are important for applications such as urban area development and disaster management [
4].
Owing to the popularity of deep learning (DL) and the advantage of its end-to-end characteristics, DL has been widely applied in computer vision [
5,
6], among which feature-learning-based semantic segmentation methods for both BE and CD tasks have been widely studied. Building feature representation and pixel classification are procedures commonly required for both BE and CD tasks. Improving feature representation and fusion ability is usually required to improve model performance [
7,
8,
9].
For BE tasks, early convolutional neural network (CNNs)-based models, such as the fully convolutional network (FCN) [
10], ResNets [
11] and UNet [
12], can provide promising results with extracted rich semantic features. Incorporating more powerful backbones [
13,
14,
15], sophisticated processing strategies [
16,
17], or semi-supervised learning [
18] has improved the model performance on the BE task. However, in the method based on CNNs, the down-sampling operation will lose spatial details of the high-resolution images, resulting in blurry edges of the extracted buildings. Some methods attempt to add edge information in building extraction [
19,
20], which can boost the ability of a model to perceive edges. In addition, ref. [
21] designs a boundary refinement network from coarse to fine, gradually improved the construction of building edges, and suppressed the irrelevant noise of underlying features under the guidance of high-level semantics.
Similar to the BE task that takes one image as input and outputs a mask layer, the building CD task is also typically regarded as a pixel-level prediction problem. The involved classification and detection sub-tasks of a CD problem are generally dispensed together, as DL allows for the end-to-end CD and avoids the negative accumulation of errors from multiple-step-based approaches such as post-classification CD. Summarily, the building CD task can be performed by fusing multiple RS images to output building changes. There are roughly three types of strategies: late, early, and hybrid fusion, depending on how the paired images are dealt with [
3]. Early fusion [
22] approaches concatenate multi-temporal images as one input into a network (single-feature structure), whereas late fusion approaches [
23,
24] separately learn mono-temporal features and later combine them as an input to the CD network (dual-feature structure) [
4]. Hybrid fusion is a combination of early and late fusions, which concatenations are carried out for both the input multi-temporal images and the learned respective features.
The early CNNs employed for CD tasks were fully convolutional Siamese networks and the following variants [
25]. This type of architecture features two encoding branches for feature extraction from paired input images, and one decoding branch to detect changes from feature differences. A weight-shared encoder makes it easier to detect changes. Additionally, ref. [
26] introduces a global co-attention mechanism and designs an attention-guided Siamese network that is based on pyramid features and focuses on the correlation among input feature pairs. Ref. [
27] combines the Siamese network and UNet network, and proposes a large-scale SCD (semantic change detection) network comprising two encoders and two decoders with shared model parameters. Similarly, ref. [
28] utilizes a UNet-based Siamese network to learn representations from bi-temporal inputs via the encoder separately, and performed a difference connection to improve the generated difference maps. Furthermore, ref. [
29] proposes a multi-task constrained deep Siamese convolutional network containing three sub-networks: a CD network and two dense label prediction networks, which improved the CD accuracy. In addition, generative adversarial networks (GANs) [
30,
31] and recurrent neural networks (RNNs) [
23,
24] have been studied for CD tasks. For example, ref. [
24] proposes a general and new deep Siamese convolutional multi-layer RNN for CD from multi-temporal, very high-resolution (VHR) imagery via integrating the merits of CNN and RNN.
The impact of learning strategies on network performance have been demonstrated for many kinds of vision tasks and various types of learning strategies, such as the attention mechanism, multi-scale feature fusion, and transfer learning, have been proposed [
32]. Based on one of the most common strategies, the attention mechanism [
33,
34], introduces two kinds of general attention mechanisms in the position and channel dimensions and designs a local-global pyramid network. Notably, ref. [
35] proposes a novel self-attention mechanism for spatial-temporal relationship modeling, and [
36] proposes a super-resolution-based CD network with stacked attention modules. In addition, ref. [
37] proposes a multi-scale supervised fusion network (MSF-NET) based on attention mechanism. In addition, transfer learning employs knowledge from other data sources by fine-tuning the pre-training models from related tasks to address the problem of limited annotations. For instance, ref. [
38] proposes a transfer-learning-based CD method using recurrent FCNs with multi-scale 3D filters. Additionally, ref. [
39] proposes a CNN-based CD method with a novel loss function to obtain transferable model performance among various datasets.
Compared with the above classical change detection of dual-temporal remote sensing data, ref. [
40] proposes single-temporal supervised learning (STAR) for CD by utilizing land cover and land use changes in unpaired images as a kind of supervisory information. Recently, ref. [
41] proposes a graph-based segmentation for multivariate time series algorithm (MTS-GS) to analyze the change in a multivariate time series by considering all variables as an entirety, rather than treating multivariate time series as univariate time series one by one like classical change detection methods. Ref. [
42] is another interesting work of CD construction, which proposes a feature decomposition–optimization–recombination network based on the decoupling idea in semantic segmentation. In addition, the problem of edge refinement in building change detection is also a hot research direction in recent years. Ref. [
43] proposes an end-to-end building change detection framework that integrates discriminative information and edge structure prior information into a DL framework to improve building change detection results, especially to generate better edge detection results. It is also worth noting that while BE and CD have gained considerable development in the past few decades, the updating of building databases has not been fully studied. In order to automatically update the building footprints with minimal manual labeling of the current building ground truths, ref. [
44] proposes a saliency guided edge preservation network to maintain accurate building boundaries, which is used to update the existing building database to generate the latest building footprint, which is crucial for the vector-ization of building contours.
Although several change detectors have been proposed, the methods suffer from problems such as insufficient training data, which leads to model overfitting and severely limits the application of trained models. A potential solution to this problem is multi-task learning (MTL), as it can introduce more related supervising signals during network training. While BE and CD are two highly correlated topics in the RS field, most previous studies have treated them separately. On the contrary, we hope to solve these two tasks simultaneously through an MTL framework, the potential of which was demonstrated in [
45]. In the MTL setting, each task influences the other, and we assume that the BE task positively promotes the building CD task.
In this study, we propose an MTL framework to simultaneously extract buildings and detect building changes from dual-time remote sensing images by taking advantage of advanced networks, including the Swin transformer [
13] and Segformer [
46]. The contributions of this study are as follows:
We propose an MTL framework based on an advanced transformer-based backbone and lightweight BE and CD heads.
We provide benchmark results to validate design details, including backbone choice and pre-training strategies of our proposed solutions using open datasets.
We achieved a score of 81.8214 in the “BE and CD in optical images” subject of the 2021 Gaofen challenge, which is a few tenths of points behind the first place.
The reminder of this manuscript goes as follows.
Section 2 presents the model choice, employed Swin-L (the large version of Swin) backbone, and BE and CD network heads.
Section 3 describes the utilized datasets and experimental setup for testing the model performance in detail.
Section 4 first tests the BE and CD accuracies of different models for single-task learning to select an optimized MTL setup and then displays and compares the BE and CD results from MTL based on the selected setups, and also shows the CD results from the challenge.
Section 5 discusses and analyzes the benefits of MTL and its possible reasons, the validity of the pre-training weights, and the generalization of the proposed MTL approach. In the end,
Section 6 provides summaries and conclusions of the study.
2. Materials and Methods
A MTL architecture, with a shared parsing network and different outputs for each task, was proposed; the modules of a Swin transformer [
13] and lightweight heads were combined as illustrated in
Figure 1. As the Swin transformer is the backbone to learn multi-scale features from the inputs. The following are the three branches of different tasks. For tasks based on pixel semantic information, multi-level feature maps were fused to better predict the semantic labels for every pixel. An all-multilayer perceptron (MLP) head is used to predict building labels from each of the bi-temporal inputs, and a lightweight convolution-based head is used for CD.
The feature extraction part of our proposed MTL model is a Siamese network built on two backbones with shared weights. The backbone architecture can be selected arbitrarily, for example, ResNets. The backbone processes two input patches separately and outputs learned features in a high-dimensional feature space. The shared weights can help enhance the feature similarity between unchanged areas and reduce the feature similarity in changed areas. Subsequently, these embedded features from the backbone were used for the respective BE and CD tasks via different heads. The heads correspond to the decoder part in a common semantic segmentation network, the choices of which include UNet-like, FCN-like, and more advanced attention-based approaches.
When adding different losses of different tasks for optimization, weights are usually required, and manually tuning these weights as hyper-parameters is tedious. To balance the BE and CD tasks, we learned the weights of different tasks when combining the respective losses. We implemented the weighting of tasks on the basis of homoscedastic uncertainty, which was first introduced by [
47]. The multi-task loss function used in our work can be expressed as:
where
are the three respective tasks;
is the binary cross-entropy loss for each task;
is the trainable parameter corresponding to each task; and
is a weighting parameter that affects the contribution of the individual task. The regularization term
avoids trivial solutions for extremely small weighting parameters. We trained the weighting terms along with the network parameters as
for numerical stability during the optimization. Where the network parameters is
, the weighting terms is
s.
2.1. Backbones for Representation Learning
We employed a large version of Swin (vision transformer by shifted windows, Swin-L) as our feature extractor; this approach has the advantage of multi-scale feature modeling flexibility from the hierarchical structure and long-range dependency encoding from the prevalent transformer architecture. As listed in
Table 1, the Swin-L backbone primarily comprises four modules: a first patch partition to reduce the input patch, a linear embedding to increase the number of feature channels, several Swin transformer blocks, and patch merging in a subsequent order. The details of Swin-L can be found in [
13,
15]. We used features from all four stages at different scales to address large building variations.
2.2. Network Heads for Building Extraction and Change Detection
BE head. The adapted all-MLP head is illustrated in
Figure 2 and consists of three linear layers and one upsampling layer. This head takes the multi-scale representations learned by the backbone, that is, the Swin transformer, as inputs and outputs a segmentation mask that has the same size with the input patch. Specifically, the first of the MLP combines multi-level features from the backbone into features of the same size by resize and concatenation operations. These combined features are subsequently fused by a second layer before the third predicts the buildings and a final upsampling operation is performed to recover the resolution. The output of each branch is used for calculating the binary cross-entropy loss together with the input ground truth of building labels.
CD head. The employed CD head consists of four combinations of Convolution-BatchNorm-ReLU, progressively decreasing the feature channels, a final convolution layer to predict the changing mask, and a final upsampling layer to recover the resolution. The input to the CD head is a concatenation of the two output features from the multi-temporal inputs respectively, and the output is a one-channel mask layer indicating the changed pixels. The prediction output is used for calculating the binary cross-entropy loss together with the input ground truth during training. For inference, a sigmoid activation operation is applied to the prediction output, and the final pixel-level change map is obtained given a threshold. In our study, the threshold is set to 0.5.