Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images

Pang, Shiyan; Li, Xinyu; Chen, Jia; Zuo, Zhiqi; Hu, Xiangyun

doi:10.3390/rs15061655

Open AccessArticle

Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images

by

Shiyan Pang

¹,

Xinyu Li

¹,

Jia Chen

¹,

Zhiqi Zuo

² and

Xiangyun Hu

^3,4,*

¹

Faculty of Artificial Intelligence in Education, Central China Normal University, 152 Luoyu Road, Wuhan 430079, China

²

College of Informatics, Huazhong Agricultural University, Wuhan 430070, China

³

School of Remote Sensing and Information Engineering, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

⁴

Institute of Artificial Intelligence in Geomatics, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(6), 1655; https://doi.org/10.3390/rs15061655

Submission received: 29 January 2023 / Revised: 11 March 2023 / Accepted: 17 March 2023 / Published: 18 March 2023

Download

Browse Figures

Versions Notes

Abstract

:

High-resolution remote sensing image change detection technology compares and analyzes bi-temporal or multitemporal high-resolution remote sensing images to determine the change areas. It plays an important role in land cover/use monitoring, natural disaster monitoring, illegal building investigation, military target strike effect analysis, and land and resource investigation. The change detection of high-resolution remote sensing images has developed rapidly from data accumulation to algorithm models because of the rapid development of technologies such as deep learning and earth observation in recent years. However, the current deep learning-based change detection methods are strongly dependent on large sample data, and the training model has insufficient cross-domain generalization ability. As a result, a prior semantic information-guided change detection framework (PSI-CD), which alleviates the change detection model’s dependence on datasets by making full use of prior semantic information, is proposed in this paper. The proposed method mainly includes two parts: one is a prior semantic information generation network that uses the semantic segmentation dataset to extract robust and reliable prior semantic information; the other is the prior semantic information guided change detection network that makes full use of prior semantic information to reduce the sample size of the change detection. To verify the effectiveness of the proposed method, we produced pixel-level semantic labels for the bi-temporal images of the public change detection dataset (LEVIR-CD). Then, we performed extensive experiments on the WHU and LEVIR-CD datasets, including comparisons with existing methods, experiments with different amounts of data, and ablation study, to show the effectiveness of the proposed method. Compared with other existing methods, our method has the highest IoU for all training samples and different amounts of training samples on WHU and LEVIR-CD, reaching a maximum of 83.25% and 83.80%, respectively.

Keywords:

change detection; prior semantic information; semantic segmentation; convolutional neural networks; high-resolution images

1. Introduction

Remote sensing image change detection is an important component of geography and national status monitoring. It is of great significance to urban dynamic monitoring, geographic information update, natural disaster monitoring, investigation and punishment of illegal buildings, military target strike effect analysis, and land and resource investigation. High-resolution remote sensing image change detection has developed rapidly from data accumulation to algorithm models because of the rapid development of deep learning technology and earth observation technology in recent years. However, a certain gap still exists in commercial applications.

The traditional change detection algorithms mainly include layer arithmetic, post-classification change, direct classification, transformation, change vector analysis (CVA), and hybrid change detection. More details, which will not be repeated in this work, can be observed in the literature [1]. In recent years, with the continuous development of deep learning technology, change detection based on deep learning algorithms has also achieved rapid development. Deep learning-based change detection methods can be classified into four categories: supervised-based, semi-supervised-based, weakly supervised-based, and unsupervised-based. The supervised deep learning-based change detection method has become a mainstream algorithm in the field of remote sensing image change detection because of its simple implementation and stable and reliable results. However, such methods often require a large number of labeled high-quality sample sets, whereas the labeling of change detection datasets is labor-intensive and time-consuming. In addition, when the amount of training data decreases, the effect of change detection also drops sharply, especially when the training dataset is only 10% of the original dataset or less. On the other hand, the volume of existing semantic segmentation datasets is much larger than that of current change detection datasets; change detection datasets are much more difficult to produce than single-phase semantic segmentation datasets because the production of change detection datasets usually requires semantic segmentation labels of the bi-temporal images, following the comparative analysis and manual inspection to obtain the labels of change detection. How to make full use of the existing geographic information (e.g., semantic segmentation datasets) to effectively improve the accuracy and reliability of change detection with limited samples has been a research hotspot in recent years. In this view, a prior semantic information-guided change detection framework is proposed in this paper. In this framework, existing semantic segmentation datasets are first used to obtain prior semantic information, and then a new change detection network guided by prior semantic information is designed to reduce the workload of the change detection annotation effectively.

The main contributions of this paper are presented as follows.

On the basis of the existing public change detection dataset (LEVIR-CD), a set of semantic segmentation datasets that correspond to the bi-temporal images, which enables the use of the dataset for both semantic segmentation and change detection, is annotated.
A new change detection framework that makes full use of the prior semantic information generated by semantic segmentation and greatly reduces the dependence on the sample size of change detection datasets is proposed. The proposed prior semantic information-guided change detection framework (PSI-CD) achieves state-of-the-art performance on the WHU and LEVIR-CD datasets.
A new change analysis module (CAM) is designed, which adopts different processing methods for the features of different layers and effectively combines multi-scale prior semantic features and attention mechanisms. Ablation study shows that this module can effectively improve the difference representation ability of the model.

The remainder of this paper is arranged as follows. Section 2 introduces an overall review of related works. Section 3 describes the proposed method. Section 4 presents the experiments, including the comparison of different methods, the comparison of different sample data volumes, the ablation study, and finally, the conclusion of this paper.

2. Related Works

In recent years, with the continuous maturity of deep learning technology, high-resolution remote sensing image change detection technology has developed rapidly, and its accuracy and reliability far exceeded traditional algorithms. According to the degree of dependence of the change detection algorithm on sample data, the existing change detection algorithms can be divided into four categories, namely, supervised change detection, semi-supervised change detection, weakly supervised change detection, and unsupervised change detection.

In terms of supervised change detection, the earliest methods of this kind are the three different convolutional neural network (CNN) structures proposed by Zagoruyko and Komodakis [2] to calculate the similarity of bi-temporal image patches. Subsequently, CNN was used to extract features, and then the final change map was obtained through the distance calculation of the bi-temporal features [3], a simple threshold segmentation [4], or discrimination learning [5]. Furthermore, to overcome the limitations of existing algorithms, skip connections are used between the encoder and the decoder to replace the simple upsampling operation, thereby forming the most popular end-to-end fully convolutional neural networks, such as FCN [6,7], Unet and its variants [8,9,10,11,12,13], and UNet++ [14,15]. In the use of the attention module, Shi et al. [16] proposed a deeply supervised attention metric-based network (DSAMNet) in which convolutional block attention modules (CBAM) were integrated to provide more discriminative features. Zhang et al. [17] proposed a dual correlation attention-guided detector (DCA-Det) to detect the change objects. Chen et al. [18] presented a densely connected feature fusion module to fuse semantic segmentation features into the change detection decoder. Zhu et al. [19] added a global hierarchical (G-H) sampling mechanism to the network to solve the imbalance problem of insufficient samples. In addition, given the great success of transformers in the fields of image classification, semantic segmentation, and object detection, Chen et al. [20] proposed a bi-temporal image transformer (BIT) to model contexts within the spatial–temporal domain and outperformed the purely convolutional baseline. Zheng et al. [21] proposed a transformer module to learn the difference representation from the bi-temporal semantic representation. Chen et al. [22] focused on the boundary accuracy and the integrity of the change area. They introduced a transformer-based edge-guided encoder to achieve long-range context modeling and a feature difference enhancement module to learn the differences between bi-temporal features. In addition to implementing change detection, Liu et al. [23], Yang et al. [24], Tian et al. [25], and Shen et al. [26] designed a semantic change detection network that simultaneously implemented semantic segmentation and change detection. The change detection effect of this type of supervised learning is usually fine in a small area because of a large number of labeled sample datasets. However, the sample annotation workload is large, and its cost is high, thereby making its popularization and application in a large number of areas difficult.

To reduce the number of labeled samples effectively, some semi-supervised, weakly supervised, and unsupervised change detection methods have been developed in recent years. For semi-supervised change detection, Peng et al. [27] proposed a semi-supervised convolutional network for change detection (SemiCDNet) based on a GAN. In terms of weakly supervised change detection, Sakurada [28] divided semantic change detection into two parts, namely, change detection and semantic extraction. In this method, the bi-temporal images were first placed into the pre-trained correlated Siamese change detection network (CSCDNet) to generate a change probability map, then the change map and the bi-temporal images were inputted into the silhouette-based semantic change detection network (SSCDNet) to realize semantic change detection. In terms of unsupervised change detection, Hou et al. [29] and Saha et al. [30] used pre-trained convolutional neural networks and low-order decomposition [29] or CVA [30] to extract features from multi-temporal images to obtain the change map. Some researchers employ unsupervised deep learning frameworks, such as deep belief networks (DBNs) [31], symmetric convolutional coupling networks (SCCN) [32], and iterative feature mapping networks [33], to extract the difference representation between the image pairs of homologous or different sources, and then the change detection results are obtained in an unsupervised way, such as unsupervised clustering [31], a thresholding algorithm [32], or hierarchical clustering analysis [33]. Other researchers [34,35,36,37,38,39] used the results of traditional methods as the pre-classification result to train deep neural networks, and the networks used include sparse denoising autoencoder (SDAE) [38], CNNs [35,40], a noise modeling-based unsupervised fully convolutional network, a capsule network [37], and a GAN [34,39], so that the entire change detection process does not require manual intervention.

In conclusion, weakly supervised, semi-supervised, or unsupervised change detection methods can significantly reduce the workload of manually annotating sample sets, and the cost is low. However, due to the lack of strict supervision, guaranteeing the change detection effect using such methods is usually difficult. In recent years, the supervised change detection method developed with the help of existing geographic information can effectively reduce the amount of annotation, and the effect has significant advantages over the supervised change detection method with limited samples. Thus, in this paper, on the basis of making full use of the existing datasets for the other supervised training tasks, the trained semantic segmentation model is introduced as prior semantic information to guide change detection, which can reduce the sample size of the required change detection dataset effectively.

3. Methodology

In this paper, a PSI-CD method is proposed for bi-temporal high-resolution remote sensing images, as shown in Figure 1. The whole framework mainly consists of two parts. (1) A priori semantic information generation network (PSINet): relying on the semantic segmentation dataset constructed from the existing geographic information database, a reliable semantic segmentation model is constructed to obtain the prior semantic information. (2) A change detection network (CDNet): a change detection network is designed to make full use of prior semantic information. In this network, first, the encoder and decoder of the semantic segmentation model are directly used to extract the multi-scale convolution features of each single phase; second, these bi-temporal multi-scale convolution features are combined to obtain the multi-scale change features of the bi-temporal images; finally, these multi-scale change features are placed into the decoder to obtain the final change map.

3.1. Overview

PSI-CD in this paper consists of two networks that share weights: PSINet and CDNet. PSINet first extracts prior semantic information from a large number of semantic segmentation datasets, and then the extracted prior semantic information is incorporated into CDNet, thereby enabling CDNet to achieve change detection tasks under limited samples. PSINet uses the SENet module to enhance the discrimination of high-order features on channels. CDNet inherits this module because the two networks share weights. At the same time, CDNet uses the change analysis module (CAM) to extract the change information in the bi-temporal data, and the attention mechanism used in the CAM can help the network capture more useful information.

Let

I_{T 1}

and

I_{T 2}

denote a pair of images of the same region shooting at different times

T_{1}

and

T_{2}

, respectively; whereas

L_{T 1}

and

L_{T 2}

denote the pixel-level semantic segmentation labels of the two images; and

L_{T}

denotes the change detection labels of the two images. The PSI-CD process can be summarized as follows:

The semantic segmentation model (PSINet) is first trained. During the training of PSINet, images $I_{T 1}$ and $I_{T 2}$ and the labels $L_{T 1}$ and $L_{T 2}$ are inputted into PSINet to extract multi-scale convolution features. The best PSINet is selected, and the weight parameters of the entire PSINet network are retained.
Next, the trained PSINet is embedded into CDNet to extract the multi-scale convolution features of each image. The bi-temporal images $I_{T 1}$ and $I_{T 2}$ and the change labels $L_{T}$ are inputted into CDNet, and the trained PSINet is used to extract the multi-scale convolutional features of the bi-temporal images, including two sets of encoder features $F = \{f_{n}^{T 1}, f_{n}^{T 2}, n = 1, 2, 3, 4, 5\}$ and two sets of decoder features $P = \{p_{n}^{T 1}, p_{n}^{T 2}, n = 1, 2, 3, 4\}$ .
The bi-temporal image features $(F, P)$ are inputted into the change analysis module to obtain the multi-scale change features of the bi-temporal data. To improve the performance of the change analysis module, the attention mechanism and the concatenation of the multi-scale convolutional features are introduced to make the change analysis module more discriminating.
The multi-scale change features are inputted into the decoder, and the predicted change map $O_{T}$ is generated through operations such as convolution, upsampling, and skip connections. Losses are calculated from $O_{T}$ and the ground truth $L_{T}$ , and backpropagation is performed to optimize the network.

3.2. PSINet

The main role of PSINet is to extract the prior semantic information for the downstream change detection task. Based on the semantic segmentation dataset constructed by the existing geographic information database, a reliable remote sensing image semantic segmentation model, which is the prior semantic information of this paper, is obtained by training. PSINet in this paper is shown in Figure 2.

In this paper, PSINet consists of three parts: encoder, attention module, and decoder. Skip connections are used between the encoder and the decoder to fuse the deep, semantic, and rough features in the decoder and the shallow, low-level, and fine-grained features in the encoder.

The encoder in this paper adopts a classic ResNet50 network [41] and initializes it with pre-trained parameters on ImageNet. The encoder consists of convolution (conv), batch normalization (Batch Normalization, BN), maximum pooling (MaxPooling), and rectified linear unit (Rectified Liner Units, ReLU). Shallow features are mapped into a high-dimensional space, thereby resulting in deep, high-dimensional features.

To improve the feature extraction ability of the network for small targets, we introduced the SENet attention module [42] to further improve the quality of feature extraction. SENet is a lightweight attention module with a small number of parameters with good results. The SE module first compresses the channel dimensions of high-dimensional features and then performs global average pooling on the features to obtain channel-level global features. Finally, the feature is processed with a fully connected layer and a sigmoid activation function and then dot-multiplied with the original feature so that the high-dimensional feature can obtain a global view and better learn the difference between channels. The calculation of SENet is shown in Formulas (1) and (2):

y_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j)

(1)

x = x \cdot σ (L_{2} (R e L U (L_{1} (y_{c})))

(2)

where

x_{c}

is the feature after compression on channel;

y_{c}

represents channel-level global features;

H

and

W

represent the height and width of the feature, respectively;

L_{1}

and

L_{2}

are two fully connected layers; ReLU represents rectified linear unit; and

σ

represents the sigmoid function.

After the SE attention module generates a high-dimensional feature map, it is passed to the decoder of the multi-layer convolution upsampling module for dimensionality reduction. The decoder consists of upsampling, convolution, batch normalization, and ReLU. The upsampling doubles the width and height of the feature map and fuses the multi-level features extracted in the encoder to restore the image to the original dimension and size gradually.

3.3. CDNet

Based on the semantic information obtained by PSINet, a prior semantic information-guided change detection (CDNet) is designed to reduce the sample size of the change detection by sharing network weights. The workflow of CDNet introduced in this paper mainly includes five parts: weight sharing, embedding of prior features, change analysis module, decoder, and loss function, which is shown in Figure 1. Details are explained as follows.

3.3.1. Weight Sharing

In general, the training of the change detection network is mostly conducted from scratch, thereby consuming considerable computing resources and time, and the effect of the training model is poor when the dataset is small. On the one hand, the upstream tasks of semantic segmentation and change detection have strong similarities. On the other hand, the semantic information generated by semantic segmentation cannot be utilized effectively in the change detection task in the way of designing the semantic segmentation network and change detection network separately. Thus, in this paper, we attempt to combine the two tasks of semantic segmentation and change detection, hence, making full use of semantic segmentation information in the process of change detection. The weight parameters of the feature extraction module of the change detection network and the parameters of the semantic segmentation network are shared so that the prior semantic information generated by the semantic segmentation network can be applied fully to the change detection task, thereby reducing the number of parameters of change detection network needs to be adjusted effectively. In the specific implementation process, the change detection network adopts the same structure as the semantic segmentation network to extract the high-order features of the image. During the training process of the change detection network, the parameters of the high-order feature extraction part of the image are fixed, and only the parameters of the change analysis module and the decoder are trained.

Let

W_{n e t 1}

be the weights of PSINet, and

W_{n e t 2}

be the weights of CDNet. Given that the weights of the high-order feature extraction part of CDNet are the same as the weights of PSINet, the weights of CDNet can be expressed as:

W_{n e t 2} = W_{n e t 1} + W_{n e c k} + W_{d e c o d e r},

(3)

where

W_{n e c k}

represents the weights of the change analysis module of the change detection network, and

W_{d e c o d e r}

represents the weights of the decoder of the change detection network. During training,

W_{n e t 1}

remains the same, and only

W_{n e c k}

and

W_{d e c o d e r}

are trained to make full use of the prior semantic information obtained by PSINet. Experiments have verified that this method can improve the results of the change detection and the convergence speed of the training model significantly, especially when the dataset is small.

3.3.2. Embedding of Prior Features

Given bi-temporal images, the bi-temporal multi-level features

F = \{f_{n}^{T 1}, f_{n}^{T 2}, n = 1, 2, 3, 4, 5\}

are extracted separately by the two branches of the change detection network, which are the same as those of PSINet. Given that the multi-level features are extracted simultaneously by the same network for the two images, the feature shapes of each level of the two are the same, that is,

{(f_{n}^{T 1})}_{s h a p e} = {(f_{n}^{T 2})}_{s h a p e}, n = 1, 2, 3, 4, 5 .

(4)

Therefore, we can operate on two sets of features at the same level in two periods without changing the dimension of the feature, hence, avoiding the loss of feature information. Similarly, the feature scales at the same level of the encoder and decoder in the same network are the same because the PSINet in this paper is a symmetric U-shaped structure, that is:

{(f_{n}^{})}_{s h a p e} = {(p_{n}^{})}_{s h a p e}, n = 1, 2, 3, 4 .

(5)

Therefore, we can operate on the two sets of features at the corresponding level in the same period without changing the dimension of the feature, and this operation is convenient to integrate the feature information of the encoder and the decoder and improve the representation ability of the prior semantic information.

3.3.3. Change Analysis Module (CAM)

The high-level features of the images extracted by the PSINet correspond to the semantic information of the images. In the change detection task, the corresponding change features must be calculated according to the high-level features of the bi-temporal images, and the change features should be inputted into the decoder of the change detection network to obtain the change map. We design a change analysis module to enhance the network’s ability to learn the difference between the high-order features. The change analysis module learns the change information by comparing the embedded features of the bi-temporal images. In this module, a spatial attention module, abbreviated as SPAM, is introduced to improve the discrimination of features in space. At the same time, a feature integration and concatenation module, abbreviated as FICM, is introduced to strengthen the ability of the network to extract the difference of the bi-temporal features. The entire change analysis module is shown in Figure 3. In this module, SPAM and FICM are parallel and have the same input. After a series of different processing, these two modules output features of the same size. Subsequent operations, such as concatenation and convolution, are performed on these output features, and the processed features and other multi-level features are passed to the decoder. Details of the SPAM and FICM are presented as follows.

SPAM

Considering that we have already carried out channel-wise enhancement on the features in the PSINet, we design SPAM to strengthen the spatial discriminativeness of the change features. Let the features of the two phases be

F_{a}

and

F_{b}

, and the channel, height, and width of the features are C, H, and W, respectively.

First, the spatial attention feature

A_{S}

is obtained by multiplying

F_{a}

and

F_{b}

after transformation, and

A_{S}

is calculated as follows:

F_{a}^{T^{'}} = T r a n s p o s e (R e s h a p e (F_{a})),

(6)

F_{b}^{'} = R e s h a p e (F_{b}),

(7)

A_{S} = F_{a}^{T^{'}} \otimes F_{b}^{'},

(8)

where

F_{a}^{T^{'}}

is the feature of

F_{a}

obtained through operations, such as reshape and transposition;

F_{b}^{'}

is the feature of

F_{b}

obtained through reshape operation;

T r a n s p o s e

represents the operation of matrix transpose; and

\otimes

represents the operation of matrix multiplication.

After obtaining the spatial attention features

A_{S}

, operations with

F_{a}^{T^{'}}

and

F_{b}^{'}

should be performed to obtain attention matrix

F_{S a}

and

F_{S b}

which are consistent with the dimensions of the input data. Here,

F_{S a}

is calculated from

F_{a}^{T^{'}}

and

A_{S}

, and

F_{S b}

is calculated from

F_{b}^{'}

and

A_{S}

. This process can be expressed as follows:

F_{S a} = R e s h a p e (R e s h a p e (F_{a}^{T^{'}}) \otimes S o f t m a x (A_{S})),

(9)

F_{S b} = R e s h a p e (F_{b}^{'} \otimes S o f t m a x (T r a n s p o s e (A_{S}))),

(10)

where

F_{S a}

and

F_{S b}

are the features after spatial attention enhancement, and the size is C × H × W; the

S o f t m a x

function represents the column-by-column normalization operation.

Finally, the bi-temporal features are concatenated, and the output feature

F_{S P A M}

with the same size and number of channels as the input data is calculated as follows:

F_{c a t} = C a t (F_{S a}, F_{S b}),

(11)

F_{S P A M} = σ (R e L U (B N (C o n v_{3 \times 3} (C o n v_{1 \times 1} (F_{C a t}))))),

(12)

where

F_{c a t}

represents the feature after

F_{S a}

and

F_{S b}

are concatenated,

C a t

represents the function of feature concatenation,

σ

represents the sigmoid function mapping features to the range of (0, 1),

R e L U

represents the rectified linear unit,

B N

represents batch normalization,

C o n v_{3 \times 3}

represents a convolutional layer with a convolution kernel of 3, and

C o n v_{1 \times 1}

compress the number of channels of the feature to the specified number of channels.

FICM

We designed FICM to preserve the details of the bi-temporal features as much as possible. In FICM, the bi-temporal features are first concatenated, and then through a series of operations, such as convolution,

B N

and

R e L U

, the change features

F_{F I C M}

that are consistent with the dimension of the input features are obtained. This process can be expressed as follows:

F_{f} = R e L U (B N (C o n v_{1 \times 1} (C a t (F_{a}, F_{b})))),

(13)

F_{F I C M} = σ (R e L U (B N (C o n v_{3 \times 3} (F_{f})))),

(14)

where

C a t

,

σ

,

R e L U

,

B N

, and

C o n v_{1 \times 1}

are the same as those in Formulas (11) and (12).

After obtaining two sets of features

F_{S P A M}

and

F_{F I C M}

, we fuse these two sets of features to obtain the final feature

F_{C A M}

after 1 × 1 convolution, 3 × 3 convolution, batch normalization,

R e L U

, and sigmoid operation.

F_{C A M}

is calculated as follows:

F_{c a t} = C a t (F_{S P A M}, F_{F I C M}),

(15)

F_{C A M} = σ (R e L U (B N (C o n v_{3 \times 3} (C o n v_{1 \times 1} (F_{C a t}))))),

(16)

where

C a t

,

σ

,

R e L U

,

B N

,

C o n v_{1 \times 1}

, and

C o n v_{3 \times 3}

are the same as those in Formulas (11) and (12).

Since SPAM in CAM adopts the spatial attention mechanism, dealing with the bi-temporal features with large sizes is difficult. Here, we make a simplification. We only apply SPAM to the last layer (e.g., F5). For other layers (e.g., F1-F4, P1-P4), we only use FCIM in CAM to obtain the change features. Change features

F_{n}

for the encoder and

P_{n}

for the decoder of PSINet are calculated as follows:

F_{n} = \{\begin{matrix} F_{C A M} (f_{n}^{T 1}, f_{n}^{T 2}) \begin{matrix} n = 5 \begin{matrix}  \end{matrix} \end{matrix} \\ F_{F I C M} (f_{n}^{T 1}, f_{n}^{T 2}) \begin{matrix} n = 1, 2, 3, 4 \end{matrix} \end{matrix},

(17)

P_{n} = F_{F I C M} (p_{n}^{T 1}, p_{n}^{T 2}) \begin{matrix} n = 1, 2, 3, 4 \end{matrix},

(18)

where

f_{n}^{T 1}

and

f_{n}^{T 2}

correspond to the input

F_{a}

,

F_{b}

in Figure 3;

f_{n}^{T 1}

and

f_{n}^{T 2}

are the bi-temporal multi-scale convolutional features of the encoder of PSINet; and

p_{n}^{T 1}

and

p_{n}^{T 2}

are the bi-temporal multi-scale convolutional features of the encoder of PSINet.

This change analysis module (CAM) can make full use of the multi-level features extracted by the encoder and decoder of the semantic segmentation network and tap the potential of semantic information to the greatest extent, thereby reducing the model’s dependence on datasets effectively.

3.3.4. Decoder

To tap the potential of semantic information more fully, a suitable network must be designed, and the multi-scale high-order features generated by the encoder and the decoder in PSINet must be used fully. In detail, the change analysis module is first used to extract the change features from the bi-temporal high-order convolutional features at each scale, and then the change features in the encoder of each layer are concatenated correspondingly with the change features of the same size in the decoder to obtain the fused change features

R_{n}

.

R_{n}

is calculated as follows:

R_{n} = \{\begin{matrix} C a t (F_{n}, P_{n}), n = 1, 2, 3, 4 \\ F_{n}, \begin{matrix}  \end{matrix} n = 5 \end{matrix},

(19)

where

F_{n}^{}

is the change feature obtained by the bi-temporal encoder, and

P_{n}^{}

is the change feature obtained by the bi-temporal decoder.

Similar to the decoder of the classical Unet structure, the fused change features are upsampled by the decoder from the last layer (

F_{5}

) and concatenated with the fused features of the corresponding level, and the feature map is finally upsampled to the size of the input image to obtain the final change map.

3.3.5. Loss Function

In terms of loss function, binary cross-entropy is used as the loss function in this paper with only two classes, and the predicted probabilities for each class are p and 1-p, respectively. Here, the formula of the loss function

L

can be expressed as follows:

L = \frac{1}{N} \sum_{i} L_{i} = \frac{1}{N} \sum_{i} -  [y_{i} \cdot \log (p_{i}) + (1 - y_{i}) \cdot \log (1 - p_{i})],

(20)

where

N

is the number of pixels in the image;

y_{i}

represents the label of sample i, in which the positive class is 1 and, the negative class is 0, the base of log is e, and

p_{i}

is the probability that sample i is predicted to be a positive class.

4. Experiments

4.1. Datasets

In this paper, two datasets (e.g., WHU and LEVIR-CD) are mainly used to validate the proposed method. Each dataset is a semantic change detection dataset that can be used for semantic segmentation and change detection tasks. Among them, WHU is a public dataset that includes one phase of image T1 and corresponding building semantic labels, another phase of image T2 and corresponding building semantic labels, and building change labels. For the LEVIR-CD dataset, we newly annotated the corresponding semantic segmentation labels on the basis of the public dataset to form a semantic change detection dataset. To distinguish it from the public change detection dataset LEVIR-CD, we will use LEVIR to represent the semantic change detection dataset subsequently. Examples of the two datasets are shown in Figure 4. The details of the two datasets are presented as follows.

WHU dataset

Semantic Segmentation: This dataset is located in Christchurch, New Zealand. The original aerial images were obtained from the New Zealand Land Information Service website. Approximately 22,000 independent buildings can be found here. Ji et al. [43] first downsampled the image from a ground resolution of 0.075 m to 0.3 m. Then they cropped them into 8189 tiles, and each tile has 512 × 512 pixels. Each image has a corresponding ground truth. In our experiments, we randomly split it into three parts: 4736/1036/2416 for training/validation/test, respectively.

Change detection: This dataset belongs to the same region as the semantic segmentation dataset. The bi-temporal images were obtained in April 2012 and 2016. A total of 2386 image pairs of 512 × 512 pixels are available, and each image pair has a corresponding change label. During the experiment, we randomly split it into three parts: 1559/315/512 for training/validation/test, respectively.

LEVIR dataset

Semantic Segmentation: The original images of this dataset are from the public change detection dataset LEVIR-CD [7]. The semantic labels include two classes of buildings and non-buildings, which were manually produced and checked by our research group. The entire semantic segmentation dataset contains 637 × 2 images, and each image has a corresponding ground truth; the image resolution is 0.5 m/pixel, and the original image is 1024 × 1024 pixels. Based on the training set, validation set, and test set of the public dataset, we cropped the images into small patches of size 512 × 512, and the training set, validation set, and test set for semantic segmentation are 8650, 512, and 1024 images, respectively.

Change detection: LEVIR-CD [7] is a building change detection dataset in the field of remote sensing. The original images come from Google Earth. It consists of 637 pairs of bi-temporal very high-resolution (VHR) images. The time span of the bi-temporal VHR images is 5 to 14 years. The ground resolution of the VHR images is 0.5 m. The original image size is 1024 × 1024 pixels. Each image pair has corresponding ground truth of building change detection. In our experiment, based on the training set, validation set, and test set of the original dataset, we cropped the images into small patches of size 512 × 512, and the training set, validation set, and test set for semantic segmentation are 1524, 256, and 512 images, respectively.

4.2. Training Details

The proposed method (e.g., PSI-CD) is implemented with PyTorch. All experiments are conducted on an NVIDIA RTX2080Ti GPU with 11 GB RAM. It consists of two networks, namely, PSINet and CDNet. Details are presented as follows.

PSINet: During network training, the initialization parameters of the ResNet backbone module come from the pretraining weights of the ImageNet dataset. The number of training epochs is 400. We used the Adam optimizer with β1 = 0.9, β2 = 0.999, an initial learning rate of 0.0001, and a weight decay of

1 \times 10^{- 8}

. The input and output image sizes are 512 × 512 × 3 and 512 × 512 × 1, respectively. During data loading, random rotation and random color transformation [11] are adopted to augment the dataset.

CDNet: During training, the feature extraction part that corresponds to PSINet adopts the weights of the trained PSINet. The number of training epochs is 200. We used the Adam optimizer with β1 = 0.9, β2 = 0.999, an initial learning rate of

1 \times 10^{- 4}

, and weight decay of

1 \times 10^{- 8}

. The input of the network corresponds to the bi-temporal images. The input and output image sizes are 512 × 512 × 3 and 512 × 512 × 1, respectively. During data loading, random rotation and random color transformation [11] are adopted to augment the dataset.

At the same time, to reduce the difficulty of training and reduce the dependence of the model on the dataset, the parameters of the feature extraction part of CDNet are fixed during training, and only the change analysis module and the decoder of CDNet are updated, which greatly reduces the training parameters of the model and reduces the training time and computing power of the change detection network effectively.

4.3. Comparison of Different Methods

To verify the effectiveness of the proposed method, extensive comparisons are made with state-of-the-art change detection methods on the two datasets, namely, FC-EF [13], FC-CONC [13], FC-DIFF [13], DTCDSCN [23], SNUNET [15], and BIT [20]. A brief description of the above methods is as follows.

(1): FC-EF [13] is based on the Unet model, including four max-pooling and upsampling layers, and its input is the concatenation of image pairs.
(2): FC-CONC [13] is a Siamese extension of the FC-EF model. The two encoders share the weight, and then the bi-temporal features are concatenated into the decoder.
(3): FC-DIFF [13] is another Siamese extension of FC-EF. During decoding, different from EF-CONC’s concatenation of the bi-temporal image features, FC-DIFF is replaced with the absolute value of the difference between bi-temporal image features.
(4): DTCDSCN [23] consists of a change detection network and two semantic segmentation networks. It realizes change detection under dual-task constraints by sharing the parameters of feature extraction.
(5): SNUNET [15] has designed a dense connected Siamese network for change detection to achieve dense skip connections between encoders and decoders and between decoders and decoders.
(6): BIT [20] introduces a transformer into change detection and proposes a bi-temporal image transformer to model contexts within the spatial–temporal domain.

The above CD networks use the same hyperparameters. In this section, all comparisons are conducted using the same data augmentation, and details of the data augmentation can be seen in [11].

4.3.1. Comparisons on WHU Dataset

On the WHU dataset, the quantitative results of IoU, OA, precision, recall, and F1 of all methods are shown in Table 1. Table 1 shows that FC-EF achieves the lowest IoU and F1 of 59.52% and 74.62%, respectively; the second lowest is FC-DIFF, with 63.24% and 77.48% on IoU and F1, respectively. These results indicate that on the WHU dataset, the Siamese network is beneficial for improving the change detection results. The IoU and F1 of FC-CONC are 71.99% and 83.72%, which shows that the fusion feature, after directly contacting the bi-temporal features, is more helpful to the network than the difference feature of subtracting the two and taking the absolute value. The former network structure can achieve higher detection results on the WHU dataset. These three networks have the fewest number of training parameters and the largest number of frames per second (e.g., Fps). As the amount of training parameters increases, the IoU of SNUNet is 79.29%, and F1 reaches 88.32%, indicating that complex backbone networks, such as dense residual network structures and sophisticated attention modules, can capture the change information of the bi-temporal images effectively. The BIT network achieves high scores of 81.29% and 89.68% in terms of IoU and F1, hence, proving that the use of the transformer structure can extract more useful information from the convolutional features. The DTCDSCN network achieves high scores of 82.19% and 90.22% in terms of IoU and F1, proving that the use of reliable semantic information can improve the network performance; however, the number of training parameters is the highest, and the frames per second are lowest. Our method achieves the best results among all methods, with IoU and F1 of 83.25 and 90.86, respectively, proving that using prior semantic features and more advanced attention mechanisms can enable the network to capture the difference representations more efficiently, which helps improve the change detection accuracy significantly.

Figure 5 shows the performance of all methods on the WHU dataset more intuitively. We select as many different building images as possible, including different types, shapes, sizes, and spatial distribution. For example, (a)~(f) are sparsely distributed buildings of different types and sizes, (g) is a building image with obvious radiation changes, (h) is a dense small building image, and (i) and (j) are large and medium-sized building images. Methods such as FC-EF, FC-CONC, and FC-DIFF have the worst change detection effect on the WHU dataset, especially in the areas with relatively small changes. These three methods have many false and missed detections. The SNUNet and DTCDSCN results are slightly better. However, when the target is partially occluded, the detection effect is likely to be poor, as shown in Figure 5h,i, and the detection effect on the edge of the target is also poor, as shown in Figure 5e,h, which indicates that the attention mechanism is helpful for the extraction of small targets. However, the extraction ability of building edges is limited. The BIT network has improved the detection results of small targets and target edge regions. However, the network is more sensitive to changes in image radiation, such as the serious false detection in Figure 5g. At the same time, given the full combination of prior semantic information, our network achieves the best visual detection effect on the WHU dataset. Even for areas with large radiation differences, the proposed algorithm still achieves good change detection results.

4.3.2. Comparisons on LEVIR Dataset

To further verify the effect of the proposed method, more experiments were conducted on the LEVIR dataset. Table 2 shows that due to the simple building structure of the LEVIR dataset, FC-EF, FC-DIFF, and FC-CONC have different degrees of improvement in the change detection effect on LEVIR. At the same time, due to the complex background of the LEVIR dataset and many small targets, the densely connected strategy-based SNUNet achieves higher change detection results than that of the DTCDSCN method. The proposed method still achieves the best results on the LEVIR dataset. It achieves 83.80% and 91.19% on IoU and F1, respectively. Compared with the second-ranked BIT, the IoU and F1 are 1.17% and 0.7% higher, respectively.

Figure 6 visually shows the change detection results of different methods on the LEVIR dataset. The same as that on the WHU dataset, the performance of EF-FC, FC-CONC, and FC-DIFF is poor, and the missed detection is serious, which also corresponds to the high OA rate, high precision rate, and low recall rate in Table 2. SNUNet and DTCDSCN are slightly worse in the building edge areas, as shown in Figure 6b,c, and DTCDSCN has slightly more false detections than SNUNet. BIT has good detection results in most cases, but it is still prone to occasional obvious missed detections, as shown in Figure 6a,f,i. Similar to the WHU dataset, PSI-CD performs best among all methods.

4.4. Comparisons of Different Methods with Different Amounts of Data

To verify the effect of prior semantic information in our PSI-CD, we randomly selected change detection samples with different training amounts according to the ratios of 5%, 10%, 20%, 40%, 60%, 80%, and 100%. For the comparison method, we chose three models, namely, DTCDSCN, SNUNet, and BIT, because they achieved better experimental results when trained on the 100% dataset, and DTCDSCN also used semantic information to guide change detection. We used the same test set to perform statistics on the IoU metrics. The comparisons of the two datasets are shown in Table 3 and Table 4.

Table 3 and Table 4 show that on the WHU and LEVIR datasets, when the data volume is 100%, the IoU metrics of the three algorithms are not much different, but with the continuous reduction in the change detection sample data volume, the results of the three algorithms, DTCDSCN, SNUNet, and BIT, have dropped significantly. Comparing the three cases of 5% sample size and 100% sample size, the IoU of the proposed method reduces by 6.13% on the WHU dataset, which is more difficult and only reduces by 2.48% on the LEVIR dataset. The IoU of the BIT algorithm reduces by 9.86% and 7.26%, the IoU of the SNUNet algorithm reduces by 14.38% and 12.16%, and the IoU of the DTCDSCN algorithm reduces by 20.3% and 17.23%.

Compared with DTCDSCN, although the latter uses a semantic segmentation dataset in the training process, the accuracy of change detection is not significantly higher than those of SNUNet and BIT and lower than those of PSI-CD. The reasons may be as follows. (1) The structure of DTCDSCN is more sensitive to the dataset. (2) The reduction in data volume affects the extraction of semantic features, especially when the data volume is less than 60%, the semantic features have little effect on change detection.

Compared with SNUNet and BIT, with the guidance of prior semantic information, the results of PSI-CD on the 5% dataset are much higher than those of SNUNet and BIT, and they also show the best results on other sample sizes. Therefore, PSI-CD has better performance in the small sample change detection task.

4.5. Ablation Study

To evaluate the effectiveness of each component of the proposed method effectively, we divide our method into a combination of different components and conduct ablation study on the two datasets, WHU and LEVIR. The experimental data and hyperparameters used in this section are consistent with other experiments in this paper. The combined settings of the different components in the ablation study are detailed as follows.

“Baseline”: Only the encoder part of the PSINet is used, the decoder part of the PSINet is not used, the attention model (e.g., SPAM) is not used in the change analysis module, and only the FICM is used in the change analysis module.
“+Decoder”: The decoder part of the PSINet is added on the basis of the baseline model; that is, the encoder and decoder of the PSINet are used.
“+CAM”: On the basis of “Baseline ”, the change analysis module of the last layer (e.g., F5) adds an attention mechanism (e.g., SPAM) but does not use the decoder of the PSINet.
“PSI-CD”: Add “CAM” on the basis of “+Decoder”, which is all part of our network.

At the same time, to verify the influence of the accuracy of prior semantic information on change detection, two kinds of prior semantic information in the ablation study, namely, fully trained PSINet and incompletely trained PSINet, are used in this paper. The incompletely trained PSINet is the model of the eighth epoch during training.

The results of the ablation study for the two datasets are shown in Table 5.

Table 5 shows that our baseline model can achieve relatively desirable results even when guided by prior semantic information that is not fully trained. The IoU of the WHU and LEVIR datasets reached 79.71% and 80.21%, respectively. After adding the decoder of PSINet, the change detection performance was improved, with IoU improving by 1.83% and 1.98% on the two datasets, respectively, compared to 2.27% and 1.94% after adding the CAM module. This shows that both the decoder’s semantic information and CAM modules help to improve the effectiveness of change detection. The combined effect of the two PSI-CD increased by 3.31% and 2.94% in WHU and LEVIR compared to baseline.

When better prior semantic information was used, the baseline model also performed better on WHU and LEVIR, with IoU of 82.33% and 82.70%, respectively, an improvement of 2.62% and 2.49%. Compared with the decoder’s semantic information and CAM alone, the improvement of the baseline is higher, which indicates that the accuracy of prior semantic information has a greater impact on change detection results. At this time, the semantic information and CAM of the decoder improved by 0.51% and 0.57% in WHU and 0.58% and 0.56% on LEVIR, respectively. It can be demonstrated that every component of the network in this article is effective. The combined effect of PSI-CD increased by 0.92% and 1.1% on WHU and LEVIR relative to baseline.

5. Conclusions

In this paper, a prior semantic information-guided change detection method is proposed to realize the change detection of bi-temporal high-resolution remote sensing images. In terms of data, we supplement the pixel-level semantic segmentation labels of the bi-temporal images for the LEVIR dataset. For the model, we propose a new idea of incorporating prior semantic information into the change detection network to reduce the amount of labeled data required for change detection and design PSI-CD. Extensive experiments are performed on the WHU and LEVIR datasets, and experiments show that our proposed method exhibited good performance and achieved the best results among the comparative methods. In addition, we randomly selected multiple datasets with different sample sizes in various proportions and calculated corresponding model accuracy under different sample data sizes, and comparisons with the two methods of SNUNET and BIT verify the good performance of the proposed method in the case of small samples. Finally, to demonstrate the effectiveness of the components in our network, we conduct an ablation study on the WHU and LEVIR datasets, thereby demonstrating that the decoder of the PSINet and attention mechanism can help improve change detection accuracy. In future research, we will continue to explore more efficient and universal prior semantic information extraction methods and continue to optimize the high-resolution remote sensing image change detection model on this basis.

Author Contributions

S.P. conceived and guided the algorithm, and she also wrote the manuscript; X.L. conducted the experiments and wrote the manuscript; J.C. aided in the experimental verification. Z.Z. conceived the algorithm and revised the manuscript. X.H. received the funding and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Special Fund of Hubei Luojia Laboratory (No. 220100028), Ministry of Education of the People’s Republic of China (No. 22YJC880058), Knowledge Innovation Program of Wuhan-Shuguang Project (No. 2022010801020281), Hubei Provincial Natural Science Foundation (No. 2021CFB539), the Fundamental Research Funds for the Central Universities (No. CCNU22QN011 and No. CCNU22QN019), University-level Educational Reformation Research Project for Undergraduate Education, Central China Normal University, China (NO. 202143).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Jiacheng Sun, Bingchen Wang, and Yueyang Zheng for their delineation of the semantic labels of the LEIVIR-CD dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tewkesbury, A.P.; Comber, A.J.; Tate, N.J.; Lamb, A.; Fisher, P.F. A critical synthesis of remotely sensed optical image change detection techniques. Remote Sens. Environ. 2015, 160, 1–14. [Google Scholar] [CrossRef] [Green Version]
Zagoruyko, S.; Komodakis, N. Learning to Compare Image Patches via Convolutional Neural Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Zhan, Y.; Fu, K.; Yan, M.; Sun, X.; Wang, H.; Qiu, X. Change Detection Based on Deep Siamese Convolutional Network for Optical Aerial Images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1845–1849. [Google Scholar] [CrossRef]
Zhang, M.; Xu, G.; Chen, K.; Yan, M.; Sun, X. Triplet-Based Semantic Relation Learning for Aerial Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2019, 16, 266–270. [Google Scholar] [CrossRef]
Zhang, W.; Lu, X. The Spectral-Spatial Joint Learning for Change Detection in Multispectral Imagery. Remote Sens. 2019, 11, 240. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Li, W.; Shi, Z. Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Daudt, R.C.; Saux, B.L.; Boulch, A.; Gousseau, Y. High Resolution Semantic Change Detection. arXiv 2018, arXiv:1810.08452. [Google Scholar] [CrossRef]
Kim, J.H.; Lee, H.; Hong, S.J.; Kim, S.; Park, J.; Hwang, J.Y.; Choi, J.P. Objects Segmentation From High-Resolution Aerial Images Using U-Net With Pyramid Pooling Layers. IEEE Geosci. Remote Sens. Lett. 2019, 16, 115–119. [Google Scholar] [CrossRef]
Jiang, H.; Hu, X.; Li, K.; Zhang, J.; Gong, J.; Zhang, M. PGA-SiamNet: Pyramid Feature-Based Attention-Guided Siamese Network for Remote Sensing Orthoimagery Building Change Detection. Remote Sens. 2020, 12, 484. [Google Scholar] [CrossRef] [Green Version]
Pang, S.; Zhang, A.; Hao, J.; Liu, F.; Chen, J. SCA-CDNet: A robust siamese correlation-and-attention-based change detection network for bitemporal VHR images. Int. J. Remote Sens. 2022, 43, 6102–6123. [Google Scholar] [CrossRef]
Zheng, Z.; Wan, Y.; Zhang, Y.; Xiang, S.; Peng, D.; Zhang, B. CLNet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 247–267. [Google Scholar] [CrossRef]
Daudt, R.C.; Saux, B.L.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef] [Green Version]
Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A Densely Connected Siamese Network for Change Detection of VHR Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Shi, Q.; Liu, M.; Li, S.; Liu, X.; Zhang, L. A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Zhang, L.; Hu, X.; Zhang, M.; Shu, Z.; Zhou, H. Object-level change detection with a dual correlation attention-guided detector. ISPRS J. Photogramm. Remote Sens. 2021, 177, 147–160. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Zhu, Q.; Guo, X.; Deng, W.; Shi, S.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Land-Use/Land-Cover change detection based on a Siamese global learning framework for high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 63–78. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Chen, Z.; Zhou, Y.; Wang, B.; Xu, X.; He, N.; Jin, S.; Jin, S. EGDE-Net: A building change detection method for high-resolution remote sensing imagery based on edge guidance and differential enhancement. ISPRS J. Photogramm. Remote Sens. 2022, 191, 203–222. [Google Scholar] [CrossRef]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building Change Detection for Remote Sensing Images Using a Dual Task Constrained Deep Siamese Convolutional Network Model. IEEE Geosci. Remote Sens. Lett. 2020, 18, 811–815. [Google Scholar] [CrossRef]
Yang, K.; Xia, G.S.; Liu, Z.; Du, B.; Yang, W.; Pelillo, M.; Zhang, L. Asymmetric Siamese Networks for Semantic Change Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Tian, S.; Zhong, Y.; Zheng, Z.; Ma, A.; Tan, X.; Zhang, L. Large-scale deep learning based binary and semantic change detection in ultra high resolution remote sensing imagery: From benchmark datasets to urban application. ISPRS J. Photogramm. Remote Sens. 2022, 193, 164–186. [Google Scholar] [CrossRef]
Shen, Q.; Huang, J.; Wang, M.; Tao, S.; Yang, R.; Zhang, X. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 78–94. [Google Scholar] [CrossRef]
Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Huang, X. SemiCDNet: A Semisupervised Convolutional Neural Network for Change Detection in High Resolution Remote-Sensing Images. IEEE Trans. Geoence Remote Sens. 2020, 59, 5891–5906. [Google Scholar] [CrossRef]
Sakurada, K.; Shibuya, M.; Wang, W. Weakly supervised silhouette-based semantic scene change detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6861–6867. [Google Scholar]
Hou, B.; Wang, Y.; Liu, Q. Change Detection Based on Deep Features and Low Rank. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2418–2422. [Google Scholar] [CrossRef]
Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised Deep Change Vector Analysis for Multiple-Change Detection in VHR Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
Zhang, H.; Gong, M.; Zhang, P.; Su, L.; Shi, J. Feature-Level Change Detection Using Deep Representation and Feature Change Analysis for Multispectral Imagery. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1666–1670. [Google Scholar] [CrossRef]
Jia, L.; Gong, M.; Kai, Q.; Zhang, P. A Deep Convolutional Coupling Network for Change Detection Based on Heterogeneous Optical and Radar Images. IEEE Trans. Neural Netw. Learn. Syst. 2016, 29, 545–559. [Google Scholar]
Tao, Z.; Gong, M.; Jia, L.; Zhang, P. Iterative feature mapping network for detecting multiple changes in multi- source remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 146, 38–51. [Google Scholar]
Gong, M.; Yang, Y.; Zhan, T.; Niu, X.; Li, S. A Generative Discriminatory Classified Network for Change Detection in Multispectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 321–333. [Google Scholar] [CrossRef]
Li, X.; Yuan, Z.; Wang, Q. Unsupervised Deep Noise Modeling for Hyperspectral Image Change Detection. Remote Sens. 2019, 11, 258. [Google Scholar] [CrossRef] [Green Version]
Arabi, M.; Karoui, M.S.; Djerriri, K. Optical Remote Sensing Change Detection through Deep Siamese Network. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2018, Valencia, Spain, 22–27 July 2018. [Google Scholar]
Ma, W.; Xiong, Y.; Wu, Y.; Yang, H.; Zhang, X.; Jiao, L. Change Detection in Remote Sensing Images Based on Image Mapping and a Deep Capsule Network. Remote Sens. 2019, 11, 626. [Google Scholar] [CrossRef] [Green Version]
Gong, M.; Zhan, T.; Zhang, P.; Miao, Q. Superpixel-Based Difference Representation Learning for Change Detection in Multispectral Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2658–2673. [Google Scholar] [CrossRef]
Gong, M.; Niu, X.; Zhang, P.; Li, Z. Generative Adversarial Networks for Change Detection in Multispectral Imagery. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2310–2314. [Google Scholar] [CrossRef]
Wang, Q.; Yuan, Z.; Qian, D.; Li, X. GETNET: A General End-to-End 2-D CNN Framework for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2018, 57, 3–13. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]

Figure 1. Workflow of the PSI-CD method.

Figure 2. Prior semantic information generation network.

Figure 3. Change analysis module.

Figure 4. Examples of the WHU and LEVIR datasets.

Figure 5. Experimental results of different methods on the WHU dataset. (a–l) are different types of building change detection results on the WHU dataset. White, black, red, and green represent true positive, true negative, false positive, and false negative, respectively.

Figure 6. Experimental results of different methods on the LEVIR dataset. (a–l) are different types of building change detection results on the LEVIR dataset. White, black, red, and green represent true positive, true negative, false positive, and false negative, respectively.

Table 1. Comparison of change detection results on WHU dataset (unit: %).

Model	IoU	OA	Precision	Recall	F1	Training Params	Fps
FC-EF	59.52	98.39	87.46	65.07	74.62	1.35 × 10⁶	163.6
FC-CONC	71.99	98.83	85.22	82.26	83.72	1.55 × 10⁶	114.4
FC-DIFF	63.24	98.58	92.00	66.91	77.48	1.35 × 10⁶	117.6
SNUNet	79.09	99.17	90.68	86.08	88.32	1.20 × 10⁷	19.6
BIT	81.29	99.26	90.66	88.71	89.68	1.24 × 10⁷	49.6
DTCDSCN	82.19	99.31	92.70	87.87	90.22	4.11 × 10⁷	4.8
PSI-CD	83.25	99.35	93.19	88.65	90.86	2.34 × 10⁷	17.6

Table 2. Comparison of change detection results on LEVIR dataset (unit: %).

Model	IoU	OA	Precision	Recall	F1	Training Params	Fps
FC-EF	66.16	97.95	87.55	73.03	79.63	1.35 × 10⁶	163.6
FC-CONC	75.01	98.54	92.50	79.87	85.72	1.55 × 10⁶	114.4
FC-DIFF	71.30	98.37	95.32	73.88	83.24	1.35 × 10⁶	117.6
SNUNet	82.16	98.97	94.83	86.01	90.21	1.20 × 10⁷	19.6
BIT	82.63	98.99	93.88	87.33	90.49	1.24 × 10⁷	49.6
DTCDSCN	79.90	98.86	89.04	88.61	88.82	4.11 × 10⁷	4.8
PSI-CD	83.80	99.07	95.03	87.64	91.19	2.34 × 10⁷	17.6

Table 3. Comparison of change detection results of different methods on WHU dataset with different data volumes (unit: %).

Model	5%	10%	20%	40%	60%	80%	100%
SNUNet	64.71	66.13	72.64	77.64	78.88	79.40	79.09
BIT	71.43	72.80	76.04	79.24	78.53	81.74	81.29
DTCDSCN	61.89	64.98	70.04	76.96	78.59	81.67	82.19
PSI-CD	76.95	78.80	80.71	79.14	80.47	82.45	83.25

Table 4. Comparison of change detection results of different methods on LEVIR dataset with different data volumes (unit: %).

Model	5%	10%	20%	40%	60%	80%	100%
SNUNet	70.00	74.25	76.98	80.26	80.92	81.37	82.16
BIT	75.37	77.03	78.22	80.61	80.61	80.71	82.63
DTCDSCN	62.67	67.19	71.66	75.87	77.92	79.09	79.90
PSI-CD	81.14	82.12	83.04	83.37	83.57	83.54	83.80

Table 5. Results of the ablation study for the WHU and LEVIR datasets (unit: %).

Dataset	Method	Incompletely Trained PSINet					Fully Trained PSINet
Dataset	Method	IoU	OA	Pre	Rec	F1	IoU	OA	Pre	Rec	F1
WHU	Baseline	79.71	99.23	95.56	82.78	88.71	82.33	99.32	94.57	86.42	90.31
	+Decoder	81.54	99.29	93.91	86.09	89.83	82.84	99.34	93.61	87.81	90.62
	+CAM	81.98	99.31	94.04	86.47	90.10	82.90	99.34	94.25	87.32	90.65
	PSI-CD	83.02	99.02	94.25	87.45	90.72	83.25	99.35	93.19	88.65	90.86
LEVIR	Baseline	80.21	98.86	94.70	83.98	89.02	82.70	99.01	95.44	86.11	90.53
	+Decoder	82.19	98.99	95.83	85.23	90.22	83.28	99.35	93.01	88.84	90.88
	+CAM	82.15	99.03	92.39	88.12	90.20	83.26	99.08	92.19	89.58	90.87
	PSI-CD	83.15	99.36	94.90	87.04	90.80	83.80	99.07	95.03	87.64	91.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pang, S.; Li, X.; Chen, J.; Zuo, Z.; Hu, X. Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1655. https://doi.org/10.3390/rs15061655

AMA Style

Pang S, Li X, Chen J, Zuo Z, Hu X. Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images. Remote Sensing. 2023; 15(6):1655. https://doi.org/10.3390/rs15061655

Chicago/Turabian Style

Pang, Shiyan, Xinyu Li, Jia Chen, Zhiqi Zuo, and Xiangyun Hu. 2023. "Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images" Remote Sensing 15, no. 6: 1655. https://doi.org/10.3390/rs15061655

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prior Semantic Information Guided Change Detection Method for Bi-temporal High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Overview

3.2. PSINet

3.3. CDNet

3.3.1. Weight Sharing

3.3.2. Embedding of Prior Features

3.3.3. Change Analysis Module (CAM)

3.3.4. Decoder

3.3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Training Details

4.3. Comparison of Different Methods

4.3.1. Comparisons on WHU Dataset

4.3.2. Comparisons on LEVIR Dataset

4.4. Comparisons of Different Methods with Different Amounts of Data

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI