MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images

Zhang, Yonghong; Lu, Huanyu; Ma, Guangyi; Zhao, Huajun; Xie, Donglin; Geng, Sutong; Tian, Wei; Sian, Kenny Thiam Choy Lim Kam

doi:10.3390/rs15143559

Open AccessArticle

MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images

by

Yonghong Zhang

^1,*

,

Huanyu Lu

¹

,

Guangyi Ma

²,

Huajun Zhao

¹,

Donglin Xie

¹

,

Sutong Geng

¹

,

Wei Tian

³

and

Kenny Thiam Choy Lim Kam Sian

⁴

¹

School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁴

School of Atmospheric Science and Remote Sensing, Wuxi University, Wuxi 214105, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(14), 3559; https://doi.org/10.3390/rs15143559

Submission received: 15 May 2023 / Revised: 3 July 2023 / Accepted: 12 July 2023 / Published: 15 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Water bodies extraction is important in water resource utilization and flood prevention and mitigation. Remote sensing images contain rich information, but due to the complex spatial background features and noise interference, problems such as inaccurate tributary extraction and inaccurate segmentation occur when extracting water bodies. Recently, using a convolutional neural network (CNN) to extract water bodies is gradually becoming popular. However, the local property of CNN limits the extraction of global information, while Transformer, using a self-attention mechanism, has great potential in modeling global information. This paper proposes the MU-Net, a hybrid MixFormer architecture, as a novel method for automatically extracting water bodies. First, the MixFormer block is embedded into Unet. The combination of CNN and MixFormer is used to model the local spatial detail information and global contextual information of the image to improve the ability of the network to capture semantic features of the water body. Then, the features generated by the encoder are refined by the attention mechanism module to suppress the interference of image background noise and non-water body features, which further improves the accuracy of water body extraction. The experiments show that our method has higher segmentation accuracy and robust performance compared with the mainstream CNN- and Transformer-based semantic segmentation networks. The proposed MU-Net achieves 90.25% and 76.52% IoU on the GID and LoveDA datasets, respectively. The experimental results also validate the potential of MixFormer in water extraction studies.

Keywords:

attention mechanism; convolutional neural network; MixFormer; remote sensing; semantic segmentation; Transformer

1. Introduction

Water body extraction, which refers to the correct and effective detection of water bodies in remote sensing images under the interference of complex spatial backgrounds, is a fundamental task in remote sensing image interpretation [1,2]. The extraction of water bodies is of great importance in disaster prevention [3], resource utilization, and environmental protection [4]. Remote sensing data from high-resolution and multi-spectral satellites are primarily used for water body extraction tasks. Many methods have been proposed to accurately extract water bodies from remote sensing images, which can be mainly divided into water body index-based spectral analysis methods and machine learning-based methods.

The threshold value is used in the water body index-based spectral analysis method. First, the normalized difference water index (NDWI) [5] utilizes the different reflectance of water bodies in different bands to detect the water body. To address the problem that NDWI cannot adequately handle the image background noise, Xu et al. [6] changed the near-infrared band in NDWI to a short-wave infrared band and named it modified NDWI (MNDWI). A single threshold value may lead to water body misclassification because the shadows of some targets have similar spectral characteristics to water bodies. Therefore, some researchers have also combined the NDWI with other relevant indices to eliminate the influence of shadows [7,8]. Although the performance of the water body index-based method is improving, it can still be impacted by static thresholds and subjective factors. In addition, the threshold method is not appropriate for extracting information from small area water bodies.

Machine learning-based water classification methods mostly use manually designed water features. The feature space formed by the water features is input into a machine-learning model for water extraction. The machine-learning approaches to extracting water bodies show better performance by avoiding the subjective selection of thresholds and better utilizing the information in the images. Balázs et al. [9] used principal component analysis to extract information from images and successfully identified water bodies and saturated soils. Zhang et al. [10] proposed an SVM-based method that can enhance edge feature extraction and reduce some segmentation errors. But the model’s accuracy still needs to be improved, and the algorithm cannot be easily implemented when the data are too large. Despite the progress of the above methods, the manually designed features have a limited ability to describe water bodies and require a certain amount of prior knowledge. Additionally, they cannot fully utilize deep features and global information to detect water bodies.

The traditional water extraction methods suffer from poor model generalization ability and cannot remove the interference of noise well. Recently, convolutional neural networks (CNN) have been extensively used in remote sensing images due to their excellent feature extraction ability. Deep CNN (DCNN) extracts features at different scales of images by multiple convolutional layers, avoiding the complicated feature selection process. Currently, DCNN has gradually become a mainstream method for water extraction. Wang et al. [11] used the ResNet101 network with depth-wise separable convolution as an encoder and proposed a multi-scale densely connected module to enlarge the receptive field to improve the segmentation accuracy of small lake water bodies. Zhang et al. [12] designed a new multi-feature extraction module and a combined module to extract water bodies, considering feature information at different scales. Chen et al. [13] proposed a method based on feature pyramid enhancement and pixel pair matching to extract water bodies, which alleviates the problem of detail loss in deep networks and reduces the classification error of boundaries. Dang et al. [14] considered the multi-scale and multi-shape characteristics of water bodies and proposed a multi-scale residual model using a self-supervised learning strategy for water body extraction.

The method using DCNN improves the performance of water body extraction, but some limitations still exist. The convolution operation has a limited receptive field and lacks modeling ability for global information. Convolution only collects information from adjacent pixels. For semantic segmentation, if only local information is modeled, the relationship between pixels is ignored, resulting in low segmentation accuracy. With the help of global information, the semantic information of each pixel becomes more exact. To overcome the local property of the CNN and model the global information in images, some researchers have combined attention mechanisms with networks [15,16]. Attention mechanisms are used for the adaptive weighting of important features. Attention mechanisms have also been applied to water body extraction tasks. Duan et al. [17] introduced channel and spatial attention mechanisms in the network and combined multi-scale features to enhance the continuity of the water contour. To reduce the influence of noisy information on water body boundaries, Zhong et al. [18] introduced a two-way channel attention mechanism, which improved the network’s accuracy for lake boundary segmentation. However, due to the interference of some complex spatial backgrounds and shadows, the performance of extracting water body boundaries and tributaries still needs to be improved. Moreover, the above attention mechanisms rely on convolution operation, which limits the representation of global features.

Transformer is an attention-based architecture [19]. It can model the relationship between input tokens and deal with long-range dependencies. Unlike the CNN structure, Transformer processes one-dimensional sequence features generated from two-dimensional image features by flattening operations. The standard Transformer structure mainly consists of layer normalization (LN), multi-head self-attention (MHSA), multilayer perceptron (MLP), and skip connections. The Transformer can be continuously stacked to model the long-range dependencies of image features. Transformer performs global self-attention computation on image features to obtain feature representations, which can effectively capture long-range dependencies and compensate for the deficiency of CNN in the ability to extract global information of images. Although some studies have achieved satisfactory results using Transformer, they are mostly based on large-scale pre-training [19,20]. MixFormer [21] is a kind of Transformer structure with an excellent ability to capture global contextual information.

To extract water bodies accurately from remote sensing images, we embedded MixFormer into Unet [22] and proposed MU-Net, a hybrid MixFormer structure. We validated the network’s performance on the GID [23] and LoveDA datasets [24]. MU-Net combines CNN and MixFormer to focus on capturing the local and global contexts of images, respectively. CNN has excellent spatial induction bias [25], so we used classical convolutional layers to extract shallow image features, which preserves high-resolution information and avoids large-scale pre-training operations. In order to obtain feature maps at different scales, the original image was downsampled via the Maxpooling operation. In order to accurately identify water features in complex backgrounds and improve the integrity of tributaries and boundary contours, we first extracted local information from deep features by convolutional layers, and then modeled global contextual information using a MixFormer block to mine deeper semantic features of water bodies. Afterward, the features generated by the encoder were refined by the attention mechanism module (AMM). The AMM weights the features related to water bodies to suppress the interference of non-water features and noise. Finally, in the decoding process, we recovered the resolution and detail information of image by bilinear interpolation and skip connection to generate the final water body extraction results. The main contributions of this paper can be summarized as follows:

(1): MU-Net, a network embedding MixFormer into Unet is proposed for the automatic extraction of water bodies. In order to enhance the integrity and accuracy of the water body extraction results, CNN and MixFormer are combined to model the local spatial and global contextual information of the deep features to further mine the details and deep semantic features of water bodies.
(2): The AMM refines the features generated by the encoder to suppress image background noise and non-water body features.
(3): Compared with other CNN- and Transformer-based segmentation networks, our proposed method achieves optimal performance on the GID and LoveDA datasets.

The rest of this paper is organized as follows. Section 2 introduces related work, including the background of the model concept. In Section 3, we describe the proposed method in detail, including the overall structure and the specific structure of each module. We present the details of the experiments in Section 4, which include the dataset, the results of the experiments comparing different networks, and the ablation experiments of the proposed method. Section 5 verifies the rationality of some settings in MU-Net, and Section 6 summarizes the work.

2. Related Work

2.1. Semantic Segmentation

The semantic segmentation network eventually generates a probability map using the Softmax or Sigmoid function and assigns a category label to each pixel based on its maximum probability of classification. In 2014, the emergence of the full convolutional network (FCN) [26] laid the foundation of semantic segmentation and promoted the development of convolution. Badrinarayanan et al. [27] proposed SegNet, a network based on the encoder and decoder architecture that preserves the detailed information of the image by saving the position index during downsampling. Unet [22], which has a similar structure to SegNet, was originally proposed to solve the medical image segmentation problem. Unet recovers the image’s resolution by deconvolution in the decoding stage and then concatenates it with the features generated in the encoding stage. However, this simple feature concatenation does not fully exploit the deep information in the image. In 2017, Zhao et al. [28] proposed the PSPNet, in which the pyramid module aggregates contexts at different scales to obtain different receptive fields. DeepLabv3+ [29] is one of the semantic segmentation networks with good performance, which utilizes atrous spatial pyramid pooling (ASPP) for multi-scale feature extraction. In recent years, semantic segmentation has also been applied to remote sensing images, such as building extraction [30,31], road extraction [32], water body extraction [33,34], and land cover mapping [35]. The extraction of target features is affected by the complex background of remote-sensing images. For example, the image has a diverse land cover, and the same type of objects are in different geographic environments, increasing the scale variations. Moreover, the spatial detail information of the image is rich, and there is much noise, which leads to inaccurate segmentation results.

2.2. Attention Mechanism

In order to overcome the limitation of CNN in extracting global information, the general approach embeds an attention mechanism into the network to compensate for the deficiency. The attention methods mainly include channel and spatial attention. Fu et al. [16] proposed DANet to integrate local and global features and two attention mechanisms to model semantic interdependencies in the spatial and channel dimensions, respectively. Yu et al. [36] proposed an attention refinement module to refine features and guide feature learning. Attentional mechanisms are also used in remote sensing. Xu et al. [37] designed a dual attention module to obtain global information in the feature extraction stage. Shi et al. [38] used attention to fuse spatial detail information and deep semantic information to enhance the precision of target boundaries segmentation. Niu et al. [39] embedded location information into the channel attention to establish long-range dependencies and strengthen feature representation. Applying the attention mechanism enhances the ability of the segmentation network to mine image information. However, the attention operations mentioned above are dependent on convolutional operations and hence limit the representation of global information.

2.3. Transformer in Visual Tasks

Unlike the CNN structure, Transformer converts two-dimensional-based image tasks into one-dimensional-based sequence tasks for processing [19]. Due to its strong sequence-to-sequence modeling abilities, Transformer performs well in extracting global contextual information. Dosovitskiy et al. [40] proposed VIT, which uses a pure self-attention mechanism approach to extract information from images and apply it to image classification. Transformer has a quadratic computational complexity with regard to the number of tokens, which slows down the processing speed of images. Liu et al. [20] proposed the Swin Transformer to calculate self-attention in the form of a local window and used shifted window scheme for cross-window information interaction. To solve the problems of restricted receptive field and weak modeling ability of window-based self-attention, MixFormer [21] combines window-based self-attention with depth-wise convolution to expand the receptive field and offers bi-directional interaction branches to improve modeling ability. Transformer has also been applied in image segmentation tasks. SETR [41] uses Transformer to encode the image to achieve the segmentation task, which models the global information by serializing the image. Xie et al. [42] proposed SegFormer, which uses a Transformer-based encoder and a lightweight MLP decoder to aggregate multi-scale feature information. In the field of medical images, inspired by Transformer, Chen et al. [43] proposed TransUNet, which combines CNN and Transformer. TransUNet uses a multilayer Transformer to encode the images. The decoder still uses the CNN structure but has a larger number of parameters. Cao et al. [44] proposed a pure Transformer structure for medical image segmentation. They used the Swin Transformer as an encoder to extract contextual features and proposed patch-expanding layers to recover the spatial resolution of the images.

2.4. Transformer in Remote Sensing

Transformer is also used in the field of remote sensing. Wang et al. [45] proposed a hybrid CNN and Transformer structure to classify crops, combining the local and global information extracted by CNN and Transformer. BANet [46] contains a dependency path constructed by Transformer and the texture path constructed by CNN and uses an aggregation module to fuse the global and spatial detail information to achieve fine segmentation of urban scene images. Chen et al. [31] used the Swin Transformer with a shifted window as the backbone to obtain features. This method gives more attention to the global information in remote sensing images. Yuan et al. [47] proposed a multi-scale adaptive network based on Swin Transformer and used convolutional operations to fuse features at different levels to extract buildings. Wang et al. [48] proposed an efficient global-local attention mechanism and proposed the Transformer-based decoder architecture UNetFormer to mine image information. These are the attempts to introduce Transformer in the field of remote sensing images, and all of them have achieved better results. We propose MU-Net to extract water bodies in remote sensing images, which introduces the latest MixFormer [21], a structure with powerful information extraction ability in both spatial and channel dimensions.

3. Proposed Method

This section introduces the proposed MU-Net. First, we give a general overview of the model’s structure, then we describe the structure of each module in detail and introduce the loss function.

3.1. Overall Structure

As shown in Figure 1, MU-Net is based on the encoder and decoder structure, which mainly consists of Conv block, MixFormer block, and AMM.

In the encoding process, we used the maxpooling operation to enable image downsampling and generate a series of feature maps at different scales. Shallow features are generated by convolution operations so that the generated shallow features contain high-resolution details and focus on local contextual information. The local features are extracted using convolution layers, which avoids the large-scale pre-training of Transformer. The shallow features contain rich location and detailed information focusing on capturing the boundaries of water bodies. These features are important for accurately extracting small water bodies and tributaries. All Conv blocks consist of two classical convolution modules: a convolution with kernel size 3, a batch normalization (BN), and a rectified linear unit (ReLU).

Deep features focus on capturing semantic information to accurately classify pixels, which is important for accurately identifying water features in different scenarios. For deep-level features, the local information is first modeled by convolutional layers, and then the features are flattened, and self-attention operations are performed in the MixFormer block. In this setting, the detailed information of the image is retained, and the global receptive field of the image is captured, which helps to extract the water body more accurately and completely. Considering the loss of detailed information and the scale of the model, we introduced two layers of MixFormer block in the network. Deep features capture the global receptive field by modeling the global information through the MixFormer block. The deep features have a smaller size, which reduces the computational cost. The sequence features output from the MixFormer block also need to be reshaped to convert 1D sequence features to 2D image features for the next convolution operation. Then, AMM refines the features produced by the encoder to suppress noise interference and strengthen the features related to water bodies.

The decoder keeps low-level features by skip connection, using bilinear interpolation to continuously recover resolution and detail information of the image. Finally, the feature maps are upsampled to the original image resolution, and the predicted water body extraction results are generated.

3.2. MixFormer

In remote sensing images, the complex spatial background affects the feature extraction, and the acquisition of global information is beneficial to the accurate segmentation of targets. The usual solution is to add a convolution-based attention mechanism to the model, but it is difficult to obtain global contextual information. Alternatively, Transformer is used as the encoder, increasing the model’s complexity while losing image details. Therefore, we introduced the MixFormer [21] to model global contextual information after the CNN-based encoder obtains the deep feature, improving the long-range dependencies of images.

Although Transformer has an excellent ability to model global information, its structure makes its computational complexity much higher than that of the convolution-based attention mechanism, seriously affecting its potential for applications in remote sensing image processing. Some researchers have used the form of local windows to reduce the computational amount of self-attention [20,49]. MixFormer [21] combines window-based self-attention with depth-wise convolution to form parallel branches, enabling cross-window information interaction and expanding the receptive field. The use of depth-wise convolution also captures the local relationships of the image, which is complementary to the global information obtained using self-attention. Furthermore, MixFormer includes a bi-directional interaction path to deliver complementary information for parallel branches.

3.2.1. MixFormer Block

As shown in Figure 2, the MixFormer block consists of two normalization layers, mixing attention and a multilayer perceptron. The process can be formulated as follows:

\begin{array}{l} {\hat{X}}^{l + 1} = M I X (L N (X^{l}), W S A, C O N V) + X^{l}, \\ X^{l + 1} = M L P (L N ({\hat{X}}^{l + 1})) + {\hat{X}}^{l + 1} \end{array}

(1)

where

W S A

represents the window-based self-attention branch, and CONV represents the depth-wise convolution branch.

M I X

represents the operation that fuses the features of the

W S A

and

C O N V

branches.

L N

represents layer normalization, and

M L P

represents the multilayer perceptron composed of two linear layers and one GELU [50]. In parallel branches, the outputs of the depth-wise convolution branch and window-based self-attention branch are concatenated and then fed into the feedforward network.

3.2.2. Window-Based Self-Attention

The computational flow of self-attention and illustration of the image window partition are shown in Figure 3. The computation process of window-based self-attention is as follows. The 2D feature map is first partitioned into 8 × 8 windows, and the 2D features are flattened into 1D sequence. Then self-attention is performed inside the window. The number of channels of the 1D sequence is expanded 3 times by a linear layer and then split into three vectors: query (

Q

), key (

K

), and value (

V

). The self-attention calculation is expressed as follows:

A = S o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{C}} + B) \cdot V

(2)

where

Q \cdot K^{T}

obtains the relationship between pixels. The dimensions of Q, K, and V are all

N \times C

, where N represents the length of the sequence, and C represents the number of channels. First,

K

is transposed (

K^{T}

), then

Q

is multiplied with matrix

K^{T}

, followed by the division by

\sqrt{C}

. Similar to the Swin Transformer method, relative position encoding B (

B \in R^{N \times N}

) is added, and then the feature map is passed through the Softmax function to generate a spatial attention matrix. Finally, the output feature map is multiplied by matrix

V

to output

A

. A represents the feature map after the self-attention operation. More details of the window-based self-attention used in this paper can be found in Swin Transfomer [20].

3.2.3. Parallel Branches and Bi-Directional Interactions

As shown in Figure 2, the purpose of using window-based self-attention and depth-wise convolution as parallel branches is to solve the limited receptive field problem arising from the self-attention operation within local windows. Considering the inference efficiency, the parameter setting of MU-Net is the same as MixFormer [21], and the convolutional kernel size of depth-wise convolution is set to 3 × 3.

The parallel bi-directional interactions between branches are designed to improve the modeling ability of the network [21]. The self-attention branch performs self-attention operations within the windows and shares weights in the channel dimension, which can lead to weak modeling ability in the channel dimension. The depth-wise convolution shares weights in the spatial dimension while modeling channel dimension features [51]. As shown in Figure 2, the information extracted by depth-wise convolution is passed through the channel interaction path to the window-based self-attention branch to improve the modeling ability of the network in the channel dimension. In the same way, the information obtained from the window-based self-attention branch is passed through the spatial interaction path to the depth-wise convolution branch to enhance the modeling ability of spatial dimension.

3.3. AMM

The shallow features generated by the encoder contain rich spatial details but lack semantic information and are susceptible to noise interference. Moreover, some non-water body disturbances mostly exist in the shallow features, so we proposed an attention mechanism module to refine the features and further enhance the ability of the network to identify water bodies. The structure of AMM is shown in Figure 4. We constructed two attention branches to improve the representation of encoder features in spatial and channel dimensions, where r is set to 16. Specifically, the channel attention branch generates the feature map

F \in R^{1 \times 1 \times c}

by a global average pooling layer, where C represents the channel dimension. The two 1 × 1 convolution operations first compress the number of channels of the feature to 1/16 times, then restore it to the original number of channels. The channel attention map is generated by the sigmoid function. The spatial attention branch uses depth-wise convolution to strengthen information in spatial dimension and then passes two 1 × 1 convolutions, and BN and ReLU. With two convolution layers, the number of channels is reduced to 1. The channel and spatial attention maps are multiplied by the original input features, and the refined feature maps are generated by feature fusion.

3.4. Loss Function

During the training stage, MU-Net was trained in an end-to-end manner and using a target loss function. The final prediction map was generated by a Softmax function. The water samples were not evenly distributed in the remote sensing images, with a smaller proportion of water samples to negative samples. We used dice loss [52] to prevent the training from focusing on the background region, which led to inaccurate segmentation results. Our target loss function combines cross-entropy and dice loss [52]. The specific formula is shown below.

ℒ_{c e} = - \sum_{i = 1}^{t} y_{i} \log (p_{i})

(3)

ℒ_{dice} = 1 - \frac{\sum_{i = 1}^{t} y_{i} p_{i} + ε}{\sum_{i = 1}^{t} y_{i} + p_{i} + ε}

(4)

ℒ_{p} = α \cdot ℒ_{c e} + β \cdot ℒ_{dice}

(5)

where

t

represents the total number of pixels,

y_{i}

represents the ground truth value of the pixel, and

p_{i}

represents the predicted probability value of the pixel generated by the final Softmax function. α and β are two hyperparameters that give different weighting factors to the two losses. In our experiments, we set

α = 1.0

,

β = 0.7

, and

ε = 10^{- 5}

.

4. Experiments

4.1. Dataset

We conducted comparison experiments on the GID [23] and LoveDA datasets [24].

4.1.1. GID Dataset

The original GID dataset [23] contains 150 images with annotations from the large-scale Gaofen-2 (GF-2) satellite. The GF-2 satellite carries a multi-spectral sensor with a spatial resolution of 4 m. The dataset images, each with a size of 6800 × 7200 pixels, cover more than 60 cities in China. The GID large classification dataset contains 5 categories, where the water body category includes areas such as lakes, ponds, rivers, and paddy fields. We retained only the water body category in the original labels and set the other categories as background. Finally, the images containing water bodies in the GID dataset were traversed and cropped to 512 × 512 size images. A total of 8953 images were obtained. The cropped images were divided into training and test datasets, containing 7162 and 1791 images (ratio 8:2), respectively.

4.1.2. LoveDA Dataset

The original LoveDA dataset [24] has 5987 high-resolution images, including 2713 images from urban scenes and 3274 images from rural scenes, with pixel sizes 1024 × 1024. The spatial resolution of the dataset is 0.3 m, with red, green, and blue bands, and all data come from the Google Earth platform. They contain 7 land cover types. We set all categories except water bodies as background. The images containing water bodies in the original dataset are traversed and cropped to 512 × 512 size images, generating 4476 images. The cropped images were divided into training and test datasets in the ratio of 8:2, containing 3581 and 895 images, respectively. Data augmentation, including random flipping and rotation, was applied to the dataset during training.

Examples of images and labels from the two datasets are shown in Figure 5 and Figure 6.

4.2. Implementation Details

4.2.1. Experimental Settings

To validate the performance of the proposed method, we compared MU-Net with widely used CNN- and Transformer-based segmentation networks on the GID and LoveDA datasets. All networks used in the experiments did not use pre-trained weights. The experiments were implemented under the Pytorch framework, with Python version 3.8, and using an NVIDIA Geforce RTX2080Ti 11-GB GPU. The operating system was Windows 10, running on an Intel Corei9 at 3.60 GHz. To accelerate convergence during model training, we used the stochastic gradient descent (SGD) optimizer [53] and set the momentum to 0.9 and weight decay to

1 \times 10^{- 4}

The number of training epochs was set to 150, and the initial learning rate was set to 0.001. We used an early stopping mechanism in the training process to terminate the training when the performance of the test dataset did not increase in 10 epochs. The learning rate decay strategy is “poly”. The specific learning decline strategy is shown below:

l r = l r_{0} {(1 - \frac{i t e r}{m a x i t e r})}^{p o w e r}

(6)

where

i t e r

represents the current number of iterations,

m a x i t e r

represents the total number of iterations,

l r_{0}

represents the initial learning rate, and power is set to 0.9.

4.2.2. Evaluation Metrics

Semantic segmentation refers to image classification at the pixel level. The water extraction task is ultimately a pixel classification, so pixel-level evaluation metrics are generally used to assess the performance of the segmentation network. We used a total of five evaluation metrics. Overall accuracy (

O A

) represents the proportion of correctly classified pixels to the total number of pixels, while

p r e c i s i o n

(

P

) represents the proportion of correctly classified water bodies to the total number of pixels classified as water bodies.

R e c a l l

(

R

) represents the proportion of pixels classified as water bodies to the original water body pixels.

F 1 - s c o r e

(

F 1

) represents a harmonic mean of precision and recall. Intersection over Union (

I o U

) represents the ratio of intersection and concatenation of true label and predicted output pixels. The specific formula is shown below:

O A = \frac{T P + T N}{T P + T N + F P + F N}

(7)

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

R e c a l l = \frac{T P}{T P + F N}

(9)

F 1 - s c o r e = \frac{2 T P}{2 T P + F P + F N}

(10)

I o U = \frac{T P}{T P + F P + F N}

(11)

where

T P

is true positive, indicating the number of pixels correctly identified as water bodies;

T N

is true negative, indicating the number of pixels correctly identified as non-water bodies;

F P

is false positive, indicating the number of pixels incorrectly identified as water bodies; and

F N

is false negative, indicating the number of pixels incorrectly identified as non-water bodies. Precision and recall were used to evaluate the degree of error in water body extraction.

4.3. Experimental Results on GID Dataset

4.3.1. Comparison with CNN-Based Semantic Segmentation Networks on GID Dataset

To verify the performance of MU-Net, we compare the model with other CNN-based semantic segmentation networks on the GID dataset. None of the CNN-based networks in this paper use pre-trained weights. The specific results are shown in Table 1. The results show that MU-Net has a significant performance improvement compared to other networks. Compared to Unet, IoU, precision, and recall of MU-Net are improved by 2.13%, 1.87%, and 0.51%, respectively. Compared to SegNet, a natural image segmentation method, the IoU of MU-Net is improved by 3.86%. Compared to the water segmentation methods MWEN and SR-SegNet, the IoU of MU-Net is improved by 1.71% and 2.06%, respectively. Hence, our network achieves optimal results in all five metrics, indicating that our model outperforms the CNN-based segmentation networks.

Figure 7 shows the visualization of the prediction results of MU-Net and other CNN-based segmentation networks. SegNet appears to over-segment in meandering and slender tributary scenes, identifying the scenes around the tributaries as water bodies. PSPNet can also identify some of the small tributaries, but there is missed detection when there are many tributaries, and the river is complex. This is because the model cannot learn rich global contextual information. In other different scenes, although other segmentation networks can identify water bodies well, the results of MU-Net are more continuous and accurate. In the last row, SegNet and SR-SegNet cannot distinguish the lakes due to their weak ability to suppress noise interference. MU-Net utilizes MixFormer to model global contextual information for deep-level features and capture features with long-range dependencies in images. The global information modeling by self-attention enhances the feature representation of water bodies in different scenes and obtains more accurate water body segmentation results. Compared with other networks, MU-Net can capture clearer and more accurate boundaries of water bodies. The results show that our proposed method is more advantageous in identifying tributaries and water body boundaries.

4.3.2. Comparison with Transformer-Based Segmentation Networks on GID Dataset

We compared MU-Net with other Transformer-based segmentation networks. The Transformer-based networks in this paper do not use pre-trained weights. As can be seen in Table 2, MU-Net obtains optimal results on the five evaluation metrics. Remote sensing images contain much information, in which shallow features are very important for accurate localization and segmentation of water bodies. The network with Transformer as the encoder has a slightly lower performance, possibly because the pure Transformer as the encoder loses some image shallow detail information, resulting in inaccurate segmentation results. The networks using CNN or hybrid Transformer structure as the encoder can effectively extract shallow and deep features of images. Compared with TransUNet, our method improves OA by 0.92% and IoU by 3.55%. Compared with UNetFormer, a remote sensing image segmentation network, the precision and IoU of our method are improved by 0.55% and 1.12%, respectively. The above metrics show that our method outperforms these Transformer-based segmentation networks.

Figure 8 shows the visualization results of MU-Net compared with other Transformer-based segmentation networks. The figure shows that the edges of the tributaries detected by BANet are relatively coarse because the model cannot accurately obtain position information of the tributaries, which leads to low accuracy of boundary segmentation. TransUNet is not precise enough in identifying the boundary of water bodies, and there are some missed detection instances. MU-Net can better locate the position of water bodies, and the detected edges are more accurate and continuous. UTNet and UNetFormer are mixed CNN and Transformer networks. The results show that the water bodies identified by UTNet and UNetFormer are relatively complete but lose some detailed information. From row 4 of Figure 8, MU-Net can accurately identify the non-water body part in the lake. MU-Net uses AMM to suppress the background noise interference and obtain more accurate information on lake boundaries. When processing images with complex backgrounds, MU-Net further suppresses the interference of non-water features and maintains the water body integrity. The visualization results demonstrate that our method can obtain richer semantic features of water bodies and can completely identify water bodies in different scenes.

4.4. Experimental Results on LoveDA Dataset

4.4.1. Comparison with CNN-Based Segmentation Networks on LoveDA Dataset

To verify the robustness of MU-Net, we conducted comparative experiments on the LoveDA dataset. Table 3 shows that our model still maintains the highest OA, F1-Score, and IoU compared with other CNN-based networks. Compared to the water extraction network MWEN, the MU-Net improves the IoU by 1.05% and the F1-score by 0.68%. Compared to SegNet, MU-Net improves precision and IoU by 1.42% and 3.23%, respectively. Although the precision of SR-SegNet is high, the recall is low. The low recall tends to lead to under-segmentation, which indicates that the performance of SR-SegNet is not balanced. The experimental results show that MU-Net maintains stable performance on the LoveDA dataset, which indicates that MU-Net has better practical generalization ability.

Figure 9 shows the visualization results of MU-Net and other CNN-based segmentation networks on the LoveDA dataset. PSPNet and SegNet cannot clearly identify the boundary of water bodies. MU-Net uses MixFormer to model the global information of the image, and the identified water body features are more complete. AMM suppresses the interference of background noise and further refines the boundary information. In the fifth row, MWEN, SR-SegNet identifies the partial islands in the lake as water bodies, while MU-Net accurately excludes the non-water body features, and the water body extraction results are more accurate.

4.4.2. Comparison with Transformer-Based Segmentation Networks on LoveDA Dataset

Table 4 shows the results of the experiments comparing MU-Net with other Transformer-based networks on the LoveDA dataset. The experimental results show that MU-Net achieves optimal performance on OA, F1-score, and IoU. Compared to TransUNet, the IoU and F1-score of MU-Net are improved by 1.26% and 0.82%, respectively. Compared to UTNet, the F1-score of MU-Net improves by 0.57%. In summary, the experiment results on the LoveDA dataset not only verify the robustness of MU-Net but also show that Transformer has great potential for water body extraction studies.

Figure 10 shows the visualization results of MU-Net and other Transformer-based networks on the LoveDA dataset. From Figure 10, SwinUnet cannot suppress the image background noise well, and the segmentation of water body edges is not smooth. UTNet and UNetFomer can accurately locate the position of the water body but do not give good continuity. Compared with other methods, MU-Net uses MixFormer to capture comprehensive contextual relationships of water bodies and accurately locate the water features in the images.

4.5. Ablation Experiments

To further evaluate the performance of each module in the MU-Net, we conducted ablation experiments on the GID dataset. We used OA, precision, recall, F1-score, and IoU as the main metrics to evaluate the model.

Baseline: The baseline was based on an encoder-decoder structure with 4 downsampling operations and 4 upsampling operations, respectively. The 1/8 and 1/16 feature maps model local features by a 3 × 3 convolution only.

MixFormer: MixFormer represents MixFormer blocks in this section of the experiment. The deep features were first modeled local relationships by convolution, and then global contextual information was modeled by two-layer MixFormer blocks.

AMM: We added AMM to baseline+MixFormer to construct the MU-Net (indicated as baseline+MixFormer+AMM). The features produced by the encoder were refined through AMM.

As can be seen from Table 5, the network performance was improved after introducing the MixFormer block based on the baseline. The precision, recall, and IoU of the model were improved by 1.1%, 1.07%, and 1.91%, respectively. The results of the experimental visualization are shown in Figure 11. Before the MixFormer block was introduced, the segmentation results of the model appeared hollow and lost water body information at the boundary. This is because only simple convolution operations are used for local feature extraction, which cannot learn global contextual information in the image, resulting in missed and inaccurate boundary segmentation. After adding the MixFormer block, the result of water body extraction is more complete. This indicates that MixFormer improves the ability of the network to locate water bodies by modeling global information. The experiments show that MixFormer performs excellently in maintaining the integrity and continuity of the water extraction results.

The results of AMM ablation experiments are shown in Table 5, and all the metrics are improved after adding AMM to MU-Net. The precision increased by 0.48%, which means that the ability of the network to identify water body pixels correctly has improved. The visualization results of the experiment are shown in Figure 12. A certain degree of misclassification occurs before the addition of AMM due to the interference caused by the similarity of the spectral features of objects near water bodies. AMM can refine the features related to water bodies in the image. It uses an attention mechanism to assign different weights to features and suppress interference of non-water body features and background noise. After the shallow features pass through AMM, the result of water body segmentation excludes these interferences and accurately identifies the water body boundary. The results of the ablation experiments on the GID dataset verify the effectiveness of each component of MU-Net.

5. Discussion

5.1. Discussion of α and β in Loss

Two hyperparameters, α and β, were used in loss to assign different weights to cross-entropy and dice loss. We set different sizes of α and β on the GID dataset to analyze the effect of the weights of the two losses on the experiment results. As shown in Table 6, the IoU indicators are very similar when α = 1 and β = 1.0 or 0.7. When β = 0.5, the IoU and F1-score of the network appear to decrease. This indicates that too-small dice loss weight affects the final result of network training. When the image positive and negative samples are unbalanced, dice loss causes the model to mine more positive sample regions during training and improves the segmentation accuracy. When β = 1 and α = 0.5 or 0.7, the F1-score and IoU indicators of the network show different degrees of decrease, with the largest decrease at α = 0.5. The proportional increase in the cross-entropy loss is beneficial to the network’s convergence because cross-entropy loss is measured by the pixel-level difference between the actual and predicted distributions. A too-small proportion of cross-entropy loss can cause training instability. Therefore, after careful consideration, we set α = 1 and β = 0.7.

5.2. MixFormer Blcok in MU-Net

The mixing attention mechanism, which is the main structure in the MixFormer block, is verified by experiments in this section. We substituted the mixing attention mechanism with other self-attention mechanisms and kept the MU-Net architecture unchanged to illustrate the effectiveness and necessity of introducing MixFormer into MU-Net. AMM was not used in the experiments of this section. As can be seen in Table 7, benefiting from the powerful modeling ability of the mixing attention, MU-Net achieved optimal performance on the GID dataset. Moreover, the experiments verified the effectiveness of MU-Net architecture compared to other segmentation networks on the GID dataset.

5.3. Sampling Layers of the Network

Our proposed method contained four times the sampling operations. We set up MU-Net with different sampling layers of three, four, and five to analyze the influence of the number of sampling layers on the network’s performance. The experiments in this section are implemented on the GID dataset. The specific settings are shown in Figure 13. In this section’s experiments, we added the number of model parameters as an evaluation index, which is measured in millions (M).

From Table 8, when the number of sampling layers is four, the network’s performance is well-balanced in all metrics. When the sampling layer is three, the network has the lowest IoU. Although the number of parameters is minimized when the sampling layer is three, fewer sampling layers do not extract sufficient water body information from the images, decreasing the network’s accuracy. When the number of sampling layers is four or five, the IoU and F1-score of MU-Net are similar. However, MU-Net with five sampling layers has a larger number of parameters, affecting the network’s convergence and increasing the model’s training time. Moreover, a too-deep network would generate redundant features. Therefore, we set the number of sampling layers to four.

5.4. Model Complexity of the MU-Net

In order to further verify the performance of our proposed method, we also analyzed the complexity of the network. We compared different networks in four aspects: the number of network parameters, network computational complexity (measured in floating point operations (Flops)), IoU, and inference speed. The IoU is derived from experimental results on the GID dataset. The inference speed is expressed in terms of the number of images processed per second. The results are shown in Table 9. With respect to the number of parameters and computational complexity, MU-Net is smaller than most CNN-based networks. MU-Net reduces the number of parameters by 7.75 M and increases the IoU by 2.13% compared to Unet. MU-Net achieves the highest IoU without reducing inference speed much. In conclusion, MU-Net has optimal performance and outperforms most networks in terms of the number of parameters and Flops, achieving a good tradeoff between network performance and complexity.

6. Conclusions

This paper proposes the MU-Net, a new hybrid CNN and MixFormer structure, to extract water bodies from remote sensing images. MU-Net combines CNN and MixFormer to fully extract the local and global contextual information of images. MU-Net models the global contextual information using a MixFormer block, which compensates for the shortcomings of CNN. We verified the network’s performance on the GID and LoveDA datasets, and the experimental results show that MU-Net achieves optimal performance. It can be seen from the visualization results that MU-Net has higher pixel precision and accurate position information in identifying water bodies. In addition, we conducted ablation experiments to verify the effectiveness of each component of MU-Net. Although MU-Net achieves better results than other models, some shortcomings still need to be improved. First, the complexity of our method is not greatly reduced. Then, data annotation is very laborious and time-consuming, while semi-supervised learning can achieve similar performance using only a small amount of labeled data. Therefore, our subsequent work will build a lightweight network and use a semi-supervised learning method for water body extraction.

Author Contributions

Conceptualization, H.L. and Y.Z.; methodology, H.L.; validation, H.L., Y.Z. and W.T.; formal analysis, H.L.; investigation, H.L.; resources, Y.Z.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, Y.Z., K.T.C.L.K.S., and G.M.; visualization, H.L.; supervision, W.T., H.Z., D.X., S.G., and G.M.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2021YFE0116900), the Fengyun Application Pioneering Project (Grant No. FY-APP-2022.0604), the National Nature Science Foundation of China (Grant No. 42175157), and the Jiangsu Province Graduate Research Innovation Program Project (Grant No. KYCX23_1366).

Data Availability Statement

The experiments were conducted using publicly available datasets. The download sites of the datasets can be found in the corresponding published papers. The code is available at https://github.com/Kakakakakakah/MU-Net, accessed on 13 July 2023.

Acknowledgments

The authors thank the anonymous reviewers and the editors for their valuable comments to improve our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Haibo, Y.; Zongmin, W.; Hongling, Z.; Yu, G. Water Body Extraction Methods Study Based on RS and GIS. Procedia Environ. Sci. 2011, 10, 2619–2624. [Google Scholar] [CrossRef] [Green Version]
Verma, U.; Chauhan, A.; Pai, M.M.M.; Pai, R. DeepRivWidth: Deep Learning Based Semantic Segmentation Approach for River Identification and Width Measurement in SAR Images of Coastal Karnataka. Comput. Geosci. 2021, 154, 104805. [Google Scholar] [CrossRef]
Pawełczyk, A. Assessment of Health Hazard Associated with Nitrogen Compounds in Water. Water Sci. Technol. 2012, 66, 666–672. [Google Scholar] [CrossRef] [PubMed]
Mantzafleri, N.; Psilovikos, A.; Blanta, A. Water Quality Monitoring and Modeling in Lake Kastoria, Using GIS. Assessment and Management of Pollution Sources. Water Resour. Manag. 2009, 23, 3221–3254. [Google Scholar] [CrossRef]
McFEETERS, S.K. The Use of the Normalized Difference Water Index (NDWI) in the Delineation of Open Water Features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Xu, H. Modification of Normalised Difference Water Index (NDWI) to Enhance Open Water Features in Remotely Sensed Imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Xie, C.; Huang, X.; Zeng, W.; Fang, X. A Novel Water Index for Urban High-Resolution Eight-Band WorldView-2 Imagery. Int. J. Digit. Earth 2016, 9, 925–941. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated Water Extraction Index: A New Technique for Surface Water Mapping Using Landsat Imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Balázs, B.; Bíró, T.; Dyke, G.; Singh, S.K.; Szabó, S. Extracting Water-Related Features Using Reflectance Data and Principal Component Analysis of Landsat Images. Hydrol. Sci. J. 2018, 63, 269–284. [Google Scholar] [CrossRef]
Hannv, Z.; Qigang, J.; Jiang, X. Coastline Extraction Using Support Vector Machine from Remote Sensing Image. J. Multimed. 2013, 8, 175–182. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A Novel Deep Learning Network for Lake Water Body Extraction of Google Remote Sensing Images. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, M.; Ji, S.; Yu, H.; Nie, C. Rich CNN Features for Water-Body Segmentation from Very High Resolution Aerial and Satellite Imagery. Remote Sens. 2021, 13, 1912. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y.; Zhang, C. Water-Body Segmentation for Multi-Spectral Remote Sensing Images by Feature Pyramid Enhancement and Pixel Pair Matching. Int. J. Remote Sens. 2021, 42, 5025–5043. [Google Scholar] [CrossRef]
Dang, B.; Li, Y. MSResNet: Multiscale Residual Network via Self-Supervised Learning for Water-Body Detection in Remote Sensing Imagery. Remote Sens. 2021, 13, 3122. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Duan, Y.; Zhang, W.; Huang, P.; He, G.; Guo, H. A New Lightweight Convolutional Neural Network for Multi-Scale Land Surface Water Extraction from GaoFen-1D Satellite Images. Remote Sens. 2021, 13, 4576. [Google Scholar] [CrossRef]
Zhong, H.-F.; Sun, H.-M.; Han, D.-N.; Li, Z.-H.; Jia, R.-S. Lake Water Body Extraction of Optical Remote Sensing Images Based on Semantic Segmentation. Appl. Intell. 2022, 52, 17974–17989. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Chen, Q.; Wu, Q.; Wang, J.; Hu, Q.; Hu, T.; Ding, E.; Cheng, J.; Wang, J. MixFormer: Mixing Features across Windows and Dimensions. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention 2015, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef] [Green Version]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; Vanschoren, J., Yeung, S., Eds.; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2021; Volume 1. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Xu, H.; Zhu, P.; Luo, X.; Xie, T.; Zhang, L. Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement. Remote Sens. 2022, 14, 564. [Google Scholar] [CrossRef]
Chen, X.; Qiu, C.; Guo, W.; Yu, A.; Tong, X.; Schmitt, M. Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2503605. [Google Scholar] [CrossRef]
Sun, Z.; Zhou, W.; Ding, C.; Xia, M. Multi-Resolution Transformer Network for Building and Road Segmentation of Remote Sensing Image. ISPRS Int. J. Geo-Inf. 2022, 11, 165. [Google Scholar] [CrossRef]
Yuan, K.; Zhuang, X.; Schaefer, G.; Feng, J.; Guan, L.; Fang, H. Deep-Learning-Based Multispectral Satellite Image Segmentation for Water Body Detection. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2021, 14, 7422–7434. [Google Scholar] [CrossRef]
Hu, K.; Li, M.; Xia, M.; Lin, H. Multi-Scale Feature Aggregation Network for Water Area Segmentation. Remote Sens. 2022, 14, 206. [Google Scholar] [CrossRef]
Huang, J.; Weng, L.; Chen, B.; Xia, M. DFFAN: Dual Function Feature Aggregation Network for Semantic Segmentation of Land Cover. ISPRS Int. J. Geo-Inf. 2021, 10, 125. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. arXiv 2018, arXiv:1808.00897. [Google Scholar]
Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2020, 13, 71. [Google Scholar] [CrossRef]
Shi, H.; Fan, J.; Wang, Y.; Chen, L. Dual Attention Feature Fusion and Adaptive Context for Accurate Segmentation of Very High-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 3715. [Google Scholar] [CrossRef]
Niu, X.; Zeng, Q.; Luo, X.; Chen, L. FCAU-Net for the Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Remote Sens. 2022, 14, 215. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Yuan, W.; Xu, W. MSST-Net: A Multi-Scale Adaptive Network for Building Extraction from Remote Sensing Images Based on Swin Transformer. Remote Sens. 2021, 13, 4743. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer. arXiv 2021, arXiv:2106.03650. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2020, arXiv:1606.08415. [Google Scholar]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. On the Connection between Local Attention and Dynamic Depth-Wise Convolution. arXiv 2022, arXiv:2106.04263. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Guo, H.; He, G.; Jiang, W.; Yin, R.; Yan, L.; Leng, W. A Multi-Scale Water Extraction Convolutional Neural Network (MWEN) Method for GaoFen-1 Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2020, 9, 189. [Google Scholar] [CrossRef] [Green Version]
Weng, L.; Xu, Y.; Xia, M.; Zhang, Y.; Liu, J.; Xu, Y. Water Areas Segmentation from Remote Sensing Images Using a Separable Residual SegNet Network. ISPRS Int. J. Geo-Inf. 2020, 9, 256. [Google Scholar] [CrossRef]
Gao, Y.; Zhou, M.; Metaxas, D. UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. In Proceedings of the MICCAI 2021: Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Strasbourg, France, 27 September–1 October 2021. [Google Scholar]
Zhang, Q.-L.; Yang, Y.-B. ResT V2: Simpler, Faster and Stronger. arXiv 2022, arXiv:2204.07366. [Google Scholar]

Figure 1. Overview of the proposed MU-Net.

Figure 2. Structure of the MixFormer block. The right part of the graph is the parallel branch.

Figure 3. (a) Self-attention operation. (b) Window partition.

Figure 4. Structure of the AMM.

Figure 5. Original images and ground truth of the GID dataset.

Figure 6. Original images and ground truth of the LoveDA dataset.

Figure 7. Visualization results of MU-Net and CNN-based semantic segmentation networks on GID dataset: (a) images; (b) labels; (c) Unet; (d) PSPNet; (e) SegNet; (f) MWEN; (g) SR-SegNet; (h) MU-Net. Black denotes the background, and white denotes water bodies.

Figure 8. Visualization results of MU-Net and Transformer-based segmentation networks on GID dataset: (a) images; (b) labels; (c) TransUNet; (d) SwinUnet; (e) BANet; (f) UTNet; (g) UNetFormer; (h) MU-Net.

Figure 9. Visualization results of MU-Net and CNN-based semantic segmentation networks on the LoveDA dataset: (a) images; (b) labels; (c) Unet; (d) PSPNet; (e) SegNet; (f) MWEN; (g) SR-SegNet; (h) MU-Net.

Figure 10. Visualization results of MU-Net and Transformer-based segmentation networks on the LoveDA dataset: (a) images; (b) labels; (c) TransUNet; (d) SwinUnet; (e) BANet; (f) UTNet; (g) UNetFormer; (h) MU-Net.

Figure 11. Visualization results of MixFormer block ablation experiment: (a) images; (b) labels; (c) baseline; (d) baseline+MixFormer.

Figure 12. Visualization results of the AMM ablation experiment: (a) images; (b) labels; (c) base-line+MixFormer; (d) baseline+MixFormer+AMM (MU-Net).

Figure 13. Illustration of the network sampling layers. The yellow block represents the Conv block, the blue block represents AMM, and the purple block represents the combination of convolution and MixFormer block.

Table 1. Comparison results of MU-Net and CNN-based segmentation networks on the GID dataset (%). The best result for each metric is shown in bold.

Method	OA	P	R	F1	IoU
Unet [22]	97.05	93.45	93.93	93.69	88.12
PSPNet [28]	96.79	93.73	92.44	93.08	87.06
SegNet [27]	96.60	92.82	92.58	92.70	86.39
MWEN [54]	97.17	94.25	93.59	93.92	88.54
SR-SegNet [55]	97.08	94.10	93.35	93.72	88.19
MU-Net	97.62	95.32	94.44	94.88	90.25

Table 2. Comparison results of MU-Net and Transformer-based segmentation networks on the GID dataset (%).

Method	OA	P	R	F1	IoU
TransUNet [43]	96.70	93.57	92.19	92.88	86.70
SwinUnet [44]	95.76	91.64	90.02	90.83	83.19
BANet [46]	96.39	93.09	91.32	92.19	85.52
UTNet [56]	97.25	94.00	94.25	94.12	88.90
UNetFormer [48]	97.33	94.77	93.75	94.25	89.13
MU-Net	97.62	95.32	94.44	94.88	90.25

Table 3. Comparison results of MU-Net and CNN-based segmentation networks on the LoveDA dataset (%).

Method	OA	P	R	F1	IoU
Unet [22]	94.38	86.26	85.29	85.77	75.09
PSPNet [28]	93.92	87.35	81.13	84.13	72.60
SegNet [27]	94.01	86.46	82.80	84.59	73.29
MWEN [54]	94.50	86.93	85.13	86.02	75.47
SR-SegNet [55]	94.65	90.01	82.18	85.92	75.31
MU-Net	94.79	87.88	85.56	86.70	76.52

Table 4. Comparison results of MU-Net and Transformer-based segmentation networks on the LoveDA dataset (%).

Method	OA	P	R	F1	IoU
TransUNet [43]	94.60	89.36	82.67	85.88	75.26
SwinUnet [44]	93.53	86.35	80.09	83.10	71.09
BANet [46]	93.45	85.73	80.40	82.98	70.91
UTNet [56]	94.58	87.62	84.69	86.13	75.64
UNetFormer [48]	94.56	87.98	84.10	86.00	75.43
MU-Net	94.79	87.88	85.56	86.70	76.52

Table 5. Ablation experiment results of MixFormer block and AMM on the GID dataset (%).

Method	OA	P	R	F1	IoU
Baseline	96.99	93.74	93.33	93.53	87.85
Baseline+AMM	97.18	94.37	93.48	93.92	88.54
Baseline+MixFormer	97.49	94.84	94.40	94.62	89.78
Baseline+MixFormer+AMM	97.62	95.32	94.44	94.88	90.25

Table 6. Effect of different α and β in loss (%).

Parameters	P	R	F1	IoU
α = 1, β = 1.0	95.08	94.63	94.86	90.22
α = 1, β = 0.5	95.08	94.58	94.83	90.16
α = 1, β = 0.7	95.32	94.44	94.88	90.25
α = 0.5, β = 1	94.81	94.69	94.75	90.03
α = 0.7, β = 1	94.83	94.81	94.82	90.15

Table 7. Results of different attention mechanisms on the GID dataset (%).

Attention Mechanism	F1	IoU
Shifted window attention [20]	94.49	89.56
Efficient global–local attention [48]	94.29	89.20
Efficient multi-head self-attention [57]	93.87	88.45
Mixing attention	94.62	89.78

Table 8. Effect of network sampling layers.

Sampling Layers	P (%)	R (%)	F1 (%)	IoU (%)	Params (M)
3	94.93	94.46	94.70	89.93	11.43
4	95.32	94.44	94.88	90.25	23.29
5	95.06	94.70	94.88	90.26	93.08

Table 9. Comparison of model complexity.

Method	Params (M)	Flops (G)	IoU (%)	Speed (img/s)
Unet [22]	31.04	192.94	88.12	18.79
PSPNet [28]	46.71	184.48	87.06	19.50
SegNet [27]	29.44	160.44	86.39	20.77
MWEN [54]	30.01	127.78	88.54	24.99
SR-SegNet [55]	23.84	200.84	88.19	16.84
TransUNet [43]	105.91	168.7	86.70	14.89
UTNet [56]	14.42	82.92	88.90	18.26
MU-Net	23.29	172.75	90.25	16.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T.C.L.K. MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images. Remote Sens. 2023, 15, 3559. https://doi.org/10.3390/rs15143559

AMA Style

Zhang Y, Lu H, Ma G, Zhao H, Xie D, Geng S, Tian W, Sian KTCLK. MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images. Remote Sensing. 2023; 15(14):3559. https://doi.org/10.3390/rs15143559

Chicago/Turabian Style

Zhang, Yonghong, Huanyu Lu, Guangyi Ma, Huajun Zhao, Donglin Xie, Sutong Geng, Wei Tian, and Kenny Thiam Choy Lim Kam Sian. 2023. "MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images" Remote Sensing 15, no. 14: 3559. https://doi.org/10.3390/rs15143559

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Attention Mechanism

2.3. Transformer in Visual Tasks

2.4. Transformer in Remote Sensing

3. Proposed Method

3.1. Overall Structure

3.2. MixFormer

3.2.1. MixFormer Block

3.2.2. Window-Based Self-Attention

3.2.3. Parallel Branches and Bi-Directional Interactions

3.3. AMM

3.4. Loss Function

4. Experiments

4.1. Dataset

4.1.1. GID Dataset

4.1.2. LoveDA Dataset

4.2. Implementation Details

4.2.1. Experimental Settings

4.2.2. Evaluation Metrics

4.3. Experimental Results on GID Dataset

4.3.1. Comparison with CNN-Based Semantic Segmentation Networks on GID Dataset

4.3.2. Comparison with Transformer-Based Segmentation Networks on GID Dataset

4.4. Experimental Results on LoveDA Dataset

4.4.1. Comparison with CNN-Based Segmentation Networks on LoveDA Dataset

4.4.2. Comparison with Transformer-Based Segmentation Networks on LoveDA Dataset

4.5. Ablation Experiments

5. Discussion

5.1. Discussion of α and β in Loss

5.2. MixFormer Blcok in MU-Net

5.3. Sampling Layers of the Network

5.4. Model Complexity of the MU-Net

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI