Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images

Zhang, Lili; Fan, Yu; Yan, Ruijie; Shao, Yehong; Wang, Gaoxu; Wu, Jisen

doi:10.3390/rs13132594

Open AccessArticle

Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images

by

Lili Zhang

¹,

Yu Fan

¹,

Ruijie Yan

¹,

Yehong Shao

²,

Gaoxu Wang

^3,* and

Jisen Wu

¹

College of Computer and Information Engineering, Hohai University, Nanjing 211100, China

²

Arts and Sciences, Ohio University Southern, Ironton, OH 45638, USA

³

State Key Laboratory of Hydrology—Water Resources and Hydraulic Engineering, Nanjing Hydraulic Research Institute, Nanjing 210029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(13), 2594; https://doi.org/10.3390/rs13132594

Submission received: 21 May 2021 / Revised: 20 June 2021 / Accepted: 25 June 2021 / Published: 2 July 2021

(This article belongs to the Special Issue Advances in Object and Activity Detection in Remote Sensing Imagery)

Download

Browse Figures

Versions Notes

Abstract

:

The tidal flat is long and narrow area along rivers and coasts with high sediment content, so there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. The existing waterbody extraction methods are mostly used for the extraction of large water bodies like rivers and lakes, whereas less attention has been paid to tidal flat waterbody extraction. Extracting tidal flat waterbody accurately from high-resolution remote sensing imagery is a great challenge. In order to solve the low accuracy problem of tidal flat waterbody extraction, we propose a fine-grained tidal flat waterbody extraction method, named FYOLOv3, which can extract tidal flat water with high accuracy. The FYOLOv3 mainly includes three parts: an improved object detection network based on YOLOv3 (Seattle, WA, USA), a fully convolutional network (FCN) without pooling layers, and a similarity algorithm for water extraction. The improved object detection network uses 13 convolutional layers instead of Darknet-53 as the model backbone network, which guarantees the water detection accuracy while reducing the time cost and alleviating the overfitting phenomenon; secondly, the FCN without pooling layers is proposed to obtain the accurate pixel value of the tidal flat waterbody by learning the semantic information; finally, a similarity algorithm for water extraction is proposed to distinguish the waterbody from non-water pixel by pixel to improve the extraction accuracy of tidal flat water bodies. Compared to the other convolutional neural network (CNN) models, the experiments show that our method has higher accuracy on the waterbody extraction of tidal flats from remote sensing images, and the

I o U

of our method is 2.43% higher than YOLOv3 and 3.7% higher than U-Net (Freiburg, Germany).

Keywords:

tidal flat water; YOLOv3; similarity algorithm for water extraction

Graphical Abstract

1. Introduction

Water resources are closely related to human survival and development, and many researchers focus on how to obtain water resource information quickly and accurately. The extraction and detection of the water bodies from remote sensing images is one of the main ways to obtain water resource information. It can be widely applied in ecosystem protection and restoration, river supervision, pollution control, and infrastructure construction [1,2]. In recent years, with the rapid development of remote sensing satellite technology, obtaining water resource information from remote sensing images [3] has gradually replaced manual measurement, and the images are widely applied in water resource surveys and flood predictions.

At present, scholars have proposed a variety of water extraction methods for different satellite imagery, which can be summarized into three categories: visual interpretation methods [4], extraction methods based on spectral bands [5,6,7,8,9], and machine learning methods [10,11,12]. However, these methods are mainly applied to extract large water bodies like rivers and lakes, and there are few waterbody extraction methods for tidal flats. The tidal flat area [13] refers to the tidal invasion area between the high tide level and the low tide level along rivers and coasts, etc. The water bodies in this kind of area are relatively long and narrow, with high sediment content. Due to the influence of tides, there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. Meanwhile, the mixture of water and sand makes the spectral band characteristics of the water in the tidal flat area different from the water in the other areas. Therefore, the methods based on spectral bands are not suitable for tidal flat waterbody extraction. The machine learning method used for water extraction is usually based on supervised learning, so the training dataset is necessary. However, there is not public training dataset for tidal flat waterbody extraction. Hence the machine learning methods usually have poor ability and do not learn effectively due to the limited training dataset and have an accuracy bottleneck in the water extraction as a result.

The boundary of the waterbody is blurry in tidal flat area, in order to solve the low accuracy problem of its waterbody extraction caused by little feature difference between the waterbody and the background, this paper proposes a fine-grained tidal flat waterbody extraction method, named FYOLOv3. The FYOLOv3 mainly includes three parts: an improved object detection network based on YOLOv3 (Seattle, WA, USA), a fully convolutional network (FCN) without pooling layers, and a similarity algorithm for water extraction.

In this paper, our contributions are as follows:

(1): An improved object detection network was introduced, which contains two modules, one is a 13-layer convolutional neural network (CNN) as the backbone network, and the other is the feature pyramid network for multi-scale water detection.
(2): A FCN without pooling layers is proposed to obtain the accurate pixel value of the tidal flat waterbody by learning the semantic information, complete the initial extraction of the waterbody and realize cross-channel information fusion.
(3): A similarity algorithm for water extraction is proposed to distinguish the waterbody from non-water pixel by pixel to improve the extraction accuracy of tidal flat waterbodies, in which a standard water pixel valued and similarity between the water pixels and the standard water pixels are introduced, respectively.

The rest of this paper is organized as follows. In Section 2, we introduce some classical methods and analyze the YOLO models. In Section 3, a fine-grained tidal flat waterbody extraction method FYOLOv3 is described in detail. The experiments and analysis are presented in Section 4. Finally, the conclusion of this paper with some discussions and future work are given in Section 5.

2. Related Work

2.1. Water Extraction Methods

Spectral band analysis methods are the earlier methods for waterbody extraction from remote sensing images [5,6,7,8,9] by analyzing the differences of absorption and reflection of different ground objects for each band spectrum, then obtaining the water region in the remote sensing images. There are three methods based on spectral analysis: single band threshold method [14], multi-band spectral relationship method [15], and water index method [8]. Xu et al. [8] proposed an improved normalized difference water index (MNDWI) based on the band combination of the normalized water index. The experiments show that the method is efficient for the extraction of urban water bodies, and effectively solves the influence of urban building shadow. Guo et al. [7] proposed a weighted normalized difference water index (WNDWI) to solve the influence of turbid water, small water bodies and shadow areas on water extraction. The method was tested on Landsat images and achieved good results. Methods based on spectral analysis usually only use the spectral information of remote sensing images, which does not effectively use the texture, space, surrounding background, and other information, so its extraction ability has certain limitations. These methods have specific requirements for the band of remote sensing images and have low applicability as a result.

Some machine learning methods, such as support vector machine (SVM) and maximum likelihood classification [10,11,12] try to balance the learning effectiveness and the interpretability of the models and provide a solution framework for the classification problem of limited samples. This kind of method improves the accuracy of target extraction in a certain range by learning the distribution characteristics of the training data. However, they have poor ability and do not learn effectively due to the limited training dataset and have an accuracy bottleneck in the water extraction.

With the concept of deep learning proposed by Hinton et al. [16] in 2006 and the outstanding achievements of deep convolution neural network proposed by Alex [17] in natural images recognition in 2012, deep learning ushered in a new research phase. Many experts and scholars began to apply deep learning technology to obtain object extraction from remote sensing images. Zhong et al. [18] used convolution neural network model to extract waterbody from remote sensing images, and the experiments showed that convolution neural network is more efficient to extract waterbody from remote sensing images than normalized water index. Liang et al. [19] introduced dense connection structure in the full convolution network to reduce the shallow feature loss, get more detailed information from the remote sensing images, and achieve better water extraction. Song et al. [20] used the self-learning ability of deep learning to construct a modified Mask R-CNN method which integrates bottom-up and top-down processes for water recognition. Yu et al. [21] presented a novel deep learning framework for waterbody extraction from Landsat images considering both its spectral and spatial information, which is a hybrid of CNN and logistic regression classifier. Li et al. [22] adopted a fully convolutional network (FCN) to extract water bodies in the case of limited training data, which consists of an encoder for extracting multiscale features and a decoder for recovering spatial contexts. Wang et al. [23] proposed an end-to-end trainable model named the multi-scale lake water extraction network (MSLWENet) to extract lake water from Google remote sensing images. Yu et al. [24] developed a novel self-attention capsule feature pyramid network (SA-CapsFPN) to extract water bodies from remote sensing images. Li et al. [25] built a deep learning model for water extraction based on the EfficientNet-B5 (Perdriel, Argentina).

2.2. YOLO Models

The excellent performance of deep convolution neural network [17] has been demonstrated in computer vision. Recently, YOLO models such as YOLOv1 (Seattle, WA, USA) [26], YOLOv2 [27], and YOLOv3 [28], were proposed one after another. The YOLOv1 model is based on GoogLeNet (Mountain View, CA, USA) [29], which is mainly composed of convolutional layers and fully connected layers to achieve the object detection fast. The model transforms the object detection problem into coordinate regression problem and carries out the classification and regression of target objects. Because the two prediction frames generated in YOLOv1 (Seattle, WA, USA) for each lattice in the images can only predict one target object, the detection accuracy of adjacent objects whose center point falls in the same lattice is reduced as a result. In view of the above shortcomings, YOLOv2 (Seattle, WA, USA) proposes a variety of strategies to improve the network framework, which significantly improves the speed and accuracy of object detection. In order to further optimize the YOLO models, the DarkNet-53 network is used as the object feature extractor in YOLOv3 (Seattle, WA, USA) model, and the output module uses the feature pyramid structure to achieve three-way outputs to complete the accurate detection of the targets with different sizes [28].

3. Methodology

To solve the low accuracy problem of water extraction for tidal flats, this paper proposes a fine-grained tidal flat waterbody extraction method for high-resolution remote sensing images, named FYOLOv3. The key parts of our method are as follows, firstly, the improved object detection network based on YOLOv3 (Seattle, WA, USA) is proposed and used to locate the tidal flat waterbody, and the frame coordinates of the corresponding waterbody are obtained; secondly, four images with size of 32 × 32 are clipped from the obtained border region, which are used as the input of the FCN without pooling layers to get the initial waterbody extraction; finally, the similarity algorithm for water extraction is used to judge all pixels in the obtained initial waterbody region to optimize and improve the initial waterbody extraction. We list the steps of our method as follows:

Construction of training dataset: This part mainly includes the data preprocessing, data augmentation and waterbody labeling of remote sensing images.
Model construction: Based on the YOLOv3 (Seattle, WA, USA) model, this paper constructs an improved network model for water detection. It uses 13 convolutional layers as the model backbone network to meet the accuracy requirement of water detection, while reducing the time cost and alleviating the overfitting phenomenon; it also uses two branch structures as the output module to avoid the problem of missing extraction in the waterbody extraction caused by the small prior box. FCN without pooling layers followed the improved object detection network to obtain the semantic information of waterbody in a tidal flat area.
Model training: The cross entropy is used as the loss function, and the backpropagation algorithm is used to train the internal parameters of the network model.
Detection and initial extraction: The trained network models are used to detect the water from the remote sensing images to locate the waterbody area and obtain the initial waterbody extraction, respectively.
Similarity algorithm for water extraction: This algorithm is used to optimize and improve the initial waterbody extraction by similarity judgment.

The architecture of our method is shown in Figure 1, where the 13-layer CNN is constructed for water feature extraction and mainly composed of convolutional layers, pooling layers, and batch standardization layers. The multi-scale feature pyramid network uses the different feature maps to get the narrow and long waterbodies and small waterbodies in the tidal flat area, respectively. Hence, our object detection network can guarantee the water detection accuracy while reducing the time cost and alleviating the overfitting phenomenon.

3.1. Construction of Training Dataset

3.1.1. Preprocessing

The GF-2 remote sensing satellite is the first civil high-resolution satellite in China, and it was successfully launched in 2014. GF-2 satellite has high spatial resolution, accurate positioning, and strong maneuverability. The remote sensing images used in this paper are the Level-1 product data. Therefore, it is necessary to preprocess the remote sensing images first. The preprocessing of GF-2 remote sensing images used in this paper mainly includes radiometric calibration [30], atmospheric correction [31], orthorectification [32], and image fusion [33].

Radiometric correction and orthorectification of multispectral images

Radiometric correction includes two parts: radiometric calibration and atmospheric correction. Radiometric calibration refers to convert the brightness value of pixels into absolute radiance value, which helps researchers to compare remote sensing images acquired from different types of sensors at different times. Atmospheric correction refers to the process of eliminating the radiation error caused by atmospheric influence and obtaining the true reflectance of surface objects. Orthorectification is to correct the geometric distortion of remote sensing images and the plane orthophotos are obtained at last. The preprocessing example of the multispectral images is shown in Figure 2.

2.: Orthorectification of panchromatic images

Different from multispectral images, the band range of panchromatic images in GF-2 is 0.45–0.90 µm, which includes multiple wavelength ranges. The attenuation of the atmosphere is selective for the lights with different wavelengths, and each wavelength is affected by the atmosphere differently. Therefore, it is usually impossible to carry out an atmospheric correction for panchromatic images. The number and distribution of controlled points in a remote sensing image influence the error of the orthorectification and the mean square error is used to evaluate the accuracy of orthorectification. Fan et al. made the accuracy analysis of GF-2 satellite image according to the mentioned evaluation indexes [34], and the RPC orthorectification was proved better to correct the geometric distortion in panchromatic images. Hence, we use the RPC orthorectification to deal with the panchromatic images in this paper, and the experiment is shown in Figure 3.

3.: Image fusion

Image fusion is often used to enrich the image information. It fuses the images of the same area from different channels and finally obtains the fused images with more information and higher quality.

In this paper, the NNDiffuse Pan Sharpening [35] method is used to fuse the multispectral images and panchromatic images. The multispectral images and the panchromatic images are obtained synchronously by different sensors installed in the GF-2, and the former has higher resolution but less spectral information, and the latter has more spectral information and lower resolution. If we fuse them, we could get the fused image with high resolution and more spectral information, as shown in Figure 4.

4.: Band combination selection

GF-2 multispectral images contain redundant data because of the close correlation between different bands. In order to make full use of the features of GF-2 multispectral images, reduce data redundancy and maintain the original characteristics of the images, we need to make the optimal band combination for GF-2 multispectral images.

There are three principles to choose the optimal band combination: the information in a single band should be as much as possible; the information intersection between two bands should be less; the spectral differences of different types of ground objects after the band combination should be getting clearer [18]. Because the spectral bands of GF-2 multispectral images are the same as the GF-1 multispectral images, according to the above three principles, it is appropriate to use the standard deviation and Optimum Index Factor (OIF) [36] to study the best band combination of GF-2 images. Finally, we get band 2, band 3, and band 4 as the combined bands to generate the original image in our study. The remote sensing image after band combination is shown in Figure 5.

3.1.2. Data Labeling and Augmentation

Data labeling

The data labeling mainly includes two parts: one is to label a region in which the waterbody is located to get the training dataset for our proposed object detection network model and the other is to label the waterbody to get the training dataset for our FCN without pooling layers.

(a): Labeling a region: We use the LabelImg (Barcelona, Spain) to label the region in which the waterbody is located. The water area is labeled by a rectangular frame, and a xml file is generated finally. As shown in Figure 6, the label in the file records the name, path, water area category and coordinates of the frame.
(b): Waterbody labeling: The Labelme is used to label the waterbody. The labeled image is shown in Figure 7. In the labeled image, the labeled information of the waterbody is saved in the index dataset. Because the extraction of waterbody is essentially binary classification, the black area in the labeled image is the background area and represented by 0. The red area is the waterbody and is represented by 1.

2.: Data augmentation

Compared with public images dataset like ImageNet, there is no remote sensing image training dataset, and it is difficult to get much more data by ourselves, so we enlarge the dataset by data augmentation [37,38] to expand the training data and avoid the overfitting phenomenon. The remote sensing images are clipped, and the size of the images is 256 × 256, which is feasible to complete the construction of the training dataset. The geometric transformation operations used in this paper include rotation operations of 90°, 180°, and 270° of the original images, horizontal flip operation, and vertical flip operation. We use the OpenCV (Intel, Santa Clara, CA, USA) based on python for data augmentation. The operation examples are shown in Figure 8.

We labeled the data at first, and then achieve the data augmentation operations. To meet the training requirement of FCN without pooling layers, we clip the size of waterbody labeled images into 32 × 32. Now we have 6000 waterbody labeling data with size of 32 × 32, and 6000 waterbody region labeled data with size of 256 × 256. We choose 70% of them as the training set to train the improved water detection network and the FCN without pooling layers, respectively, and 30% are used as test data.

3.2. Improved Water Detection Network Based on YOLOv3

As shown in Figure 9, the improved water detection network based on YOLOv3 mainly includes two parts. The first part is the feature extraction module, in which we use 13 convolutional layers to obtain the water features. The second part is the feature pyramid network structure for multi-scale waterbody detection, which uses feature fusion for multi-scale waterbody detection.

3.2.1. Improved Feature Extraction Module

The Darknet-53 network structure used in the feature extraction module of YOLOv3 (Seattle, WA, USA) easily leads to the overfitting phenomenon in the case of limited training data. In order to solve this problem, a 13-layer CNN is constructed for water feature extraction in the feature extraction module. The module is mainly composed of convolutional layers, pooling layers, and batch standardization layers. The parameters are shown in Table 1.

In the improved feature extraction module, the convolutional layers with a convolution kernel of 3 × 3 is used to extract the water features of a tidal flat area, and the convolutional layers with a convolution kernel of 1 × 1 is used to realize cross-channel information fusion. In order to ensure the generalization ability of our waterbody detection model in a tidal flat area, this paper uses pooling layers to keep the main characteristic data of water. To solve the slow convergence and gradient explosion, the improved feature extraction module used in this paper adds batch standardization layers. This operation normalizes the data before it passes through the activation function to reduce the change data amplitude and make it follow the Gaussian distribution and speed up the convergence of the network model as a result.

3.2.2. Feature Pyramid Network Structure for Multi-Scale Water Detection

Inspired by the design of the feature pyramid, three branches are used in YOLOv3 (Seattle, WA, USA) to obtain feature maps with sizes of 13 × 13, 26 × 26 and 52 × 52 respectively. The feature maps of different sizes correspond to different receptive fields. The larger the size of the feature map is, the smaller the corresponding receptive field is. The correspondence between feature graph size and prior box is shown in Table 2. Based on the size and characteristic of the tidal flat water, we design two branches in our model. One of the branches, used for the detection of narrow and long waterbodies in the tidal flat area, is to get a 13 × 13 feature map through three convolutional layers after the improved feature extraction module; the other branch, used for the detection of small waterbodies in tidal flat areas, is to up-sample the output of the 14th convolutional layer in the network, and then fuse it with the features obtained by the 13th convolutional layer, and finally get the feature map with size of 26 × 26 through two convolutional layers.

3.3. FCN without Pooling Layers

The improved object detection network based on YOLOv3 (Seattle, WA, USA) is a water object detection model, so it cannot extract the water edge. In order to solve this problem, we design the FCN without pooling layers to complete the initial waterbody extraction and obtain the feature information of waterbody in a tidal flat area. The network structure of the FCN without pooling layers is shown in Figure 10.

In general, the pooling layers of CNN have two main functions: one is to compress the extracted features to reduce the computational time of the model. The second is to enlarge the receptive field of the model so that each point in the feature map corresponds to a larger area in the original image. Because the receptive field represents the receptive range of different neurons in the network to image, the enlargement of receptive field means the enlargement of receptive range of different neurons in the network. So, each point in the feature map corresponds to a larger area in the original image when we enlarge the receptive field of the model. The FCN without pooling layers we proposed in this paper aims at the initial extraction of waterbodies from 32 × 32 remote sensing images, so the receptive field is not required and the water extractions based on the network still work in our method.

The FCN without pooling layers uses six convolutional layers to extract waterbodies from 32 × 32 remote sensing images, and the convolution kernel sizes of convolutional layers are 3 × 3 and 1 × 1, respectively. All the parameters in the network can be seen in Table 3. Compared with the convolution kernels with sizes 7 × 7 and 5 × 5, we use the convolution kernel with size 3 × 3 in the model to improve the network depth and the nonlinear expression ability of the model with the same receptive field. The convolution kernel with size 1 × 1 realizes cross-channel information fusion.

3.4. Similarity Algorithm for Water Extraction

To reduce the false extraction caused by the high similarity between the waterbody and background in a tidal flat area, a similarity algorithm for water extraction is proposed. The steps of the algorithm are as follows:

Firstly, we obtain the detection results of the improved water detection network based on YOLOv3 and the initial water extraction results of the FCN without pooling layers.
Secondly, we compute the average pixel value of the initial water extraction information obtained from the FCN without pooling layers and take it as the standard water pixel value in the tidal flat area. The formula is:

r, g, b = \sum_{i}^{n} L_{r, g, b} / n

(1)

where

r, g, b

represents the average water pixel value,

L_{r, g, b}

represents the pixel value of the water extraction results, and

n

is the number of waterbody pixels.

3.: Thirdly, we traverse every pixel in the detection information, and calculate the similarity between the water pixels and the standard water pixels. The formula is:

Υ = \sqrt{{(L_{r} - r)}^{2} + {(L_{g} - g)}^{2} + {(L_{b} - b)}^{2}}

(2)

where

L_{r}, L_{g}

and

L_{b}

represent the pixel values of the detection results in the red, green and blue channels, respectively.

4.: Finally, we set a similarity threshold and finish the water extraction. We set 34 based on the experiments. If the similarity between a water pixel in the water detection results and a standard water pixel is greater than the threshold, the pixel point is considered as water, otherwise it is not water.

The similarity algorithm for water extraction proposed in this paper effectively solves the accuracy problem of waterbody extraction caused by the blurry boundary between the waterbody and background.

The similarity algorithm for water extraction is as Algorithm 1:

Algorithm 1. Similarity Algorithm for Water Extraction.

Input: Output results of improved water detection network based on YOLOv3 and FCN without pooling layers
Output: Pixel is waterbody or non-waterbody
1. Procedure Similarity-Water-Extraction (

n

: integer);
2. begin
3. for

i

: = 1 to

n

do
4. begin
5.

s u m L_{r}

=

s u m L_{r}

+

L_{r i}

;
6.

s u m L_{g}

=

s u m L_{g}

+

L_{r i}

;
7.

s u m L_{b}

=

s u m L_{b}

+

L_{b i}

;
8. end;
9.

r

: =

s u m L_{r}

/

n

;
10.

g

: =

s u m L_{g}

/

n

;
11.

b

: =

s u m L_{b}

/

n

;
12. while (pixel is the result of waterbody target detection) do
13. begin
14.

Υ

= sqrt ((

L_{r}

−

r

)²+ (

L_{g}

−

g

)² + (

L_{b}

−

b

)²);
15. if (

Υ

> 34) then
16. pixel is waterbody;
17. else then
18. pixel is non-waterbody;
19. end;
20. end;

4. Experiment and Analysis

4.1. Experimental Configuration

All experiments are implemented on a system with NVIDIA GeForce GTX1070 (Santa Clara, CA, USA) and Intel(R) Core (TM) i7 (Santa Clara, CA, USA), and the operating system is Windows 10 (Redmond, WA, USA). The software environment of the system is ENVI 5.3 (Boulder, CO, USA), Python 3.6 (Wilmington, DE, USA), TensorFlow 1.12.0 (Mountain View, CA, USA) and Keras 2.2.4 (Cobham, UK).

4.2. Evaluation Criterion

To accurately analyze the experiments, this paper selects three indicators to quantitatively evaluate the model: Intersection over Union (

I o U

), pixel accuracy, and Kappa coefficient. The overlap ratio describes the overlap degree between the extracted object and the ground truth; the pixel accuracy is used to measure the proportion coefficient of the correct part of the detection result; the Kappa coefficient is used to measure the pixel classification accuracy. The calculation formulas of the three indicators are as follows:

I o U = \frac{A r e a (P) \cap A r e a (T)}{A r e a (P) \cup A r e a (T)}

(3)

where

A r e a (P)

represents the prediction result and

A r e a (T)

represents the ground truth.

P = \frac{T P}{T P + F P}

(4)

where

P

represents the pixel accuracy,

T P

represents the number of samples that are positive and identified as positive by the network model, and

F P

represents the number of samples that are incorrectly classified as positive.

k = \frac{p_{0} - p_{e}}{1 - p_{e}}

(5)

where

k

represents the value of Kappa coefficient,

p_{0}

represents the proportion of the correct cells, and

p_{e}

represents the proportion of misinterpretations caused by chance.

p e = T P * \frac{F N}{n * n}

(6)

where

T P

represents the number of samples that are positive and identified as positive by the network model,

n

represents the number of ground feature types, and

F N

represents the number of samples that are incorrectly classified as negative.

4.3. Parameter Setting

In this paper, the waterbody detection network plays an important role for the final water extraction. To study the influence of learning rate parameters on the accuracy of water detection, we compare and analyze the decline curve of the loss function under different learning rates and take the optimal learning rate as the model parameter at last. The values of learning rate are set as 0.0001, 0.005, 0.001 and 0.01 respectively. The convergence curve of the loss function is shown in Figure 11.

As shown in Figure 11, when the learning rates are 0.0001 and 0.005, the network model converges slowly, and the loss function value of the final convergence result is higher. When the learning rates are 0.001 and 0.01, the network performs better, and its convergence speed and final convergence result are significantly improved compared with other learning rates. Based on the above analysis of learning rate, as well as many experiments and model debugging, the training parameters of the water detection model are obtained. In this paper, we set the learning rate to be 0.001, the batch training sample size 64, the impulse 0.9, the weight attenuation 0.0005, and the epoch 500 for the improved water detection network based on YOLOv3 (Seattle, WA, USA). The network also uses two

I o U

thresholds during training. If a prediction overlaps the ground truth by 0.7 it is as a positive example, by 0.5–0.7 it is ignored, less than 0.5 for all ground truth objects it is a negative example. We set the learning rate to be 0.01, the batch training sample size 32, the impulse 0.9, the weight attenuation 0.0001 and the epoch 150 for the FCN without pooling layers in our experiments.

4.4. Performance Analysis

4.4.1. Influence of Threshold of Similarity Algorithm for Water Extraction

We set 31, 32, 33, 34, 35, 36, 37, 38 and 39 as thresholds, respectively, and use the extraction accuracy to study the influence of thresholds. The experiments are shown in Figure 12.

We calculated the pixel accuracy of water extraction with different thresholds, and the experiments show that the pixel accuracies increasing at first and decreasing after 34, as shown in Figure 12. When the threshold is getting 34, the pixel accuracy is the highest in our experiments. When the threshold is lower than 34, the phenomenon of missing extraction begins to appear in the water extraction, which makes the accuracy of the water extraction continue to decrease. When the threshold is higher than 34, water extraction begins to appear the false extraction, and the accuracy of water extraction decreases with the increase of the threshold as well. This is likely caused by the definition of the standard water pixel value. To sum up, this paper selects 34 as the threshold of similarity algorithm for water extraction in the tidal flat area.

4.4.2. Qualitative Analysis

To verify the effectiveness of this method, we compare the following methods: NDWI, support vector machine (SVM), maximum likelihood classification, U-Net (Freiburg, Germany) [39], YOLOv3 (Seattle, WA, USA) and FYOLOv3. The tidal flat remote sensing images from the GF-2 satellite are selected as the sample, and the experiments are shown in Figure 13.

The experiments of NDWI, SVM, maximum likelihood classification, U-Net (Freiburg, Germany), YOLOv3 (Seattle, WA, USA) and FYOLOv3 for small waterbodies and waterbodies with blurry boundaries are shown in Figure 13, respectively.

As shown in Figure 13, the NDWI method effectively extracts the waterbody in the remote sensing images, but there are a lot of noises in the extraction results. The water extractions by SVM and maximum likelihood classification method are relatively good, but they cannot effectively solve the problem of high similarity between water and background, and there are a lot of false extractions in the experiments. From the experiments of U-Net, we can see that the water extraction is not good for small waterbodies, and there are lots of false extraction and missing extraction. Compared to NDWI, SVM, maximum likelihood classification, and U-Net (Freiburg, Germany), the experiments of YOLOv3 (Seattle, WA, USA) and FYOLOv3 have better extraction. However, in the experiments of YOLOv3 (Seattle, WA, USA), there are some missing extractions in the densely small water areas due to the prior frames, and FYOLOv3 is able to check each pixel in the detection area based on the similarity algorithm for water extraction, which solves the false and missing extraction, so it is superior to YOLOv3 (Seattle, WA, USA).

4.4.3. Accuracy Analysis

We take

I o U

, pixel accuracy (

P

) and Kappa (

k

) coefficient as the evaluation indexes to compare six methods: NDWI, SVM, maximum likelihood classification, U-Net (Freiburg, Germany), YOLOv3 (Seattle, WA, USA), and FYOLOv3. We set the image size to be 256 × 256, the learning rate 0.001, the decay 0.0005, the momentum 0.9 and we use the optimizer Adam for YOLOv3. We set the learning rate to be 0.001, the decay 0.0001 the momentum 0.9 and the optimizer Adam for U-Net (Freiburg, Germany). The threshold of NDWI is 0.19 and the parameter of maximum likelihood is 2.1. The experiment results of six methods are shown in Table 4.

As shown in Table 4,

I o U

,

P

and

k

of the FYOLOv3 method for water extraction in a tidal flat area on remote sensing images are the highest, followed by the YOLOv3 network, NDWI, U-Net, maximum likelihood classification, and SVM. The method proposed in this paper has higher extraction accuracy than other methods and has a better effect for water extraction in tidal flat with fuzzy boundaries and small waterbodies in a tidal flat area. This proves that this method has more advantages for small waterbody extraction in a tidal flat area.

Table 5 shows the model training time and water extraction time of the three convolutional neural network methods. Although the FYOLOv3 method is divided into three parts, its speed of water extraction is the highest. The method proposed in this paper not only improves the accuracy of water extraction, but also reduces the model training time and water extraction time due to the improvement of YOLOv3 (Seattle, WA, USA).

5. Conclusions

The tidal flat is long and narrow with high sediment content, so there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. Extracting tidal flat waterbody accurately from high-resolution remote sensing imagery is a great challenge. In order to solve the low accuracy problem of tidal flat waterbody extraction, in this paper, a FYOLOv3 is proposed to solve the above problems and extract waterbody in tidal flat with high accuracy. The FYOLOv3 mainly includes three parts: Firstly, according to the characteristics of tidal flat water extraction, an improved object detection network based on YOLOv3 (Seattle, WA, USA) is proposed to ensure the accuracy of water detection, reduce the computational time of the model and alleviate the overfitting phenomenon; secondly, a FCN without pooling layers follows the improved object detection network to obtain the initial water extraction; at last, a similarity algorithm for water extraction is proposed, which distinguishes the waterbody and non-water pixel by pixel in order to improve the extraction accuracy of tidal flat waterbody. Compared to the other models, the experiments show that our method has higher accuracy on the waterbody extraction of tidal flats or small areas, and the

I o U

of our method is 2.43% higher than YOLOv3 (Seattle, WA, USA) and 3.7% higher than U-Net (Freiburg, Germany). However, this method also has some limitations, which needs to manually select the similarity threshold, and different thresholds need to be set for different data, which affects the robustness of the method. Therefore, our future research will consider how to determine the threshold intelligently in order to improve the robustness of the method.

Author Contributions

Conceptualization, L.Z. and Y.F.; Methodology, L.Z. and Y.F.; Validation, Y.S. and R.Y.; Resources, R.Y. and G.W.; Data Curation, J.W. and G.W.; Writing—Original Draft Preparation, L.Z. and Y.F.; Writing—Review and Editing, L.Z. and Y.F.; Supervision, L.Z.; Funding Acquisition, L.Z. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (No.2016YFA0601703, 2016YFC0401005) and National Natural Science Foundation of China (91847301, 42075191, 52009080).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The study did not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shao, Y.; Zhou, J.; Mu, R.; Zhu, L.; Jiang, T.Y. Research on urban development and wetland protection in China. J. Ecol. Environ. 2018, 27, 381–388. [Google Scholar]
Li, L.; Zhao, J.; Xue, X.F. Comprehensive treatment and water quality simulation of Nanming River in Guizhou. J. Environ. Sci. 2018, 38, 1920–1928. [Google Scholar]
Jin, J.; Li, G.; Sun, W.; Yang, X.; Chang, X.; Liu, K.; Liu, Y. Application status and Prospect of satellite remote sensing water resources survey and monitoring. Surv. Mapp. Bull. 2020, 0, 7–10. [Google Scholar]
Rasid, H.; Pramanik, M.A.H. Visual interpretation of satellite imagery for monitoring floods in Bangladesh. Environ. Manag. 1990, 14, 815–821. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated water extraction index: A new technique for surface water mapping using landsat imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Yao, F.; Wang, C.; Dong, D.; Luo, J.; Shen, Z.; Yang, K. High-Resolution mapping of urban surface water using ZY-3 multi-spectral imagery. Remote Sens. 2015, 7, 12336–12355. [Google Scholar] [CrossRef] [Green Version]
Guo, Q.; Pu, R.; Li, J.; Cheng, J. A weighted normalized difference water index for water extraction using Landsat imagery. Int. J. Remote Sens. 2017, 38, 5430–5445. [Google Scholar] [CrossRef]
Xu, H. Extraction of water information by improved normalized difference water index (MNDWI). Chin. J. Remote Sens. 2005, 5, 589–595. [Google Scholar]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Zhang, J.; He, C.; Pan, Y.; Li, J. Classification of high spatial resolution remote sensing data based on multi-source information combination of SVM. J. Remote Sens. 2006, 10, 49–57. [Google Scholar]
Sisodia, P.S.; Tiwari, V.; Kumar, A. Analysis of supervised maximum likelihood classification for remote sensing images. In Proceedings of the IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014), Jaipur, India, 9–11 May 2014; pp. 1–4. [Google Scholar]
Duan, Q.; Meng, L.; Fan, Z.; Hu, W.; Xie, W. Applicability of water information extraction method from GF-1 satellite images. Land Resour. Remote Sens. 2015, 27, 79–84. [Google Scholar]
Kirby, R. Practical implications of tidal flat shape. Cont. Shelf Res. 2000, 20, 1061–1077. [Google Scholar] [CrossRef]
Jain, S.K.; Singh, R.D.; Jain, M.K.; Lohani, A.K. Delineation of Flood-Prone Areas Using Remote Sensing Techniques. Water Resour. Manag. 2005, 19, 333–347. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Kong, G. The application of multi-band spectral relationship method in waterbody extraction. Mine Surv. 2004, 4, 30–32. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagesnet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
Zhong, X. Research on Waterbody Recognition from Remote Sensing Images Based on Convolution Neural Network. Master’s Thesis, Hohai University, Nanjing, China, 2019. [Google Scholar]
Liang, Z. Extraction Method of Multi-Source Remote Sensing Water Information Based on Deep Learning and Its Application. Master’s Thesis, Anhui University, Hefei, China, 2019. [Google Scholar]
Song, S.; Liu, J.; Liu, Y.; Feng, G.; Han, H.; Yao, Y.; Du, M. Intelligent object recognition of urban water bodies based on deep learning for multi-source and multi-temporal high spatial resolution remote sensing imagery. Sensors 2020, 20, 397. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yu, L.; Wang, Z.; Tian, S.; Ye, F.; Ding, J.; Kong, J. Convolutional neural networks for waterbody extraction from Landsat imagery. Int. J. Comput. Intell. Appl. 2017, 16, 1750001. [Google Scholar] [CrossRef]
Li, L.; Yan, Z.; Shen, Q.; Cheng, G.; Gao, L.; Zhang, B. Waterbody extraction from very high spatial resolution remote sensing data based on fully convolutional networks. Remote Sens. 2019, 11, 1162. [Google Scholar] [CrossRef] [Green Version]
Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A novel deep learning network for lake water body extraction of Google remote sensing images. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
Yu, Y.; Yao, Y.; Guan, H.; Li, D.; Liu, Z.; Wang, L.; Yu, C.; Xiao, S.; Wang, W.; Chang, L. A self-attention capsule feature pyramid network for water body extraction from remote sensing imagery. Int. J. Remote Sens. 2021, 42, 1801–1822. [Google Scholar] [CrossRef]
Li, L.; Wen, Q.; Wang, B.; Fan, S.; Li, L.; Liu, Q. Water body extraction from high-resolution remote sensing images based on scaling efficientNets. J. Phys. Conf. Ser. 2021, 1894, 012100. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Montanaro, M.; Lunsford, A.; Tesfaye, Z.; Wenny, B.; Reuter, D. Radiometric calibration methodology of the Landsat 8 thermal infrared sensor. Remote Sens. 2014, 6, 8803–8821. [Google Scholar] [CrossRef] [Green Version]
Huang, X.; Zhu, J.; Han, B.; Jamet, C.; Tian, Z.; Zhao, Y.; Li, J.; Li, T. Evaluation of four atmospheric correction algorithms for GOCI images over the Yellow Sea. Remote Sens. 2019, 11, 1631. [Google Scholar] [CrossRef] [Green Version]
Liu, S.; Wan, J.; Zhang, J.; Ma, Y.; Ren, G. Orthorectification method of SPOT5 satellite image based on ERDAS. Oceanography 2008, 28, 30–33. [Google Scholar]
Li, H.; He, X.; Tao, D.; Tang, Y.; Wang, R. Joint medical image fusion, denoising and enhancement via discriminative low-rank sparse dictionaries learning. Pattern Recognit. 2018, 79, 130–146. [Google Scholar] [CrossRef]
Fan, W.; Li, H.; Wen, Q.; Gao, X. Orthorectification accuracy analysis of GF-2 satellite image. Surv. Mapp. Bull. 2016, 0, 63–66. [Google Scholar]
Sun, W.; Chen, B.; Messinger, D. Nearest-neighbor diffusion-based pan-sharpening algorithm for spectral images. Opt. Eng. 2014, 53, 013107. [Google Scholar] [CrossRef] [Green Version]
Ren, J.; Yang, W.; Deng, X.; Wang, L.; Wang, F. Applicability of gf-2 image classification based on OIF and optimal scale segmentation. Mod. Electron. Technol. 2018, 41, 72–77, 82. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on images data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Ian, G.; Yoshua, B.; Aaron, C. Deep Learning; People’s Posts and Telecommunications Press: Beijing, China, 2017; pp. 208–209. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical images segmentation. In Proceedings of the International Conference on Medical Images Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]

Figure 1. Fine-grained tidal flat waterbody extraction method.

Figure 2. Comparison of multispectral images before and after preprocessing. (a) Multispectral image before preprocessing; (b) multispectral image after preprocessing.

Figure 3. Comparison of panchromatic images before and after orthorectification. (a) Panchromatic image before orthorectification; (b) panchromatic image after orthorectification.

Figure 4. Comparison of multispectral images before and after fusion. (a) Original multispectral image; (b) image after fusion.

Figure 5. Image after band combination.

Figure 6. Example of labeling a region.

Figure 7. Remote sensing image and regional waterbody labeled image. (a) Remote sensing image; (b) regional waterbody labeled image.

Figure 8. Example of remote sensing images by data augmentation. (a) 90° rotation; (b) 180° rotation; (c) 270° rotation; (d) horizontal flip; (e) vertical flip.

Figure 9. The network structure of the improved water detection network.

Figure 10. Fully convolutional network without pooling layers.

Figure 11. Loss function curve of the water detection model with different learning rates.

Figure 12. Comparison of the pixel accuracy with different thresholds.

Figure 13. Comparison of different methods for water extraction in a tidal flat area. (a) Small water bodies; (b) water bodies with blurry boundaries; (c,d) NDWI; (e,f) SVM; (g,h) maximum likelihood classification; (i,j) U-Net; (k,l) YOLOv3; (m,n) FYOLOv3.

Table 1. Network structure and parameters of improved feature extraction module.

Type/Parameter	Number of Convolution Kernels	Convolution Kernel Size	Step	Padding
Conv1	16	1 × 1	1	same
Conv2	16	3 × 3	1	same
Conv3	256	1 × 1	1	same
BN
Max Pooling
Conv4	128	3 × 3	1	same
Conv5	128	1 × 1	1	same
Conv6	512	3 × 3	1	same
BN
Max Pooling
Conv7	256	3 × 3	1	same
Conv8	256	1 × 1	1	same
Conv9	512	3 × 3	1	same
BN
Max Pooling
Conv10	256	3 × 3	1	same
Conv11	256	1 × 1	1	same
Conv12	512	3 × 3	1	same
Max Pooling
Conv13	1024	3 × 3	1	same

Table 2. Corresponding between feature graph size and prior box.

Size of Feature Map	Receptive Field	Prior Box
13 × 13	large	116 × 90	156 × 198	373 × 326
26 × 26	middle	20 × 61	62 × 45	59 × 119
52 × 52	small	10 × 13	16 × 30	33 × 23

Table 3. Parameters of FCN without pooling layers.

Layer	Number of Convolution Kernels	Convolution Kernel Size	Step	Padding
Conv1	64	3 × 3	1	same
Conv2	64	1 × 1	1	same
Conv3	256	3 × 3	1	same
Conv4	256	1 × 1	1	same
Conv5	512	3 × 3	1	same
Conv6	512	3 × 3	1	same
Conv7	2	3 × 3	1	same

Table 4. Accuracy comparison of six methods for water extraction in tidal flat area.

Method	$I o U$	$P$	$k$
NDWI	0.9351	0.9665	0.9303
SVM	0.8925	0.9432	0.8821
Maximum likelihood classification	0.9041	0.9496	0.8952
U-Net	0.9309	0.9642	0.9251
YOLOv3	0.9436	0.9710	0.9394
FYOLOv3	0.9679	0.983	0.9613

Table 5. Comparison of the model training time and water extraction time.

Method	Training Time (h)	Water Extraction Time (s/Sheet)
U-Net	8	0.18
YOLOv3 + FCN + Similarity algorithm	$(10 + 3)$	$0.24 \times (0.11 + 0.08 + 0.05)$
FYOLOv3	$(6 + 3)$	$0.16 \times (0.03 + 0.08 + 0.05)$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Fan, Y.; Yan, R.; Shao, Y.; Wang, G.; Wu, J. Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 2594. https://doi.org/10.3390/rs13132594

AMA Style

Zhang L, Fan Y, Yan R, Shao Y, Wang G, Wu J. Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images. Remote Sensing. 2021; 13(13):2594. https://doi.org/10.3390/rs13132594

Chicago/Turabian Style

Zhang, Lili, Yu Fan, Ruijie Yan, Yehong Shao, Gaoxu Wang, and Jisen Wu. 2021. "Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images" Remote Sensing 13, no. 13: 2594. https://doi.org/10.3390/rs13132594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Water Extraction Methods

2.2. YOLO Models

3. Methodology

3.1. Construction of Training Dataset

3.1.1. Preprocessing

3.1.2. Data Labeling and Augmentation

3.2. Improved Water Detection Network Based on YOLOv3

3.2.1. Improved Feature Extraction Module

3.2.2. Feature Pyramid Network Structure for Multi-Scale Water Detection

3.3. FCN without Pooling Layers

3.4. Similarity Algorithm for Water Extraction

4. Experiment and Analysis

4.1. Experimental Configuration

4.2. Evaluation Criterion

4.3. Parameter Setting

4.4. Performance Analysis

4.4.1. Influence of Threshold of Similarity Algorithm for Water Extraction

4.4.2. Qualitative Analysis

4.4.3. Accuracy Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI