A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net

Liu, Jiahao; Wang, Hong; Zhang, Yao; Zhao, Xili; Qu, Tengfei; Tian, Haozhe; Lu, Yuting; Su, Jingru; Luo, Dingsheng; Yang, Yalei

doi:10.3390/rs15153711

Open AccessArticle

A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net

by

Jiahao Liu

¹

,

Hong Wang

^1,2,*,

Yao Zhang

¹,

Xili Zhao

¹,

Tengfei Qu

²

,

Haozhe Tian

¹,

Yuting Lu

¹,

Jingru Su

¹,

Dingsheng Luo

¹ and

Yalei Yang

²

¹

College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi 830046, China

²

Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3711; https://doi.org/10.3390/rs15153711

Submission received: 13 June 2023 / Revised: 16 July 2023 / Accepted: 21 July 2023 / Published: 25 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper focuses on the problems of omission, misclassification, and inter-adhesion due to overly dense distribution, intraclass diversity, and interclass variability when extracting winter wheat (WW) from high-resolution images. This paper proposes a deep supervised network RAunet model with multi-scale features that incorporates a dual-attention mechanism with an improved U-Net backbone network. The model mainly consists of a pyramid input layer, a modified U-Net backbone network, and a side output layer. Firstly, the pyramid input layer is used to fuse the feature information of winter wheat at different scales by constructing multiple input paths. Secondly, the Atrous Spatial Pyramid Pooling (ASPP) residual module and the Convolutional Block Attention Module (CBAM) dual-attention mechanism are added to the U-Net model to form the backbone network of the model, which enhances the feature extraction ability of the model for winter wheat information. Finally, the side output layer consists of multiple classifiers to supervise the results of different scale outputs. Using the RAunet model to extract the spatial distribution information of WW from GF-2 imagery, the experimental results showed that the mIou of the recognition results reached 92.48%, an improvement of 2.66%, 4.15%, 1.42%, 2.35%, 3.76%, and 0.47% compared to FCN, U-Net, DeepLabv3, SegNet, ResUNet, and UNet++, respectively. The superiority of the RAunet model in high-resolution images for WW extraction was verified in effectively improving the accuracy of the spatial distribution information extraction of WW.

Keywords:

winter wheat extract; CNN; deep supervision; attention mechanism; residual block

Graphical Abstract

1. Introduction

As China’s basic industry, agriculture is closely related to the country’s livelihood and social stability. Globally, wheat is widely grown, and it is one of the most important food crops in the world [1]; winter wheat (WW), a major food crop in China, is mainly distributed in the north of the Qinling and Huaihe rivers, some areas south of the Great Wall, and provinces south of the Qinling and Huaihe rivers [2]. The production of winter wheat in China in 2021 was as high as 13,097,200 tons [3], accounting for approximately 90% of China’s wheat and 18% of the world’s wheat acreage [4]. Therefore, the accurate estimation of WW acreage is essential for understanding the status of grain production and safeguarding national food security [5].

The traditional method of obtaining WW planting information is through human and material methods such as reporting agricultural statistics, which cannot meet the needs of today’s society for a rapid grasp of agricultural planting information and planning adjustments. Therefore, accurate and rapid access to crop cultivation information is important in assisting the relevant government departments in formulating scientific and reasonable agricultural management policies [6]. In recent years, remote sensing technology has received international attention for its ability to monitor agricultural cropping information over large areas. With the progress of domestic satellites, the spatial resolution has been improved, the spatial structure and the location layout of features are clearer, and information such as texture and size are finer, providing a good database for extracting the spatial distribution information of crops [7,8]. Remote sensing technology has the characteristics of a wide field of view and multiple time scales. Remote sensing images can provide instantaneous static images over a large area, and compared to traditional survey methods, remote sensing technology has the advantages of low cost, obvious hierarchy, efficient information collection, and strong timeliness [9]. It can monitor ground information more rapidly and objectively, and it plays an important role in land cover identification and detection [10]. Remote sensing images can monitor the crop growth cycle, and optical remote sensing data can also reflect the spectral band information of crops, which has good applications in the task of WW classification extraction [11]. The use of remote sensing images is the most efficient method for accurately and rapidly obtaining information on the planting of WW over a large area. In order to improve the classification accuracy of remotely sensed images, researchers have conducted a lot of related research.

Image segmentation algorithms play a critical role in the field of computer vision. Image segmentation groups pixels with consistent elemental features into the same class of features and extracts the target of interest. Researchers have applied image segmentation techniques to remote sensing images. Remote sensing image segmentation classifies each image element in an image into different classes according to its spectral brightness, spatial structure characteristics, or other information in different bands according to some rules or algorithms. Traditional classification methods have significantly improved the accuracy of remote sensing classification results by segmenting images pixel by pixel [12]. Threshold-based segmentation methods [13], edge-monitoring-based segmentation methods [14], and region-based segmentation methods [15,16] are three traditional methods that remain at a low level for the characterization of segmented image features, and some holistic information of the object is lost in the process of deep feature extraction [17]. Therefore, these traditional image segmentation methods are only applicable in simple scene segmentation tasks, and the segmentation accuracy in complex scenes is far from the practical application requirements. Deep learning has provided new research ideas to solve the above-mentioned problems created by high-resolution images. Scholars have proposed pixel-to-object segmentation and image semantic segmentation methods to conduct in-depth research on the remote sensing classification of various crops [18,19,20].

Deep learning methods can extract image features in a more abstract way. Based on traditional segmentation methods, deep learning models are applied to image semantic segmentation tasks of pixel-level classification, and they have gradually developed into deep-learning-based image semantic segmentation methods, which can extract unique features of different targets at a deeper level and have higher segmentation accuracy. Convolutional neural networks (CNNs), as one of the most representative deep learning network types, have been extensively used in the classification and segmentation of images [21]. CNNs have the advantage of learning features autonomously, and also have features such as local linkage, weight sharing, and reduced network complexity, and number of parameters to be trained [22]. In recent years, CNNs have been widely used in image classification [23,24,25,26], object detection [27,28,29], semantic segmentation [30,31], and other fields.

FCN is the first semantic segmentation codec network, which is a model designed with a hopping layer connection structure for fusing high- and low-dimensional feature maps, which can truly achieve end-to-end pixel-level semantic segmentation [19]; however, it does not consider pixel-to-pixel relationships and lacks spatial consistency. The U-Net network has a more efficient hopping layer connection structure that corresponds to high and low feature maps, which helps to recover the target boundary contours and restore the spatial dimension and pixel location information of the image [20]. SegNet makes the decoding process more standardized by recording the pooling index during encoding [18]. ResNet adds residual operations to the network, which greatly deepens the depth of the network, where the identity skip connection in the residual block can directly send the input data into the next layer, alleviating possible problems such as gradient disappearance and performance degradation, and also making it possible to train extremely deep networks [32]. Atrous Spatial Pyramid Pooling (ASPP) has been proposed in the DeepLab series to fix the resolution size of the feature map in each branch of the input pyramidal structure, use different convolution kernels, and introduce convolution layers with different expansion rates to obtain multi-scale semantic information [33,34,35]. ResUNet is a deep learning model that incorporates residuals and combines the advantages of ResNet and U-Net. By introducing multiple residual blocks in the encoder and decoder, it can better extract low-frequency information, alleviate the problem of missing semantic information and gradient disappearance, and thus improve the accuracy of image segmentation [36]. UNet++ enhances the quality of the segmentation of objects of various sizes by introducing U-Net collections of variable depth that can partially share a single encoder and can be co-learned via deep supervision, which can be used for tasks that perform semantic segmentation across different datasets and backbone architectures [37].

Through the continuous research on convolutional neural networks, researchers have begun to apply convolutional neural networks to WW spatial distribution information extraction. CNN consists of multiple nonlinear mapping layers, and by mining the spatial correlation between the mixed image elements or heterogeneous landscape areas in remote sensing images and combining them for analysis, higher-dimensional features can be obtained to achieve the intelligent extraction of WW planting areas [38]. The extraction ability of the network model for WW was enhanced by adjusting the model structure and introducing some feature extraction modules [39]. The sensitivity of the model to wheat features was enhanced by building a network model combining multi-scale high-level features with low-level features to improve segmentation accuracy [40]. Convolutional-neural-network-based crop spatial distribution information extraction is generally based on the standard CNN model, combining factors such as target and data sources and adjusting the network structure to optimize the feature extraction capability, so as to achieve better extraction results [41].

In this paper, drawing on the idea of dealing with the semantic segmentation of natural images, we propose an RAunet model of a multi-scale-feature deep supervised network incorporating a dual-attention mechanism, ensuring the good overall robustness of the network. It can effectively solve the problems of the omission, misclassification, and inter-adhesion of WW caused by the overly dense distribution, intraclass diversity, and interclass variability of WW extracted using high-resolution images. Furthermore, this approach also provides valuable theoretical and technical approaches that can be applied to extract information for other crops.

In summary, our research makes the following contributions:

Using the image pyramid as the input to the RAunet model, the channel utilization of the input image is improved by widening the network width.
The residual module, ASPP, and Convolutional Block Attention Module (CBAM) are introduced as backbone networks in the U-Net model, which are used to obtain the rich multi-level feature information of WW and enable the model to perform targeted learning of WW features.
A side output layer is constructed as the output layer of the model, and the convolutional layers for feature extraction are deeply supervised, which improves the segmentation accuracy of small region features that are easily lost in cascade convolution and facilitates the network to learn more local perceptual features of WW.

2. Study Area and Data

2.1. Overview of the Study Area

The study area is located in the Linfen Basin, Shanxi Province, China. The Linfen Basin starts from Hanhou Ridge in the north, folds southward to Houma, and westward to the bank of the Yellow River. It is bordered on the east and west sides of the basin by the Huoshan Great Fault, the Luoyun Mountains, and the Hancheng Great Fault and mountains, respectively, and on the south by the Emei Plateau, with an elevation of 400–600 m. It is a typical graben basin. The Linfen Basin shares a semi-arid and semi-humid temperate continental monsoon climate [42], surrounded by mountains, with a central flat river, while the interior of the basin is mostly river and lake phase accumulation. Most areas have fertile soil, and the rain is hot in the same period. In the study area, the spring is more arid, the summer is more rainy, the autumn is mild and cool, and the winter is cold and dry with some snow. With an average temperature of 10–21 °C throughout the year and an annual precipitation of 525.1 mm, Linfen has good climatic conditions, affording it a good growing environment for crop cultivation and making it the key grain and cotton production base in North China. WW and other grain crops are widely grown in the basin, which is one of the major grain-yielding areas in Shanxi Province. The location of the study area is shown in Figure 1.

2.2. Data Acquisition

2.2.1. Acquisition and Preprocessing of Remote Sensing Data

The experimental data were based on GF-2 data as the remote sensing image data source to meet the requirements of high spatial resolution. The GF-2 satellite carries two cameras with 1 m panchromatic and 4 m multispectral resolutions, which have high positioning accuracy and fast attitude maneuverability. The panchromatic sensor has sub-meter spatial resolution, and the multispectral sensor is designed with four bands (blue, green, red, and near-infrared), which can meet the demand of conventional remote sensing image map production.

A total of nine scenes of GF-2 images were used to cover the Linfen Basin, and all images were taken on 19 April 2021, a period when WW was in a relatively vigorous growth state at the time of plucking, taken adjacent to the ground survey. The nine GF-2 images were cloud-free and had high resolution.

The remote sensing service platform ENVI5.3 was used to preprocess the remote sensing GF-2 images in the study area. In the process of remote sensing image capturing, the satellite attitude, speed, altitude, atmospheric disturbance, and other factors subjected the remote sensing images to geometric deformation and information errors.

First, radiometric calibration and atmospheric correction were applied to the multispectral images, and radiometric calibration was applied to the panchromatic images, converting remote sensing image element brightness value (DN) to radiance and reflectance. The two images then underwent orthorectification separately using DEM data with a resolution of 30 m. Finally, the NNDiffuse Pan Sharpening fusion algorithm was used to fuse a multispectral image with a resolution of 4 m to a panchromatic image with a resolution of 1 m. The spatial resolution of the fused image was 1 m.

2.2.2. Ground Survey Data

In order to produce samples for training the model in the crop extraction process, a field survey of the study area was required to ensure that the training samples matched the actual spatial distribution of the crops. Field sampling of the study area was conducted on 26 April 2021, and sample data were obtained through ground surveys using GPS to record the location information of the WW sample sites in the field. The acquisition process was as follows: (1) the location and size of the sample to be surveyed were initially determined according to the existing remote sensing base map, and the vector boundary of the sample was digitized on the remote sensing image, which ensured that the internal features of each sample were as homogeneous as possible and that the size was not less than 16 m²; (2) according to the vector boundary of the existing sample, the sample to be surveyed was found in the study area using GPS, and the category of the sample was determined; and (3) some samples that were difficult to obtain in the survey process or contained mixed species in the field were discarded, and the regularity and completeness of the data were checked. After verification, the remaining samples were deemed to meet the relevant accuracy requirements and to evenly cover the entire study area. After the field survey, 80 WW sampling points were obtained, as shown in Figure 2.

2.2.3. Dataset Production

The preprocessed GF-2 images of the study area were combined with the ground survey data and visual interpretation, followed by manual labeling using LabelMe software to produce a dataset. The wheat-growing areas were labeled as WW and the other areas as non-winter wheat (NW). After labeled images were created, the annotated images were cropped to 600 × 600 pixels, with the original images and labels corresponding to one another, resulting in a final dataset of 500 samples. The distribution of the dataset in the study area is shown in Figure 2, of which 350 were used for training, 100 for validation, and 50 for testing.

3. Research Methodology

For the WW extraction task, a multi-scale-feature deep supervised network model RAunet with a fused dual-attention mechanism by improving U-Net as the backbone network is proposed, which consists of three main parts (Figure 3). First, the image pyramid is used as the model’s input. Then, in the decoder of the model, multiple input paths of different scales are added to achieve feature fusion at different scales and to widen the network width to improve the channel utilization of the input image. The second part improves the U-Net model [20] as the main architecture, including the use of a residual block to replace the regular convolution in the U-Net encoder to avoid the gradient explosion problem due to the deepening of the network, which can effectively eliminate the direct connection gap between the low- and high-level features of the network model. Additionally, the ASPP structure was also introduced and is used to obtain rich multi-level feature information of WW. Meanwhile, the CBAM dual-attention mechanism was added to the skip connection stage to add weight information in both the channel and spatial dimensions of the feature map, which effectively suppresses the capability of the model to learn NW features and enables the model to perform targeted learning of WW features. Finally, the side output layers are constructed by adding auxiliary classifiers to each layer in the decoder of the model, and the convolutional layers for feature extraction are deeply supervised. Each output path outputs a local mapping result, and its output loss is backpropagated to the feature extraction layer on the decoder path, which helps the training of the feature extraction layer, improves the segmentation accuracy of small regional features that are easily lost in cascade convolution, and facilitates the network to learn more local perceptual features of WW.

3.1. Image Pyramid Input

The RAunet model constructs an image pyramid input (Figure 3a), uses an avgpooling layer to 2 × 2 down-sample the image several times to obtain images of different scales, and constructs multiple input paths on the encoder. Moreover, before each maxpooling layer of the decoder, the input images of each scale are respectively subjected to a network consisting of a 3 × 3 convolution layer, a batch normalization layer, and a semantic feature extraction for the Relu activation layer, which is integrated into the encoder to increase the network width of the decoder path and improve the utilization of each layer of channels. This method, combined with deep supervision, can effectively enhance the ability of cascaded convolution to capture small-scale features, which facilitates the network to learn more local perceptual features about the classification target [43].

3.2. Backbone Network Construction

3.2.1. U-Net Model

U-Net is a fully symmetrical codec semantic segmentation model that can better fuse the features of images [20]. The encoder consists of five semantic extraction units that follow a typical CNN architecture. Each semantic extraction unit is stacked using two 3 × 3 convolutions, followed by Relu nonlinear activation, and each semantic extraction unit is connected via maxpooling (maximum pooling layer). The number of feature channels is expanded to 1024 after four down-sampling cycles. During the coding process, the network gradually compresses the spatial dimension of the feature map and expands the channels, and continuously extracts the deeper features of the image. The decoder continuously scales up the feature maps via up-sampling and fuses them with the feature maps extracted from the semantics of each layer of the decoder to fully combine shallow and deep semantic features. Since the padding is not set in the convolution operation during the network training, all of the feature maps need to be cropped before being fused in the decoding process, and then connected using Concatenate to fully fuse the shallow and deep features, so as to recover the detailed information of the feature map. The feature maps obtained after encoding and decoding are compressed into channels, and then concatenated into the SoftMax classifier to obtain the probability of each pixel belonging to the WW and NW categories; the corresponding pixels are then mapped as WW or NW.

3.2.2. Residual Block

Theoretically, it is believed that the deeper the convolutional neural network, the more features extracted and the more semantic information included, but in practice, model degradation occurs due to gradient dispersion [44]. In the process of backpropagation, the gradient may decrease or increase exponentially through multiple layers of transmission, which will result in the parameters of the pre-layer of the network no longer being updated, causing problems such as gradient disappearance or explosion. The ResNet residual block adds an identity mapping path to the convolution process, which can avoid the degradation of the whole model’s learning accuracy due to the stacked multi-layer network during the continuous deepening of the network. The residual structure calculation formula is shown in Equation (1) [32].

F (x) = H (x) - X

(1)

In the equation, the input is x, the optimal mapping is denoted as H(x), the learning to residual result is F(x), and the identity mapping performed by skipping the multi-layer convolution is X. When the residual is 0, only the identity mapping is complete.

In this paper, the bottleneck structure of ResNet50 was selected as the feature extraction module of the backbone network according to the feature extraction network structure suitable for U-Net. The structure is shown in Figure 4 [32].

3.2.3. Atrous Spatial Pyramid Pooling

Atrous Spatial Pyramid Pooling (ASPP) is proposed in the DeepLabv2 deep learning model [34], which consists of multiple Atrous convolutions with different expansion rates. The features extracted from each Atrous convolution with different expansion rates are further processed and fused in a separate branch to generate the final result, which can effectively capture multi-scale features. The DeepLabv3 model incorporates image-level features containing global contextual information [35], which further improves the ASPP’s performance and effectively improves the segmentation accuracy. The specific structure is shown in Figure 5 [35].

3.2.4. Convolutional Block Attention Module

The convolutional block attention module (CBAM) is a feedforward convolutional neural network attention mechanism module that integrates the channel attention module (CAM) and the spatial attention module (SAM) [45]. The feature map input into CBAM first obtains the feature map containing the channel information through the CAM, and then enters the SAM to obtain the semantic feature information of the monitoring target using the spatial information of the feature map. Finally, those feature maps containing the channel weight information are multiplied to output feature maps with channel and spatial weight information. The introduction of the attention mechanism can change the allocation of resources so that more resources tend toward WW. In the training process, assigning more weight parameters to the WW in images can strengthen the capability of WW feature extraction and reduce the influence of background noise. The structure is given in Figure 6 [45].

(1): Channel attention

The CAM applies avgpooling and maxpooling to the input feature map (H × W × C) to obtain two spatial context features,

F_{a v g}^{C}

and

F_{m a x}^{C}

. The correlation between channels is learned by the weights generated for each feature channel [46]. Avgpooling integrates the global spatial information, while maxpooling takes the maximum value of pixel points in the neighborhood of the feature map to reduce the influence of useless information. The two obtained features are input into a shared network for operational summation, which consists of a multi-layer perceptron (MLP) and an implicit layer (Figure 7) [45]. The hidden layer size is set to C/r × 1 × 1, where r is the compression ratio. Finally, the CAM [47] is calculated with the Sigmoid function using the following equation:

M_{C} (F) = δ (M L P (a v g p o o l (F)) + M L P (m a x p o o l (F))) = δ (W_{1} (W_{0} (F_{a v g}^{C})) + W_{1} (W_{0} (F_{m a x}^{C})))

(2)

In the equation,

δ

denotes the sigmoid function, and

F_{a v g}^{C}

and

F_{m a x}^{C}

denote the features of average pooling and maximum pooling.

(2): Spatial attention

The SAM enhances the filtering ability of the features by focusing on feature information in space [48]. It is similar to the CAM in that both use avgpooling and maxpooling to compress the weighted feature map, Mc (H × W × C), of the CAM into two feature maps,

F_{a v g}^{S} = H \times W \times 1

and

F_{m a x}^{S} = H \times W \times 1

, which are stitched and superimposed to obtain

F^{S} = H \times W \times 2

. Then, the obtained feature maps are compressed into a feature map with the channel number 1 with one convolution operation. Finally, the spatial attention features are calculated using the Sigmoid function (Figure 8) [45]. The SAM [47] is calculated as follows:

M_{S} (F) = δ (f^{7 \times 7} ([a v g p o o l (F); m a x p o o l (F)])) = δ (f^{7 \times 7} ([F_{a v g}^{S}; F_{m a x}^{S}]))

(3)

In the equation,

δ

denotes the sigmoid function and

f^{7 \times 7}

denotes the convolution operation, with a filter size of 7 × 7.

3.2.5. Improved U-Net Structure

The main structure of the model is similar to that of the U-Net model (Figure 9). Its encoder is composed of five semantic extraction units, and the first four semantic feature extraction units use a residual block instead of a normal convolutional layer for feature extraction. The feature extraction layers all consist of convolutional layer, batch normalization layer, and Relu activation layer for semantic feature extraction. To a certain extent, the gradient disappearance and gradient explosion problems that exist in deep learning architectures are eliminated. The features are then compressed using a 2 × 2 pooling layer for down-sampling. The underlying layer of the model uses the ASPP structure instead of the normal convolution operation, and the module uses multiple Atrous convolutions with different expansion rates to obtain multi-scale object information. The decoder consists of four feature fusion units, where each module contains a 2 × 2 up-sampling layer and a convolutional layer with a 3 × 3 convolutional kernel, so that a low-resolution feature map with high-level abstract features restores the detailed information of the feature map while retaining the high-level abstract features and restoring the size of the feature map. In the skip connection stage, the low-level feature map obtained from the encoder is input into the CBAM dual-attention module to add the weight information, and then feature fusion operations are performed. Padding is set for the convolution operation during network training to avoid cropping operations before the feature map fusion during the decoding process.

3.3. Deeply Supervised Side Output Layer

The multi-path output is set on the right side of the decoder, and the side output layer is constructed by adding auxiliary classifiers to each layer in the decoder (Figure 3c). Each feature fusion path is finally subjected to one layer of convolutional kernel size 1 × 1 convolutional operation, used to adjust the number of channels. Then, the SoftMax classifier is used to predict each pixel of the feature map at different scales. In order to directly use the prediction images at different scales in the side output layer, the output of the branch network is up-sampled by factors of 2 × 2, 4 × 4, and 8 × 8, respectively. This up-sampling ensures that the resolution of the side output images is consistent with the final output. Then, all side output images are integrated into the final prediction image by means of an averaging layer. The side output layer backpropagates the loss values of each output path together with the feature extraction layer in the decoder [49] to effectively monitor the output results of the different scales in each path.

3.4. Experimental Environment and Setting

The RAunet model was written in Python under Windows 10 OS and implemented in the Keras framework. An NVIDIA RTX 2080 Ti GPU (11 G memory) was used for training and testing. Adam was used as the optimizer for the model experiments, setting its initial learning rate to 0.0001, batch size to 2, and loss function to the cross-entropy loss function.

3.5. Evaluation Metrics

The performance of the RAunet model for extracting WW was evaluated using precision, recall, F1-score, mean intersection over union (mIou), and overall accuracy as the evaluation metrics. These evaluation metrics are accuracy measures based on confusion matrices that summarize the pixel values in an image based on the true and predicted categories and describe the performance of the network model on a set of data. The rows and columns of the matrix represent the true and predicted values, respectively (Table 1).

Precision indicates the percentage of predicted WW image elements correctly classified, calculated as shown below [50]:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Recall indicates the percentage of correctly classified WW pixels over all actual WW, calculated as follows [51]:

R e c a l l = \frac{T P}{T P + F N}

(5)

F1-score is the weighted average of precision and recall. The value range is 0–1, and the larger the value, the higher the model accuracy, which is calculated as follows [51]:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{p r e c i s i o n + R e c a l l}

(6)

Iou is the intersection and concurrence ratio of actual and predicted category samples, while mIou is the summed average of the intersection and concurrence ratios for each category [52]:

I o u = \frac{T P}{T P + F N + F P}

(7)

m I o u = \frac{\sum_{i}^{n} I o u}{n}

(8)

Accuracy indicates the percentage of all correctly predicted pixels over all pixels in the sample, calculated as shown below [8]:

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(9)

4. Results

During the training process, the lower the loss of the training and validation sets, the higher the accuracy, indicating better performance of the network. As shown in Figure 10, RAunet’s training and validation losses gradually decreased with the increase in epoch times. Particularly in the early stage of training, the losses of the training and validation sets converged faster and the overall accuracy improved significantly. At around 10 iterations, the loss curve of the validation set gradually tended to level off.

Figure 10 shows the loss and accuracy curves of RAunet and the six comparison models for the training and validation sets during the training process. As can be seen from the figure, the loss of all seven networks decreased gradually with the increase in the number of iterations. The validation loss curves of the six comparison models started to converge after 20 epochs, and the RAunet loss converged faster compared to the six comparison models. The validation set loss curve gradually leveled off after 10 iterations, and a smaller loss value (0.0971) was obtained than that of the six comparison models. It can be seen that the RAunet model obtained better loss function convergence and was more conducive to improving the WW recognition accuracy compared to the six comparison models.

To verify the accuracy of the RAunet model for extracting the spatial distribution information of WW, the trained FCN [19], U-Net [20], DeepLabv3 [40], SegNet [18], ResUNet [36], and UNet++ [37] models and the RAunet model were tested on the test set. FCN and U-Net are classical pixel-by-pixel codec-structured semantic segmentation models, while DeepLabv3 is better able to capture multi-scale information, and the SegNet model is a supervised codec-structured full convolutional neural-network-based model. ResUNet, UNet++, and RAunet models are all improvements on the U-Net. Comparing the above six models can better reflect the advantages of the RAunet model for WW extraction.

Columns a, b, and c in Figure 11 show the extraction effect of each network structure on densely distributed small-scale WW; the yellow box shows the more densely distributed winter wheat patches. The segmentation results of the FCN, U-Net, and ResUNet models differ greatly from the original marker map, and the WW patches in the segmentation results have unclear edges and rough contours. The obtained results are neither fine enough nor sensitive enough to the detailed information of the captured images, resulting in smaller-scale WW patches not being recovered effectively. SegNet, DeepLabv3, and UNet++ improved the edges of the WW patches in the model segmentation results, and partially recovered small-scale WW patches. However, there are obvious adhesions between the WW, and the splicing marks are more obvious. In contrast, the RAunet model accurately identified and captured the features of densely distributed WW borders, and could determine the interdependent borders and accurately separate WW that was close to one another.

The extraction results for WW with large intraclass variability are shown in column d of Figure 11, where the WW color in the yellow box is light green and the WW intraclass color features vary widely. FCN, SegNet, and ResUNet demonstrated poor recognition ability for WW with large feature variability, while U-Net, DeepLabv3, and UNet++ improved relative to the above two models and was able to recognize most of this type of WW information. RAunet accurately identified WW with large intraclass variability.

The extraction results of WW obtained under the interference of green vegetation with small interclass variability are shown in column e of Figure 11. The texture and color characteristics of the green vegetation in the yellow box are very close to that of WW, which cause interference in the accurate extraction of WW. Due to the small gap between the classes of WW and green plants, the misclassification of networks with poor recognition ability is more serious. FCN, U-Net, and ResUNet both incorrectly identified some of the green vegetation as WW, and SegNet, DeepLabv3, and UNet++ performed better compared to the first two models, but there was incomplete extraction for the fine WW patches. RAunet demonstrated stronger feature recognition ability to capture more detailed classification features, overcoming the interference of features similar to some features of WW, achieving a more complete extraction of WW.

Table 2 provides the confusion matrix for the segmentation results of FCN, U-Net, DeepLabv3, SegNet, ResUNet, UNet++, and RAunet (adopted in this paper). Each column indicates the actual share of WW or NW, while each row indicates the predicted share of WW or NW. As can be seen from Table 2, the RAunet model proposed in this paper has the best segmentation effect, with only 1.54% of WW pixels being misclassified as NW and 2.32% of NW being misclassified as WW. The reductions are 1.1% and 0.33% compared to the FCN model, 1.03% and 1.23% compared to the U-Net model, 0.71% and 0.05% compared to the DeepLabv3 model, 0.17% and 1.11% compared to the SegNet model, and 1.78% and 0.28% compared to the ResUNet model, respectively. Compared with the Unet++ model, the misclassification of WW pixels into NW wheat decreased by 0.3%, while the misclassification of NW into WW increased by 0.05%. But the segmentation error rates of the seven models are 5.29%, 6.12%, 4.62%, 5.14%, 5.92%, 4.11%, and 3.86%, respectively. Overall, the RAunet model proposed in this paper has the best segmentation effect.

In order to accurately analyze and compare the WW extraction accuracy of the seven models, the precision, recall, F1-score, mIou, and accuracy of the WW extraction results of the seven models are shown as precision evaluation indexes in Table 3. By comparing the data in Table 3, it can be seen that the segmentation accuracy of the RAunet model is better than that of the six comparison models (FCN, U-Net, DeepLabv3, SegNet, ResUNet, and UNet++). Specifically, RAunet has the highest five metrics in precision (96.52%), recall (94.84%), accuracy (96.13%), F1-score (95.67%), and mIou (92.48%). In the mIou metric, the model improved 2.66%, 4.15%, 1.42%, and 2.35% compared to FCN, U-Net, DeepLabv3, and SegNet, respectively. At the same time, in order to verify the outstanding improvement of the U-Net model, compared with ResUNet and UNet ++, the mIou index of RAunet increased by 3.76% and 0.47%, respectively. The above analysis shows that the RAunet model could better solve the extraction problem of WW and significantly improved the extraction effect of WW compared to the conventional method.

The performance of the RAunet model was evaluated through the comparison experiments described above; Table 3. quantitatively indicates that the model has higher segmentation accuracy relative to the six comparison models. As shown in Figure 12, the RAunet model achieved the highest scores of 96.52%, 94.84%, and 95.67% in the precision, recall, and F1-score indicators for WW, respectively. RAunet’s precision in WW improved by 0.64% and 0.4% relative to the SegNet and ResUNet models, and 0.3% and 1.09% on F1-score compared to UNet++ and DeepLabv3, respectively, which have obvious advantages compared to other networks. Although the RAunet model has a lower recall than UNet++, its F1-score increased by 0.69%. In addition, combined with Table 2, it can be seen that the error rate of the segmentation results of the RAunet model decreased by 0.25% compared with the UNet++ model. The above results show that RAunet can better balance the two metrics of accuracy and recall, and can effectively improve the extraction accuracy of the spatial distribution information of WW.

5. Discussion

WW has a pivotal position among food crops across the world. Thus, timely and accurate information on WW cultivation is important for government management in formulating agricultural and food policies, adjusting agricultural cultivation structures, and ensuring national food security [53]. Deep learning semantic segmentation models have absolute advantages over traditional classification methods for WW extraction in high-resolution images. However, the direct application of the classical CNN to the extraction of WW in high-resolution images leads to the following problems (Figure 11): (1) For densely distributed small-scale WW patches, the segmentation results obtained were not fine enough, the contours were rough, and there was mutual adhesion between the WW patches, resulting in smaller-scale WW patches that could not be effectively recovered. (2) The large intraclass disparity in the WW results indicates the poor identification of WW with large differences in characteristics. (3) The similarity between WW and other crops caused interference with the WW extraction, leading to incorrect extraction. To address the above problems, an RAunet model of a multi-scale-feature deep supervised network with an improved U-Net backbone network incorporating a dual-attention mechanism was proposed in this paper, which uses a modified U-Net model as the backbone network and image pyramids as the input, and adds auxiliary classifiers to each branch in the decoder to form a side output layer. The results show that RAunet trained on high-resolution image datasets can effectively identify WW, and the extracted spatial distribution information of WW is more complete.

From the RAunet model and the four comparison models for the WW extraction results, it can be seen that FCN and U-Net were the worst for WW with dense distribution of small scales. This is because FCN does not consider the relationship between pixels and ignores the spatial consistency in pixel-based classification segmentation methods [20], resulting in rough contours in the segmentation results. In the U-Net model, the continuous up-sampling during the decoding process lost a large amount of edge information, which lead to the degradation of segmentation accuracy. The SegNet model was better at extracting WW pixels when they covered a large area and were not internally interspersed with other similar crops, but was less capable of recognizing WW that contained multiple ground-class targets in a single image element or had a fragmented distribution. SegNet is prone to generating additional noise when performing the nonlinear up-sampling of the maximum pooling index information saved during encoding [53], which can lead to poor recognition ability for WW with large disparities within the same classes and small differences between WW and other crop classes. The DeepLabv3 captures multi-scale contextual information by setting different Atrous convolutions [33], which could recover most of the disparate WW patches, but the extracted patches were more fragmented for regions where texture features were weakly differentiated. ResUNet’s limited receptive-field size and lack of cross-channel interaction lead to a decrease in segmentation accuracy [36]. UNet++ uses a network structure with nested and dense skip connections, but it does not directly extract enough information from the multiscale information, leading to ignoring detailed information when extracting winter wheat [37].

The RAunet model increased the network width by constructing multi-scale pyramids to improve the utilization of each channel, which, combined with the advantage of deep supervision, effectively improved the ability of this model to capture small-scale features and facilitated the network to learn more local perceptual features of WW. By using a residual block instead of ordinary convolution in the backbone network, the gradient explosion and disappearance problems could be effectively prevented, as could introducing ASPP to obtain more scale feature information and adding the CBAM dual-attention module, which was used to improve the local attention of ASPP in the skip connection stage. The RAunet model could effectively solve the difficult problem of identifying WW with large disparities within the same classes and small differences between WW and other crop classes.

In order to verify the effectiveness of each module of the RAunet model, ablation experiments were conducted to explore the effects of adding the pyramid input layer, residual module, ASPP, CBAM, and side output layer of the U-Net network in turn on the extraction accuracy of winter wheat. The results of the ablation experiments are shown in Table 4, where each row represents a set of experiments and the check marks indicate that the corresponding modules were added.

As shown in Table 4, the F1 score of the U-Net network increased by 1.59% and mIou increased by 2.75% with the addition of the pyramid input layer, indicating that the addition of the pyramid input layer can effectively improve the model’s extraction ability for WW. In line 3.4.5.6 of Table 4, the residual module, ASPP, CBAM, and side output layer module are added, respectively, and their F1 and mIou are all expected to improve. Obviously, all the modules added in this paper are helpful to the extraction of winter wheat, thereby optimizing the overall extraction accuracy. The results showed that the RAunet model trained on high-resolution image datasets could effectively identify WW, and the extracted spatial distribution information of WW was more complete.

6. Conclusions

The use of convolutional neural networks to extract the spatial distribution information of WW in high-resolution remote sensing images has become a mainstream method. The existing methods are influenced by the factors of overly dense distribution, large intra-class differences, and interclass variability of WW, and there are some phenomena of wrong classification and missing classification. In this paper, an RAunet model of a multi-scale-feature deep supervised network with an improved U-Net backbone network incorporating a dual-attention mechanism was proposed to extract WW from high-resolution images. The model consists of pyramid input layer, improved U-Net model with residual block, ASPP and CBAM, and side output layer with multiple auxiliary classifiers. From the comparative experiment, it can be seen that the results of extracting winter wheat using this model are better than four classical models such as FCN, Unet, DeeplabV3, and SegNet, and two improved U-Net models such as ResUNet and UNet++. The effectiveness of each module of RAunet was also verified via ablation experiments. RAunet’s model has higher accuracy as well as robustness with other models on homemade datasets. It effectively solves the problem of extraction difficulty caused by the dense distribution of WW, large intraclass difference, and small interclass difference when using high-resolution remote sensing images to extract WW, and provides a theoretical and technical method for the spatial distribution information extraction of other crops.

However, there are still some shortcomings in this paper. In terms of image data source, only GF-2 images were used, and other types of remote sensing data can be considered for the extraction of the spatial distribution information of WW using multi-temporal data in the future.

Author Contributions

Conceptualization, J.L. and H.W.; funding acquisition, H.W.; methodology, J.L.; supervision, H.W.; data curation, J.L.; validation, J.L.; visualization, H.T., Y.L. and J.S.; writing—original draft, J.L.; writing—review and editing, J.L., H.W., Y.Z., X.Z., T.Q., D.L. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Science and Technology Project of Inner Mongolia (2021ZD0015).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ozdarici-Ok, A.; Ok, A.; Schindler, K. Mapping of Agricultural Crops from Single High-Resolution Multispectral Images—Data-Driven Smoothing vs. Parcel-Based Smoothing. Remote Sens. 2015, 7, 5611–5638. [Google Scholar] [CrossRef] [Green Version]
Jing-Song, S.; Guang-Sheng, Z.; Xing-Hua, S. Climatic suitability of the distribution of the winter wheat cultivation zone in China. Eur. J. Agron. 2012, 43, 77–86. [Google Scholar] [CrossRef]
National Bureau of Statistics of China (NBS). Available online: http://www.stats.gov.cn (accessed on 5 July 2023).
Fan, L.; Yang, J.; Sun, X.; Zhao, F.; Liang, S.; Duan, D.; Chen, H.; Xia, L.; Sun, J.; Yang, P. The effects of Landsat image acquisition date on winter wheat classification in the North China Plain. ISPRS-J. Photogramm. Remote Sens. 2022, 187, 1–13. [Google Scholar] [CrossRef]
Zhang, S.; Yang, S.; Wang, J.; Wu, X.; Henchiri, M.; Javed, T.; Zhang, J.; Bai, Y. Integrating a Novel Irrigation Approximation Method with a Process-Based Remote Sensing Model to Estimate Multi-Years Winter Wheat Yield over the North China Plain. J. Integr. Agric. 2023, in press. [Google Scholar] [CrossRef]
Wang, S.; Azzari, G.; Lobell, D.B. Crop type mapping without field-level labels: Random forest transfer and unsupervised clustering techniques. Remote Sens. Environ. 2019, 222, 303–317. [Google Scholar] [CrossRef]
Scott, G.J.; England, M.R.; Starms, W.A.; Marcum, R.A.; Davis, C.H. Training Deep Convolutional Neural Networks for Land–Cover Classification of High-Resolution Imagery. IEEE Geosci. Remote Sens. Lett. 2017, 14, 549–553. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef] [Green Version]
Chen, F.; Zhang, W.; Song, Y.; Liu, L.; Wang, C. Comparison of Simulated Multispectral Reflectance among Four Sensors in Land Cover Classification. Remote Sens. 2023, 15, 2373. [Google Scholar] [CrossRef]
Li, W.; Zhang, H.; Li, W.; Ma, T. Extraction of Winter Wheat Planting Area Based on Multi-Scale Fusion. Remote Sens. 2022, 15, 164. [Google Scholar] [CrossRef]
Ashourloo, D.; Nematollahi, H.; Huete, A.; Aghighi, H.; Azadbakht, M.; Shahrabi, H.S.; Goodarzdashti, S. A new phenology-based method for mapping wheat and barley using time-series of Sentinel-2 images. Remote Sens. Environ. 2022, 280, 113206. [Google Scholar] [CrossRef]
Zakeri, H.; Yamazaki, F.; Liu, W. Texture Analysis and Land Cover Classification of Tehran Using Polarimetric Synthetic Aperture Radar Imagery. Appl. Sci. 2017, 7, 452. [Google Scholar] [CrossRef] [Green Version]
Guo, A.; Wang, Y.; Guo, L.; Zhang, R.; Yu, Y.; Gao, S. An adaptive position-guided gravitational search algorithm for function optimization and image threshold segmentation. Eng. Appl. Artif. Intell. 2023, 121, 106040. [Google Scholar] [CrossRef]
Feng, Y.; Chow, L.S.; Gowdh, N.M.; Ramli, N.; Tan, L.K.; Abdullah, S.; Tiang, S.S. Gradient-based edge detection with skeletonization (GES) segmentation for magnetic resonance optic nerve images. Biomed. Signal Process. Control 2023, 80, 104342. [Google Scholar] [CrossRef]
Kerkeni, A.; Benabdallah, A.; Manzanera, A.; Bedoui, M.H. A coronary artery segmentation method based on multiscale analysis and region growing. Comput. Med. Imaging Graph. 2016, 48, 49–61. [Google Scholar] [CrossRef] [Green Version]
Espindola, G.M.; Camara, G.; Reis, I.A.; Bins, L.S.; Monteiro, A.M. Parameter selection for region-growing image segmentation algorithms using spatial autocorrelation. Int. J. Remote Sens. 2007, 27, 3035–3040. [Google Scholar] [CrossRef]
Gaetano, R.; Masi, G.; Poggi, G.; Verdoliva, L.; Scarpa, G. Marker-Controlled Watershed-Based Segmentation of Multiresolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2987–3004. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 79, 1337–1342. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef] [Green Version]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
Liu, M.; Xie, T.; Cheng, X.; Deng, J.; Yang, M.; Wang, X.; Liu, M. FocusedDropout for Convolutional Neural Network. Appl. Sci. 2022, 12, 7682. [Google Scholar] [CrossRef]
Lecun, Y. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Huang, R.; Wang, C.; Li, J.; Sui, Y. DF-UHRNet: A Modified CNN-Based Deep Learning Method for Automatic Sea Ice Classification from Sentinel-1A/B SAR Images. Remote Sens. 2023, 15, 2448. [Google Scholar] [CrossRef]
Wu, H.; Shi, C.; Wang, L.; Jin, Z. A Cross-Channel Dense Connection and Multi-Scale Dual Aggregated Attention Network for Hyperspectral Image Classification. Remote Sens. 2023, 15, 2367. [Google Scholar] [CrossRef]
Zheng, J.; Yuan, S.; Wu, W.; Li, W.; Yu, L.; Fu, H.; Coomes, D. Surveying coconut trees using high-resolution satellite imagery in remote atolls of the Pacific Ocean. Remote Sens. Environ. 2023, 287, 113485. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Tang, Y.; Wu, X. Salient Object Detection Using Cascaded Convolutional Neural Networks and Adversarial Learning. IEEE Trans. Multimed. 2019, 21, 2237–2247. [Google Scholar] [CrossRef]
Zhang, C.; Atkinson, P.M.; George, C.; Wen, Z.; Diazgranados, M.; Gerard, F. Identifying and mapping individual plants in a highly diverse high-elevation ecosystem using UAV imagery and deep learning. ISPRS-J. Photogramm. Remote Sens. 2020, 169, 280–291. [Google Scholar] [CrossRef]
Osco, L.P.; Arruda, M.D.S.D.; Marcato Junior, J.; da Silva, N.B.; Ramos, A.P.M.; Moryia, É.A.S.; Imai, N.N.; Pereira, D.R.; Creste, J.E.; Matsubara, E.T.; et al. A convolutional neural network approach for counting and geolocating citrus-trees in UAV multispectral imagery. ISPRS-J. Photogramm. Remote Sens. 2020, 160, 97–106. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 808–818. [Google Scholar] [CrossRef] [Green Version]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS-J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef] [Green Version]
Huo, G.; Lin, D.; Yuan, M. Iris segmentation method based on improved UNet++. Multimed. Tools Appl. 2022, 81, 41249–41269. [Google Scholar] [CrossRef]
Sun, H.; Wang, B.; Wu, Y.; Yang, H. Deep Learning Method Based on Spectral Characteristic Rein-Forcement for the Extraction of Winter Wheat Planting Area in Complex Agricultural Landscapes. Remote Sens. 2023, 15, 1301. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, Z.; Liu, L.; Miao, R.; Yang, Y.; Ren, T.; Yue, M. Research on SUnet Winter Wheat Identification Method Based on GF-2. Remote Sens. 2023, 15, 3094. [Google Scholar] [CrossRef]
Tang, Z.; Sun, Y.; Wan, G.; Zhang, K.; Shi, H.; Zhao, Y.; Chen, S.; Zhang, X. Winter Wheat Lodging Area Extraction Using Deep Learning with GaoFen-2 Satellite Imagery. Remote Sens. 2022, 14, 4887. [Google Scholar] [CrossRef]
Kim, Y.H.; Park, K.R. MTS-CNN: Multi-task semantic segmentation-convolutional neural network for detecting crops and weeds. Comput. Electron. Agric. 2022, 199, 107146. [Google Scholar] [CrossRef]
Zhu, Y.; Zhao, Y.; Li, H.; Wang, L.; Li, L.; Jiang, S. Quantitative Analysis of the Water-Energy-Climate Nexus in Shanxi Province, China. Energy Procedia 2017, 142, 2341–2347. [Google Scholar] [CrossRef]
Abraham, N.; Khan, N.M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. In Proceedings of the IEEE 16th International Symposium on Biomedical Imaging (ISBI), Venice, Italy, 8–11 April 2019; pp. 683–687. [Google Scholar] [CrossRef] [Green Version]
Zanchetta, A.; Zecchetto, S. Wind direction retrieval from Sentinel-1 SAR images using ResNet. Remote Sens. Environ. 2021, 253, 112178. [Google Scholar] [CrossRef]
Yin, M.; Chen, Z.; Zhang, C. A CNN-Transformer Network Combining CBAM for Change Detection in High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 2406. [Google Scholar] [CrossRef]
Kim, M.; Jeong, D.; Kim, Y. Local climate zone classification using a multi-scale, multi-level attention network. ISPRS-J. Photogramm. Remote Sens. 2021, 181, 345–366. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef] [Green Version]
Zhang, J.; Wang, H.; Wang, Y.; Zhou, Q.; Li, Y. Deep Network Based on up and down Blocks Using Wavelet Transform and Successive Multi-Scale Spatial Attention for Cloud Detection. Remote Sens Environ. 2021, 261, 112483. [Google Scholar] [CrossRef]
Fu, H.; Cheng, J.; Xu, Y.; Wong, D.W.K.; Liu, J.; Cao, X. Joint Optic Disc and Cup Segmentation Based on Multi-Label Deep Network and Polar Transformation. IEEE Trans. Med. Imaging. 2018, 37, 1597–1605. [Google Scholar] [CrossRef] [Green Version]
Zhang, D.; Ding, Y.; Chen, P.; Zhang, X.; Pan, Z.; Liang, D. Automatic extraction of wheat lodging area based on transfer learning method and deeplabv3+ network. Comput. Electron. Agric. 2020, 179, 105845. [Google Scholar] [CrossRef]
Li, H.; Gan, Y.; Wu, Y.; Guo, L. EAGNet: A Method for Automatic Extraction of Agricultural Greenhouses from High Spatial Resolution Remote Sensing Images Based on Hybrid Multi-Attention. Comput. Electron. Agric. 2022, 202, 107431. [Google Scholar] [CrossRef]
Yuan, Y.; Lin, L.; Zhou, Z.-G.; Jiang, H.; Liu, Q. Bridging optical and SAR satellite image time series via contrastive feature extraction for crop classification. ISPRS J. Photogramm. Remote Sens. 2023, 195, 222–232. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, C.; Wang, S.; Li, J.; Li, F.; Yang, X.; Wang, Y.; Yin, L. Extracting Crop Spatial Distribution from Gaofen 2 Imagery Using a Convolutional Neural Network. Appl. Sci. 2019, 9, 2917. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Study area (the remote sensing image of the study area in the figure is a GF-2 image).

Figure 2. Spatial distribution of the WW ground sampling points.

Figure 3. RAunet model structure diagram.

Figure 4. Residual model.

Figure 5. ASPP model.

Figure 6. Convolutional Block Attention Module (CBAM) structure.

Figure 7. Channel attention mechanism.

Figure 8. Spatial attention mechanism.

Figure 9. Improvement of U-Net model.

Figure 10. Loss and accuracy curves of the RAunet model and the six comparison models (acc and val_acc indicate the precision for the validation and test sets during model training, while loss and val_loss indicate the loss values for the validation and test sets during model training, respectively).

Figure 11. Comparison of the model segmentation results ((a–c) for densely distributed winter wheat, (d) large intraclass variation, and (e) other green plant disturbance).

Figure 12. Precision, recall, and F1-score metrics for each model.

Table 1. Confusion matrix (TP represents the pixels in which WW was correctly identified; FP represents pixels where WW was incorrectly identified as NW; TN represents the pixels where NW was correctly identified; FN represents pixels where NW was incorrectly identified as WW).

Confusion Matrix	Winter Wheat	Non-Winter Wheat
Winter Wheat	TP	FN
Non-Winter Wheat	FP	TN

Table 2. Confusion matrix for each comparison model.

Model	Real/Predicted	Winter Wheat	Non-Winter Wheat
FCN	WW	0.4159	0.0264
FCN	NW	0.0265	0.5311
U-Net	WW	0.4166	0.0257
U-Net	NW	0.0355	0.5221
DeepLabv3	WW	0.4198	0.0225
DeepLabv3	NW	0.0237	0.5340
SegNet	WW	0.4252	0.0171
SegNet	NW	0.0343	0.5234
ResUNet	WW	0.4241	0.0182
ResUNet	NW	0.0410	0.5156
UNet++	WW	0.4239	0.0184
UNet++	NW	0.0227	0.5349
RAunet	WW	0.4270	0.0154
RAunet	NW	0.0232	0.5344

Table 3. Comparison of the evaluation indexes of each comparison model.

Evaluation Indicators	FCN	U-Net	DeepLabv3	SegNet	ResUNet	UNet++	RAunet
Precision	0.9402	0.9417	0.9490	0.9612	0.9588	0.9583	0.9652
Recall	0.9400	0.9214	0.9466	0.9254	0.9116	0.9491	0.9484
Accuracy	0.9470	0.9387	0.9530	0.9486	0.9407	0.9588	0.9613
F1-score	0.9401	0.9314	0.9478	0.9430	0.9347	0.9537	0.9567
mIou	0.8982	0.8833	0.9106	0.9013	0.8872	0.9201	0.9248

Table 4. Ablation experiment results.

Pyramid Input Layer	Residual Module	ASPP	CBAM	Side Output Layer	F1-Score	mIou
					0.9314	0.8833
√					0.9473	0.9108
√	√				0.9517	0.9172
√	√	√			0.9534	0.9194
√	√	√	√		0.9544	0.9215
√	√	√	√	√	0.9567	0.9248

“√” means add the corresponding module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Wang, H.; Zhang, Y.; Zhao, X.; Qu, T.; Tian, H.; Lu, Y.; Su, J.; Luo, D.; Yang, Y. A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net. Remote Sens. 2023, 15, 3711. https://doi.org/10.3390/rs15153711

AMA Style

Liu J, Wang H, Zhang Y, Zhao X, Qu T, Tian H, Lu Y, Su J, Luo D, Yang Y. A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net. Remote Sensing. 2023; 15(15):3711. https://doi.org/10.3390/rs15153711

Chicago/Turabian Style

Liu, Jiahao, Hong Wang, Yao Zhang, Xili Zhao, Tengfei Qu, Haozhe Tian, Yuting Lu, Jingru Su, Dingsheng Luo, and Yalei Yang. 2023. "A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net" Remote Sensing 15, no. 15: 3711. https://doi.org/10.3390/rs15153711

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Spatial Distribution Extraction Method for Winter Wheat Based on Improved U-Net

Abstract

1. Introduction

2. Study Area and Data

2.1. Overview of the Study Area

2.2. Data Acquisition

2.2.1. Acquisition and Preprocessing of Remote Sensing Data

2.2.2. Ground Survey Data

2.2.3. Dataset Production

3. Research Methodology

3.1. Image Pyramid Input

3.2. Backbone Network Construction

3.2.1. U-Net Model

3.2.2. Residual Block

3.2.3. Atrous Spatial Pyramid Pooling

3.2.4. Convolutional Block Attention Module

3.2.5. Improved U-Net Structure

3.3. Deeply Supervised Side Output Layer

3.4. Experimental Environment and Setting

3.5. Evaluation Metrics

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI