Next Article in Journal
Towards Cooperative Global Mapping of the Ionosphere: Fusion Feasibility for IGS and IRI with Global Climate VTEC Maps
Previous Article in Journal
A Novel Electromagnetic Wave Rain Gauge and its Average Rainfall Estimation Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Refined UNet V2: End-to-End Patch-Wise Network for Noise-Free Cloud and Shadow Segmentation

Aerospace Information Research Institute (AIR), Chinese Academy of Sciences (CAS), Beijing 100101, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2020, 12(21), 3530; https://doi.org/10.3390/rs12213530
Submission received: 29 September 2020 / Revised: 21 October 2020 / Accepted: 22 October 2020 / Published: 28 October 2020

Abstract

:
Cloud and shadow detection is an essential prerequisite for further remote sensing processing, whereas edge-precise segmentation remains a challenging issue. In Refined UNet, we considered the aforementioned task and proposed a two-stage pipeline to achieve the edge-precise segmentation. The isolated segmentation regions in Refined UNet, however, bring inferior visualization and should be sufficiently eliminated. Moreover, an end-to-end model is also expected to jointly predict and refine the segmentation results. In this paper, we propose the end-to-end Refined UNet v2 to achieve joint prediction and refinement of cloud and shadow segmentation, which is capable of visually neutralizing redundant segmentation pixels or regions. To this end, we inherit the pipeline of Refine UNet, revisit the bilateral message passing in the inference of conditional random field (CRF), and then develop a novel bilateral strategy derived from the Guided Gaussian filter. Derived from a local linear model of denoising, our v2 can considerably remove isolated segmentation pixels or regions, which is able to yield “cleaner” results. Compared to the high-dimensional Gaussian filter, the Guided Gaussian filter-based message-passing strategy is quite straightforward and easy to implement so that a brute-force implementation can be easily given in GPU frameworks, which is potentially efficient and facilitates embedding. Moreover, we prove that Guided Gaussian filter-based message passing is highly relevant to the Gaussian bilateral term in Dense CRF. Experiments and results demonstrate that our v2 is quantitatively comparable to Refined UNet, but can visually outperform that from the noise-free segmentation perspective. The comparison of time consumption also supports the potential efficiency of our v2.

Graphical Abstract

1. Introduction

More and more remote sensing applications are supported by cloud- and shadow-free images [1,2,3,4], while remote sensing images are usually degraded by clouds and cloud shadows, which leads to a negative effect on the further processing or resolve activity. In particular, cloud and shadow removal needs a necessary prerequisite of cloud and corresponding shadow segmentation, which is still a challenging issue in the remote sensing preprocessing. Fundamental solutions to the cloud and cloud shadow segmentation focus on manually developed segmentation methods, which can be generally grouped into three categories: spectral tests, temporal differentiation, and statistical methods [4]: spectral thresholds can be secured in terms of spectral data [3,5,6,7,8,9], temporal differentiation methods [10,11,12] pinpoint the movement of clouds and shadows, and statistical methods [13,14] exploit the statistics of spatial and spectral features. Manually developed methods can also be promoted by machine-learning methods due to large-scale labeled datasets [3], and data-driven segmentation methods have shown promising performance in cloud and shadow segmentation tasks [4,15].
Neural image segmentation (image segmentation by neural networks) approaches, on the other hand, introduce the learnable end-to-end solutions in the spatial and spectral feature spaces of remote sensing images. The convolutional neural network-based (CNN-based) feature encoders enable spatial and spectral feature extractions and output representative feature vectors, which can be used as the backbone for dense classification tasks. Learnable parameters [16,17,18,19,20,21] or network structures [22,23] push models to fit the feature space and reach the accurate pixel-wise classification results. Typical neural classifiers [4,24,25] have already been transferred to remote sensing segmentation. Some novel designing principles [23,26,27] can be gradually applied to segmentation models fitting particular scenarios as well.
Some challenging issues, however, remain in the cloud and cloud shadow segmentation tasks, such as the edge-precise segmentation [15]. Due to the discrete cost functions and the inflation of receptive fields in CNN-based feature extractors, neural image segmentation is restricted to coarse pixel-level classification, but the precise delineation of clouds and shadows is still limited. To achieve edge-precise segmentation, the fully connected conditional random field (Dense CRF) has been employed to model in pixel or patch level, and it is able to refine the segmentation performance on the edges. A feasible solution has been given in Refined UNet [15], in which we have preliminarily investigated the edge-precise cloud and shadow segmentation and proposed a feasible two-stage pipeline: the trainable UNet coarsely locates clouds and shadows patch by patch, and then the post-processing of Dense CRF refines the segmentation edges on the full images.
In the prospect of the two-stage Refined UNet [15], we expect an end-to-end implementation of UNet-CRF segmentation, which should incorporate UNet coarse prediction and CRF refinement for an individual patch in one forward step. However, the complicated high-dimensional filter-based bilateral message passing makes it difficult to extend Refined UNet to an end-to-end model, which inspires us to explore the bilateral message passing with other sophisticated strategies. Therefore, we intend to go deep into the Dense CRF and simplify its bilateral message passing, which is of help for building a joint implementation composed of UNet coarse prediction, unary transformation, and CRF refinement. In this paper, we inherit the two-stage pipeline of cloud and shadow segmentation in Refined UNet [15] and further explore an end-to-end solution, in which clouds and shadows can be jointly identified and refined by the concatenation of the pretrained UNet and the following CRF. Guided Gaussian filter-based message passing is employed in our computationally efficient CRF inference, rather than the complicated high-dimensional filter. Practically, a vanilla brute-force GPU implementation can be easily given by the Gaussian filter in GPU frameworks. Derived from the local linear model of denoising, our proposed CRF can effectively eliminate redundant isolated segmentation pixels or regions, which can yield “cleaner” results. A visual example of our Refined UNet v2 has been shown in Figure 1. Accordingly, our main contributions are listed as follows.
  • Refined UNet V2: an experimental prototype of the end-to-end model for cloud and shadow segmentation is proposed, which can jointly predict and refine clouds and shadows by the concatenation of the UNet and following CRF for an individual image patch in one forward step.
  • Straightforward and potentially efficient GPU implementation: we give an innovative Guided Gaussian filter-based message-passing strategy, which is straightforward and easy to implement in GPU frameworks. Thus, the vanilla implementation is also potentially efficient in computation.
  • Noise-free segmentation: Our proposed Refined UNet v2 can effectively eliminate redundant isolated segmentation pixels or regions and yield “cleaner” results. Moreover, we demonstrate that the CRF can show a particular segmentation preference (edge-precise or clean results) if the bilateral term is customized to fit the preference.
The rest of the paper is organized as follows. Section 2 reviews related work regarding cloud and shadow segmentation. Proposed Refined UNet v2 is described in Section 3. Section 4 presents the experiments on Landsat 8 OLI dataset, including quantitative and visual comparisons against Refined UNet [15], ablation study with respect to the proposed CRF, hyperparameter sensitivity with respect to r and ϵ , and computational efficiency. Section 5 concludes this paper.

2. Related Work

In this section, we review semantic segmentation and related techniques, including neural image segmentation methods, CRF methods, and edge-preserving filters.

2.1. Neural Image Segmentation

In our paper, image segmentation by trainable neural networks is referred to as neural image segmentation, because the backbone of feature extraction and transformation is basically built upon convolutional [16,17,18,19,20,21] or graph neural networks [28,29]. Neural semantic segmentation methods mainly model the dense classification tasks by minimizing the pixel-level cost function [16,17]; they usually train an end-to-end segmentation model using pairs of images and labels. Neural semantic segmentation methods generally benefit from CNN-based backbones, which effectively provide semantic feature transformation. Sophisticated CNN-based backbones, such as VGG-16/VGG-19 [30], MobileNets V1/V2/V3 [31,32,33], ResNet18/ResNet50/ResNet101 [34,35], and DenseNet [36], facilitate high-level semantic comprehension and then substantially promote the accuracy and efficacy of these tasks due to their delicate feature extraction.
Typical neural segmentation methods are reviewed as they can show the principles of designing architecture and adapting to particular scenarios. Fully convolutional network (FCN) [16] initiated neural semantic segmentation, which replaced fully connected layers with convolutional layers to adapt to the segmentation of arbitrary size. Unet [17] introduced the intermediate layer concatenation to reuse and fuse the extracted feature maps. Moreover, the long-range feature map aggregation was fully employed in segmentation tasks, such as RefineNet [18] enhancing high-resolution segmentation and PSPNet [19] with pyramid pooling module. In scene understanding applications, SegNet [20,21] inherited the encoder-decoder architecture for segmentation. For efficient computation in segmentation tasks, Joint pyramid upsampling was applied to FastFCN [37]. DeepLab series [38,39,40,41] exploited the atrous convolution, the CRF post-processing, the depth-wise separatable convolution, the atrous spatial pyramid pooling module, and novel backbones, attempting to improve both the efficiency and robustness.
The concurrent trend of semantic segmentation concentrates on (i) bringing prior knowledge or features to particular scenarios, such as “cars cannot fly up in the sky” in urban scene segmentation [42], Fourier domain adaption [26], and model transfer (synthetic images to real images) [43], (ii) effective and efficient prediction, such as single-stage effective segmentation [44], boundary preserving segmentation [45], and very high-resolution segmentation [46], and (iii) novel network architecture, such as learnable dynamic architecture for semantic segmentation [23] and graph reasoning [27].

2.2. CRF-Based Image Segmentation

The CRFs can model semantic segmentation in pixel or patch level, which implicitly performs a maximum a posteriori (MAP) inference by minimizing the corresponding Gibbs energy function. Adjacency CRFs have been significantly improved by higher-order potentials or hierarchical connectivity, such as Robust P n CRF [47,48], but they are still restricted to coarse segmentation due to the short-range connectivity. Fully connected CRFs benefit from the long-range connectivity, while the computation complexity remains a challenging problem; graph-cut [49] and the high-dimensional filter [50] are two common solutions to the efficient inference. On the other hand, the unary potentials in CRF can be predicted by a coarse segmentation structure. For example, CRFasRNN [51] presented an end-to-end segmentation structure concatenating CNN and CRF, in which the CNN-based segmentation backbone yielded the unary prediction and the following CRF refinement was built as a recurrent-neural-network (RNN) layer. And a similar combination of deep learning architecture and Gaussian Conditional Random Field (G-CRF) was given in [52]. CRFs can refine the segmentation performance in pixel level, which provides a novel perspective in our precise cloud and shadow segmentation task.

2.3. Edge-Preserving Filters

Edge-preserving filters aim to simultaneously smooth images and preserve edges, which have been widely used in applications of noise removal [53], high dynamic range (HDR) compression [54], haze removal [55], and joint upsampling [56]. Sophisticated filters include bilateral, weighted least squares (WLS), and guided filters. The bilateral filter [57] was a straightforward edge-preserving method to smooth images, which computed the output pixels weighted by a Gaussian kernel of both the spatial and color intensity discrepancy. The WLS [58] filter filtered images in an edge-preserving way of optimizing the quadratic function and taking as the guidance the input image, which achieved a global optimization. The guided filter [59] can perform the edge-preserving smoothness on images by taking themselves as guidance. In our study, the edge-preserving filter provides a message-passing strategy of smoothing the segmentation and preserving the edges in the CRF inference.

3. Methodology

We introduce our Refined UNet v2 in four subsections, including an overview of Refined UNet v2 in Section 3.1, the Dense CRF as the segmentation refinement in Section 3.2, the Guided Gaussian filter as the efficient message-passing strategy in Section 3.3, and our end-to-end CRF inference in Section 3.4.

3.1. Overview of Refined UNet v2

We present the overview of our Refined UNet v2 performing noise-free segmentation on high-resolution remote sensing images. An end-to-end UNet-CRF architecture is used to roughly locate clouds and shadows and refine noise from a local perspective, which takes as input a seven-band 512 × 512 patch and yields a corresponding refined segmentation result. Hence, a high-resolution remote sensing image is first padded and cropped into patches of 512 × 512 , and then the aforementioned end-to-end network infers and refines patch by patch. The full segmentation result of clouds and shadows is eventually reconstructed from the patches. In this case, the pretrained UNet is inherited from [15] and the proposed CRF inference is introduced in the following subsections. The full pipeline of our Refined UNet v2 is illustrated in Figure 2.

3.2. Revisiting Fully Connected Conditional Random Fields

We revisit the Dense CRF and the corresponding mean-field approximation inference, which has been thoroughly defined in [50]. Given the random field X and its global observation (image) I , the CRF ( I ,   X ) is characterized by a Gibbs distribution, defined in Equation (1), and the corresponding Gibbs energy is given by Equation (2).
P ( X | I ) = 1 Z ( I ) exp ( E ( x | I ) )
E ( x ) = i ψ u ( x i ) + i i < j ψ p ( x i , x j )
in which x denotes the label assignments for all pixels, ψ u the unary potential, and ψ p the pairwise potential.
In Equation (2), the unary potential can be practically given by a pixel-level classifier, and the pairwise potential is given by Equation (3).
ψ p ( x i , x j ) = μ ( x i , x j ) m = 1 K w ( m ) k ( m ) ( f i ,   f j ) k ( f i ,   f j )
in which k ( m ) and w ( m ) denote a Gaussian kernel and its corresponding weight, f i and f j the feature vectors of pixel i and j, μ the label compatibility function.
In [50], the contrast-sensitive two-kernel potentials are given by Equation (4), in which I i , I j , p i , and p j denote color vectors and spatial positions of pixel i and j.
k ( f i ,   f j ) = w ( 1 ) exp | p i p j | 2 2 θ α 2 | I i I j | 2 2 θ β 2 + w ( 2 ) exp | p i p j | 2 2 θ γ 2
The inference of CRF aims to find a x ^ as the most possible pixel-level classification by minimizing the energy function E ( x ) , and the mean-field approximation facilitates the inference instead of computing the exact distribution P ( X ) . Equation (5) of mean-field approximation leads to an iterative update algorithm, which is presented in Algorithm 1.
Q i ( x i = l ) = 1 Z i exp ψ u ( x i ) l L μ ( l ,   l ) m = 1 K w ( m ) j i k ( m ) ( f i ,   f j ) Q j ( l )
Algorithm 1 Mean-Field Approximation in Fully Connected CRFs.
1:
Initialize Q: Q i ( x i ) exp ϕ u ( x i ) / Z i
2:
while not converged do
3:
Q ˜ i ( m ) ( l ) j i k ( m ) ( f i ,   f j ) Q j ( l ) for all m
4:
Q ^ i ( x i ) l L μ ( m ) ( x i ,   l ) m w ( m ) Q ˜ i ( m ) ( l )
5:
Q i ( x i ) exp ψ u ( x i ) Q ^ i ( x i )
6:
normalize Q i ( x i )
7:
end while

3.3. Guided Gaussian Filter

We introduce the Guided Gaussian filter as our proposed efficient message-passing method. As is presented in [50], the message passing is the bottleneck of efficient Dense CRF inference, and the high-dimensional filter based on the permutohedral lattice [60] is chosen to accelerate. The implementation of high-dimensional Gaussian filter significantly reduces the time complexity but is quite complicated. Therefore, we intend to find an alternative to efficient message passing which is highly relevant to the bilateral term, including both color intensity and Gaussian spatial feature. The Guided Gaussian filter can satisfy the requirement of the aforementioned bilateral term, which is introduced in this subsection.
Assuming a local linear model between the guidance I and the output y, we have Equation (6) in a window ω k centered at the pixel k.
y i = a k I i + b k ,   i ω k
The difference between the desired output y i and the input x i is assumed to the unwanted noise n i , defined in Equation (7).
y i = x i n i
Given a guidance I, the solution should satisfy minimizing the discrepancy between the input x and the desired output y, which is formulated as a cost function of linear ridge regression. Moreover, we introduce a Gaussian weight g i k in window ω k , and the cost function is defined in Equation (9).
g i k = exp | p i p k | 2 θ α 2
E a k ,   b k = i ω k   g i k a k I i + b k x i 2 + ϵ a k 2
in which ϵ is a regularization parameter to penalize large a k .
Equation (9) defines a linear ridge regression model and the solution is given by Equations (10)–(13).
a k = i ω k   g i k I i x i i ω k   g i k I i x ¯ k i ω k   g i k I i 2 i ω k   g i k I i μ k + ϵ i ω k   g i k
b k = i ω k   g i k ( x i a k I i ) i ω k   g i k = x ¯ k a k μ k
x ¯ k = i ω k   g i k x i i ω k   g i k
μ k = i ω k   g i k I i i ω k   g i k
in which x ¯ k and μ k denote the Gaussian-weighted means of x i and I i in window ω k . The derivation is given in Appendix A.
In fact, a k and b k can be computed by the Gaussian filter with radius r, defined in Equations (14) and (15).
a k = f GF ( I i x i ) f GF ( I i ) f GF ( x i ) f GF ( I i 2 ) f GF 2 ( I i ) + ϵ
b k = f GF ( x i ) a k f GF ( I i )
in which the radius r controls the size of window ω k : its width is 2 r + 1 .
Using the Gaussian filter with radius r to compute the overlapping a k and b k , defined in Equations (16) and (17), y i can be obtained by Equation (18) and the Guided Gaussian filter can be defined in Algorithm 2.
a ¯ i = k ω i   g k i a k k ω i   g k i = f GF ( a k )
b ¯ i = k ω i   g k i b k k ω i   g k i = f GF ( b k )
y i = a ¯ i I i + b ¯ i
Algorithm 2 Guided Gaussian Filter.
Input: Input image to filter x, Guidance image I, Radius r, Regularization parameter ϵ
Output: Filtered output y
1:
mean I f GF ( I )
2:
mean x f GF ( x )
3:
corr I f GF ( I . I )
4:
corr I x f GF ( I . x )
5:
var I corr I mean I . mean I
6:
cov I x corr I x mean I . mean x
7:
a cov I x . / ( var I + ϵ )
8:
b mean x a . mean I
9:
mean a f GF ( a )
10:
mean b f GF ( b )
11:
y mean a . I + mean b
12:
returny
where . and . / denote element-wise multiplication and division.
A bilateral term composed of both the color intensity and the Gaussian spatial features is desired as that presented in [50]. Actually, the Guided Gaussian filter takes into consideration both the position and color intensity features of pixel i and j, which can be proved by the implicit kernel weights, defined in Equation (19).
W i j ( I ) = 1 Z g 2   k : ( i , j ) ω k   g i k g j k   1 + ( I i μ k ) ( I j μ k ) σ k + ϵ
in which Z g , μ k , and σ k are the normalization term of the Gaussian filter i ω k   g i k , the Gaussian-weighted mean f GF ( I i ) and variance f GF ( I i 2 ) f GF 2 ( I i ) in window ω k , respectively. The proof is given in Appendix B.

3.4. End-to-End CRF Inference

As is mentioned above, an efficient inference has been presented in [50], in which a high-dimensional filtering algorithm was applied to the efficient bilateral message passing, defined in Equation (20). The high-dimensional filter of the permutohedral lattice [60] can reduce the time complexity of the bilateral message passing to linear time consumption ( O ( N ) ) [50].
Q ˜ i ( m ) ( l ) = j i k ( m ) ( f i ,   f j ) Q j ( l ) message passing = [ G Λ ( m ) Q ( l ) ] ( f i ) Q ¯ i ( m ) ( l ) Q i ( l )
Nevertheless, the high-dimensional filter is quite complicated to implement, especially for GPU and end-to-end inference. Conversely, the brute-force Guided Gaussian filter can be applied to the bilateral message passing, which is highly relevant to the bilateral features. The bilateral message passing term can be replaced by the Guided Gaussian filter while the smoothness term is still Gaussian kernel, defined in Equation (21).
Q ¯ i ( l ) = w ( 1 ) j i W i j ( I ) Q j ( l ) Q ˜ i ( 1 ) ( l ) + w ( 2 ) j i k i j ( 2 ) Q j ( l ) Q ˜ i ( 2 ) ( l )
in which the smoothness term is defined in Equation (22).
k i j ( 2 ) = exp | p i p j | 2 2 θ γ 2
Consequently, the mean-field approximation algorithm for our CRF inference is presented in Algorithm 3.
Algorithm 3 End-to-End Mean-Field Approximation in CRF Inference.
1:
Initialize Q: Q i ( x i ) exp ϕ u ( x i ) / Z i
2:
while not converged do
3:
Q ˜ i ( 1 ) ( l ) j i W i j ( I ) Q j ( l )
4:
Q ˜ i ( 2 ) ( l ) j i k i j ( 2 ) Q j ( l )
5:
Q ^ i ( x i ) l L μ ( m ) ( x i , l ) m w ( m ) Q ˜ i ( m ) ( l )
6:
Q i ( x i ) exp ψ u ( x i ) Q ^ i ( x i )
7:
normalize Q i ( x i )
8:
end while

4. Experiments and Discussion

We experiment with our method in several subsections, including quantitative and visual comparisons against Refined UNet [15], ablation study with respect to our CRF inference, hyperparameter sensitivity with respect to the r and ϵ in the CRF, and computational efficiency of the v2.

4.1. Revisiting Experimental Datasets, Preprocessing, Implementation Details, and Evaluation Metrics

To evaluate the performance of the proposed Refined UNet v2, we reuse the experimental dataset in [15], which is selected from Landsat 8 OLI imagery data [3]. Practically, test dataset, labels of clouds and shadows, class IDs, and visual settings are inherited: images of Landsat 8 OLI, Path 113, Row 26, Year 2016 are preserved for the test, listed as follows. For a fair comparison, the pretrained UNet backbone is inherited, which has been trained and validated on the training and validation sets given in [15]. QA is derived from the Level-2 Pixel Quality Assessment band as the reference, which specifically labels pixels of clouds and shadows with high confidence generated by the CFMask algorithm [8]. Please note that QA is referred to as reference rather than ground truth because the labels of clouds and shadows are dilated and not precise enough in the pixel level. Class IDs of background, fill values (invalid values, 9999 ), shadows, and clouds are assigned to 0, 1, 2, and 3, respectively; land, snow, and water are merged into background as we focus more on cloud and shadow segmentation. Band 5 NIR, 4 Red, and 3 Green are stacked as false-color images, and pixels of background (0), cloud shadows (2), and clouds (3) are colorized by gray (#9C9C9C), green (#267300), and cyan (#73DFFF), respectively. Please refer to [15] to obtain more details regarding the dataset.
  • Test images of Landsat 8 OLI, Path 113, Row 26, Year 2016: 2016-03-27, 2016-04-12, 2016-04-28, 2016-05-14, 2016-05-30, 2016-06-15, 2016-07-17, 2016-08-02, 2016-08-18, 2016-10-21, and 2016-11-06
We also inherit the data preprocessing in [15]: fill and padded values are assigned to zero, full images are sliced into patches with a size of 512 × 512 , and all pixels in one patch are normalized to the interval (0, 1] for UNet prediction. In our case, test images are padded to 8192 × 8192 so they can be sliced into 256 patches. UNet backbone takes all seven bands as input, but CRF only uses Bands 5, 4, and 3, and combines them to grayscale. We follow [50] to transform predictions into unary potentials.
In the implementation of CRF inference, the Gaussian and Guided Gaussian filters are built upon TensorFlow [61] framework, and the center weight of the Gaussian kernel is assigned to zero in order to not passing the message to itself. Instead of reusing the reported assessment, we reproduce Refined UNet [15] but the number of iteration of Dense CRF inference is set to 10. In our Refined UNet v2, ϵ is empirically assigned to 10 8 , while r varies from 10 to 80; we choose v2 of r = 20 for visual assessment as well. We also conduct a subsequent experiment to examine the effect with respect to these hyperparameters in this section. The demo code is available at https://github.com/92xianshen/refined-unet-v2.
For evaluation, we inherit the quantitative metrics from [15] to assess our methods, including accuracy, precision P, recall R, and F 1 scores. Considering the indicator p i j denoting the cumulative number of pixels that should be grouped into class i but are actually grouped into class j, precision P reports the ability of the method to predict correctly, recall R reports the ability of the method to retrieve comprehensively, and F1 score is a synthesized indicator considering both P and R. Equations (23)–(25) define above indicators.
P i = p i i i = 1 C p i j
R i = p i i j = 1 C p i j
F 1 i = 2 × P i · R i P i + R i
in which C is the number of all the classes.
Additionally, time consumption is also considered to be an indicator to evaluate these methods, which can demonstrate if our method is practically efficient.

4.2. Quantitative Comparison between Refined UNets and V2

We first clarify some expressions in the comparison experiments. For brevity, we refer to the segmentation results from UNet as predictions and those from CRF as refinements. The differences between local and global Refined UNet are refining on partial or full predictions. Hence, the local Refined UNet reproduces refinements by taking as input the image patches, and the full segmentation results are restored from patches. Conversely, the global Refined UNet predicts locally but refines globally, yielding full results by CRF. Naturally, the global Refined UNet represents the method in [15].
For quantitative comparison, we inherit the hyperparameter configuration of global and local Refined UNets from [15]: θ α , θ β , and θ γ are empirically assigned to 80, 13, and 3, respectively. The quantitative assessment is listed in Table 1, in which precision, recall, and F1 scores are compared. As can be seen in Table 1, the quantitative evaluations of UNet backbone, global Refined UNet [15], local Refined UNet, and Refined UNet v2 are approximately close in terms of accuracy, P, R, and F 1 . Please note that the QA band is in fact not precise enough as the ground truth, so we just take into account these quantitative indicators as the secondary evaluation, in order to observe if there is a distinct difference in numerical indicators. Accordingly, the quantitative assessment can demonstrate that our Refined UNet v2 is quantitatively comparable to Refined UNet [15]. On the other hand, the accuracy score drops slightly with the increase of the radius r, and the scores of precision P and recall R of the cloud shadow decrease significantly. We attribute this to the intrinsic refinement of our v2: it prunes necessary shadows so that P and R are quite affected. Additionally, the side effect of these experiments is demonstrating the reproductivity of Refined UNet [15].

4.3. Visual Comparison between Refined UNet and V2

Furthermore, we compare the visual performance between Refined UNet [15] and v2. Theoretically, Refined UNet [15] focuses on global edge-preserving refinements while Refined UNet v2 concentrates more on eliminating noise in local predictions, due to their mathematical derivations. We present the qualitative evaluations as the principal assessment, which is illustrated in Figure 3 and Figure 4. Visually, our Refined UNet v2 retains the prediction from its backbone but enlarges some small pieces of clouds and shadows from a global perspective (Figure 3). From local observation, it features “denoising” the prediction (eliminating isolated minor pieces or pixels), performing a better segmentation on the snow region (the first row in Figure 4). We attribute this to its intrinsic property of denoising, which has been shown in Section 3.3. Therefore, our Refined UNet v2 is superior in the noise-free segmentation of clouds and shadows.
However, it is noted that our Refined UNet v2 sometimes over-refines the “noise” of not only the misclassified minor regions (snow regions or fine cloud regions) but also the small pieces of background, which leads to merging two separated cloud regions; we attribute these drawbacks to the property of denoising as well. The strength of denoising can be essentially controlled by the radius r, which will be thoroughly explored in the following subsection. In addition, there remain some misclassified regions, a piece of the river, for example, is detected as the shadow, and some associated shadows are missing. This is, in our opinion, because of the misclassification of the UNet backbone and the strength of our CRF inference. In the future, we will further explore the property of other filter-based message-passing mechanisms to improve the CRF inference, and plan to introduce sophisticated weak-supervised strategies to improve the accuracy.

4.4. Ablation Study Regarding Our CRF Inference

We evaluate the performance of the UNet backbone and v2 to demonstrate the effect of our CRF inference, which is illustrated in Figure 3 and Figure 4. Visually, UNet with adaptive weights concentrates more on the minority of categories (pixels of cloud shadows), which results in a prediction with too much noise. The visualizations in Figure 3 and Figure 4 have been demonstrated above efficacy. Refined UNet [15] can refine the boundaries of cloud and shadow entities and, to some extent, eliminate isolated regions, but still preserve some stubborn fine regions. In contrast, v2 can faithfully denoise these predictions and generate cleaner results. Consequently, our Refined UNet v2 outperforms Refined UNet [15] in the noise-free segmentation. Similarly, the over-refinement of isolated regions remains, which should be sometimes a drawback to overcome in future exploration.

4.5. Hyperparameter Sensitivity Regarding r and ϵ in Our CRF Inference

We evaluate the performance of Refined UNet v2 with respect to the hyperparameters r and ϵ . As is mentioned above, r and ϵ denote the radius of the Gaussian filter and the regularization parameter of the linear ridge regression model. According to [59], r controls the filter window and affects the inference efficiency of the brute-force implementation: a higher r should yield a more refined result because of longer-range connectivity but it takes more time in inference, while ϵ controls the ability to smooth predictions: a higher ϵ yields a “blur” prediction. The qualitative and quantitative assessments are illustrated and shown in Figure 5 and Table 1.
Figure 5 visually presents the effect of r in the performance of segmentation. As is speculated above, a higher r propagates messages with longer-range connectivity, leading to finer segmentation results; a lower r truncates long-range messages passing, leading to coarser segmentation results. The quantitative assessment in Table 1 also confirms the denoising sensitivity: v2 with a larger r prefers to prune more shadows, but it tends to retain more shadow entity if a smaller r is assigned to it. It can be visually explained by the similarity between shadows and segmentation noise. Our Refined UNet v2, on the other hand, is not sensitive with ϵ until it grows up to 1; the edges of segmentation results are not precise enough when ϵ is too high. In summary, we can increase r to yield cleaner results and decrease that to preserve more entities.

4.6. Computational Efficiency of Refined UNet v2

We compare the computational efficiency between Refined UNets and v2, which is shown in Table 1. Please note that the listed duration is the time consumption of inferring one full image in the test phase. As can be seen in Table 1, we can confirm that our v2 is potentially efficient compared to the global and local Refined UNets: it is noted that global and local Refined UNet spend similar time refining results and is not proportional to θ α , θ β , and θ γ , but our v2 consumes less time than the counterparts if r < 50 . The computational efficiency of our v2 benefits from the GPU support of TensorFlow framework. Moreover, according to Table 1, the computational efficiency of our brute-force implementation is highly related to the selection of radius r: the time consumption increases significantly when greater r is used, and we attribute this to our brute-force implementation. Thus, we will optimize the time performance by an efficient GPU filter in future work.

5. Conclusions

In this paper, we present an experimental prototype of an end-to-end pixel-level classifier for cloud and shadow noise-free patch-wise segmentation, which concatenates the UNet for cloud and shadow coarse segmentation and the following CRF inference for segmentation noise removal. In contrast to the separated pipeline of the local prediction of UNet and the global refinement of Dense CRF post-processing in [15], our end-to-end Refined UNet v2 locally gives the coarse prediction and the noise-free segmentation result simultaneously. Practically, it is straightforward for GPU frameworks and can be easily implemented by any machine-learning frameworks, which is able to be potentially efficient. Theoretically, we prove that the proposed bilateral term is highly relevant to both color intensity and Gaussian spatial features, which is similar to that of [50]. Experiments and results have demonstrated that our v2 is quantitatively comparable to Refined UNet [15] in terms of accuracy, precision, recall, and F1 scores, but can visually outperform that from the noise-free segmentation perspective. It is noted that a larger radius r results in the accuracy score dropping slightly, but the precision, recall, and F1 scores decreasing significantly, particularly for cloud shadows. We attribute this to the intrinsic refinement of our v2. The comparison of time consumption also supports that our implementation is potentially efficient; actually, our brute-force implementation is more efficient than Refined UNet if r < 50. We will further explore the backpropagation and the learnability of our CRF implementation, and improve its computational efficiency.

Author Contributions

Conceptualization, L.J. and P.T.; Funding acquisition, L.J.; Methodology, L.J.; Supervision, C.H. and P.T.; Writing—original draft, L.J.; Writing—review & editing, L.H. and C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Postdoctoral Science Foundation grant number 2019M660852, Special Research Assistant Foundation of CAS, and National Natural Science Foundation of China grant number 41971396.

Acknowledgments

The authors would like to thank for the open-source implementations of the Cython-based Python wrapper for Dense CRF (https://github.com/lucasb-eyer/pydensecrf) and CRFasRNN-Keras (https://github.com/sadeepj/crfasrnn_keras) as they inspired the brute-force implementation of Refined UNet v2.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of Guided Gaussian Filter

We revisit the derivation of Guided Gaussian filter in this section and first recall the cost function of ridge regression with respect to linear coefficients a k and b k .
E a k ,   b k = i ω k g i k a k I i + b k x i 2 + ϵ a k 2
The partial derivatives of E with respect to a k and b k are given by Equations (A2) and (A3).
E a k = i ω k   g i k   2 I i ( a k I i + b k x i ) + 2 ϵ a k
E b k = i ω k   g i k   2 ( a k I i + b k x i )
The two partial derivatives should be zero if we intend to minimize the cost function. Then we have
E a k = 0
E b k = 0
We first consider the partial derivative of E with respect to b k and then we have
i ω k   g i k   2 ( a k I i + b k x i ) = 0
b k   i ω k   g i k = i ω k   g i k ( x i a k I i )
Thus, b k is given by Equation (A8). For brevity of exposition, we refer to μ k and x ¯ k as the Gaussian-weighted mean of I i and x i in the window ω k .
b k = 1 i ω k   g i k i ω k   g i k ( x i a k I i ) = x ¯ k a k μ k
Now we consider the partial derivative of E with respect to a k , given by Equations (A9) and (A10).
i ω k   g i k   2 I i ( a k I i + b k x i ) + 2 ϵ a k = 0
a k   i ω k   g i k I i 2 + i ω k   g i k I i b k i ω k   g i k I i x i + ϵ a k   i ω k   g i k = 0
We substitute b k (Equation (A8) into the term with respect to b k in Equation (A10) and it yields
i ω k   g i k I i b k = i ω k   g i k I i ( x ¯ k a k μ k ) = x ¯ k   i ω k   g i k I i a k μ k   i ω k   g i k I i
a k   i ω k   g i k I i 2 + x ¯ k   i ω k   g i k I i a k μ k   i ω k   g i k I i i ω k   g i k I i x i + ϵ a k   i ω k   g i k = 0
a k i ω k   g i k I i 2 μ k   i ω k   g i k I i + ϵ   i ω k   g i k = i ω k   g i k I i x i x ¯ k   i ω k   g i k I i
Now we have the solution to a k , given by Equation (A14).
a k = i ω k   g i k I i x i x ¯ k i ω k   g i k I i i ω k   g i k I i 2 μ k i ω k   g i k I i + ϵ i ω k   g i k

Appendix B. Proof of the Kernel of Guided Gaussian Filter

The Guided Gaussian filter takes into consideration both the position and color intensity features of pixel i and j, which can be proved by the implicit kernel weights, defined in Equation (A15).
W i j ( I ) = 1 Z g 2 k : ( i , j ) ω k   g i k g j k   1 + ( I i μ k ) ( I j μ k ) σ k + ϵ
in which Z g , μ k , and σ k are the normalization term of the Gaussian filter i ω k   g i k , the Gaussian-weighted mean f GF ( I i ) and variance f GF ( I i 2 ) f GF 2 ( I i ) in window ω k , respectively.
Proof. 
The kernel is given by Equation (A16).
W i j = y i x j
We replace b k in Equation (18) with (15) and obtain
y i = f GF ( a k ) I i + f GF ( b k ) = 1 Z g   k ω i   g k i a k ( I i μ k ) + x ¯ k
Thus, the derivative of y i with respect to x j is
y i x j = 1 Z g   k ω i   g k i a k x j ( I i μ k ) + x ¯ k x j
The derivative of x ¯ k with respect to x j is
x ¯ k x j = 1 Z g g j k δ j ω k = 1 Z g g k j δ k ω j
And the derivative of a k with respect to x j is
a k x j = 1 σ k + ϵ x j f GF ( I i x i ) μ k x ¯ k x j = 1 σ k + ϵ 1 Z g g j k I j δ j ω k μ k 1 Z g g j k δ j ω k = 1 σ k + ϵ 1 Z g g k j I j μ k δ k ω j
Finally, we have
y i x j = 1 Z g 2   k ω i , k ω j   g k i g k j 1 + ( I i μ k ) ( I j μ k ) σ k + ϵ

References

  1. Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Woodcock, C.E.; Zhu, Z. Landsat-8: Science and product vision for terrestrial global change research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef] [Green Version]
  2. Wulder, M.A.; White, J.C.; Loveland, T.R.; Woodcock, C.E.; Roy, D.P. The global Landsat archive: Status, consolidation, and direction. Remote Sens. Environ. 2016, 185, 271–283. [Google Scholar] [CrossRef] [Green Version]
  3. Vermote, E.F.; Justice, C.O.; Claverie, M.; Franch, B. Preliminary analysis of the performance of the Landsat 8/OLI land surface reflectance product. Remote Sens. Environ. 2016, 185, 46–56. [Google Scholar] [CrossRef] [PubMed]
  4. Chai, D.; Newsam, S.; Zhang, H.K.; Qiu, Y.; Huang, J. Cloud and cloud shadow detection in Landsat imagery based on deep convolutional neural networks. Remote Sens. Environ. 2019, 225, 307–316. [Google Scholar] [CrossRef]
  5. Sun, L.; Liu, X.; Yang, Y.; Chen, T.; Wang, Q.; Zhou, X. A cloud shadow detection method combined with cloud height iteration and spectral analysis for Landsat 8 OLI data. ISPRS J. Photogramm. Remote Sens. 2018, 138, 193–207. [Google Scholar] [CrossRef]
  6. Qiu, S.; He, B.; Zhu, Z.; Liao, Z.; Quan, X. Improving Fmask cloud and cloud shadow detection in mountainous area for Landsats 4–8 images. Remote Sens. Environ. 2017, 199, 107–119. [Google Scholar] [CrossRef]
  7. Li, Z.; Shen, H.; Li, H.; Xia, G.; Gamba, P.; Zhang, L. Multi-feature combined cloud and cloud shadow detection in GaoFen-1 wide field of view imagery. Remote Sens. Environ. 2017, 191, 342–358. [Google Scholar] [CrossRef] [Green Version]
  8. Zhu, Z.; Woodcock, C.E. Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
  9. Foga, S.; Scaramuzza, P.L.; Guo, S.; Zhu, Z.; Dilley, R.D.; Beckmann, T.; Schmidt, G.L.; Dwyer, J.L.; Hughes, M.J.; Laue, B. Cloud detection algorithm comparison and validation for operational Landsat data products. Remote Sens. Environ. 2017, 194, 379–390. [Google Scholar] [CrossRef] [Green Version]
  10. Zhu, X.; Helmer, E.H. An automatic method for screening clouds and cloud shadows in optical satellite image time series in cloudy regions. Remote Sens. Environ. 2018, 214, 135–153. [Google Scholar] [CrossRef]
  11. Frantz, D.; Roder, A.; Udelhoven, T.; Schmidt, M. Enhancing the Detectability of Clouds and Their Shadows in Multitemporal Dryland Landsat Imagery: Extending Fmask. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1242–1246. [Google Scholar] [CrossRef]
  12. Zhu, Z.; Woodcock, C.E. Automated cloud, cloud shadow, and snow detection in multitemporal Landsat data: An algorithm designed specifically for monitoring land cover change. Remote Sens. Environ. 2014, 152, 217–234. [Google Scholar] [CrossRef]
  13. Ricciardelli, E.; Romano, F.; Cuomo, V. Physical and statistical approaches for cloud identification using Meteosat Second Generation-Spinning Enhanced Visible and Infrared Imager Data. Remote Sens. Environ. 2008, 112, 2741–2760. [Google Scholar] [CrossRef]
  14. Amato, U.; Antoniadis, A.; Cuomo, V.; Cutillo, L.; Franzese, M.; Murino, L.; Serio, C. Statistical cloud detection from SEVIRI multispectral images. Remote Sens. Environ. 2008, 112, 750–766. [Google Scholar] [CrossRef]
  15. Jiao, L.; Huo, L.; Hu, C.; Tang, P. Refined UNet: UNet-Based Refinement Network for Cloud and Shadow Precise Segmentation. Remote Sens. 2020, 12, 2001. [Google Scholar] [CrossRef]
  16. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
  17. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  18. Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  19. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  20. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  21. Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. In Proceedings of the British Machine Vision Conference, London, UK, 4–7 September 2017. [Google Scholar]
  22. Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  23. Li, Y.; Song, L.; Chen, Y.; Li, Z.; Zhang, X.; Wang, X.; Sun, J. Learning Dynamic Routing for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  24. Xie, F.; Shi, M.; Shi, Z.; Yin, J.; Zhao, D. Multilevel Cloud Detection in Remote Sensing Images Based on Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3631–3640. [Google Scholar] [CrossRef]
  25. Zi, Y.; Xie, F.; Jiang, Z. A Cloud Detection Method for Landsat 8 Images Based on PCANet. Remote Sens. 2018, 10, 877. [Google Scholar] [CrossRef] [Green Version]
  26. Yang, Y.; Soatto, S. FDA: Fourier Domain Adaptation for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  27. Li, X.; Yang, Y.; Zhao, Q.; Shen, T.; Lin, Z.; Liu, H. Spatial Pyramid Based Graph Reasoning for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  28. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  29. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Roseville, CA, USA, 2016; pp. 3844–3852. [Google Scholar]
  30. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  31. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  32. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  33. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
  36. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
  37. Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
  38. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  39. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  40. Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  41. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  42. Choi, S.; Kim, J.T.; Choo, J. Cars Can’t Fly Up in the Sky: Improving Urban-Scene Segmentation via Height-Driven Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  43. Zhang, Y.; Qiu, Z.; Yao, T.; Ngo, C.W.; Liu, D.; Mei, T. Transferring and Regularizing Prediction for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  44. Araslanov, N.; Roth, S. Single-Stage Semantic Segmentation From Image Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  45. Lee, H.J.; Kim, J.U.; Lee, S.; Kim, H.G.; Ro, Y.M. Structure Boundary Preserving Segmentation for Medical Image with Ambiguous Boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  46. Cheng, H.K.; Chung, J.; Tai, Y.W.; Tang, C.K. CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  47. Kohli, P.; Ladicky, L.; Torr, P.H.S. Robust Higher Order Potentials for Enforcing Label Consistency. Int. J. Comput. Vis. 2009, 82, 302–324. [Google Scholar] [CrossRef] [Green Version]
  48. Ladicky, L.; Russell, C.; Kohli, P.; Torr, P.H. Associative Hierarchical CRFs for Object Class Image Segmentation. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009. [Google Scholar]
  49. Kolmogorov, V.; Zabin, R. What energy functions can be minimized via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 147–159. [Google Scholar] [CrossRef] [Green Version]
  50. Krähenbühl, P.; Koltun, V. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Roseville, CA, USA, 2011; pp. 109–117. [Google Scholar]
  51. Zheng, S.; Jayasumana, S.; Romeraparedes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H.S. Conditional Random Fields as Recurrent Neural Networks. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1529–1537. [Google Scholar]
  52. Chandra, S.; Kokkinos, I. Fast, Exact and Multi-scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 402–418. [Google Scholar]
  53. Liu, C.; Freeman, W.T.; Szeliski, R.; Kang, S.B. Noise Estimation from a Single Image. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006. [Google Scholar]
  54. Durand, F.; Dorsey, J. Fast Bilateral Filtering for the Display of High-Dynamic-Range Images. ACM Trans. Graph. 2002, 21, 257–266. [Google Scholar] [CrossRef] [Green Version]
  55. He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar]
  56. Kopf, J.; Cohen, M.F.; Lischinski, D.; Uyttendaele, M. Joint bilateral upsampling. ACM Trans. Graph. 2007, 26, 96. [Google Scholar] [CrossRef]
  57. Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Bombay, India, 4–7 January 1998; pp. 839–846. [Google Scholar]
  58. Farbman, Z.; Fattal, R.; Lischinski, D.; Szeliski, R. Edge-preserving decompositions for multi-scale tone and detail manipulation. ACM Trans. Graph. 2008, 27. [Google Scholar] [CrossRef]
  59. He, K.; Sun, J.; Tang, X. Guided Image Filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1397–1409. [Google Scholar] [CrossRef] [PubMed]
  60. Adams, A.; Baek, J.; Davis, M.A. Fast High-Dimensional Filtering Using the Permutohedral Lattice. Comput. Graph. Forum 2010, 29, 753–762. [Google Scholar] [CrossRef]
  61. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software. 2015. Available online: tensorflow.org (accessed on 27 October 2020).
Figure 1. Visualization of cloud and cloud shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016), including false-color image, QA, Refined UNet [15], and our Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct false-color images for visualization. Pixels of clouds and shadows with high confidence in the Level-2 Pixel Quality Assessment band are labeled as references (QA). Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. Compared to Refined UNet [15], our Refined UNet v2 attempts to eliminate isolated pixels and regions, yielding a “cleaner” segmentation result.
Figure 1. Visualization of cloud and cloud shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016), including false-color image, QA, Refined UNet [15], and our Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct false-color images for visualization. Pixels of clouds and shadows with high confidence in the Level-2 Pixel Quality Assessment band are labeled as references (QA). Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. Compared to Refined UNet [15], our Refined UNet v2 attempts to eliminate isolated pixels and regions, yielding a “cleaner” segmentation result.
Remotesensing 12 03530 g001
Figure 2. The full pipeline of Refined UNet v2. First, patches are extracted from a high-resolution remote sensing image. UNet is employed to localize clouds and shadows roughly, and then the proposed CRF refines the prediction. The full result is restored from the refinement patches.
Figure 2. The full pipeline of Refined UNet v2. First, patches are extracted from a high-resolution remote sensing image. UNet is employed to localize clouds and shadows roughly, and then the proposed CRF refines the prediction. The full result is restored from the refinement patches.
Remotesensing 12 03530 g002
Figure 3. Visualizations of cloud and cloud shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016). (a) False-color image, (b) QA, (c) UNet backbone, (d) Refined UNet [15], (e) Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct false-color images for visualization. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively.
Figure 3. Visualizations of cloud and cloud shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016). (a) False-color image, (b) QA, (c) UNet backbone, (d) Refined UNet [15], (e) Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct false-color images for visualization. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively.
Remotesensing 12 03530 g003
Figure 4. Local examples of cloud and shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016, Date 12 April). (a) False-color patches, (b) QA, (c) UNet backbone, (d) Refined UNet [15], (e) Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct a false-color image for visualization. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. Visually, Refined UNet [15] focuses more on the precise boundaries of clouds and shadows while generates more “noise” in the prediction. Refined UNet v2, alternatively, can remove the “noise”. Over-refinement, however, can also be found in terms of the visual assessment, which should be further considered in the future.
Figure 4. Local examples of cloud and shadow segmentation (Landsat 8 OLI, Path 113, Row 26, Year 2016, Date 12 April). (a) False-color patches, (b) QA, (c) UNet backbone, (d) Refined UNet [15], (e) Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct a false-color image for visualization. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. Visually, Refined UNet [15] focuses more on the precise boundaries of clouds and shadows while generates more “noise” in the prediction. Refined UNet v2, alternatively, can remove the “noise”. Over-refinement, however, can also be found in terms of the visual assessment, which should be further considered in the future.
Remotesensing 12 03530 g004
Figure 5. Visualizations with regards to r and ϵ in our CRF implementation. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. The candidate values of r vary from 10 to 80 when ϵ is secured to 10 8 . The candidate values of ϵ vary from 10 8 to 100 when r is secured to 50. The prediction is closer to boundaries if a higher r and a lower ϵ are used, while more isolated regions are also eliminated, especially for small pieces of shadows.
Figure 5. Visualizations with regards to r and ϵ in our CRF implementation. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. The candidate values of r vary from 10 to 80 when ϵ is secured to 10 8 . The candidate values of ϵ vary from 10 8 to 100 when r is secured to 50. The prediction is closer to boundaries if a higher r and a lower ϵ are used, while more isolated regions are also eliminated, especially for small pieces of shadows.
Remotesensing 12 03530 g005
Table 1. Average accuracy, precision, recall, and F1 scores on the test set (Average ± Standard deviation, + represents that higher scores indicate better performance).
Table 1. Average accuracy, precision, recall, and F1 scores on the test set (Average ± Standard deviation, + represents that higher scores indicate better performance).
ModelsTime (s/img) 1Acc. + (%)Background (0)Fill Values (1)Shadows (2)Clouds (3)
P.+ (%)R.+(%)F1+ (%)P.+ (%)R.+ (%)F1+ (%)P.+ (%)R.+ (%)F1+ (%)P.+ (%)R.+ (%)F1 + (%)
UNet [15,17] 220.67 ± 1.9693.04 ± 5.4593.34 ± 4.8881.52 ± 15.386.35 ± 11.04100 ± 0100 ± 0100 ± 034.74 ± 14.7754.31 ± 18.7240.43 ± 14.7487.28 ± 18.7895.96 ± 3.6390.12 ± 13.77
Global Refined UNet [15] 3384.81 ± 5.9193.48 ± 5.4689.89 ± 7.3985.94 ± 17.6686.86 ± 12.3399.88 ± 0.07100 ± 099.94 ± 0.0435.43 ± 20.2617.87 ± 12.0721.21 ± 11.8987.6 ± 19.1595.87 ± 3.290.15 ± 14.13
Local Refined UNet [15] 4372.04 ± 5.0193.64 ± 5.4189.48 ± 7.4887.04 ± 17.8887.15 ± 12.28100 ± 0100 ± 0100 ± 040.80 ± 23.9916.83 ± 10.4021.24 ± 10.8888.21 ± 19.0294.92 ± 4.7190.05 ± 13.90
Refined UNet v2 of r = 10 61.36 ± 5.2593.60 ± 5.5091.99 ± 5.7484.51 ± 16.2487.29 ± 11.35100 ± 0100 ± 0100 ± 040.64 ± 19.8839.00 ± 13.7736.79 ± 12.2687.93 ± 18.8395.83 ± 3.9990.37 ± 13.89
Refined UNet v2 of r = 20 106.86 ± 2.4793.56 ± 5.4991.21 ± 6.3685.18 ± 16.7387.19 ± 11.59100 ± 0100 ± 0100 ± 039.55 ± 20.2431.47 ± 11.7732.12 ± 10.287.97 ± 18.8895.65 ± 4.0890.30 ± 13.91
Refined UNet v2 of r = 30 202.7 ± 2.793.51 ± 5.4890.64 ± 6.8585.62 ± 17.1487.06 ± 11.78100 ± 0100 ± 0100 ± 037.94 ± 20.2525.55 ± 11.0327.55 ± 8.5887.94 ± 18.9195.52 ± 4.1890.21 ± 13.92
Refined UNet v2 of r = 40 320.64 ± 4.7793.47 ± 5.4890.26 ± 7.1685.92 ± 17.4486.97 ± 11.93100 ± 0100 ± 0100 ± 036.24 ± 20.5820.96 ± 10.6823.54 ± 8.0887.90 ± 18.9395.41 ± 4.2690.12 ± 13.94
Refined UNet v2 of r = 50 485.36 ± 2.9193.45 ± 5.4890.00 ± 7.3586.13 ± 17.7086.91 ± 12.09100 ± 0100 ± 0100 ± 035.07 ± 21.1317.56 ± 10.2120.34 ± 7.9687.85 ± 18.9695.31 ± 4.3490.04 ± 13.96
Refined UNet v2 of r = 60 700.57 ± 17.6493.42 ± 5.4889.79 ± 7.4886.28 ± 17.9186.85 ± 12.22100 ± 0100 ± 0100 ± 034.01 ± 21.5914.91 ± 09.4717.71 ± 7.6187.80 ± 18.9795.21 ± 4.4489.96 ± 13.97
Refined UNet v2 of r = 70 892.97 ± 5.3693.40 ± 5.4989.67 ± 7.5586.37 ± 18.1086.80 ± 12.34100 ± 0100 ± 0100 ± 033.13 ± 22.2012.88 ± 08.7515.60 ± 7.0487.75 ± 19.0095.13 ± 4.5389.88 ± 13.98
Refined UNet v2 of r = 80 1213.23 ± 4.9793.38 ± 5.4989.57 ± 7.5986.42 ± 18.2586.75 ± 12.45100 ± 0100 ± 0100 ± 032.22 ± 22.7411.27 ± 08.1613.85 ± 6.7387.69 ± 19.0195.06 ± 4.6389.80 ± 13.99
1 Inference duration for one full image in the test phase, s/img denotes seconds per image. 2 Accuracy, precision, recall, and F1 scores of UNet [17] are reported in [15]. 3 The iteration of CRF in the reproduction of Refined UNet is 10, which leads to the minor differences of quantitative results, compared to [15]. 4 Local Refined UNet predicts patch by patch while global Refined UNet refines the entire image by Dense CRF.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jiao, L.; Huo, L.; Hu, C.; Tang, P. Refined UNet V2: End-to-End Patch-Wise Network for Noise-Free Cloud and Shadow Segmentation. Remote Sens. 2020, 12, 3530. https://doi.org/10.3390/rs12213530

AMA Style

Jiao L, Huo L, Hu C, Tang P. Refined UNet V2: End-to-End Patch-Wise Network for Noise-Free Cloud and Shadow Segmentation. Remote Sensing. 2020; 12(21):3530. https://doi.org/10.3390/rs12213530

Chicago/Turabian Style

Jiao, Libin, Lianzhi Huo, Changmiao Hu, and Ping Tang. 2020. "Refined UNet V2: End-to-End Patch-Wise Network for Noise-Free Cloud and Shadow Segmentation" Remote Sensing 12, no. 21: 3530. https://doi.org/10.3390/rs12213530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop