1. Introduction
Building segmentation from remote sensing images is the process of segmenting the roof or footprint boundary by grouping pixels which are adjacent with similar feature values (such as brightness, edge, texture, and color). Allowing for fine details in certain areas and very fine details under certain conditions, remote sensing images greatly facilitate the classification and extraction of area-related features, such as roads and buildings. They have also been widely used in many applications, including urban planning [
1], urban environmental modeling [
2], disaster management [
3], land-use change monitoring [
4], and digital city evolution [
5].
The application of such methods is quite challenging for the following three key reasons: First, the existence of objects are at multiple scales. In a remote sensing image, a pixel may belong to inconspicuous small objects, salient large objects, or stuff. Inconspicuous objects and stuff may be dominated by salient objects and, as such, their information will be somewhat weakened or even disregarded, resulting in an uneven categorical distribution; second, remote sensing images are generally obtained through satellites, airplanes, drones, etc. Therefore, they only provide a top-down view, which does not contain all of the important characteristics of objects. The missing characteristics are usually visible in the ground-based or panoramic view of the object; third, the adjacent buildings belonging to the same category in remote sensing images may have huge differences in appearance (such as color and shape). In recent years, there have been more and more researches in this field, and some remarkable achievements have been made.
Conventional remote sensing image segmentation methods usually rely on handcrafted features to extract spatial and texture information of the image, and consider a building as a combination of low-level features, such as spectral information [
6], tone [
7], texture [
8], and geometric shape [
9,
10]. Many classic powerful feature extractors can also achieve high efficiency, such as support vector machines (SVMs) [
11], multinomial logistic regression [
12,
13], boosted decision trees (DTs) [
14], neural networks [
15,
16,
17], Scale-Invariant Feature Transform (SIFT) [
18], Histogram of Oriented Gradient (HOG) [
19], and Speeded Up Robust Features (SURF) [
20]. These methods do, however, have high requirements for the researchers themselves, and cannot represent the high-level semantic features of the image well.
A deep Convolutional Neural Network (CNN) [
21] predicts using only features learned from training data, instead of hand-engineered features. Essential to their success is the built-in invariance of deep CNNs to local image transformations, which allows them to learn increasingly abstract feature representations. The challenge of CNN-based semantic segmentation comes from the inner content, shape, and scale variations of the same objects, as well as the easily confused and fine boundaries among different objects. First, the convolution layer has both shift and spatial-invariant characteristics. While invariance is clearly desirable for high-level vision tasks, it may hamper low-level tasks, such as pose estimation and semantic segmentation, in which precise localization is required, rather than the abstraction of spatial details. Second, CNNs are able to capture informative representations with global receptive fields, by stacking convolutional layers and down-sampling; however, the use of down-sampling to compress data is irreversible, resulting in information loss and further causing translation invariance and smooth results. This also leads to inaccurate object boundary acquisition. Third, the multiple scales of objects are also a challenge for CNNs. The features extracted by CNNs usually have a limited receptive field, where the feature mainly describes the core region and largely ignores the context around the boundary, no matter whether the receptive field is large or small. The extraction of any feature from an object, when conducted at both a certain scale and other scales, will produce dissimilar results.
With the development of deep learning, techniques have been introduced to meet these requirements. To enlarge the receptive field of feature maps, the Pyramid Scene Parsing Network (PSPNet) [
22] adopts Spatial Pyramid Pooling, which pools the feature maps into different sizes and concatenates them after up-sampling. DeepLab [
23,
24,
25] adopts Atrous Spatial Pyramid Pooling (ASPP), which employs dilated/atrous convolution [
26], in order to extract at a single scale and to perform accurate and effective classification of regions at any scale. To improve the resolution of the reconstructed image and, thus, object boundaries, Single-Image Super-Resolution (SISR) [
27] has been proposed to generate a high-resolution image from low-resolution observations of the same scene. The Super-Resolution Convolutional Neural Network (SRCNN) [
28] is the first end-to-end super-resolution algorithm that uses the CNN structure, which has achieved superior performance against traditional methods. Very Deep Convolutional Networks for Super Resolution (VDSR) [
29] and Deeply-Recursive Convolutional Network (DRCN) [
30] further improve SRCNN by using gradient clipping, skip connection, or recursive-supervision to ease the difficulty of training deep network. Accelerating the Super-Resolution Convolutional Neural Network (FSRCNN) [
31] has been proposed to accelerate the training and testing of SRCNN. Super-Resolution Using a Generative Adversarial Network (SRGAN) has been submitted to construct a deeper network with perceptual losses [
32] and a generative adversarial network (GAN) [
33] for photo-realistic super-resolution. Efficient Sub-Pixel Convolutional Neural Network (ESPCN) [
34] uses a shallow network (e.g., FCN) and only upscale the low-resolution image at the very last stage, in order to achieve real-time performance. DeepBack-Projection Networks For Super-Resolution (DBPN) [
35] provides an error feedback mechanism for projection errors at each stage. Recent research has made improvements to the CNN structure, based on the characteristics of remote sensing images.
Recent studies have also applied CNN networks for remote sensing images, such as FCN, Deeplabv3+, ENet [
36], and ESPNet [
37], which have been used to achieve excellent results in building detection. Bittner et al. used FCN to extract buildings [
38]. Xu et al. used the improved UNet [
39] network with the guided filter to extract buildings [
40]. Wu [
41] used SegNet for Unmanned Aerial Vehicle-Based Images in Riverbank Monitoring. Zhang [
42] proposed a multi-constraint fully convolutional network (MC–FCN) model to perform end-to-end building segmentation. Sanjeevan [
43] improved the FCN using an exponential linear unit (ELU), in order to reduce the noise (i.e., falsely classified buildings) and sharpen the boundaries of the buildings. SPP-PCANet [
44] has been designed to combine PCANet and spatial pyramid pooling, in order to reduce the amount of false positives and to improve the detection rate. Wang [
45] applied the ASPP and superpixel-based DenseCRF to perform dense semantic labeling on remote sensing images. Ma [
46] used an MR-SSD model to detect different objects in large-scale SAR images. WSF-Net [
47] can achieve comparable results to fully supervised methods, when using only image-level annotations with weakly supervised binary segmentation. SPMF-Net [
48] takes the image-level labels as supervision information in a classification network that combines superpixel pooling and multi-scale feature fusion structures for building segmentation. Wang [
49] combined semi-supervised training, Average Update of Pseudo-label (AUP) with pseudo-labels, and strong labels to improve the segmentation performance. Gergelova [
50] proposes a processing procedure in a geographic information system (GIS) environment which advocates the identification of roof surfaces based on the LiDAR point cloud. Lyu [
51] proposes the statistical region merging (SRM) and a shape context similarity model for Light Detection and Ranging (LiDAR) data.
Inspired by the methods above, we propose a novel building segmentation network, which combines semantic segmentation and super-resolution for accurate building segmentation in remote sensing images. Our main contributions can be summarized as follows:
Detail Enhancement (DE) module based on SISR is proposed to enhance the boundaries of buildings. Deep Residual Block is designed to improve the learning ability of the network. This design is to reduce the local information loss caused by the use of convolutional layers and down-sampling in a single Deep CNN which will make negatively impact on the details of the image around the small object.
Widely Adapted Spatial Pyramid (WASP) is designed to enlarge the receptive fields for extracting more context information. This desigh is to multiple scale feature maps with pooling layer and dilated convolution, and integrates feature of different scales step-by-step, which can incorporate neighbor scales of context features more precisely and introducing better feature learning and fitting capabilities. We further apply the improved ResNet50 and Global Feature Guidance to improve the segmentation accuracy.
The paper is organized as follows:
Section 2 presents our building segmentation network.
Section 3 presents the experimental results on two public datasets along with the discussion. Our conclusions are presented in
Section 4.
2. Proposed Methods
In this paper, the Semantic Segmentation module and DE modules are designed. The Semantic Segmentation module is used to achieve the coarsely segmentation results buiding objects in the remote sensing images. Among them, the DE module is independent of the Semantic Segmentation module, using an independent feature extraction module (Deep Residual Block) and a feature transfer method (Feature Fusion Connection) to decrease the information loss caused by the pooling layer and down-sampling in the segmentation network. As shown via the overall architecture in
Figure 1, we apply the improved ResNet50 (LResNet50) to extract dense features. We used pooling operation to obtain spatial distribution information in the image and use this as a guide for intact bodies and accurate boundaries, by embedding the results of the Detail Enhancement and Semantic Segmentation modules.
2.1. Detail Enhancement (DE) Module
A super-resolution reconstruction algorithm can reconstruct a corresponding high-resolution image from a low-resolution image. In a neural network, the convolutional layer can directly learn low-frequency information through a non-linear filter; the low-frequency information is well-preserved, such that the low-resolution image input to the network and the high-resolution image output are similar, considering the outlines of ground objects. However, the loss of high-frequency information will still cause the reconstructed image to lack details. Therefore, existing super-resolution networks generally use a shallow network to ensure the quality of the reconstructed image and the shallow network means limited learning ability. As remote sensing images contain a large amount of information, the network can be fully utilized only when the depth of the network is guaranteed. Meanwhile, in the Deep Learning community, many theoretical studies [
52,
53,
54] have shown that deep convolutional neural networks (number of layers > 5) can recursively identify patches of the images at lower layers, resulting in exponential increases in many linear regions of the input space. The deep layers typically extract more complex features from the input image than the shallow layers (i.e., layers < 5).
Based on the above analysis, we designed a Detail Enhancement (DE) module to realize the detail enhancement of the buildings in the remote sensing image, based on ESPCN. The module, as shown in
Figure 2, contains a Deep Residual Block and Feature Fusion Connections. ESPCN uses a pixel shuffle layer, which uses the sub-pixel up-sampling method. By using sub-pixel up-sampling differs from the deconvolution layer and interpolation-based up-sampling, the interpolation function is implicitly included in the previous convolutional layer which can be learned automatically. The image size is transformed only in the last layer, the previous convolution operation is performed on low-resolution images, the efficiency of sub-pixel up-sampling will be higher than the bicubic up-sampling and deconvolution. In this module, we retain the original sub-pixel upsampling layer. For the feature extraction part, we designed a feature enhancement module, in order to increase the depth of the network, and we used a feature transfer module to ensure that the network retains more high-frequency information.
In Deep Residual Block, we apply the Residual Block proposed by EDSR [
55]. Compared with the original ResNet [
56], this Residual Block proposed by EDSR does not apply the pooling layer and the batch normalization(BN) [
57] layer; correspondingly, the structure is simpler and less calculation is required, which makes it suitable for processing pictures with rich information. As shown in
Figure 2a, every Group Block contains three Residual Blocks, in
Figure 2b the Deep Residual Block (DRblock) contains two Group Blocks. Since the image super resolution is an image-to-image translation task where the input image is highly correlated with the target image, the residuals between them is only we need to learn the feature which we named global residual feature. In this case, it avoids learning a complicated transformation from a complete image to another, instead only requires learning a residual map to restore the missing high-frequency details. The local residual learning is similar to the residual learning in ResNet and used to alleviate the degradation problem caused by increasing network depths, reducing training difficulty and improving the learning ability.
Between Deep Residual Blocks and Group Blocks, we applied skip connections for feature transfer, using the Feature Fusion Connection inspired by ResNet. The Feature Fusion Connections are both implemented by shortcut connections and element-wise addition. The difference is that the former directly connects the input and output images as black line can transport the global residual feature to the pixel shuffle layer, the latter one usually adds multiple shortcuts between layers with different depths inside the network which are represented as black line and transfer the feature extracted from every block step by step to the sub-pixel up-sampling layer. In this case, the Feature Fusion Connection part of the module is intended to achieve mitigation of the dense network, which is used to address the gradient vanishing problem.
2.2. Widely Adapted Spatial Pyramid (WASP) Module
In a deep neural network, the receptive field [
58] size can roughly indicate how much we use contextual information. Due to the convolutional nature of a CNN, local convolutional features usually have a limited receptive field. Even with a large receptive field, the feature mainly describes the core region and ignores the contextual information around boundary. Due to the arbitrary sizes of objects in the image, purely using single-scale convolution for image semantic segmentation will lead to the incorrect classification of pixels between categories. Some small-size objects, such as streetlight, are hard to distinguish but may be of great importance. In addition, large objects may exceed the receptive field of the CNN, thus causing discontinuous prediction. These issues demand processing at multiple scales. As strided convolutions are used in convolution layers and spatial pooling is used in pooling layers, the resolution of the feature maps is reduced. The use of a pooling layer also leads to reduced localization accuracy in the labelled images. As a result, many recent works have attempted to deal with the problems mentioned above, in one way or the other.
The DeepLab series utilizes Atrous spatial pyramid pooling (ASPP), with image-level features responsible for encoding the global context. Differing from the conventional approaches, which tackle objects at multiple scales, it applies rescaled versions of an input image, then aggregates the feature maps. The ASPP applies dilated/atrous convolution to integrate the visual context at multiple scales. With the dilated convolution, the size of the feature map captured will avoid reducing. The structure can also probe an incoming feature map using multiple-sampling rate filters with effective fields-of-view encoding the context at multiple scales. Aggressively increasing the dilation rate may, however, cause an inherent problem called “grid” [
59]. PSPNet abstracts different sub-regions by adopting varying-size pooling kernels, and uses global pooling to extract global context information, combining it with the spatial pyramid. The structure combines global pooling with the spatial pyramid [
60,
61,
62,
63], fuses features under four different pyramid scales, and up-samples the low-dimensional feature maps to obtain the same size features as the original feature map by bilinear interpolation. The use of pooling layers reduces the resolution of the feature map and abandons local information, which further makes it difficult to distinguish objects, especially small-sized objects.
Current semantic segmentation architectures perform feature learning by extracting multi-scale features and merging them, but lack selective extraction of features at different scales. ENet base SegNet uses two blocks with maxpooling to heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Additionally, our intuition is that the initial network layers should not directly contribute to classification. Instead, they should rather act as good feature extractors and only preprocess the input for later portions of the network. ESPNet uses point-wise convolutions to be helped in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. EncNet [
64] employs the dilation strategy and proposes a Context Encoding Module (CEM) incorporating the Semantic Encoding Loss (SE-loss), in order to capture contextual information and to selectively highlight class-dependent feature maps. The proposed SE-Loss, unlike per-pixel loss, is capable of taking the global context into consideration, resulting in context-aware training. The Selective Kernel Network (SENet) [
65] proposes a dynamic selection mechanism to adaptively adjust the receptive field size of each neuron, based on multiple scales of the input information, and multiple branches with different kernel sizes are fused using the softmax attention function, guided by the information in these branches. Although these branches can extract receptive fields with different scales in feature maps, they lack global context prior attention to select the features in a channel-wise manner. The Pyramid Attention Network (PAN) [
66] exploits the impact of global contextual information to combine an attention mechanism and spatial pyramid to extract precise dense features for pixel labeling, instead of complicated dilated convolutions and artificially designed decoder networks.
With Pyramid Attention Network (PAN) [
66], we consider how to exploit high-level feature maps to guide low-level features in recovering pixel localization. Global context features and sub-region context are helpful, in this regard, to distinguish among various categories. This global prior is designed to remove the fixed-size constraint of CNNs for image classification.
Considering the above observations, the Widely Adapted Spatial Pyramid (WASP) is proposed, as shown in
Figure 3. In this module, we concatenate dilated convolution and pooling layers in the Multi-Scale Feature Extraction block. Each branch first uses a global pooling operation to capture global contextual information at the image level, where the contextual information can use spatial statistical data to interpret the entire image scene. Then we use dilated convolution to expand the receptive field of the feature map without reducing the resolution of the feature map, where each pixel belongs to the feature map from the different image levels and corresponds different sub-regions in the orginal image. When the level size of the pyramid is N, we reduce the size of the context representation to 1/N of the original size. Specifically, the design in the pyramid is 1/8, 1/4, 1/2 and 1 of the four branches. Compared with a single branch, this setting can more effectively restore the spatial information damaged by the pooling operation [
67], so that objects in the final prediction have a sharper boundary. We use a spatial pyramid structure similar to DeepLabv3, where the dilated convolutions we selected were 6, 12, 18, and 1. This design is based on the formula for the receptive field shown as Equation (
1). This structure can avoid the scale of the receptive field obtained by one branch being equal or an integral multiple of that of another branch, under the premise of obtaining a receptive field as large as possible.
where
n represents the number of layers,
represents the receptive field size of the
nth layer,
represents the convolution kernel size of the
nth layer,
d represents the dilated convolution size of the
nth layer, and
represents the convolution stride size of the
nth layer.
The calculation formula of the dilated convolutional kernel is:
where
represents the equivalent convolution kernel for dilated convolution of the nth layer. In our network design, each branch of the pyramid has a pooling layer and a dilated convolution layer. Then the receptive field of each branch can be calculated as:
where
represents the convolution kernel of pooling layer. In this paper, we set the stride and convolution size in the pooling layer to be equal to the pooling rate. Assuming that the pooling rate = p, the formula can be expressed as:
It can also be expressed as:
Then, we upsample the feature maps to obtain features with the same size as the original feature map. Benefiting from the Progressive Upsample structure, the feature maps at multi-scales are fused and learned through the convolution layers in the progressive process. Compared with the structures that directly concatenate the feature maps after the parallel branches, as in DeepLab or PSPNet, it can lead to better feature learning and fitting capabilities, as well as relatively dense feature maps. The Progressive Upsample structure integrates features at different scales step-by-step, and can incorporate neighboring scales of context features more precisely. At the output end of the final multi-scale branch, we also adopt a progressive convolution structure design. After the feature processing of each parallel branch is completed, we use bilinear interpolation to upsample the feature map of the current branch, and feature maps of the same size are added in a step-by-step manner. Finally, a feature map of the same size as the original input can be obtained. In this way, the feature maps at multiple scales are fused and learned through the convolution operator in the progressive process. Compared with structures that only use a single-layer convolution to process the feature maps after parallel branches are merged, it can introduce more convolution operations, better feature learning and fitting capabilities, and relatively dense feature maps.
We add the multi-scale feature map which is the coarsely obtained from the Semantic Segmentation module and the feature map obtained by the Detail Enhancement module, for which average pooling is used for the image. The global context information of the entire image is obtained by pooling operation. Assuming that the input image is , a tensor of size is obtained after pooling. It is more native to the convolution structure, by enforcing correspondences between feature maps and categories, such that it can generate one feature map for each corresponding category of the classification task and sums the spatial information. Therefore, it is more robust to spatial translations of the input. We apply the global context information to directly perform pixel-by-pixel weighted selection of the input feature map. Here we call it Global Feature Guidance (abbreviated as GFG) to facilitate subsequent experiments.
2.3. Other Settings in Our Method
In our method, ResNet50 was chosen to extract the feature map. The original ResNet50 uses the Rectified Linear Unit (ReLU) as the activation function. ReLU (which can be calculated as in Equation (
2)) retains the biological inspiration of the step function (i.e., the neuron is activated only when the input exceeds the threshold). When the input is positive, the derivative is non-zero, allowing gradient-based learning (although, at
, the derivative is undefined). Using this function can make the calculation process very fast, as neither the function nor its derivative contains complex mathematical operations. When the input is less than zero or the gradient is zero, its weight cannot be updated and, therefore, the learning speed of ReLU may become very slow, or it may even make the neuron directly invalid. The Leaky ReLU function (LReLU) (which can be calculated as in Equation (
3)) is a variant of the classic ReLU activation function. The output of this function still has a small slope when the input is negative. When the derivative is non-zero, it can reduce the appearance of silent neurons, allowing for gradient-based learning (although it will be slow), thus solving the problem of neurons not learning after the ReLU function enters a negative interval. Compared with ReLU, LReLU has a larger activation range. Based on the above, we replaced all ReLU functions in ResNet50 with LReLU (Leaky Rectified Linear Unit) as the basic module. In this paper, the ResNet50 with LReLU, instead of ReLU, is termed “LResNet50”.