Next Article in Journal
A Hybrid Privacy-Preserving Deep Learning Approach for Object Classification in Very High-Resolution Satellite Images
Next Article in Special Issue
MAEANet: Multiscale Attention and Edge-Aware Siamese Network for Building Change Detection in High-Resolution Remote Sensing Images
Previous Article in Journal
Airborne Coherent GNSS Reflectometry and Zenith Total Delay Estimation over Coastal Waters
Previous Article in Special Issue
Gated Path Aggregation Feature Pyramid Network for Object Detection in Remote Sensing Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Source Remote Sensing Pretraining Based on Contrastive Self-Supervised Learning

State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2022, 14(18), 4632; https://doi.org/10.3390/rs14184632
Submission received: 15 July 2022 / Revised: 12 September 2022 / Accepted: 14 September 2022 / Published: 16 September 2022

Abstract

:
SAR-optical images from different sensors can provide consistent information for scene classification. However, the utilization of unlabeled SAR-optical images in deep learning-based remote sensing image interpretation remains an open issue. In recent years, contrastive self-supervised learning (CSSL) methods have shown great potential for obtaining meaningful feature representations from massive amounts of unlabeled data. This paper investigates the effectiveness of CSSL-based pretraining models for SAR-optical remote-sensing classification. Firstly, we analyze the contrastive strategies of single-source and multi-source SAR-optical data augmentation under different CSSL architectures. We find that the CSSL framework without explicit negative sample selection naturally fits the multi-source learning problem. Secondly, we find that the registered SAR-optical images can guide the Siamese self-supervised network without negative samples to learn shared features, which is also the reason why the CSSL framework outperforms the CSSL framework with negative samples. Finally, we apply the CSSL pretrained network without negative samples that can learn the shared features of SAR-optical images to the downstream domain adaptation task of optical transfer to SAR images. We find that the choice of a pretrained network is important for downstream tasks.

1. Introduction

Today, with the developments of earth observation technology and sensor hardware platforms, a large number of Earth observation satellites are operating [1]. There are an increasing number of remote-sensing images from different sources that we can obtain [2]. However, there are many limitations in the practical application of the images and spectral information from a single sensor, which has made it difficult to meet the daily requirements [3] Therefore, the application of multi-source images has become a developmental trend in the field of remote sensing [4] These can fuse the information of different types of images obtained from different sensors, improving the accuracy and richness of the information contained in the same area [5,6] The registered multisource scene images can provide information about the structural consistency of the scene, and the effective use of this information can improve the classification accuracy of the scene images. However, traditional fusion methods rely on manually designed filters to extract features [7,8] and it is difficult to design ideal feature extraction methods or fusion rules. With their powerful data processing capability, convolutional neural networks are increasingly used in multi-source remote sensing and have become the main method for remote sensing image analysis [9] The most primitive method of convolutional neural networks in image processing is to send a large number of images and their unique corresponding tags into the network and extract image features from the neural network by supervised training [10]. The accuracy and efficiency of this end-to-end network are greatly improved. Nevertheless, the supervised training process consumes a large amount of time, memory, and annotated data. Consequently, the most commonly used image classification method in deep learning is to transfer the pretrained network on a large-scale annotated dataset to the downstream classification task for finetuning [11].
The pretrained network can bring initial model parameters, so that helps to achieve higher task accuracy and speed up model convergence [12]. The most widely used pretraining network is based on ImageNet [13], a large, labeled dataset in the computer vision (CV) field. The ImageNet pretrained network learned generalized feature representations from a mass of natural annotation images [14]. There are many CNN classification methods pretrained on the ImageNet dataset that have been widely used on remote-sensing scene images for finetuning [15]. However, due to the huge difference in imaging principles between natural optical images and remote-sensing images, remote-sensing images and natural images have great differences in color features and texture features, especially in data distribution between SAR images and natural optical images [16]. The prior knowledge of the model brought by transferring the network pretrained on the natural optical image is not reliable, and system bias will be introduced [17]. The remote-sensing image finetuning with the ImageNet pretrained network may introduce strong class-discriminative biases for shapes [18] and textures [19] from the pretraining dataset [20]. Those problems are difficult to solve only by using remote-sensing data for finetuning. Therefore, the most direct solution is to construct a remote-sensing annotation dataset of the same size as the ImageNet dataset and then train a general used pretrained network on this remote-sensing dataset.
The number of remote-sensing images we can obtain increases exponentially, but it is time-consuming to annotate them and requires a lot of expert knowledge [21,22]. In addition, there are many types of remote-sensing images, and the imaging methods obtained by various sensors are different [23]. Also, in SAR images, there are differences in the data distribution of images from different wavebands and imaging angles [24]. Hence, the idea of training a general used network on such a complete remote-sensing dataset is not tenable. The key to solving these problems is how to train large datasets without the need for annotations.
In recent years, self-supervised learning has provided a new idea for unlabeled data utilization [25]. Self-supervised learning is based on the idea of training a neural network to solve an artificially designed pretext task by constructing pseudo-labels from unlabeled samples to drive model learning features [17]. The pretext task is the most important role in self-supervised learning, which should prompt the neural network to learn meaningful image representations. Self-supervised learning can be divided into contrastive self-supervised learning (CSSL) and generative self-supervised learning (GSSL) according to different pretext tasks [26]. The core concept of GSSL is feature learning by restoring artificially damaged images to the original images. In addition, CSSL learns features through contrasting latent representations of different transformation augmentations. Consequently, GSSL heavily focuses on pixel-level information, which will lead to redundancy and memory wasting. Since CSSL focuses on extracting easily distinguishable features between categories, it saves more memory than GSSL and is widely used by researchers [27,28]. CSSL can be divided into CSSL methods with negative samples and CSSL methods without negative samples according to whether there are negative samples or not. Typical CSSL methods with negative samples include MOCO [29], SimCLR [30] and CMC [31], while typical CSSL methods without negative samples include BYOL [32] and SimSiam [33]. The self-supervised learning pretrained neural network can be used in downstream supervised tasks [16]. Accordingly, self-supervised learning can be used to develop the pretraining network on unlabeled multi-source remote-sensing data.
The strategy of multi-source image fusion can fuse the information of different types of images obtained from different sensors to improve the accuracy and richness of the information contained in the same area [34]. Stojni proposed applying the CMC structure in multi-source remote-sensing images [17]. However, in the face of CSSL methods under different frameworks, no research shows which framework is more suitable for multi-source images. In the meantime, BYOL and SimSiam achieve superior performance on single-source images by both using a Siamese network and discarding the negative samples during the network training. The Siamese network structure of BYOL and SimSiam can better utilize the consistency information brought by registered SAR-optical images so that the network can learn the shared features of SAR-optical images. Consequently, we apply SAR-optical images to the CSSL framework without negative samples. The features extracted from SAR-optical remote-sensing images in the network with and without negative samples are compared experimentally, and the pretrained network is transferred to the downstream single-source scene classification task to compare their accuracy.
In fact, self-supervised pretraining networks are widely used [12]. In the downstream transfer task of optical to SAR images in the field of remote-sensing, domain adaptation is usually applied to improve the transfer accuracy, such as in DANN [35], DAN [36] and CDAN [37], because of the large difference in the distribution of optical and SAR images. However, in domain adaptation, the use of ImageNet pretrained networks as feature extractors to align the data distributions of optical and SAR images suffers and deteriorates the transfer accuracy. In this paper, a contrastive self-supervised pretraining network based on registered SAR-optical images is used to replace the ImageNet pretraining network to extract features of optical and SAR images and verify whether the accuracy of the domain adaptation transfer task can be improved.
To address the aforementioned issues, we conducted extensive experiments on six popular datasets. The three main contributions are summarized as follows:
(1)
We introduce the CSSL method with negative samples and the CSSL method without negative samples into multi-source remote-sensing image applications simultaneously, and we systematically compare the effectiveness of the two different frameworks for remote-sensing image classification for the first time.
(2)
We determine through analysis that the registered SAR-optical images have structural consistency and can be used as natural contrastive samples that can drive the Siamese network without negative samples to learn shared features of SAR-optical images.
(3)
We explore the application of self-supervised pretrained networks to the downstream domain adaptation task of optical transfer to SAR for the first time.

2. Methods

We address the problem of improving unlabeled SAR-optical data utilization in a contrastive self-supervised fashion. This paper applies the two frameworks of CSSL to SAR-optical images. The CSSL framework with negative samples requires the encoder to decrease the distance between views mapped to instances in the latent space while increasing the distance between other instances (see Figure 1a) [38]. Another framework is CSSL without negative samples, which tasks the encoder with predicting different augmented views of input samples to learn features (see Figure 1b). However, multiple views in single-source images often require the use of random augmentation to simulate them (see Figure 2a). Multi-source remote-sensing images naturally have multiple views, especially the registered multi-source remote-sensing scene images. The images obtained by different sensors in the same scene have strong structural consistency, forming a natural positive sample pair as shown in Figure 2b.
In this section, we introduce different contrastive frameworks on SAR-optical data augmentation strategies (see Figure 1c). We propose to explore which contrastive framework can effectively improve the utilization of SAR-optical images. We also investigate the principles of those methods and propose a novel application scenario of SAR-optical contrastive self-supervised pretraining network on domain adaptation tasks.

2.1. Multi-Source Contrastive Self-Supervised Method with Negative Samples

In a set of data, different augmented views of the same image constitute positive sample pairs, and augmented views of different images constitute negative sample pairs. The purpose of contrastive learning is to reduce the distance between positive samples and features and increase the distance between negative samples to learn instance-level features.
Given a dataset of X that consists of a collection of image pairs { x 1 , x 2 , , x n } , { x i , x i + } i = 1 , 2 , , n are positive pairs where x i + is an augmentation sample of x i , { x i , x j } i j are negative pairs. In contrastive loss function, f ( ) is a feature extractor consists of neural network. Contrastive loss function is sampled as Equation (1):
L = E x · log exp ( f ( x i ) T f ( x i + ) ) exp ( f ( x i ) T f ( x i + ) ) + j = 1 n 1 exp ( f ( x i ) T f ( x j ) )
CMC (contrastive multi-view coding) is a multi-view contrastive method. In multisource image contrastive application, images in the same scene are regarded as positive sample pairs, while different source images in different scenes are regarded as negative sample pairs. As shown in Figure 3, f θ 1 ( ) , f θ 2 ( ) are the same feature extractors, and p θ 1 ( ) , p θ 2 ( ) are the same projectors with different parameters θ 1 , θ 2 . Given two datasets of optical images X 1 and SAR images X 2 that consist of a collection of samples { x 1 i , x 2 i } i = 1 N , we take { x 1 i , x 2 i } as positive sample pairs, and take { x 1 i , x 2 j } as negative pairs. The features of images are extracted as y 1 i = f θ 1 ( x 1 i ) , y 2 i = f θ 2 ( x 2 i ) , and then projected as z 1 i = p θ 1 ( y 1 i ) , z 2 i = p θ 2 ( y 2 i ) .
In this paper, ResNet18 [15] is selected as the feature encoder, and we use a fully-connected linear layer for the projection network. After the feature encoder extracts image features, feature vectors are mapped into space, and the distance between feature vectors is calculated by the loss function.
Cosine similarity is used in discriminating functions to measure similarity between samples. The discriminating function h ( ) is trained to achieve a high value for positive pairs and a low for negative pairs and can be defined with the temperature parameter τ as:
h ( { x 1 i , x 2 i } ) = exp z 1 i z 2 i z 1 i z 2 i 1 τ
Loss function of CMC can be defined as:
L x 1 , x 2 = E { x 1 1 , x 2 2 , , x 2 M + 1 } log h ( { x 1 1 , x 2 1 } ) j = 1 M + 1 h ( { x 1 1 , x 2 j } )
where view x 1 is an anchor, and there are M negative samples all sampled from x 2 . Therefore, the whole loss in the training set can be defined as:
L ( x 1 , x 2 ) = L x 1 , x 2 + L x 2 , x 1
At the end of the calculation of the loss function, the distance between the positive sample pairs is increasing, and the distance between the negative sample pairs is decreasing.

2.2. Multi-Source Contrastive Self-Supervised Method without Negative Samples

The contrastive learning method without negative samples can not only eliminate the impact of negative sample data volume on performance but also greatly reduce memory consumption. In contrastive learning methods without negative samples, asymmetric Siamese networks are generally used to map the features from different augmentations of the same image for mutual prediction. We apply SAR-optical images to the CSSL framework without negative samples, as shown in Figure 4. The network extracts features from optical and SAR images registered in the same scene and drives different branches of the Siamese network to predict each other to learn the shared features of the optical and SAR images.
The improved model of the SAR-optical multi-source structure of BYOL and SimSiam is shown in Figure 5 and Figure 6, which improves the original single-source input image into the optical and SAR multi-source input structure. There are two different parameter update strategies between the BYOL and SimSiam structures.
The working method of multi-source BYOL is as follows: Remote-sensing images from different sources are input to the online network and the target network respectively. Features of different enhanced views are extracted by the encoder f θ and f ξ and get feature representation y θ and y ξ . The feature representations are mapped by the projector g θ and g ξ , and finally get feature vectors z θ and z ξ . Resnet-18 is used as the encoder output feature, and then the projector and predictor are constructed by a multi-layer perceptron (MLP) network [39]. The online network needs to go through MLP mapping again and finally get vector q θ ( z θ ) . The target network finally get vector z ξ . The vector obtained from the online network predicts the vector of the target network and calculates the loss through the loss function:
L θ , ξ = 2 2 q θ ( z θ ) , z ξ q θ ( z θ ) 2 z ξ 2
where | | | | 2 is l 2 -norm. Similarly, we exchange online network inputs with target network inputs and define the loss as L θ , ξ , the total loss in BYOL can be defined as:
L B Y O L = L θ , ξ + L ξ , θ
Only the online network parameter θ is updated by an optimizer for each calculated loss L B Y O L , and the target network parameter ξ is updated by exponential average strategy depending on θ :
θ o p t i m i z e r ( θ , θ L B Y O L , η )
ξ ω θ + ( 1 ω ) θ
where optimizer (·) in Equation (7) denotes an optimizer. Equation (7) represents the use of a selected optimizer to update the gradient and parameters of the loss function L B Y O L . In this paper, we used the SGD (stochastic gradient descent) optimizer [40]. According to the derivation of the parameter θ in the loss function L B Y O L in Equation (7), this parameter is updated according to the stochastic gradient descent strategy and the set learning rate η. And ω in Equation (8) denotes the target decay rate.
SimSiam differs from BYOL in that the two branch networks in SimSiam share parameters and only update online network parameters according to the loss function.

2.3. Domain Adaption Based on Multi-Source Contrastive Self-Supervised Pretraining

When there is a huge gap between the data distribution of a source domain and the target domain, the method of domain adaptation is generally used to align their data distribution and therefore improve the classification accuracy of the target domain. In the case of the optical and SAR image transfer tasks, unsupervised domain adaptation methods are adopted when the annotation of SAR data is difficult. An ImageNet pretrained network is used for feature extraction domain adaptation. This approach ignores the difference between natural optical images and remote-sensing images, which will lead to low classification accuracy in the transfer task of optical to SAR images.
In this paper, the SAR-optical contrastive framework without negative samples in Section 2.2 can theoretically learn the shared features of SAR-optical images. Therefore, we propose to replace the ImageNet pretraining network with the contrastive self-supervised pretraining network without negative samples to extract features of optical images and SAR images, as shown in Figure 7 [41]. The upper part exemplified the contrastive self-supervised training process of SimSiam. The pretrained feature extractor was transferred to the domain adaptation method based on metric learning exemplified in the lower part to extract image features more effectively to improve the transfer accuracy.

3. Experiments

3.1. Datasets

(1) So2Sat LCZ42 [42]: So2Sat LCZ42(So2Sat) contains 17 categories of data and is a dataset composed of 400,673 pairs of corresponding Sentinel-1 SAR and Sentinel-2 multispectral image patches with local-climate-zone (LCZ) labels. The dataset covers 42 city-dense areas on all continents except Antarctica. Sentinel-1 consists of two polar-orbiting satellites equipped with C-band SAR remote-sensing systems, covering an area of less than 400 km. Sentinel-1 data in the So2Sat dataset contain 8 real-valued bands, while the Sentinel-2 data in the So2Sat LCZ42 dataset contain 10 real-valued bands. SAR-optical image pairs and the corresponding LCZ labels were sampled to 10 m spatial resolution and cropped into 32 × 32 pixels patches. We selected five types of data with obvious characteristics for the experiments, compact high rise, heavy industry, dense trees, bare soil, and sand and water. A sample diagram of the So2Sat LCZ42 dataset is shown in Figure 8.
(2) SEN1-2 [43]: SEN1-2 is a dataset composed of 282,384 pairs of corresponding Sentinel-1 SAR and Sentinel-2 multispectral image patches without labels, collected from across the globe and throughout all meteorological seasons. SAR-optical image pairs were sampled to 10 m spatial resolution and cropped into 256 × 256 pixels patches. A sample diagram of the SEN1-2 dataset is shown in Figure 9.
(3) QXS-SARPORT [44]: QXS-SARPORT(QXS) is a dataset composed of 20,000 pairs of SAR and optical images without labels. SAR patches from SAR satellite GaoFen-3 and optical patches from Google Earth. Those images are collected from three ports: San Diego, Shanghai, and Qingdao. SAR-optical image pairs were sampled to 1 m spatial resolution and cropped into 100 × 100 pixels patches. A sample diagram of the QXS dataset is shown in Figure 10.
(4) UCMerced_LandUse [45]: UCMerced_LandUse (UCM) is a dataset composed of 21 types of optical remote-sensing scene images, with an image resolution of 0.3 m and an image size of 256 × 256 pixels. We selected six types of data with obvious characteristics for the experiments, beach, forest, river, dense residential, medium residential, and storage tank. A sample diagram of the UCMerced_LandUse dataset is shown in Figure 11.
(5) AID [46]: AID is a dataset composed of 30 types of optical remote-sensing scene images, with an image resolution of 0.5–0.8 m and an image size of 600 × 600 pixels. We selected six types of data with obvious characteristics for the experiments, beach, forest, river, dense residential, medium residential, and storage tank. A sample diagram of the AID dataset is shown in Figure 12.
(6) OpenSarUrban [47]: OpenSarUrban is a Sentinel-1 dataset composed of 10 different target area categories of urban SAR images, with a resolution of 20 m and image size of 100 × 100 pixels. A sample diagram of the OpenSarUrban dataset is shown in Figure 13.

3.2. Experimental Setup

The experiments of this research are divided into three parts: the first part is the process of CMC, BYOL, and SimSiam contrastive self-supervised feature extractor training. The second part is to evaluate self-supervised methods by transferring those pretrained networks to the downstream single-source SAR and optical classification tasks. Then, the third step is transferring BYOL and SimSiam self-supervised pretrained networks to the downstream task of optical images transfer to SAR images, replacing the original feature extractor pretrained on the ImageNet dataset in the domain adaption network.

3.2.1. Self-Supervised Network Training

The pretraining data in self-supervised learning can be presented as 20,000 pairs of SAR-optical images. For CMC, the optical and SAR images of the same scene are positive sample pairs, while different scene images are negative sample pairs. We used SGD [40] as the optimizer in the experiment: The SGD was represented by a learning rate of 0.05, the weight decay was given at 0.0004, and the momentum was set to 0.9 in the CMC training step. We used a batch size of 256, and models were run for 200 epochs. For multi-source BYOL and SimSiam, SAR-optical image pairs were divided into two branches, one for online network input and the other for target network input. The SGD with a learning rate of 0.05, a weight decay of 0.0004, and a momentum of 0.9 was adapted as our optimizer, and Gaussian blurring augmentation was used to improve the performance.

3.2.2. Downstream Classification Task

We assessed the performance of three SAR-optical self-supervised methods after self-supervised pretraining on the training set of the So2Sat LCZ42 dataset. We first evaluated their representation by training a linear classifier on top of the frozen representation without updating the network parameters and the batch statistics. Then, we evaluated our representation by finetuning the pretrained network respectively pretrained on SEN1-2, So2Sat and QXS datasets to assess whether the features learned on different SAR-optical fusion datasets were generic and ultimately useful or not in classification tasks of single-source So2Sat, UCM, AID, and OpenSarUrban datasets. The total finetuning training data amount in the downstream SAR image and optical image classification tasks was 6684, the data amount of the training set increased successively, and the data amount of the test set was 6654. In detail, we trained the linear classifier for 50 epochs.

3.2.3. Downstream Domain Adaption Task

In this paper, the domain adaptation method based on metric learning and the domain adaptation method based on adversarial learning were adopted. We used a SAR-optical self-supervised pretraining network without negative samples as a feature extractor in the transfer task of optical to SAR images. The source domain data were optical remote-sensing images from So2Sat, and the target domain data were SAR images from So2Sat.

4. Results

We assessed the performance of BYOL’s representation after self-supervised pretraining on the training set of the So2Sat dataset. We first evaluated it on both RGB and SAR images from So2Sat dataset in a linear evaluation setup. We then measured its transfer capabilities on other datasets as well.
Firstly, we wanted to analyze the influence of the augmentation contrastive strategies of single-source and multi-source data in CSSL; we also wanted to explore the effectiveness of multi-source augmentation strategies in CSSL frameworks with and without negative samples through the classification accuracy by training a linear classifier of downstream single-source remote-sensing scene images in So2Sat dataset. Secondly, we wanted to investigate the reason which multi-source strategy is more effective under which CSSL framework. Thirdly, we measured the transfer capabilities of the CSSL framework without negative samples on other datasets. Finally, we evaluated the possibility of extracting features in the task of domain adaptation transfer from optical to SAR images using a previously trained CSSL model without negative samples.

4.1. Comparison on the Linear Classification Task

The effectiveness of the encoders pretrained by BYOL, SimSiam, CMC, SimCLR, and MOCO [48] was evaluated by performing the linear classification protocol on features extracted in an unsupervised way. We compared self-supervised methods with pure supervised methods and the ImageNet pretrained network, and we also compared single-source CSSL methods with SAR-optical CSSL methods on single-source SAR and optical image classification tasks.
Firstly, we compared the CSSL methods with the ImageNet pretrained network and the pure supervised method on the linear classification downstream task of remote sensing images. We compare the CSSL methods in Table 1 with the accuracy of ImageNet and supervised methods in the last two rows. We found that the CSSL methods achieved higher accuracy than that of ImageNet pretrained and pure supervised. Therefore, it is much better to train the self-supervised models on remote-sensing images than the ImageNet pretrained network and pure supervised.
Secondly, we compared the CSSL methods with negative samples of the single-source augmentation strategy such as MOCO, SimCLR and the contrastive methods without negative samples such as BYOL and SimSiam. We can see that in the linear classification downstream task of single-source remote-sensing images, BYOL achieved the best accuracy, while the SimSiam method without negative samples had low accuracy. We speculate that the parameter update strategy is important in the upstream training of contrastive learning from single-source remote-sensing images without negative samples. Thirdly, we compared the effectiveness of BYOL without negative samples in single-source and multi-source augmentation strategies. It can be seen that the SAR-optical augmentation strategy greatly improved the linear classification accuracy of single-source optical and SAR images, and the SAR-optical augmentation made the upstream feature encoder learn more effective features.
Thirdly, we compared the effectiveness of the SAR-optical augmentation strategy in the negative sample method CMC and the CSSL methods without negative samples, such as BYOL and SimSiam. In the linear classification results, the SAR-optical method without negative samples achieved higher classification accuracy. It was proven that the SAR-optical method without negative samples learned more effective features, and the shared features of SAR and optical images learned by BYOL and SimSiam are more conducive to transfer to the downstream single-source classification task.
Finally, we compared the upstream self-supervised training time of different CSSL methods on a GeForce RTX 3070 GPU with a memory size of 16G, as shown in Table 2. For single-source methods, we only compared the training time on optical remote-sensing images. It can be seen from Table 2 that the BYOL and SimSiam methods without negative samples had longer training times than the methods with negative samples. The reason for the longer training time of BYOL and SimSiam is that the batch normalization (BN) layer in the MLP network increases the amount of computation, while MoCo, SimCLR and CMC do not need to be based on global BN, so the feedforward part that consumes the most computation is much faster. However, in the method without negative samples, the multi-source method does not consume more time than the single-source method and can obtain better accuracy.

4.2. Comparison on Finetuning Results

We wanted to analyze the influence of the number of images used for finetuning the self-supervised models for our downstream tasks. To do this, we finetuned self-supervised models pretrained on the So2Sat dataset using subsets of 500, 2000, 3500, 5000, and 6654 annotation images from the So2Sat dataset. We can see from Figure 14 that self-supervised models can achieve higher accuracy than the ImageNet pretrained network, especially with a small number of annotation images.
The results in Table 3 show that the choice of images used for the training of the BYOL and SimSiam models is crucial to the performance on the downstream tasks if we use the models as feature extractors. The results of the downstream finetuning task on the So2Sat dataset show little difference between different pretraining datasets. We can see that in the case of UCM and AID dataset images, the best results were obtained using neural networks pretrained in a supervised fashion. The QXS dataset has a resolution of 1 m, which is much higher than the 10-m resolution of the SEN1-2 and So2Sat datasets. It is also interesting that those self-supervised models gave good results when trained with images that had similar resolutions as the images in the downstream task. When the spatial resolution between pretraining and downstream images significantly differed, we obtained opposite results. On the downstream task of the OpenSarUrban dataset, resolution was also a key point in the pretrained dataset.

4.3. Visualization

We wanted to verify that the SAR-optical CSSL without negative samples had learned the shared features by visualizing the image features obtained from the convolution layer. We took two SAR-optical image pairs from QXS as the input of the pretraining network to visualize the features extracted after the first convolution layer. Comparing Figure 15b,d with Figure 15c, we can see that the SAR images and optical features extracted by each channel of the convolutional layer in the BYOL and SimSiam methods are the same. We can conclude that features learned by BYOL and SimSiam are shared features of SAR-optical images: BYOL and SimSiam outperformed CMC because the shared features can bring better stability and robustness in the transfer process, as well as bring higher accuracy to optical and SAR single-source image classification tasks. We also see that BYOL and SimSiam pay more attention to texture features instead of background messages, which is beneficial for remote-sensing image scene classification.

4.4. Transfer Task of Optical to SAR Images

In the previous experiments, we found that BYOL and SimSiam could obtain very robust shared features of optical and SAR images and could obtain high accuracy in single-source optical and SAR image classification tasks at the same time. Therefore, we applied the multi-source supervised pretraining network to the transfer task of optical remote-sensing images to SAR images. Since there is a big difference between the data distribution of optical images and SAR images, the common solution is to introduce a domain adaptation algorithm to align the data distribution of the source domain and the target domain. Among the domain adaptation algorithms listed in the Table 4, DANN and CDAN are based on adversarial learning. While DAN is based on metric learning, the source-only is the control group that does not adopt the domain adaptive algorithm. In the experiment, we replaced the ImageNet pretraining network with a self-supervised pretraining network to extract the features of images in the source domain and target domain. As seen in Table 4, self-supervised pretraining networks can extract more effective features from the source domain and target domain and eventually improve the transfer classification accuracy of optical to SAR images.
We also analyzed the confusion matrix of classification results for each method, as shown in Figure 16. It can be seen from the results of the confusion matrix that dense trees and water have the highest transfer accuracy. Through horizontal comparison, we found that the self-supervised contrast learning network could assist the domain adaptation algorithm to maximize the classification accuracy of the successfully transferred categories. However, it cannot improve the transfer results of easily confused categories.

5. Conclusions

In this paper, we mainly study the application of self-supervised contrastive learning methods with and without negative samples to multi-source remote-sensing images. We show that if we use a feature extractor trained on the CSSL method as a pretrained network, it outperforms both the ImageNet pretrained network and the purely supervised method. We also verify that the multi-source CSSL method achieves better results than the single-source CSSL method. We prove that CSSL can effectively use the information from SAR-optical remote-sensing images and that the contrastive Siamese network without negative samples can more effectively utilize the consistency information brought by the registered images.
We arrive at the conclusion that the CSSL method without negative samples performs better than the CSSL method with negative samples because it is able to learn the shared features of multi-source remote-sensing images. Shared features of multi-source images learned by upstream self-supervised training bring robustness and stability in downstream tasks, which can be effective for single-source image classification tasks. However, with the finetuning self-supervised network in different datasets downstream, it was constrained by resolution when it was transferring to remote-sensing datasets of different resolutions. We found that in the domain adaptation method of optical transmission to SAR images, using a multi-source CSSL pretrained network without negative samples can greatly improve the transfer accuracy due to the fact that this pretrained network can learn the shared features of optical and SAR images.
Accordingly, for future development, we plan to minimize the finetuning performance degradation caused by data resolution differences for classification tasks and further improve transfer performance from optical images to SAR images.

Author Contributions

Conceptualization, C.L. and H.S.; methodology, C.L.; software, C.L.; validation, H.S. and Y.X.; formal analysis, H.S.; investigation, H.S.; resources, G.K.; data curation, C.L.; writing—original draft preparation, C.L.; writing—review and editing, C.L., H.S. and Y.X.; visualization, C.L.; supervision, H.S.; project administration, G.K.; funding acquisition, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61971426.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Robinson, C.; Malkin, K.; Jojic, N.; Chen, H.; Qin, R.; Xiao, C.; Schmitt, M.; Ghamisi, P.; Hänsch, R.; Hänsch, N. Global land-cover mapping with weak supervision: Outcome of the 2020 IEEE GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3185–3199. [Google Scholar] [CrossRef]
  2. Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big data for remote sensing: Challenges and opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
  3. Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Bovolo, F.; Chi, M.; Anders, K.; Gloaguen, R.; et al. Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar] [CrossRef]
  4. Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Multimodal Bilinear Fusion Network With Second-Order Attention-Based Channel Selection for Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1011–1026. [Google Scholar] [CrossRef]
  5. Tuia, D.; Volpi, M.; Trolliet, M.; Camps-Valls, G. Semisupervised Manifold Alignment of Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2014, 52, 7708–7720. [Google Scholar] [CrossRef]
  6. Penatti, O.; Nogueira, K.; dos Santos, J.A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 44–51. [Google Scholar]
  7. Yi, W.; Zeng, Y.; Yuan, Z. Fusion of GF-3 SAR and optical images based on the nonsubsampled contourlet transform. Acta Opt. Sin. 2018, 38, 76–85. [Google Scholar]
  8. Feng, Q.; Yang, J.; Zhu, D.; Liu, J.; Guo, H.; Bayartungalag, B.; Li, B. Integrating Multitemporal Sentinel-1/2 Data for Coastal Land Cover Classification Using a Multibranch Convolutional Neural Network: A Case of the Yellow River Delta. Remote Sens. 2019, 11, 1006. [Google Scholar] [CrossRef]
  9. Wang, D.; Du, B.; Zhang, L. Fully contextual network for hyperspectral scene parsing. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
  10. Kim, S.; Song, W.-J.; Kim, S.-H. Double Weight-Based SAR and Infrared Sensor Fusion for Automatic Ground Target Recognition with Deep Learning. Remote Sens. 2018, 10, 72. [Google Scholar] [CrossRef]
  11. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
  12. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
  13. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Karpathy, A.; Bernstein, M.S.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  14. Zhou, H.Y.; Yu, S.; Bian, C.; Hu, Y.; Ma, K.; Zheng, Y. Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Lima, Peru, 4–8 October 2020. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  16. Wang, D.; Zhang, J.; Du, B.; Xia, G.S.; Tao, D. An Empirical Study of Remote Sensing Pretraining. arXiv 2022, arXiv:2204.02825. [Google Scholar] [CrossRef]
  17. Stojnic, V.; Risojevic, V. Self-supervised learning of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1182–1191. [Google Scholar]
  18. Kriegeskorte, N. Deep neural networks: A new framework for modelling biological vision and brain information processing. Annu. Rev. Vis. Sci. 2015, 1, 417–446. [Google Scholar] [CrossRef]
  19. Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
  20. Albuquerque, I.; Naik, N.; Li, J.; Keskar, N.; Socher, R. Improving out-of-distribution generalization via multi-task self-supervised pretraining. arXiv 2020, arXiv:2003.13525. [Google Scholar]
  21. Scheibenreif, L.; Hanna, J.; Mommert, M.; Borth, D. Self-Supervised Vision Transformers for Land-Cover Segmentation and Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1422–1431. [Google Scholar]
  22. Stojnic, V.; Risojevic, V. Evaluation of Split-Brain Autoencoders for High-Resolution Remote Sensing Scene Classification. In Proceedings of the 2018 International Symposium ELMAR, Zadar, Croatia, 16–19 September 2018; pp. 67–70. [Google Scholar]
  23. Gómez-Chova, L.; Tuia, D.; Moser, G.; Camps-Valls, G. Multimodal classification of remote sensing images: A review and future directions. Proc. IEEE 2015, 103, 1560–1584. [Google Scholar] [CrossRef]
  24. Sun, Z.; Dai, M.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. An anchor-free detection method for ship targets in high-resolution SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7799–7816. [Google Scholar] [CrossRef]
  25. Goyal, P.; Mahajan, D.; Gupta, A.; Misra, I. Scaling and benchmarking self-supervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 20–26 October 2019; pp. 6391–6400. [Google Scholar]
  26. Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021. [Google Scholar] [CrossRef]
  27. Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
  28. Manas, O.; Lacoste, A.; Giró-i-Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9414–9423. [Google Scholar]
  29. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  30. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2574–2582. [Google Scholar]
  31. Tian, Y.; Krishnan, D.; Isola, P. Contrastive Multiview Coding. arXiv 2019, arXiv:1906.05849. [Google Scholar]
  32. Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
  33. Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15750–15758. [Google Scholar]
  34. Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  35. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2096-2030. [Google Scholar]
  36. Long, M.; Cao, Y.; Cao, Z.; Wang, J.; Jordan, M.I. Transferable representation learning with deep adaptation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 3071–3085. [Google Scholar] [CrossRef]
  37. Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
  38. Scheibenreif, L.; Mommert, M.; Borth, D. Contrastive self-supervised data fusion for satellite imagery. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 3, 705–711. [Google Scholar] [CrossRef]
  39. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
  40. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010, Paris, France, 22–27 August 2010; pp. 177–186. [Google Scholar]
  41. Deng, W.; Zhao, L.; Kuang, G.; Hu, D.; Pietikäinen, M.; Liu, L. Deep Ladder-Suppression Network for Unsupervised Domain Adaptation. IEEE Trans. Cybern. 2021, 1–15. [Google Scholar] [CrossRef]
  42. Zhu, X.X.; Hu, J.; Qiu, C.; Shi, Y.; Kang, J.; Mou, L.; Wang, Y.; Huang, R.; Li, H.; Sun, Y.; et al. So2Sat LCZ42: A benchmark data set for the classification of global local climate zones [Software and Data Sets]. IEEE Geosci. Remote Sens. Mag. 2020, 8, 76–89. [Google Scholar] [CrossRef]
  43. Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 dataset for deep learning in SAR-optical data fusion. arXiv 2018, arXiv:1807.01569. [Google Scholar] [CrossRef]
  44. Huang, M.; Xu, Y.; Qian, L.; Shi, W.; Zhang, Y.; Bao, W.; Wang, N.; Liu, X.J.; Xiang, X. The QXS-SAROPT dataset for deep learning in SAR-optical data fusion. arXiv 2021, arXiv:2103.08259. [Google Scholar]
  45. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
  46. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  47. Zhao, J.; Zhang, Z.; Yao, W.; Datcu, M.; Xiong, H.; Yu, W. OpenSARUrban: A Sentinel-1 SAR image dataset for urban interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 187–203. [Google Scholar] [CrossRef]
  48. da Costa, V.G.T.; Fini, E.; Nabi, M.; Sebe, N.; Ricci, E. Solo-learn: A Library of Self-supervised Methods for Visual Representation Learning. J. Mach. Learn. Res. 2022, 23, 1–6. [Google Scholar]
Figure 1. (a) Overview of model architectures of CSSL with negative samples; the blue balls represent the mapped vectors of positive samples, and the yellow ones represent the mapped vectors of negative samples. (b) Overview of model architectures of CSSL without negative samples; features from homologous data are mapped into space to predict each other. (c) Overview of SAR-optical alignment module.
Figure 1. (a) Overview of model architectures of CSSL with negative samples; the blue balls represent the mapped vectors of positive samples, and the yellow ones represent the mapped vectors of negative samples. (b) Overview of model architectures of CSSL without negative samples; features from homologous data are mapped into space to predict each other. (c) Overview of SAR-optical alignment module.
Remotesensing 14 04632 g001
Figure 2. Schematic diagram of positive pair: (a) single-source augmentation; (b) SAR-optical augmentation.
Figure 2. Schematic diagram of positive pair: (a) single-source augmentation; (b) SAR-optical augmentation.
Remotesensing 14 04632 g002
Figure 3. Schematic representation of the CMC algorithm.
Figure 3. Schematic representation of the CMC algorithm.
Remotesensing 14 04632 g003
Figure 4. Asymmetric Siamese network.
Figure 4. Asymmetric Siamese network.
Remotesensing 14 04632 g004
Figure 5. Structure of BYOL with SAR-optical images.
Figure 5. Structure of BYOL with SAR-optical images.
Remotesensing 14 04632 g005
Figure 6. Structure of SimSiam with SAR-optical images.
Figure 6. Structure of SimSiam with SAR-optical images.
Remotesensing 14 04632 g006
Figure 7. Example of self-supervised pretrained network used in domain adaption task of optical transfer to SAR data.
Figure 7. Example of self-supervised pretrained network used in domain adaption task of optical transfer to SAR data.
Remotesensing 14 04632 g007
Figure 8. A sample diagram of the So2Sat LCZ42 dataset.
Figure 8. A sample diagram of the So2Sat LCZ42 dataset.
Remotesensing 14 04632 g008
Figure 9. A sample diagram of the SEN1-2 dataset.
Figure 9. A sample diagram of the SEN1-2 dataset.
Remotesensing 14 04632 g009
Figure 10. A sample diagram of the QXS dataset.
Figure 10. A sample diagram of the QXS dataset.
Remotesensing 14 04632 g010
Figure 11. A sample diagram of the UCM dataset.
Figure 11. A sample diagram of the UCM dataset.
Remotesensing 14 04632 g011
Figure 12. A sample diagram of the AID dataset.
Figure 12. A sample diagram of the AID dataset.
Remotesensing 14 04632 g012
Figure 13. A sample diagram of the OpenSarUrban dataset.
Figure 13. A sample diagram of the OpenSarUrban dataset.
Remotesensing 14 04632 g013
Figure 14. Overall accuracy achieved by different methods on test set versus the number of samples used for the training of the finetuning evaluation. (a) The classification accuracy of SAR images in So2Sat dataset; (b) The classification accuracy of RGB images in So2Sat dataset.
Figure 14. Overall accuracy achieved by different methods on test set versus the number of samples used for the training of the finetuning evaluation. (a) The classification accuracy of SAR images in So2Sat dataset; (b) The classification accuracy of RGB images in So2Sat dataset.
Remotesensing 14 04632 g014
Figure 15. Feature visualization. (a) Example of optical and SAR remote-sensing input images; (b) Visualization result of (a) input from multi-source BYOL network; (c) Visualization result of (a) input from CMC network; (d) Visualization result of (a) input from multi-source SimSiam network; (e) Visualization result of (a) input from ImageNet pretrained network.
Figure 15. Feature visualization. (a) Example of optical and SAR remote-sensing input images; (b) Visualization result of (a) input from multi-source BYOL network; (c) Visualization result of (a) input from CMC network; (d) Visualization result of (a) input from multi-source SimSiam network; (e) Visualization result of (a) input from ImageNet pretrained network.
Remotesensing 14 04632 g015aRemotesensing 14 04632 g015b
Figure 16. Confusion matrix obtained by different domain adaption methods based on different pretrained networks. The more labels for a class are predicted, the darker the color of the squares in the matrix. The numbers in the squares on the diagonal represent the correct numbers of predictions for each class label.
Figure 16. Confusion matrix obtained by different domain adaption methods based on different pretrained networks. The more labels for a class are predicted, the darker the color of the squares in the matrix. The numbers in the squares on the diagonal represent the correct numbers of predictions for each class label.
Remotesensing 14 04632 g016
Table 1. Overall accuracy (%) obtained by different methods on the test set with a linear classifier.
Table 1. Overall accuracy (%) obtained by different methods on the test set with a linear classifier.
MethodTest Data
So2Sat LCZ42-SARSo2Sat LCZ42-OPT
Single-source without negative samplesBYOL-SAR79.44%-
BYOL-OPT-91.62%
SimSiam-SAR73.85%-
SimSiam-OPT-86.23%
Single-source with negative samplesSimCLR-SAR77.23%-
SimCLR-OPT-86.71%
MOCO-SAR74.43%-
MOCO-OPT-86.51%
Multi-source with negative samplesCMC74.03%87.09%
Multi-source without negative samplesMulti-source-BYOL81.22%92.37%
Multi-source -SimSiam80.52%91.04%
ImageNet72.95%84.61%
Supervised73.71%86.16%
Table 2. Training time of CSSL methods (h: hours, m: minutes, s: seconds).
Table 2. Training time of CSSL methods (h: hours, m: minutes, s: seconds).
MethodsMOCOSimCLRCMCBYOLSimSiam
Single-SourceMulti-SourceSingle-SourceMulti-Source
Upstream CSSL training time3 h 37 m3 h 05 m3 h 17 m4 h 12 m4 h 13 m4 h 12 m4 h 14 m
Downstream linear classification time1 m 17 s1 m 17 s1 m 17 s1 m 17 s1 m 17 s1 m 17 s1 m 17 s
Table 3. Overall accuracy (%) obtained by BYOL and SimSiam pretrained on different dataset on the downstream test sets, which are finetuned evaluation.
Table 3. Overall accuracy (%) obtained by BYOL and SimSiam pretrained on different dataset on the downstream test sets, which are finetuned evaluation.
Finetuning InitializationTest Data
So2Sat LCZ42-RGBSo2Sat LCZ42-SAROpenSAR-UrbanUCMerced_LandUseAID
BYOL-QXS-SARPORT94.0281.1557.3593.3395.75
SimSiam-QXS-SARPORT92.3881.6951.1383.3388.92
BYOL-SEN1-292.7381.2748.5263.3383.96
SimSiam -SEN1-292.8381.4948.7064.1783.73
BYOL-So2Sat LCZ4293.2581.2253.3586.6790.57
SimSiam-So2Sat LCZ4293.1581.2153.4088.0091.98
ImageNet92.5880.6459.0797.5095.87
Supervised92.7481.0958.5879.1790.80
Table 4. Overall accuracy obtained by DAN, DANN and CDAN domain adaption methods based on different pretrained networks.
Table 4. Overall accuracy obtained by DAN, DANN and CDAN domain adaption methods based on different pretrained networks.
Initialization NetworksTransfer Methods
Source-OnlyDANNDANCDAN
BYOL66.0474.8877.6778.60
SIMSIAM57.6179.5880.6781.45
ImageNet51.5675.1273.0381.35
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, C.; Sun, H.; Xu, Y.; Kuang, G. Multi-Source Remote Sensing Pretraining Based on Contrastive Self-Supervised Learning. Remote Sens. 2022, 14, 4632. https://doi.org/10.3390/rs14184632

AMA Style

Liu C, Sun H, Xu Y, Kuang G. Multi-Source Remote Sensing Pretraining Based on Contrastive Self-Supervised Learning. Remote Sensing. 2022; 14(18):4632. https://doi.org/10.3390/rs14184632

Chicago/Turabian Style

Liu, Chenfang, Hao Sun, Yanjie Xu, and Gangyao Kuang. 2022. "Multi-Source Remote Sensing Pretraining Based on Contrastive Self-Supervised Learning" Remote Sensing 14, no. 18: 4632. https://doi.org/10.3390/rs14184632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop