1. Introduction
The advancement and popularization of sensor technology have promoted the wide application of remote sensing image-related research in human activities. For example, the high-quality spatiotemporal remote sensing images we have obtained through satellite remote sensing technology have great research significance in areas [
1,
2] such as crop monitoring [
3], forest monitoring [
4], land-cover change monitoring [
5], real-time urban disaster monitoring [
6], and water resource evaluation [
7]. These applications require the spatial resolution of surface details (texture and structure of ground objects) and an intensive time series of remote sensing image data to capture changes in the ground for accurate classification and identification. However, in practical applications, there are still some unavoidable technical and budget restrictions, resulting in trade-offs of time, space, and spectral resolution of remote sensing images of earth observation data, leading to difficulties in obtaining remote sensing images with high-temporal and high-spatial resolutions [
8,
9]. In most cases, we select at least two different data sources to generate high-quality fused images:the main features of the Landsat 8 data source are that most spectral bands have a spatial resolution of 30 m and a temporal resolution of 16 days, long-term repeated measurements and the resulting data are coarse images with high temporal-low spatial (HTLS) resolution [
10]. The other is the MODerate Resolution Imaging Spectroradiometer (MODISI) image, primarily characterized by covering a large area of our planet, with a spatial resolution of 250 to 1000 m and different wavelengths, and requires being acquired every day. The obtained data are fine images of low spatial and high temporal (LTHS) images [
11]. Due to insufficient information from a single data source, researchers have proposed a combination of remote image sensing spatiotemporal fusion algorithms, which merge high spatiotemporal resolution images from multiple data sources to obtain high spatiotemporal information fusion images. We use Landsat 8 and MODIS data sources to synthesize images simultaneously with high spatial-high temporal resolution. In this way, it is proven that multiple information sources will obtain higher and richer characteristic data than only one, and many outstanding research results have been achieved, providing theoretical support for later application research.
Since Tsai proposed [
12] image super-resolution reconstruction in 1984, many scholars have researched and discussed the topic. Super-resolution reconstruction technology is a method for obtaining high-resolution images from processing low-pixel images. Due to the advantages of low cost, short consumption period, ample room for improvement, comprehensive coverage, large amount of information, and good durability, this technology has been widely used [
13]. High-resolution remote sensing images have been used in various fields, such as environmental monitoring, urban planning, and emergency rescue [
14,
15], but how to get high-resolution images at a low cost and in a short time has always been a problem that needs to be solved in the field of remote sensing [
16]. In this study, high-quality images were obtained by combining the spatiotemporal fusion rules of remote sensing images with super-resolution reconstruction, which laid a good foundation for future applications.
After years of development, researchers have proposed two types of spatial-temporal fusion models for remote sensing images: traditional models and models based on deep learning, as described in
Section 2. Although some of these models have achieved exemplary application results, there are still significant theoretical differences between the surface and the collected data. For example, how to select HTLS and low-temporal high-spatial (LTHS) images and reference images is very meaningful because all high-frequency information in the whole modeling process comes from these selected data. If there is a significant difference between the reference image and the predicted image, the final fusion result will not reach an acceptable result. Second, the actual data are often contaminated by penumbra and noise, which is theoretically inconsistent with the processed usable datasets. In order to solve these problems, we must further enhance the quality of the prediction image.
This research proposes a feedback and texture transformer-based spatiotemporal fusion model for remote sensing images, which is obtained by redesigning the network based on the Enhanced Deep Convolutional Spatiotemporal Fusion Network (EDCSTFN) model. Our model contains a total of five characteristics: (1) The model needs at least two pairs of MODIS–Landsat images to get high-quality predicted images, and the predicted image information entirely relies on a continuous time series. (2) The transformer was initially applied in natural language processing [
17]. With the development of research, it gradually abandoned convolution and recursion modules and is wholly based on the self-attention mechanism, which has strong parallelism. Our model employs a dual-branch feedback mechanism and a texture converter and considers images’ dependence on time series. Therefore, the network not only starts from the structural similarity of the coarse and fine image pairs but also uses the rich texture information in the adjacent images to predict the fine images and improve the quality of image reconstruction. (3) The same dual-branch texture transformer is used to make the prediction image obtain more detailed information accurately. (4) Both branches of the model use a feedback mechanism to refine low-level representations of high-level information. (5) The model utilizes a composite loss function to analyze the fusion results, which include content and visual losses. The primary purpose is to preserve the high-frequency information to make the generated image clearer. In this study, three datasets, Aruhorqin Banner (AHB), Coleambally Ignition District (CIA), and Lower Gwydir Basin (LGC), were selected for comparative analysis with the classical fusion model. The results display that the model offered in this paper enhances the fusion accuracy, prediction results, and the quality of fused images.
2. Related Works
In machine learning, convolutional neural networks (CNNs) have attracted significant attention [
18]. A CNN is a deep feedforward neural network trained and designed using prior knowledge and mainly extracts rich feature information from multiple test paper layers. A classic CNN is composed of five parts: one or more inputs, one or more convolutional layers, and one or more subsampling layers (or pooling layers) [
19]. Previous researchers extracted more advanced features by increasing the convolutional layers while avoiding network overfitting caused by the increase in layers. CNNs have become efficient frameworks for addressing the problem of image feature extraction and recognition [
20]. With the progress of research, CNNs have been gradually applied to the field of image super-resolution reconstruction and data fusion from their original use for extracting high-level features in image classification and recognition tasks [
21,
22].
CNN outperforms other computer vision tasks in image super-resolution [
23]. Super-Resolution Convolutional Network (SRCNN) first used a three-layer CNN in image SR to learn complex LR-HR mapping. Very Deep Convolutional Networks (VDSR) [
24] increase the depth of CNN to 20 layers to use more contextual information in the LR image and adopts the method for jumping connections to overcome the difficulty of optimization when the network is deep. In recent studies, different jump connections have been used to achieve super-resolution image reconstruction. Super-Resolution Generative Adversarial Networks (SRGAN) [
25] and Enhanced Deep Residual Networks (EDSR) [
26] use the residual jump connection [
27], which improves the accuracy of the reconstruction image effect. Super-Resolution Using Dense Skip Connections Networks (SRDenseNet) [
28] use the dense skip connection [
29] to obtain more characteristic information [
30]. A combination of local/global residuals and dense skip connections was incorporated into its RDN. Experiments have revealed that these models perform well in super-resolution image reconstruction. However, the following two problems still exist: The first problem is that due to the use of skip connections or a bottom-up combination of hierarchical features in these network architectures, they only extract low-level features, and the ability of the upper layer to receive information is limited by the small receptive field and lack of sufficient contextual information, which further limits the network’s reconstruction ability. The second problem is “space-time contradiction”; that is, there is a problem that the spatial and temporal resolutions of remote sensing images are mutually restricted.
With the continuous development of deep learning models, research based on CNN has gradually been applied to the spatial-temporal fusion of remote sensing images, but it is still at an early stage. By reading a lot of the literature in this research area, we concluded that the existing spatiotemporal fusion algorithms could be divided into five categories: (1) transformation-based, (2) reconstruction-based, (3) Bayesian-based, (4) learning-based, and (5) pan-sharpening-based.
The transform-based methods mainly adopt mathematical transformation technology [
31], such as the wavelet transform. Because of multi-source data integration, the original image pixel is represented in another abstract space by mapping. The data are converted from the spatial domain to the frequency domain. This method has two characteristics: the first is that it extracts clear high-frequency information from the transformed LTHS image and fuses it with the HTLS image to obtain high-quality fused images. Second, it has spatial generality, and different types of features can be extracted from different spatial images using fusion rules [
32].
Methods based on reconfiguration are divided into two categories: weight function-based and unmixing-based. The weight function method mainly evaluates HTLS images by setting the weight function and combining the image reflectivity. The classical methods include the spatiotemporal adaptive reflectivity fusion model (STARFM), the spatiotemporal adaptive algorithm for mapping reflectance change (STARCH) [
33], and the enhanced STARFM (ESTARFM) [
34]. The unmixing method mainly uses spectral unmixing theory and an unmixing algorithm to build the fusion model. It mainly uses HTLS images to reconstruct the corresponding LTHS images. Existing methods include the spatiotemporal reflectivity unmixing model (STRUM) [
35], flexible spatiotemporal data fusion (FSDAF) [
36], unmixing-based data fusion (UBDF) [
37], the spatial attraction model (SAM) [
38], and the spatiotemporal data fusion algorithm (STDFA) [
39].
The Bayesian-based method integrates the spatiotemporal spectrum into a unified framework, allowing the input image is not limited to achieving the most realistic prediction results. Existing methods include the unified fusion method [
40] and the Bayesian fusion method.
The learning-based method does not require the manual setting of the fusion rules. It mainly uses existing archived data to train the supervised deep learning model. Existing learning-based fusion methods include the sparse-representation-based spatiotemporal reflectance fusion model (SPSTFM) [
41], spatiotemporal fusion using a deep convolutional pair neural network (STFDCNN) [
42] deep convolutional spatiotemporal fusion network (DCSTFN) [
43], enhanced DCSTFN(EDCSTFN) [
44], two-stream convolutional neural network for spatiotemporal image fusion (STFNET) [
45], and generative adversarial network-based spatiotemporal fusion model (GAN-STFM) [
46].
Based on pan-sharpening fusion, the CNN model is applied to panchromatic and multispectral images. With the deepening of remote sensing research, many researchers have proposed various panchromatic sharpening methods; typical methods include intense hue saturation (IHS) [
47,
48,
49], principal component analysis (PCA) [
50,
51], Brovey transform (BT) [
52], Laplacian pyramid decomposition [
53], wavelet transform [
54], and curvilinear transformation under different resolutions [
55,
56,
57].
Although the progress of sensor technology has dramatically improved the accuracy of satellite observation, the following problems remain. First, the absence of technology and budget makes it impossible to obtain high temporal and spatial resolution images directly. Second, there is always a trade-off between the temporal, spatial, and spectral resolution of the observation data we obtain, so it is challenging to continuously get LTHS and HTLS data pairs for research. Therefore, this research model combines the concepts of ConvNet and transformer and uses VGGNet and transformer as the backbone networks, which are mainly reflected in two aspects: on the one hand, members of the team are studying the super-resolution reconstruction technology based on the transformer network and published research results; on the other hand, it borrows the state-of-the-art in spatiotemporal fusion models and super-resolution reconstruction from other papers and adds texture converters and feedback mechanisms to supplement the input data, extract as much helpful information as possible, and at the same time, reduce the model parameters for the best output image quality. We believe this fusion method can provide a reference for future research and has a promising practical application prospect.