remotesensing-logo

Journal Browser

Journal Browser

Remote Sensing Image Processing with Transformers

A special issue of Remote Sensing (ISSN 2072-4292). This special issue belongs to the section "Remote Sensing Image Processing".

Deadline for manuscript submissions: closed (25 December 2023) | Viewed by 9412

Special Issue Editors

Department of Electrical & Computer Engineering, University of California San Diego, 9500 Gilman Drive, San Diego, CA, USA
Interests: computer vision; image analysis; machine learning; remote sensing; big data processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Software, Northwestern Polytechnical University, Xi'an 710072, China
Interests: image classification; image segmentation; geophysical image processing; image representation; deep learning; artificial intelligence

E-Mail Website
Guest Editor
Data Science in Earth Observation, Technical University of Munich (TUM), 80333 Munich, Germany
Interests: computer vision; machine learning; remote sensing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Remote sensing image processing is essential to real-world applications such as urban planning and smart city. In recent years, deep learning algorithms such as convolutional neural networks have become dominant in the field due to their capability for large-scale learning and the exponentially growing number of satellite images. Despite advances in convolutional neural networks, transformers have shown superior performance in natural language processing and RGB image processing through a self-attention strategy that effectively models the relationship between words or image patches.

However, the investigation of transformers in remote sensing image processing is still limited. Although transformers have shown their advantages in other fields, challenges include the utilization of transformers for high-resolution satellite images and computational efficiency, and our understanding of transformers remains incomplete. Therefore, it is crucial to study the effectiveness and efficiency of transformers in remote sensing image processing.

In short, the Special Issue focuses on advances in remote sensing image processing with transformers. Topics of interest include but are not limited to:

  • Remote sensing image classification and scene understanding;
  • Object counting in remote sensing images;
  • Smart city applications with transformers;
  • Transfer learning in remote sensing images;
  • Novel transformer architecture design for remote images;
  • Efficient Transformers for remote sensing;
  • Explainable methods for remote scene understanding;
  • Self-supervised learning with transformers for remote sensing.

Dr. Jia Wan
Prof. Dr. Jiangbin Zheng
Dr. Zhitong Xiong
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Remote Sensing is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2700 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • remote sensing
  • transformers
  • image classification
  • object counting
  • transfer learning
  • scene understanding
  • self-supervised learning
  • satellite images
  • deep neural networks

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 3101 KiB  
Article
An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution
by Wenjian Zhang, Zheng Tan, Qunbo Lv, Jiaao Li, Baoyu Zhu and Yangyang Liu
Remote Sens. 2024, 16(5), 880; https://doi.org/10.3390/rs16050880 - 01 Mar 2024
Cited by 1 | Viewed by 813
Abstract
Transformer models have great potential in the field of remote sensing super-resolution (SR) due to their excellent self-attention mechanisms. However, transformer models are prone to overfitting because of their large number of parameters, especially with the typically small remote sensing datasets. Additionally, the [...] Read more.
Transformer models have great potential in the field of remote sensing super-resolution (SR) due to their excellent self-attention mechanisms. However, transformer models are prone to overfitting because of their large number of parameters, especially with the typically small remote sensing datasets. Additionally, the reliance of transformer-based SR models on convolution-based upsampling often leads to mismatched semantic information. To tackle these challenges, we propose an efficient super-resolution hybrid network (EHNet) based on the encoder composed of our designed lightweight convolution module and the decoder composed of an improved swin transformer. The encoder, featuring our novel Lightweight Feature Extraction Block (LFEB), employs a more efficient convolution method than depthwise separable convolution based on depthwise convolution. Our LFEB also integrates a Cross Stage Partial structure for enhanced feature extraction. In terms of the decoder, based on the swin transformer, we innovatively propose a sequence-based upsample block (SUB) for the first time, which directly uses the sequence of tokens in the transformer to focus on semantic information through the MLP layer, which enhances the feature expression ability of the model and improves the reconstruction accuracy. Experiments show that EHNet’s PSNR on UCMerced and AID datasets obtains a SOTA performance of 28.02 and 29.44, respectively, and is also visually better than other existing methods. Its 2.64 M parameters effectively balance model efficiency and computational demands. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing with Transformers)
Show Figures

Figure 1

18 pages, 43119 KiB  
Article
LPST-Det: Local-Perception-Enhanced Swin Transformer for SAR Ship Detection
by Zhigang Yang, Xiangyu Xia, Yiming Liu, Guiwei Wen, Wei Emma Zhang and Limin Guo
Remote Sens. 2024, 16(3), 483; https://doi.org/10.3390/rs16030483 - 26 Jan 2024
Viewed by 1384
Abstract
Convolutional neural networks (CNNs) and transformers have boosted the rapid growth of object detection in synthetic aperture radar (SAR) images. However, it is still a challenging task because SAR images usually have the characteristics of unclear contour, sidelobe interference, speckle noise, multiple scales, [...] Read more.
Convolutional neural networks (CNNs) and transformers have boosted the rapid growth of object detection in synthetic aperture radar (SAR) images. However, it is still a challenging task because SAR images usually have the characteristics of unclear contour, sidelobe interference, speckle noise, multiple scales, complex inshore background, etc. More effective feature extraction by the backbone and augmentation in the neck will bring a promising performance increment. In response, we make full use of the advantage of CNNs in extracting local features and the advantage of transformers in capturing long-range dependencies to propose a Swin Transformer-based detector for arbitrary-oriented SAR ship detection. Firstly, we incorporate a convolution-based local perception unit (CLPU) into the transformer structure to establish a powerful backbone. The local-perception-enhanced Swin Transformer (LP-Swin) backbone combines the local information perception ability of CNNs and the global feature extraction ability of transformers to enhance representation learning, which can extract object features more effectively and boost the detection performance. Then, we devise a cross-scale bidirectional feature pyramid network (CS-BiFPN) by strengthening the propagation and integration of both location and semantic information. It allows for more effective utilization of the feature extracted by the backbone and mitigates the problem of multi-scale ships. Moreover, we design a one-stage framework integrated with LP-Swin, CS-BiFPN, and the detection head of R3Det for arbitrary-oriented object detection, which can provide more precise locations for inclined objects and introduce less background information. On the SAR Ship Detection Dataset (SSDD), ablation studies are implemented to verify the effectiveness of each component, and competing experiments illustrate that our detector attains 93.31% in mean average precision (mAP), which is a comparable detection performance with other advanced detectors. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing with Transformers)
Show Figures

Figure 1

26 pages, 29980 KiB  
Article
P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification
by Guanqun Wang, He Chen, Liang Chen, Yin Zhuang, Shanghang Zhang, Tong Zhang, Hao Dong and Peng Gao
Remote Sens. 2023, 15(7), 1773; https://doi.org/10.3390/rs15071773 - 26 Mar 2023
Cited by 11 | Viewed by 2601
Abstract
Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head [...] Read more.
Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P2FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing with Transformers)
Show Figures

Figure 1

23 pages, 961 KiB  
Article
Remote Sensing Image Scene Classification with Self-Supervised Learning Based on Partially Unlabeled Datasets
by Xiliang Chen, Guobin Zhu and Mingqing Liu
Remote Sens. 2022, 14(22), 5838; https://doi.org/10.3390/rs14225838 - 18 Nov 2022
Cited by 7 | Viewed by 1955
Abstract
In recent years, supervised learning, represented by deep learning, has shown good performance in remote sensing image scene classification with its powerful feature learning ability. However, this method requires large-scale and high-quality handcrafted labeled datasets, which leads to a high cost of obtaining [...] Read more.
In recent years, supervised learning, represented by deep learning, has shown good performance in remote sensing image scene classification with its powerful feature learning ability. However, this method requires large-scale and high-quality handcrafted labeled datasets, which leads to a high cost of obtaining annotated samples. Self-supervised learning can alleviate this problem by using unlabeled data to learn the image’s feature representation and then migrate to the downstream task. In this study, we use an encoder–decoder structure to construct a self-supervised learning architecture. In the encoding stage, the image mask is used to discard some of the image patches randomly, and the image’s feature representation can be learned from the remaining image patches. In the decoding stage, the lightweight decoder is used to recover the pixels of the original image patches according to the features learned in the encoding stage. We constructed a large-scale unlabeled training set using several public scene classification datasets and Gaofen-2 satellite data to train the self-supervised learning model. In the downstream task, we use the encoder structure with the masked image patches that have been removed as the backbone network of the scene classification task. Then, we fine-tune the pre-trained weights of self-supervised learning in the encoding stage on two open datasets with complex scene categories. The datasets include NWPU-RESISC45 and AID. Compared with other mainstream supervised learning methods and self-supervised learning methods, our proposed method has better performance than the most state-of-the-art methods in the task of remote sensing image scene classification. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing with Transformers)
Show Figures

Graphical abstract

21 pages, 2206 KiB  
Article
Hyperspectral Image Classification with IFormer Network Feature Extraction
by Qi Ren, Bing Tu, Sha Liao and Siyuan Chen
Remote Sens. 2022, 14(19), 4866; https://doi.org/10.3390/rs14194866 - 29 Sep 2022
Cited by 10 | Viewed by 1761
Abstract
Convolutional neural networks (CNNs) are widely used for hyperspectral image (HSI) classification due to their better ability to model the local details of HSI. However, CNNs tends to ignore the global information of HSI, and thus lack the ability to establish remote dependencies, [...] Read more.
Convolutional neural networks (CNNs) are widely used for hyperspectral image (HSI) classification due to their better ability to model the local details of HSI. However, CNNs tends to ignore the global information of HSI, and thus lack the ability to establish remote dependencies, which leads to computational cost consumption and remains challenging. To address this problem, we propose an end-to-end Inception Transformer network (IFormer) that can efficiently generate rich feature maps from HSI data and extract high- and low-frequency information from the feature maps. First, spectral features are extracted using batch normalization (BN) and 1D-CNN, while the Ghost Module generates more feature maps via low-cost operations to fully exploit the intrinsic information in HSI features, thus improving the computational speed. Second, the feature maps are transferred to Inception Transformer through a channel splitting mechanism, which effectively learns the combined features of high- and low-frequency information in the feature maps and allows for the flexible modeling of discriminative information scattered in different frequency ranges. Finally, the HSI features are classified via pooling and linear layers. The IFormer algorithm is compared with other mainstream algorithms in experiments on four publicly available hyperspectral datasets, and the results demonstrate that the proposed method algorithm is significantly competitive among the HSI classification algorithms. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing with Transformers)
Show Figures

Graphical abstract

Back to TopTop