Next Article in Journal
Evaluation of Five Global Top-of-Atmosphere Outgoing Longwave Radiation Products
Next Article in Special Issue
Deep Spatial Graph Convolution Network with Adaptive Spectral Aggregated Residuals for Multispectral Point Cloud Classification
Previous Article in Journal
Mapping Photovoltaic Panels in Coastal China Using Sentinel-1 and Sentinel-2 Images and Google Earth Engine
Previous Article in Special Issue
A Global Structure and Adaptive Weight Aware ICP Algorithm for Image Registration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spectral Swin Transformer Network for Hyperspectral Image Classification

1
Department of Physics and Electronic Engineering, Mudanjiang Normal University, Mudanjiang 157011, China
2
Office of Academic Affairs, Heilongjiang Institute of Technology, Harbin 150001, China
3
Department of Information and Communication Engineering, Harbin Engineering University, Harbin 150009, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(15), 3721; https://doi.org/10.3390/rs15153721
Submission received: 15 June 2023 / Revised: 20 July 2023 / Accepted: 21 July 2023 / Published: 26 July 2023

Abstract

:
Hyperspectral images are complex images that contain more spectral dimension information than ordinary images. An increasing number of HSI classification methods are using deep learning techniques to process three-dimensional data. The Vision Transformer model is gradually occupying an important position in the field of computer vision and is being used to replace the CNN structure of the network. However, it is still in the preliminary research stage in the field of HSI. In this paper, we propose using a spectral Swin Transformer network for HSI classification, providing a new approach for the HSI field. The Swin Transformer uses group attention to enhance feature representation, and the sliding window attention calculation can take into account the contextual information of different windows, which can retain the global features of HSI and improve classification results. In our experiments, we evaluated our proposed approach on several public hyperspectral datasets and compared it with several methods. The experimental results demonstrate that our proposed model achieved test accuracies of 97.46%, 99.7%, and 99.8% on the IP, SA, and PU public HSI datasets, respectively, when using the AdamW optimizer. Our approach also shows good generalization ability when applied to new datasets. Overall, our proposed approach represents a promising direction for hyperspectral image classification using deep learning techniques.

1. Introduction

Hyperspectral image classification (HSI) involves the use of remote sensing images that contain both image and spectral information. This technique differs from conventional images and multispectral images in that it has a wider spectral dimension, with the third dimension of HSI containing more than two hundred spectral bands. Each channel in the third dimension contains information about the reflection of a ground object from a fixed spectrum. Due to the variability of the reflectance spectral information, the accuracy of the classification results can be enhanced by the subtle differences between them. HSI has been used in various fields, including agriculture and forests [1], urban planning and urban mapping [2], mineral exploration [3], and medical hyperspectral imaging [4].
Although neighboring bands in HSI are highly correlated, each spectral band contains unique information about the materials in the scene. The Hughes phenomenon [5] refers to the degradation of classification performance when processing high-dimensional data, even with unsupervised learning methods. Therefore, appropriate dimensionality reduction techniques need to be used to avoid the Hughes phenomenon. Methods for processing high-dimensional and redundant hyperspectral data include feature selection, feature extraction, principal component analysis (PCA) [6], and independent component analysis (ICA) [7].
Among the traditional machine learning classification methods, supervised classification methods that utilize rich spectral information are very popular. Some of the most prominent methods are maximum likelihood (ML), nearest neighbor classifier, decision trees, random forest, support vector machines (SVM), etc. [8]. K-nearest neighbor (kNN) was the most widely used simple HSI classifier at that time. Kuo Yang et al. proposed mapping the data from the input space to a higher dimensional feature space using a nonlinear transformation, and then executing the kernel Gaussian classifier and kernel k-nearest-neighbor classifier on the mapped images [9]. Along with the success of SVM in various fields of pattern recognition, researchers found that SVM outperformed traditional supervised classification methods in HSI classification. HSI classification based on the kernel feature selection of SVM has been further investigated [10,11]. Kuo et al. proposed a classification method based on SVM and Markov random fields (MRFs) [12]. The method further optimizes the classification results of SVM by the spectral–spatial contextual information obtained from the regularization of MRF.
The rapid development of deep learning has allowed for a more efficient way to classify HSI than ever before [13]. Adaptive feature extraction instead of manual feature extraction effectively avoids the disadvantages of low generalization ability and poor characterization ability. CNNs have been hot research topics in pattern recognition for a long time, and the HSI classification field is no exception [14,15]. The combined spatial–spectral classification approach makes full use of hyperspectral high-dimensional features [16]. Zhang et al. [17] proposed a classification framework, denoted as the diverse region-based CNN, which encodes semantic context-aware representations to obtain promising features. Wan et al. [18] considered the recently proposed graph convolution network (GCN) for HSI classification because it can convolve on arbitrarily structured non-Euclidean data and it applies to irregular image regions represented by a graph’s topological information. Jiao et al. [19] proposed a fully convolutional network (FCN) using weighted fusion features for HSI classification. A common problem with depth models is that, due to the limited number of training samples, the learned depth models may be suboptimal, especially for images with large intra-class variance and small inter-class variance. Gong et al. [20] proposed a novel CNN with multiscale convolution (MS-CNN) to solve this problem by extracting deep multiscale features from HSI. In contrast, deep learning-based methods utilize high-level features for HSI classification. However, they usually degrade the spatial–spectral structure, rely on a number of training samples, and ignore a large amount of implicitly useful information.
The Transformer caused a stir in the natural language processing (NLP) field when it was first proposed. However, researchers have gradually tapped its potential in the field of computer vision (CV) [21]. Xin He et al. [22] proposed a joint spatial–spectral transformer classifier, which discarded the traditional CNN architecture. Shifted window (Swin) Transformer [23] is attention-based architecture that has achieved state-of-the-art results in image classification tasks. The Swin Transformer is a Vision Transformer (VIT) network based on the self-attentive mechanism. It is less used in the field of HSI classification. So, we adapted its structure and combined it with a spectral module for HSI classification. PCA is used to process high-latitude hyperspectral images and send it to the Swin Transformer for training after creating a suitable data cube shape. By incorporating the single-stage Swin Transformer into our network structure, its attention calculation module can link contextual information of hyperspectral features. We aim to improve the accuracy and robustness of hyperspectral image classification. Our extensive experiments show that visual Transformer-based architectures can still achieve good results in the field of hyperspectral image classification. It does not require complex model structures on small hyperspectral datasets. For the experiment source code and instructions on configuring the runtime environment, please refer to the Appendix A.
This paper’s innovations mainly include the following: 1. We utilized the advanced Swin Transformer network for HSI classification and adjusted the network structure to reduce complexity. 2. We used the Swin Transformer to process three-dimensional data, combining spectral dimensions and achieving competitive classification accuracy. 3. We conducted extensive experiments on HSI datasets to demonstrate the model’s generalization, including the use of uncommon airborne HSI datasets, providing insights for other researchers dealing with such HSI datasets.

2. Proposed Approach

2.1. Spectral Swin Transformer Network for HSI Classification

Figure 1 shows the structural diagram of the model proposed in this paper. Firstly, we used PCA to reduce the spectral dimensions of the hyperspectral data at the input layer of the model. The spectra of hyperspectral images are typically high-dimensional and redundant. The PCA can project the spectral dimensions onto a new low-dimensional space, thereby removing redundant information and extracting the most important features, making the data more easily processed and classified. This is why PCA plays an important role in hyperspectral image classification; it is an effective feature extraction and data compression method.
Then, a data cube was created with three dimensions: width, height, and the simplified spectral dimension. The function “Create Image Cubes” extracts image patches of a specified size from an input image and creates corresponding labels. It pads the image with zeros, iterates over each pixel, and stores the surrounding image patches in an array. If desired, it can remove patches with zero labels and adjust the labels accordingly. The function’s main purpose is to facilitate the extraction of image patches and their associated labels for further analysis or processing. The data features are enhanced, and the cube is divided into patches at the patch layer. The input shape of a model is positively correlated with its computational complexity. A smaller input shape allows for higher-resolution hyperspectral images to be processed by the model, but may result in reduced classification accuracy. In the context of scientific research, it is important to consider the trade-off between the input shape and classification accuracy when working with hyperspectral images.
The Swin Transformer is a computer vision algorithm based on ViT [21]. It introduces an innovative sliding window mechanism that limits the attention mechanism to a fixed window size, allowing the model to learn features that span the window. The hierarchical design of the subsampling layer enables the model’s self-attentive mechanism, taking into account both local and global features of the data. The performance of the Swin Transformer tends to vary greatly with different types of datasets. Transformers often achieve good training results on large datasets but are less effective when dealing with a limited number of training samples. Therefore, the Swin Transformer designs schemes with different window sizes for different dataset types and sizes. We refer to the Swin Transformer’s tiny universal backbone network in the official Microsoft example (accessed on 30 April 2023. https://github.com/microsoft/Swin-Transformer). To improve computing efficiency, simplify the structure, and adapt to HSI classification, we reduce redundant linear embedding and patch-merging modules, compared to the normal Swin Transformer structure. We cut it down from four stages to one.
Random crop is the most commonly used data augmentation method in deep learning, which can enhance the model’s accuracy and stability. Random flip is a data augmentation method that randomly flips the matrix during training. Patch extraction is the process of further extracting the enhanced features of the data. The patch-embedding module crops the feature data with patch sizes (the initial setting is 2) to the window size settings and embeds them in each patch.
The function of the patch-merging module is to subsample before the beginning of each stage, narrow the resolution, adjust the number of channels and, thus, form a hierarchical design. Finally, the global average pooling layer will obtain the output features with the softmax activation function corresponding to different classes. Global average pooling outputs 1D data, which replaces the full connection layer in the popular CNN model and avoids model overfitting.

2.2. Swin Transformer Encoder and SW-MSA

Two successive Swin Transformer blocks are the core of the Swin Transformer, which introduces shifted window operations that interact with other windows. The computations of two successive Swin Transformer blocks can be expressed as in Equation (1). The features that are input in this stage, z l 1 , go through layer normalization (LN), window multi-head self-attention (W-MSA), and the residual layer, to obtain z ^ l . After going through LN and multilayer perceptron (MLP) again, it enters the second block that has shifted window multi-head self-attention (SW-MSA).
z ^ l = W - M S A ( L N ( z l 1 ) ) + M S A ( L N ( Z l 1 ) ) + z l 1 z l + 1 = M L P ( L N ( z ^ l + 1 ) ) + z ^ l + 1 z l = M L P ( L N ( z ^ l ) ) + z ^ l z ^ l + 1 = S W - M S A ( L N ( z l ) ) + M S A ( L N ( Z l ) ) + Z l
The attention weight calculation for each partitioned window is defined by Formula (2). V is the relative position encoding of Q, K. The dot products of Q and K represent their similarity, but the similarity is not normalized. Therefore, a softmax function is applied to normalize the dot products of Q and K. The resulting softmax output is a mask matrix with all values ranging from 0 to 1. V represents the linearly transformed features of the input. Multiplying the mask matrix with V results in a filtered V feature.
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d ) V
Shifted window multi-head self-attention (SW-MSA) can send self-attention calculations across the four windows in the window layer. A schematic diagram of the shifted window offset process can be seen in Figure 2. For each patch in the feature map, it is necessary to calculate all the patches in the self-attention calculation process. Shifted W-MSA modules first divided the feature map into multiple windows based on the M × M size, cyclic-shift offset windows, and self-attention for each internal window. By setting reasonable masks, we reduce the computation amount in each window (Figure 2 shows that there are four windows inside the window layer).

2.3. Parametric Analysis

To verify the effectiveness of the proposed model in HSI classification, several important parameters were selected, and the method of controlling variables was used to examine the impact on the final test accuracy. The Swin Transformer architecture is based on ViT, so module parameters that are not present in ViT were selected to demonstrate the effectiveness of the Swin Transformer in the field of HSI classification. The window size and shift stride are particularly important parameters. The attention head numbers are also important parameters that affect the computational complexity of the model.
In the early feature-processing stage of the proposed model, the patch size and spectral dimensions were experimentally determined. The larger the selected values, the higher the real-time computational load of the model, while smaller values require longer computation times. Additionally, different HSI datasets require different patch-input sizes, with the optimal range being between 5 and 25. The reduced spectral dimension ranges from 10 to 30.
We use the Indian pine HSI dataset as an example. In Table 1, it can be observed that the model’s accuracy decreases when the number of Swin Transformer modules is reduced or when the Swin Transformer (S-T) block is not used. Changing the number of attention heads in the model has minimal impact on the accuracy, with the default parameter being 8. The default parameter for the size of the attention window is 2, and the default parameter for the size of the shifting window is 1. Larger window sizes and window shift strides may cause the model to overlook some local features, ultimately affecting the classification accuracy.

3. Experimental Results

3.1. Dataset Description

The Indian Pine dataset and the Salinas University dataset, which are publicly available on the Internet, were used as part of the dataset for this experiment. To demonstrate the generalization ability of our model, we also utilized other publicly available hyperspectral datasets from various regions in China. These datasets include three hyperspectral datasets from different areas in Wuhan City, Hubei Province (Honghu, Hanchuan, and Longkou) [24,25,26,27]. Additionally, we utilized the MaTiWan Village hyperspectral dataset [28] from XiongAn(XA) New Area, Hebei Province. The majority of these datasets contain various types of crops as the main objects of study. Information on how to obtain the data will be provided in the data availability statement.
The Indian Pine (IP) dataset was collected by the AVIRIS sensor at a site in northwest Indiana. It has an image size of 145 × 145 pixels and a spatial resolution of 0.2 m. After removing 20 absorbing water bands, the number of bands is 200. Sixteen feature classes are labeled in the available ground truth.
The Salinas (SA) dataset was collected by AVIRIS in 1998 at Salinas Valley, California, USA. After removing 20 bands with a low signal-to-noise ratio (SNR), 204 bands were used for the experiment. The HSI includes 512 × 217 pixels with a spatial resolution of 3.7 m. Sixteen common categories are labeled in the ground truth.
The University of Pavia (PU) dataset is a part of the hyperspectral data images created in 2003 in Pavia, Italy. This dataset has 103 hyperspectral images after band denoising. There were 42,776 pixels containing ground truth and nine categories.
The WHU-Hi dataset comprises three separate unmanned aerial vehicle (UAV)-based hyperspectral datasets: WHU-Hi-Longkou (LK), WHU-Hi-HanChuan (HC), and WHU-Hi-HongHu (HU). Compared to satellite and airborne hyperspectral platforms, UAV-based hyperspectral systems can acquire high-spatial-resolution hyperspectral images.
The first dataset, WHU-Hi-HongHu, was collected in Honghu City on 20 November 2017, using a Headwall Nano-Hyperspec imaging sensor mounted on a DJI Matrice 600 Pro UAV platform. The experimental area is a complex agricultural scene with many crop types, including Chinese cabbage, cabbage, Brassica chinensis, and small Brassica chinensis. The dataset contains 270 bands ranging from 400 to 1000 nm, with a spatial resolution of 0.043 m.
The second dataset, WHU-Hi-HanChuan, was collected in Hanchuan on 17 June 2016, using a similar imaging sensor mounted on a Leica Aibot X6 UAV V1 platform. The study area was a rural–urban fringe zone with buildings, water, and cultivated land, and contains seven crop species: strawberry, cowpea, soybean, sorghum, water spinach, watermelon, and greens. The dataset contains 274 bands ranging from 400 to 1000 nm, with a spatial resolution of 0.109 m. Notably, due to the low solar elevation angle during data collection, there are many shadow-covered areas in the image.
The third dataset, WHU-Hi-LongKou, was collected in Longkou Town on 17 July 2018, using a Headwall Nano-Hyperspec imaging sensor mounted on a DJI Matrice 600 Pro UAV platform. The study area is a simple agricultural scene with six crop species: corn, cotton, sesame, broadleaf soybean, narrow-leaf soybean, and rice. The dataset contains 270 bands ranging from 400 to 1000 nm, with a spatial resolution of 0.463 m.
The aerial hyperspectral remote sensing image from the XA New Area of Matihewan Village has a spectral range of 400–1000 nm, with 250 bands, an image size of 3750 × 1580 pixels, and a spatial resolution of 0.5 m. Through the field investigation of ground object categories, a total of 19 annotated ground object categories were identified in this image, primarily consisting of economic crops.

3.2. Training Details and Evaluation Indicators

This paper utilized a laptop computer equipped with an NVIDIA GeForce GTX1060 graphics card(Model: Y7000, Lenovo China) and a cloud server with an NVIDIA GTX3090 graphics card(Provider: AutoDL, China. Accessed on 30 April 2023 https://www.autodl.com/home), which were responsible for processing small-scale and high-resolution datasets, respectively. The datasets were divided into a training set, comprising 30% of the data and a testing set comprising 70% of the data. The learning rate was set to 0.001.
The initial patch size for feature extraction was set to 25, which is a crucial parameter that affects the computational time and performance of the CPU and GPU. A larger value requires a higher performance from the computer, while a smaller value results in a longer computation time. Therefore, it is essential to strike a balance between the initial patch size and the training performance. A patch size of 25 was found to perform well on most datasets, but it was not suitable for the high-resolution XA dataset, which required a smaller patch size to allow for the computation on our cloud server.
In the Swin Transformer model, we used initial parameters with a patch size of 2 × 2, a dropout rate of 0.03, 8 attention heads, an embedding dimension of 64, 256 multilayer perceptrons, a window size of 2, and a shift window step of 1. The model uses TensorFlow 2.9 and Python 3.8. In this study, the AdamW optimizer and stochastic gradient descent (SGD) optimizer trained our proposed Swin Transformer model. Categorical cross-entropy was chosen as the loss function. In the experiment, we found that different optimizers would affect the final accuracy of the model, so we prepared a comparison of the results of different optimizers.
This experiment used three prevalence evaluation metrics: average accuracy (AA), overall accuracy (OA), and kappa coefficient. Average accuracy represents the average of the classification accuracy of each class and is calculated in Equations (3) and (4).
R e c a l l = T r u e p o s i t i v e ( T r u e p o s i t i v e + F a l s e n e g a t i v e )
A A = i = 1 n R e c a l l i n × 100
O A = ( T r u e p o s i t i v e + T r u e N e g a t i v e ) T o t a l n u m b e r o f p i x e l s × 100
K a p p a = p 0 p e 1 p e , p 0 = x i i N , p e = x i + x i + N 2
Overall accuracy was calculated in Equation (5). Kappa (6) is a statistical measure that provides consistent mutual information between the tip of the ground truth map and the classification map. P o is the observed agreement, which refers to the classification consistency among evaluators and classifiers. P e is the expected chance agreement, which is the expected level of agreement by chance based on the predicted distribution of the classifier, where X i + is the marginal total number of observations in row I, the total number of observations is N, and X i i is the number of observations in row I, column I.

3.3. Classification Results of Public Hyperspectral Image Datasets

The classification results of the Indian Pine (IP) hyperspectral dataset, the Pavia University (PU) hyperspectral dataset, and the Salinas (SA) hyperspectral dataset in our model are shown in Figure 3. In the figure, it can be observed that our proposed model achieved relatively high test accuracies on all three datasets. Additionally, the loss convergence was faster on the SA and PU datasets. This corresponds to the results of the parameter analysis discussed in Section 2. This is because the spectral dimension of the IP dataset was set to 30, while the spectral dimension of the SA and PU datasets was set to 15. The spectral dimension of the WHU-HI dataset was also set to 15, which resulted in good performance, as seen in Figure 4. The XA dataset has a higher resolution, and when creating its input data cube shape, smaller data had to be used. However, even with this consideration, the classification performance did not improve significantly, with an average test accuracy of only about 48% in Figure 5.

3.4. Classification Results for Each Object Class in All HSI Datasets

Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the classification results with AdamW for each class, including precision, recall, and F1 score. Accuracy is a metric that evaluates the proportion of correctly classified samples among all predicted samples. The F1 score is a statistical measure that assesses the accuracy of a binary classification model, taking into account both its precision and recall. The F1 score can be viewed as a weighted average of the model’s precision and recall, with a range of values between 0 and 1.

3.5. Classification Maps and Comparisons

3.5.1. Classification Maps of Chinese Dataset: (WHU-HI and XA Datasets)

To demonstrate the generalization ability of our model, we also tested it on other uncommon hyperspectral datasets, including three datasets from Wuhan and the XA dataset. For the Honghu dataset in Wuhan, which has a relatively high resolution, we reduced the patch size from 25 to 17 and set the spectral dimension to 15. For the XA dataset, which has the highest resolution, both the patch size and spectral dimension were reduced to 5 and 10, respectively.
As shown in Figure 6, Figure 7 and Figure 8, to highlight the differences in classification performance, we added the results of different optimizers on the WHU-HI dataset to demonstrate the variation in classification. It can be observed from the figure that the AdamW optimizer performs significantly better in the classification map compared to the SGD optimizer. Therefore, the choice of optimizer is crucial in deep learning models and has a significant impact on the final results. Due to the limited availability of hyperspectral datasets in China, there is relatively little research on classification models for these datasets. Therefore, there is no comparison of the classification performances of these datasets in different models. Since the XA dataset is too large to yield classification maps in our proposed model, Figure 9 only shows the original and correctly categorized maps of the XA dataset. In this paper, we only discuss the results obtained by training our model on these hyperspectral datasets. These datasets will be submitted together with the data availability statement.

3.5.2. Comparison of Our Proposed Model with Other Models on IP, SA, and PU Hyperspectral Datasets

Different from the current mainstream CNN architecture, this study uses an advanced Swin Transformer to propose a novel HSI classification model. To verify the effectiveness of the proposed method, we compare the classification results with the HSI classification classical models SVM [29], Baseline NN [29], 1DCNN [30], 2DCNN [31], 3DCNN [32], and the mainstream models 3DFCN [33], spatial–spectral 3DCNN(S-S 3DCNN) [34], ViT [21,35], and spatial–spectral ViT (SST) [22]. The code for comparative experiments was obtained from an open-source code collection on GitHub [29] (accessed on 30 April 2023 https://github.com/nshaud/DeepHyperX). The comparative experiment was conducted with a learning rate of 0.001 and a training set size of 30%, while the test set size was 70%, which is the same as in our model. All other parameters were set to their default values, as specified in the comparative code. The comparative results presented below are based on 100 training iterations and include the results obtained using the AdamW optimizer in our model.
Figure 10, Figure 11 and Figure 12 show the classification prediction maps of the ground truth samples for the IP, PU, and SA hyperspectral datasets, respectively, using various models. The maps include the real ground sample images and the correct classification answers. By comparison, we found that the 3D neural network-based model achieved good results because it can perform feature convolution in three dimensions simultaneously. However, in our model, we achieved similarly excellent results even without using three-dimensional convolution. The ground sample details in the classification results of SA and PU were well-classified and superior to those of the classical machine learning methods. However, in the case of the IP dataset, our model did not achieve the expected results, which we attribute to the input shape of the model.
Figure 13 presents a comparison of the classification results of various algorithms for the IP, PU, and SA datasets, using models referenced in the experimental setup section, with all parameters except for the learning rate set to their default values. Our model was configured with a patch size of 25 and spectral dimensions of 30 and 15, with the latter used for datasets with higher resolution (>500). The results show that our model achieved classification accuracy comparable to other state-of-the-art models in the PU and SA datasets when using the AdamW optimizer.
Table 9, Table 10 and Table 11 provide a comparison of the OA, AA, and kappa results for each algorithm, revealing that our model achieved good classification accuracy and competitive performance. Our model achieved significant advantages in the SA and PU datasets. It outperforms traditional machine learning (ML) methods and is also highly competitive in deep learning (DL) methods and Vision Transformers. Almost all models perform better in the SA and PU datasets compared to the IP dataset.

4. Conclusions

This paper mainly introduces a Swin Transformer network for processing three-dimensional HSI data. By reducing the spectral dimensionality through PCA in the preliminary stage and combining spectral information to create a three-dimensional data cube, it mitigates the high-dimensionality and complexity of high-spectral-resolution data. Finally, the unique shift window and global self-attention computation in the Swin Transformer are combined for high-spectral-resolution image classification. Experimental results demonstrate competitive performance on most small-scale HSI datasets.
In the experiments, it is observed that a less complex network structure is sufficient for handling small-scale HSI datasets. Many studies focus on adding more modules to tackle different complex problems. In HSI classification tasks, if the early-stage feature processing is conducted properly, even a simple structure can yield good results, as demonstrated by the HybridSN [36] architecture. However, such a structure has a limitation in handling ultra-high-resolution HSI datasets, performing poorly when faced with high-spectral-resolution datasets exceeding 1000 pixels. How to handle large-scale HSI datasets is a question that we will need to research in the future. The proposed model also provides a good approach for HSI classification models based on Vision Transformer.

Author Contributions

Conceptualization, B.L. and W.Z.; methodology, Y.L.; software, Y.L.; validation, Y.L. and Y.T.; resources, Y.L.; data curation, Y.L. and W.K.; writing—original draft preparation, Y.L.; writing—review and editing, B.L. and W.Z.; visualization, Y.L. and Y.T.; supervision, B.L. and W.Z.; project administration, Y.L.; funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the Natural Science Foundation of Heilongjiang Province for Key projects, China (grant no. ZD2021F004), the Postdoctoral Scientific Research Developmental Fund of Heilongjiang Province, China (grant no. LBH-Q18110).

Data Availability Statement

Publicly available datasets were analyzed in this study. The following web sites were all accessed in 30 April 2023. These data can be found here: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes. Xuzhou Hyperspectral Dataset: https://ieee-dataport.org/documents/xuzhou-hyspex-dataset#files. Wuhan Hyperspectral Dataset: http://rsidea.whu.edu.cn/resource_sharing.htm. About XA Hyperspectral please contact the corresponding author for details.

Acknowledgments

The authors would like to thank the peer researchers who made their source codes available to the whole community, as well as the open sources of the benchmark HSI datasets. The authors would like to thank the team led by Zhong Yanfei of Wuhan University, China, who collected the Wuhan hyperspectral dataset, the contributors to the Xuzhou dataset, and the team that collected the hyperspectral dataset from Matiwan Village, XA, Chinese Academy of Sciences.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HSIhyperspectral image
PCAprincipal component analysis
ICAindependent component analysis
CNNconvolutional neural network
FCNfully convolutional network
GCNgraph convolution network
MS-CNNmultiscale convolutional neural network
Swinshifted window
LNlayer normalization
W-MSAmulti-head self-attention
MLPmultilayer perceptron
IPIndian pine
SASalinas
PUUniversity of Pavia
UAVunmanned aerial vehicle
WHU-HiWuhan University of Technology (China) - hyperspectral image
LKWHU-Hi-Longkou
HCWHU-Hi-HanChuan
HUWHU-Hi-HongHu
XAXA
SGDstochastic gradient descent

Appendix A

The source code for our proposed method can be found at: https://github.com/MinatoRyu007/Swin-HSI, accessed on 30 April 2023. This code needs to run in Python 3.8+ and Jupyter Notebook. This code also requires a Python Package: tensorflow-gpu 2.9.0+, scikit-learn 1.2.2+, numpy 1.23.5+, matplotlib 3.7.1+, scipy 1.9.1+, spectral 0.23.1+.

References

  1. Teke, M.; Deveci, H.S.; Haliloğlu, O.; Gürbüz, S.Z.; Sakarya, U. A short survey of hyperspectral remote sensing applications in agriculture. In Proceedings of the 2013 6th International Conference on Recent Advances in Space Technologies (RAST), Istanbul, Turkey, 12–14 June 2013; pp. 171–176. [Google Scholar]
  2. Shafri, H.Z.; Taherzadeh, E.; Mansor, S.; Ashurov, R. Hyperspectral remote sensing of urban areas: An overview of techniques and applications. Res. J. Appl. Sci. Eng. Technol. 2012, 4, 1557–1565. [Google Scholar]
  3. Bedini, E. The use of hyperspectral remote sensing for mineral exploration: A review. J. Hyperspectral Remote Sens. 2017, 7, 189–211. [Google Scholar] [CrossRef]
  4. Lu, G.; Fei, B. Medical hyperspectral imaging: A review. J. Biomed. Opt. 2014, 19, 010901. [Google Scholar] [CrossRef] [PubMed]
  5. Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef] [Green Version]
  6. Rodarmel, C.; Shan, J. Principal component analysis for hyperspectral image classification. Surv. Land Inf. Sci. 2002, 62, 115–122. [Google Scholar]
  7. Du, H.; Qi, H.; Wang, X.; Ramanath, R.; Snyder, W.E. Band selection using independent component analysis for hyperspectral image processing. In Proceedings of the 32nd Applied Imagery Pattern Recognition Workshop, Washington, DC, USA, 15–17 October 2003; pp. 93–98. [Google Scholar]
  8. Gogineni, R.; Chaturvedi, A. Hyperspectral image classification. Process. Anal. Hyperspectral Data 2019. [Google Scholar] [CrossRef] [Green Version]
  9. Kuo, B.C.; Yang, J.M.; Sheu, T.W.; Yang, S.W. Kernel-based KNN and Gaussian classifiers for hyperspectral image classification. In Proceedings of the IGARSS 2008—2008 IEEE International Geoscience and Remote Sensing Symposium, Boston, MA, USA, 7–11 July 2008; pp. II-1006–II-1008. [Google Scholar] [CrossRef]
  10. Mercier, G.; Lennon, M. Support vector machines for hyperspectral image classification with spectral-based kernels. In Proceedings of the IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No. 03CH37477), Toulouse, France, 21–25 July 2003; Volume 1, pp. 288–290. [Google Scholar]
  11. Kuo, B.C.; Ho, H.H.; Li, C.H.; Hung, C.C.; Taur, J.S. A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 7, 317–326. [Google Scholar] [CrossRef]
  12. Tarabalka, Y.; Fauvel, M.; Chanussot, J.; Benediktsson, J.A. SVM-and MRF-based method for accurate classification of hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2010, 7, 736–740. [Google Scholar] [CrossRef] [Green Version]
  13. Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef] [Green Version]
  14. Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef] [Green Version]
  15. Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2016, 219, 88–98. [Google Scholar] [CrossRef]
  16. Mei, X.; Pan, E.; Ma, Y.; Dai, X.; Huang, J.; Fan, F.; Du, Q.; Zheng, H.; Ma, J. Spectral-Spatial Attention Networks for Hyperspectral Image Classification. Remote Sens. 2019, 11, 963. [Google Scholar] [CrossRef] [Green Version]
  17. Mengmeng, Z.; Wei, L.; Qian, D. Diverse region-based CNN for hyperspectral image classification. IEEE Trans. Image Process. 2018, 27, 2623–2634. [Google Scholar]
  18. Wan, S.; Gong, C.; Zhong, P.; Du, B.; Zhang, L.; Yang, J. Multiscale dynamic graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3162–3177. [Google Scholar] [CrossRef] [Green Version]
  19. Jiao, L.; Liang, M.; Chen, H.; Yang, S.; Liu, H.; Cao, X. Deep fully convolutional network-based spatial distribution prediction for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5585–5599. [Google Scholar] [CrossRef]
  20. Gong, Z.; Zhong, P.; Yu, Y.; Hu, W.; Li, S. A CNN with multiscale convolution and diversified metric for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3599–3618. [Google Scholar] [CrossRef]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  22. He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
  23. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  24. Bei, Z.; Yanfei, Z.; Liangpei, Z.; Bo, H. The Fisher Kernel Coding Framework for High Spatial Resolution Scene Classification. Remote Sens. 2016, 8, 157. [Google Scholar]
  25. Zhao, B.; Zhong, Y.; Xia, G.S.; Zhang, L. Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2108–2123. [Google Scholar] [CrossRef]
  26. Zhu, Q.; Zhong, Y.; Zhao, B.; Xia, G.S.; Zhang, L. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 2016, 13, 747–751. [Google Scholar] [CrossRef]
  27. Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
  28. Cen, Y.; Zhang, L.; Zhang, X.; Wang, Y.; Qi, W.; Tang, S.; Zhang, P. Aerial hyperspectral remote sensing classification dataset of Xiongan New Area (Matiwan Village). J. Remote. Sens. 2020, 24, 1299–1306. Available online: https://www.ygxb.ac.cn/en/article/doi/10.11834/jrs.20209065/ (accessed on 1 May 2023).
  29. Audebert, N.; Le Saux, B.; Lefèvre, S. Deep learning for classification of hyperspectral data: A comparative review. IEEE Geosci. Remote Sens. Mag. 2019, 7, 159–173. [Google Scholar] [CrossRef] [Green Version]
  30. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef] [Green Version]
  31. Liu, B.; Yu, X.; Zhang, P.; Tan, X.; Yu, A.; Xue, Z. A semi-supervised convolutional neural network for hyperspectral image classification. Remote Sens. Lett. 2017, 8, 839–848. [Google Scholar] [CrossRef]
  32. Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef] [Green Version]
  33. Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 3322–3325. [Google Scholar]
  34. Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens. 2017, 9, 67. [Google Scholar] [CrossRef] [Green Version]
  35. Ayas, S.; Tunc-Gormus, E. SpectralSWIN: A spectral-Swin Transformer network for hyperspectral image classification. Int. J. Remote Sens. 2022, 43, 4025–4044. [Google Scholar] [CrossRef]
  36. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The proposed model’s Swin Transformer structure.
Figure 1. The proposed model’s Swin Transformer structure.
Remotesensing 15 03721 g001
Figure 2. Self-attention computation in shifted window partitioning.
Figure 2. Self-attention computation in shifted window partitioning.
Remotesensing 15 03721 g002
Figure 3. The accuracy and loss of the IP, SA, and PU datasets in our model were trained for 100 epochs.
Figure 3. The accuracy and loss of the IP, SA, and PU datasets in our model were trained for 100 epochs.
Remotesensing 15 03721 g003
Figure 4. The accuracy and loss of the WHU-HI datasets in our model were trained for 100 epochs.
Figure 4. The accuracy and loss of the WHU-HI datasets in our model were trained for 100 epochs.
Remotesensing 15 03721 g004
Figure 5. The accuracy and loss of the XA dataset in our model were trained for 100 epochs.
Figure 5. The accuracy and loss of the XA dataset in our model were trained for 100 epochs.
Remotesensing 15 03721 g005
Figure 6. Classification maps of the HU hyperspectral dataset.
Figure 6. Classification maps of the HU hyperspectral dataset.
Remotesensing 15 03721 g006
Figure 7. Classification maps of the LK hyperspectral dataset.
Figure 7. Classification maps of the LK hyperspectral dataset.
Remotesensing 15 03721 g007
Figure 8. Classification maps of the HC hyperspectral dataset.
Figure 8. Classification maps of the HC hyperspectral dataset.
Remotesensing 15 03721 g008
Figure 9. XA dataset ground sample and truth information.
Figure 9. XA dataset ground sample and truth information.
Remotesensing 15 03721 g009
Figure 10. Comparison of classification maps between various algorithms and the proposed algorithm in the IP hyperspectral dataset.
Figure 10. Comparison of classification maps between various algorithms and the proposed algorithm in the IP hyperspectral dataset.
Remotesensing 15 03721 g010
Figure 11. Comparison of classification maps between various algorithms and the proposed algorithm in the PU hyperspectral dataset.
Figure 11. Comparison of classification maps between various algorithms and the proposed algorithm in the PU hyperspectral dataset.
Remotesensing 15 03721 g011
Figure 12. Comparison of classification maps between various algorithms and the proposed algorithm in the SA hyperspectral dataset.
Figure 12. Comparison of classification maps between various algorithms and the proposed algorithm in the SA hyperspectral dataset.
Remotesensing 15 03721 g012
Figure 13. Comparison of model validation accuracy in the IP, PU, and SA datasets.
Figure 13. Comparison of model validation accuracy in the IP, PU, and SA datasets.
Remotesensing 15 03721 g013
Table 1. Analysis of the effect of the setting of the number of parameters on the final accuracy of the proposed model. (Indian Pine dataset).
Table 1. Analysis of the effect of the setting of the number of parameters on the final accuracy of the proposed model. (Indian Pine dataset).
NumbersAttention HeadsAttention WindowShifting WindowS-T Block
0 88.61%
197.13%98.21%97.46%96.19%
297.48%97.46%97.37%97.46%
497.63%98.46%97.54%
897.46%
Table 2. IP dataset classification results, including class, precision, recall, and F1 score.
Table 2. IP dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Alfalfa0.790.970.8732
Corn—no-till0.970.950.961000
Corn—min-till0.970.980.98581
Corn0.960.980.97166
Grass–pasture0.950.960.95338
Grass–trees0.960.980.97511
Grass—pasture-mowed0.750.450.5620
Hay—windrowed1.001.001.00335
Oats1.000.290.4414
Soybean—no-till0.960.950.96680
Soybean—min-till0.980.990.991719
Soybean—clean0.940.960.95415
Wheat0.970.990.98143
Woods1.001.001.00886
Buildings–Grass–Trees–Drives1.001.001.00270
Stone–Steel–Towers0.870.630.7365
Table 3. SA dataset classification results, including class, precision, recall, and F1 score.
Table 3. SA dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Broccoli_green_weeds_11.001.001.001406
Broccoli_green_weeds_21.001.001.002608
Fallow1.001.001.001383
Fallow_rough_plow1.001.001.00976
Fallow_smooth1.001.001.001875
Stubble1.001.001.002771
Celery1.001.001.002505
Grapes_untrained1.001.001.007890
Soil_vineyard_develop1.001.001.004342
Corn_senesced_green_weeds1.001.001.002295
Lettuce_romaine_4wk1.001.001.00748
Lettuce_romaine_5wk1.001.001.001349
Lettuce_romaine_6wk1.001.001.00641
Lettuce_romaine_7wk1.001.001.00749
Vineyard_untrained1.001.001.005088
Vineyard_vertical_trellis1.001.001.001265
Table 4. PU dataset classification results, including class, precision, recall, and F1 score.
Table 4. PU dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Asphalt1.001.001.004642
Meadows1.001.001.0013,055
Gravel1.001.001.001469
Trees1.001.001.002145
Painted metal sheets1.001.001.00942
Bare Soil1.001.001.003520
Bitumen1.001.001.00931
Self-Blocking Bricks1.001.001.002577
Shadows1.001.001.00663
Table 5. HC dataset classification results, including class, precision, recall, and F1 score.
Table 5. HC dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Strawberry1.001.001.0031,315
Cowpea1.001.001.0015,927
Soybean1.001.001.007201
Sorghum1.001.001.003747
Water spinach1.001.001.00840
Watermelon1.000.970.983173
Greens1.001.001.004132
Trees1.001.001.0012,585
Grass1.001.001.006628
Red roof1.001.001.007361
Gray roof1.001.001.0011,838
Plastic1.001.001.002575
Bare soil0.990.990.996381
Road1.001.001.0012,992
Bright object1.001.001.00795
Water1.001.001.0052,781
Table 6. HU dataset classification results, including class, precision, recall, and F1 score.
Table 6. HU dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Red roof1.001.001.009829
Road1.000.980.992458
Bare soil1.001.001.0015,275
Cotton1.001.001.00114,300
Cotton firewood1.001.001.004353
Rape1.001.001.0031,190
Chinese cabbage1.001.001.0016,872
Pak choi1.001.001.002838
Cabbage1.001.001.007573
Tuber mustard1.001.001.008676
Brassica parachinensis1.001.001.007711
Brassica chinensis1.001.001.006268
Small Brassica chinensis1.001.001.0015,755
Lactuca sativa0.991.001.005149
Celtuce1.000.990.99701
Film covered lettuce1.001.001.005083
Romaine lettuce1.001.001.002107
Carrot0.991.000.992252
White radish1.001.001.006098
Garlic sprouts1.001.001.002440
Broad bean1.000.991.00930
Tree1.001.001.002828
Table 7. LK dataset classification results, including class, precision, recall, and F1 score.
Table 7. LK dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Corn1.001.001.0024,158
Cotton1.001.001.005862
Sesame1.001.001.002122
Broadleaf soybean1.001.001.0044,248
Narrow-leaf soybean1.001.001.002906
Rice1.001.001.008298
Water1.001.001.0046,939
Roads and houses0.990.990.994987
Mixed weed1.000.980.993660
Table 8. XA dataset classification results, including class, precision, recall, and F1 score.
Table 8. XA dataset classification results, including class, precision, recall, and F1 score.
PrecisionRecallF1 ScoreSupport
Acer negundo Linn0.340.100.15157,953
Willow0.340.410.37126,536
Elm0.000.000.0010,747
Paddy0.470.520.49316,501
Chinese Pagoda Tree0.390.470.43332,914
Fraxinus chinensis0.370.450.40118,539
Koelreuteria paniculata0.000.000.0016,313
Water0.330.000.00115,953
Bare land0.000.000.0026,886
Paddy stubble0.350.150.21135,681
Robinia pseudoacacia0.000.000.003928
Corn0.670.000.0041,416
Pear0.620.930.74718,559
Soya0.000.000.005006
Alamo0.390.050.1063,750
Vegetable field0.000.000.0020,404
Sparsewood0.000.000.001047
Meadow0.470.540.50295,253
Peach0.000.000.0045,860
Building0.150.000.0020,731
Table 9. Classification accuracy comparison between the proposed model and other models on the IP dataset with 30% training data. (×100%).
Table 9. Classification accuracy comparison between the proposed model and other models on the IP dataset with 30% training data. (×100%).
MethodsAverage AccuracyOverall AccuracyKappa
ML
Models
SVM0.3170.5280.427
Baseline NN0.6540.7530.713
DL
Models
1D CNN0.2030.4490.324
3D CNN0.6630.7750.737
3D FCN0.5450.7020.658
S-S 3D CNN0.7440.7790.749
ViT
Models
ViT0.7890.7180.680
S-S ViT0.6670.8870.864
Ours0.8800.9720.969
Table 10. Comparison of the classification accuracy between the proposed model and other models on the SA dataset with 30% training data. (×100%).
Table 10. Comparison of the classification accuracy between the proposed model and other models on the SA dataset with 30% training data. (×100%).
MethodsAverage AccuracyOverall AccuracyKappa
ML
Models
SVM0.3170.5280.427
Baseline NN0.9080.9170.908
DL
Models
1D CNN0.8090.8320.812
3D CNN0.8790.9110.901
3D FCN0.9040.9390.932
S-S 3D CNN0.9080.9470.941
ViT
Models
ViT0.7530.7810.709
S-S ViT0.9310.9420.937
Ours0.9990.9990.999
Table 11. Comparison of the classification accuracy between the proposed model and other models on the PU dataset with 30% training data. (×100%).
Table 11. Comparison of the classification accuracy between the proposed model and other models on the PU dataset with 30% training data. (×100%).
MethodsAverage AccuracyOverall AccuracyKappa
ML
Models
SVM0.6350.8340.771
Baseline NN0.8520.9610.948
DL
Models
1D CNN0.5850.7810.696
3D CNN0.8570.9540.939
3D FCN0.8670.9760.968
S-S 3D CNN0.8760.9690.959
ViT
Models
ViT0.7690.8020.701
S-S ViT0.8360.9270.908
Ours0.9990.9990.999
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, B.; Liu, Y.; Zhang, W.; Tian, Y.; Kong, W. Spectral Swin Transformer Network for Hyperspectral Image Classification. Remote Sens. 2023, 15, 3721. https://doi.org/10.3390/rs15153721

AMA Style

Liu B, Liu Y, Zhang W, Tian Y, Kong W. Spectral Swin Transformer Network for Hyperspectral Image Classification. Remote Sensing. 2023; 15(15):3721. https://doi.org/10.3390/rs15153721

Chicago/Turabian Style

Liu, Baisen, Yuanjia Liu, Wulin Zhang, Yiran Tian, and Weili Kong. 2023. "Spectral Swin Transformer Network for Hyperspectral Image Classification" Remote Sensing 15, no. 15: 3721. https://doi.org/10.3390/rs15153721

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop