STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition

Li, Peng; Wu, Ji; Wang, Yongxian; Lan, Qiang; Xiao, Wenbin

doi:10.3390/jmse10101428

Open AccessArticle

STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition

by

Peng Li

¹

,

Ji Wu

²,

Yongxian Wang

^1,*

,

Qiang Lan

^1,* and

Wenbin Xiao

¹

College of Meteorology and Oceanography, National University of Defense Technology, Changsha 410073, China

²

College of Computer, National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2022, 10(10), 1428; https://doi.org/10.3390/jmse10101428

Submission received: 18 August 2022 / Revised: 23 September 2022 / Accepted: 27 September 2022 / Published: 4 October 2022

(This article belongs to the Special Issue Application of Sensing and Machine Learning to Underwater Acoustic)

Download

Browse Figures

Versions Notes

Abstract

:

With the evolution of machine learning and deep learning, more and more researchers have utilized these methods in the field of underwater acoustic target recognition. In these studies, convolutional neural networks (CNNs) are the main components of recognition models. In recent years, a neural network model Transformer that uses a self-attention mechanism was proposed and achieved good performance in deep learning. In this paper, we propose a Transformer-based underwater acoustic target recognition model STM. To the best of our knowledge, this is the first work to introduce Transformer into the underwater acoustic field. We compared the performance of STM with CNN, ResNet18, and other multi-class algorithm models. Experimental results illustrate that under two commonly used dataset partitioning methods, STM achieves 97.7% and 89.9% recognition accuracy, respectively, which are 13.7% and 50% higher than the CNN Model. STM also outperforms the state-of-the-art model CRNN-9 by 3.1% and ResNet18 by 1.8%.

Keywords:

underwater acoustic target recognition; deep learning; Transformer

1. Introduction

Underwater acoustic target recognition is a hot research issue in the field of underwater acoustic signal processing. Affected by the complex underwater environment, background noise, and sound scattering, the underwater acoustic target recognition technology meets great challenges. Artificial intelligence technology has promoted underwater acoustic target recognition. Machine learning [1] and deep learning [2,3,4] have been used in underwater acoustic target recognition in recent years.

In the past decade, convolutional neural networks (CNNs) [5,6] have been widely used for deep learning end-to-end network modeling in many fields, as the inductive biases inherent to CNNs such as spatial locality and translation equivariance are believed to be helpful. To capture long-range global information, a recent trend is to add an attention mechanism to the deep learning network architecture. Such attention architecture has achieved good performance for audio classification fields such as audio event classification [7,8], speech command recognition [9], and emotion recognition [10]. Attention-based models are also made successful in the vision domain [11,12,13], it is reasonable to ask whether it is also made useful in underwater acoustic target recognition?

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing the modeling of dependencies without regard to their distance in the input or output sequences [8,14]. Some attention mechanisms are used in conjunction with other network architectures [15]. The Transformer proposed a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. Instead of using other network architectures, Transformer achieves good performance using a self-attention mechanism.

In this paper, we introduce a Transformer-based recognition model and use experiments to study the impact of different feature extraction methods on the accuracy. The contributions of this paper can be summarized as follows:

We propose a spectrogram Transformer model (STM) for underwater acoustic target recognition, in which underwater audio is specially processed to fit the model. To the best of our knowledge, this is the first work to introduce Transformer into the underwater acoustic target recognition field.
We compare the performance of different feature extraction methods (Short-time Fourier transform [16], Filter Bank [17], Mel Frequency Cepstral Coefficients [18]) for ResNet18 [19] and STM. The experimental results show that the FBank features are more suitable for deep learing models. Especially under the pre-training of AudioSet, the recognition accuracy is significantly higher than other features.
Combine with pre-training, under two dataset partitioning methods, STM has achieved 97.7% and 89.9% recognition accuracy, which are 13.7% and 50% higher than the CNN Model. STM also outperforms the state-of-the-art model CRNN-9 [20] by 3.1% and ResNet18 by 1.8%.

2. Related Work

In this section, we will introduce the Transformer model and deep learning for underwater acoustic target recognition.

2.1. Transformer Model

Learning from the mechanism of the human brain to solve information overload, artificial intelligence improves the ability of the neural network to process information from two aspects: 1. Additional external memory: optimize the memory structure of the neural network to improve the information storage capacity of the neural network. 2. Attention: filter out a large amount of irrelevant information through the top-down information selection mechanism.

In order to select information related to a specific task from N input vectors

X = [x_{1}, x_{2}, \dots x_{N}]

, we need to introduce a task related query vector Q, and calculate the correlation between each input vector and query vector through a scoring function. The attention distribution of the nth input vector

α_{n}

is:

α_{n} = p (z = n | X, q) = s o f t m a x (s (x_{n}, q))

(1)

Summarize the input information of all vectors, the attention distribution of X is:

A t t e n t i o n (X, q) = \sum_{n = 1}^{N} α_{n} x_{n}

(2)

The Transformer [21] was proposed by Vaswani et al. in 2017, which consists of self-attention mechanisms. The Transformer adopts the key–value pairs attention mechanism, where

(K, V) = [(k_{1}, v_{1}), (k_{2}, v_{2}), \dots, (k_{N}, v_{N})]

represents the input information of the task group. Given the task-related query vector q, the attention function is:

A t t e n t i o n ((K, V), q) = \sum_{n = 1}^{N} α_{n} v_{n} = \sum_{n = 1}^{N} \frac{e x p (s (k_{n}, q))}{\sum_{j} e x p (s (k_{j}, q))} v_{n}

(3)

Attributed to using multi-head attention improves model parallelism in Transformer, good performance has been achieved in the field of natural language processing (NLP). Subsequently, the BERT [22] algorithm used to generate word vectors achieved significant improvements in 11 NLP tasks.

With the development and application of the Transformer in NLP, it has also been introduced into the field of computer vision. The earliest work is the image classification model ViT [11] proposed by Google, which divides the image into multiple image blocks and uses the standard Transformer Encoder structure for classification. In the field of speech recognition, Yuan et al. proposed the AST model [23], realizing a non-convolution, pure attention sound recognition model.

2.2. Deep Learning for Underwater Acoustic Target Recognition

From the current research status of deep learning applied to underwater target recognition, the research content mainly involves three aspects:

The first is that in the field of underwater identification, due to military confidentiality and security reasons, the collection and production of data sets are difficult, and the number of samples is scarce. Therefore, as many existing samples as possible will be used, and data enhancement technology will be used to realize samples to meet the needs of large data volumes in deep learning.

The second is the applications of the most orthodox deep learning field (such as computer vision, natural language recognition, etc.), by optimizing and designing complex deep neural network structures, relying only on neural networks. The network completes feature extraction. For example, Hu [24], Sun [25], etc. use the original time domain as the input of the model.

The third is in the data preprocessing before the input of the neural network. Because of the serious pollution of the collected data by environmental noise and the limitation of the collected data format, the audio samples are denoised and spectrally transformed. The purpose of performing operations such as image denoising is to make the sample as pure as possible through a method similar to feature engineering, with obvious features, which is more conducive to the needs of deep neural network feature extraction later [20,25,26]. Our work will also follow this approach.

3. Materials and Methods

This section describes the overall architecture of STM, the feature extraction method, and some implementation details.

3.1. System Overview

For a long time, underwater acoustic targets have mainly relied on sonar operators for manual identification and interpretation. A mature sonar operator needs a lot of time to train, and is also easily affected by the external environment and mental health. With the development of artificial intelligence technology, some researchers have applied machine learning and deep learning to underwater acoustic target recognition. The Transformer model has proved its superiority of the model in the fields of Natural Language Processing (NLP), Computer Vision (CV), and speech recognition. The spectrogram of the sound after feature extraction is essentially a single-channel image, and spectrogram processing can also be conducted in underwater acoustic target recognition.

A Transformer consists of encoder and decoder models. Since STM is used for classification recognition, it uses the encoder of the Transformer only. As shown in Figure 1, the encoder of Transformer is composed of a stack of multiple layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a fully connected feed-forward network. A residual connection around each of the two sub-layers, followed by layer normalization.

Figure 2 shows the architecture of the STM system. The STM model was implemented in three steps:

Step 1. We perform feature extraction on the audio signal to obtain the spectrogram as the input of the model.

Step 2. The spectrogram is split into a sequence of patches for positional embedding and summation.

Step 3. It is used as the input of the Transformer Encoder, and the output result is obtained after passing through a layer of MLP after training.

3.2. Feature Extraction

In the underwater acoustic target recognition task, short-time Fourier transform (STFT), Mel filterbank (FBank), and Mel Frequency cepstral coefficients (MFCC) are commonly used in the audio feature extraction stage.

STFT is a natural extension of Fourier transform in addressing signal non-stationarity by applying windows for segmented analysis. In the continuous domain, STFT could be represented as,

X (n, w) = \sum_{m = - \infty}^{\infty} x (m) g (n - m) e^{- j w n}

(4)

where

x (m)

is the input signal,

g (m)

is the window function, and

X (n, w)

is a two-dimensional function of time n and frequency w.

A filter bank (FBank) is a system that divides the input signal

x (m)

into a set of analysis signals

x_{1} (m), x_{2} (m), \dots

, each of which corresponds to a different region in the spectrum of

x (m)

. It could be represented as,

F_{m} = l o g_{10} \sum_{k = 0}^{N - 1} ω_{M} (k) {| X (k) |}^{2} 0 \leq m \leq M - 1

(5)

where

ω_{m} (k)

is the mth window function,

X (k)

is the FFT of input signal

x (m)

.

MFCC takes into account human perception for sensitivity at appropriate frequencies by converting the conventional frequency to Mel Scale. It could be represented as from FBank:

C_{m} = \sum_{m = 0}^{M - 1} F_{m} c o s (π n (m - 0.5) / M) 0 \leq n \leq C - 1

(6)

where C is the number of MFCC.

Although STFT can retain the most comprehensive time-frequency information, it cannot highlight the spectral characteristics of the signal well. Both FBank and MFCC can highlight spectral features based on human hearing design, but the DCT (discrete cosine transform) in the MFCC method filters out part of the signal information and also increases the amount of calculation. Figure 3 shows the different spectrograms obtained by these three feature extraction methods. To get a better classification recognition accuracy, we used the spectrograms obtained by the three feature extraction methods as the input of the model, and analyzed and compared the result after training.

3.3. Patch and Position Embedding

After feature extraction, we can get a 2D spectrogram, but the input of Transformer Encoder is a 1D token sequence. To handle the 2D spectrogram, we split the spectrogram into a sequence of N 16 × 16 patches with an overlap of 6 in both the time and frequency dimension. Then, we flatten each 16 × 16 patch to a 1D patch embedding using a linear projection layer. As the Transformer model can not capture the sequence information of the input patches, STM adds a piece of trainable sequence information to each patch for the model to capture the features of the entire spectrum. In addition, we add a CLASS token (CLS) at the beginning of the patch sequence. We illustrated this process in Figure 4, the value of the CLS represents the classification result of the input spectrogram [11].

3.4. Model Details

We use PyTorch to build, train, and validate our proposed method. In the feature extraction stage, we use 254 ffts for STFT, 128 mel-filters, Hanning windows, and the stride length is set to 10 ms for FBank and MFCC. When going through the convolutional layers, setting the size of each patch to 16 × 16 and the overlap in time and frequency domains to 6, results in 600 one-dimensional patch sequences. The Transformer Encoder receives 600 inputs, 12 layers, and 12 heads. In the final Multilayer Perceptron (MLP) layer, we use the sigmoid function to process the data to get the final classification labels.

3.5. Data Augmentation

The limitation of the dataset is a common problem in the underwater acoustic target recognition, which may lead to overfitting model training. To expand the dataset, we augment the dataset with time masking and frequency masking [27]. The specific implementation of the two methods is as follows:

Time masking: using the mean to mask t consecutive time steps in the time domain, that is

[t_{0}, t_{0} + t]

, where

t \in (0, T], t_{0} \in [0, τ - t)

, T is the masking parameter, and

τ

is the duration of the spectrogram. During the execution, t and

t_{0}

are randomly selected within the range of values.

Frequency masking: using the mean to mask f consecutive frequency bins in the frequency domain, that is

[f_{0}, f 0 + f]

, where

f \in (0, F], f_{0} \in [0, v - f)

, F is the masking parameter, and v is the number of frequency bin of the spectrogram. During the execution, f and

f_{0}

are randomly selected within the range of values.

3.6. Pretraining

One drawback of the Transformer model is that it requires a large amount of data to train the model. In image classification, when the training data is more than 14 M, the performance can exceed the performance of CNN [11]. Considering that spectrograms and images have a certain similarity, and previous research work [23,28,29,30] also confirmed the possibility of transfer learning from image tasks to audio tasks. In our work, we pre-train the model using two already trained networks, which are the ImageNet dataset model trained on ViT and the Audioset dataset model trained on AST.

4. Experiments and Analysis

In this section, we introduce the dataset and the data processing methods. In addition, we analyze the experimental results.

4.1. Dataset and Data Processing

The dataset used in our study is ShipsEar [31], which was recorded using digitalHyd SR-1 recorders at the port of Vigo, Spain in the autumn of 2012 and summer of 2013, with a total of 90 sound records and 11 vessel targets. The duration of each recording varies from 15 s to 10 min. According to the size of the vessel, it can be divided into 5 categories (4 vessel classes and 1 background noise class). The division method and the number of files are described in detail in Table 1. Since it was proposed, ShipsEar has been widely used in various studies and has become a benchmark in the field of underwater acoustic target recognition.

In our experiments, we split the sound file into data samples with a time interval of 5 s. After removing some useless sound clips, 2303 data samples were obtained. When the model was trained, we use the following two dataset division methods:

(1): In the first method, each 5 s sound clip was regarded as an independent sample, and all the samples were randomly divided into the training set, validation set, and test set according to the ratio of 7:1.5:1.5. Dataset A was obtained by this data division, and the number of samples in our training set, validation set, and test set were 1614, 346, and 343, respectively.
(2): In the second division method, the original sound files were sorted in chronological order. The first 70% was the training set, and the remaining was divided into the validation set and the test set according to the ratio of 1:1. Dataset B was obtained by this data division, and the number of samples in our training set, validation set, and test set were 1611, 346, and 346, respectively.

The training set in the dataset partition was the data sample used for model training and fitting. The validation set was a separate set of samples that can be used to tune the model’s hyperparameters and make an initial assessment of the model’s capabilities. The test set was used to evaluate the generalization ability of the final model. All of the 2303 data samples participated in the experiment.

In particular, it should be pointed out here that Dataset B was more challenging in recognition tasks than Dataset A, and was also more suitable for practical applications. This is because Dataset A does not consider the impact of the time dimension when dividing the data, while Dataset B is divided on a time-series basis, just like the real world, using samples of objects that have appeared in the past to predict the class of upcoming objects.

4.2. Experimental Setup

On the basis of the above dataset division, firstly, we use the modified ResNet18 model to experiment with three feature extraction methods, and preliminarily determine the best feature extraction method; then, we further verify the three methods on the STM model, to determine the final feature extraction method; finally, we use this determined feature to conduct a comparative experimental analysis in CNN, ResNet18, and STM.

4.3. Implementation Details

The model was trained on a workstation configured with an Nvidia RTX3090 GPU, 2 Xeon 4210R CPUs, and 128 GB of internal memory. During training, we set the batch size and the maximum number of epochs to 24 and 100, respectively. The cross-entropy loss function and Adam optimizer were used. The initial learning rate was set to

10^{- 5}

in the case of pre-training with the AudioSet dataset, and

10^{- 4}

in all other cases. we saved the model with the highest recognition accuracy and used the test set to verify the final recognition accuracy of this model.

In addition, the specific parameters of the CNN model and reset18 model used as a comparative experiment are as follows:

For the CNN model, we used two convolutional layers (the first convolutional layer: in_chanels(1), out_chanels(64), kernel_size(5), stride(1), padding(2); activation function (Tanh), maximum pooling (2); second convolutional layer: in_chanels(64), out_chanels(256), kernel_size(5), stride(1), padding(2); activation function (Tanh), maximum Pooling (2)), one dropout layer set to 0.1, two fully connected layers with softmax activation function.

For the ResNet18 model, we used the basic ResNet18 network structure. Since the spectrogram obtained from the file obtained by using the acoustic target is single-channel, we only modify the input channel to 1, and the other structures are not changed.

4.4. Experimental Results

This section presents a few numerical results. We comparatively analyze the recognition accuracy of STM and other classification models with different feature extraction methods, pre-training datasets and data augmentation.

4.4.1. Comparison of Feature Extraction Methods and Pre-Training

We compared the recognition accuracy of different feature extraction methods on ResNet18 and our model. Table 2 shows the recognition accuracy of three different feature extraction methods using the ResNet18 model. We can find that the Spectrogram obtained by STFT transformation only gets 76.9% and 60.1% recognition accuracy on Datasets A and B. FBank and MFCC obtained relatively good results. The classification recognition accuracy of 93.8% was achieved on Dataset A, and the performance on Dataset B was slightly worse at 88.2% and 85.5%, respectively.

From the results in Table 2, it can be seen that the FBank feature has obtained relatively good results on the datasets divided by the two data methods, and obtained a recognition accuracy of 88.2% on the dataset B that is relatively closer to the actual application scenario. It is 28.1% and 2.7% higher than the other two feature extraction methods. So is the FBank feature also the most suitable feature for our model? For this reason, we used three feature extraction methods in our model for comparison.

Table 3 shows the recognition accuracy of three different feature extraction methods in our model. We find that the Spectrogram obtained by STFT transformation only achieves 66.8% and 51.4% recognition accuracy on Datasets A and B. FBank and MFCC obtained better results. The classification recognition accuracy of 85.7% and 86.8% were achieved on Dataset A, and the performance on Dataset B was slightly worse at 63.0% and 76.0%, respectively.

With the addition of ImageNet pre-training, the recognition accuracy of the three feature extraction methods has been improved to a certain extent. The improvement of FBank and MFCC is particularly obvious, reaching 94.1%, 81.2% and 93.8%, 82.3% on Dataset A and Dataset B, respectively. However, the STFT Spectrogram only reached 83.1% and 61.3%.

Since the AudioSet pre-training model uses FBank features in the initial training, in this experiment, the FBank feature extraction method achieved the best performance of 97.7% on Dataset A and 89.9% on Dataset B, and the performance of the other two features decreased slightly.

Experimental results show that relatively good results can be obtained by using FBank and MFCC features in our model. In the case of using AudioSet pre-training, FBank has significantly higher recognition accuracy. This corresponds to what was mentioned in the feature extraction section above: STFT can retain the most comprehensive time-frequency information, but it cannot highlight the spectral characteristics of the signal well. FBank and MFCC can highlight spectral features based on human hearing design, but the DCT (discrete cosine transform) in the MFCC method filters out part of the signal information.

4.4.2. Comparison of Classification Models

In order to better analyze the results, we compare some research results that also use the ShipsEar dataset with our experimental results. The specific comparison results are shown in Table 4. Hong [26] used 3D fusion features and the ResNet18 model, and achieved a recognition accuracy of 94.3% on Dataset A. Liu [20] used 3D Mel Spectrogram and convolutional recurrent neural network to obtain 94.6% and 87.3% recognition accuracy on Dataset A and Dataset B, respectively. In our CNN model, the recognition accuracy rates of 84.0% and 39.9% were obtained on dataset A and dataset B, respectively. In the ResNet18 model, the performance was improved to a certain extent, reaching 94.3% and 88.1%, respectively. The results are already on par with those of CRNN-9, and even exceed the performance of CRNN-9 on dataset B. It can be seen from the comparison of the results that when we select the FBank feature, in the case of AudioSet pre-training, we can get 97.7% recognition accuracy on Dataset A, which is the best performance at present. We obtained 89.9% recognition accuracy on Dataset B, surpassing the ResNet18 model.

Comparing the performance of various models on dataset A, we can see that the trained model exceeds the Baseline by 6.5%. However, the performance is relatively poorer than the CNN, ResNet18, CRNN-9 in [20] and the ResNet in [26]. As mentioned in [11], its performance is inferior to CNN when the training set is insufficient (less than 14 M). After using pre-training, combined with certain prior knowledge, the performance of the model is qualitatively improved, and finally exceeds the Baseline by 22.3%, CNN by 13.7%, ResNet18 by 4% and the CRNN-9 is 3.1%. Considering the situation that is closer to the actual application scenario, the experimental results on dataset B show that with pre-training, our model still outperforms the ResNet18 model by 1.8%.

4.4.3. Comparison of Data Augmentation

To expand the ShipsEar dataset, we used online data augmentation to perform time and frequency masking on the spectrogram. The masking is random, and the scale of the dataset expansion is the training epoch.

Table 5 shows some of our experimental results with online data augmentation. After 100 epochs of training, which is equivalent to expanding the training set by 100 times, we found that the recognition accuracy on dataset A reached 98.8%, which was 1.1% higher than that without data augmentation. The performance on dataset B is roughly the same as before (reduced by 0.1%). After data enhancement processing, although the diversity of training samples is enhanced to a certain extent, since the sound signal is a time series signal, only the existing signal is randomly masked, and the test data without the model may not have a big performance boost.

4.4.4. Results Analysis

To clearly show the prediction information of each type of underwater acoustic target after passing through our proposed model, we calculated the confusion matrix of the two data division methods.The confusion matrix of the final classification result is shown in Figure 5.

In order to better evaluate the entire model, we use the recognition rate, precision rate, recall rate, and F1-score as performance evaluation indicators. The calculation methods of each indicator are as follows:

Accuracy = \frac{n_{A A} + n_{B B} + n_{C C} + n_{D D} + n_{E E}}{N}

(7)

{Precision}_{i} = \frac{n_{i i}}{n_{A i} + n_{B i} + n_{C i} + n_{D i} + n_{E i}}

(8)

{Recall}_{i} = \frac{n_{i i}}{n_{i A} + n_{i B} + n_{i C} + n_{i D} + n_{i E}}

(9)

{F 1 - score}_{i} = \frac{2 \times {Precision}_{i} \times {Recall}_{i}}{{Precision}_{i} + {Recall}_{i}}

(10)

In the above calculation method, N represents the total number of samples, and

n_{i j}

represents the number of samples of class i that are predicted to be samples of class j.

When calculating the average value of each indicator, we use the macro calculation method. The specific calculation method is as follows:

m a c r o - P = \frac{1}{n} \sum_{i = 1}^{n} P_{i}

(11)

m a c r o - R = \frac{1}{n} \sum_{i = 1}^{n} R_{i}

(12)

m a c r o - F 1 = \frac{2 \times m a c r o - P \times m a c r o - R}{m a c r o - P + m a c r o - R}

(13)

The detailed experimental results of dataset A and dataset B are shown in Table 6 and Table 7. The support in the table represents the number of samples of this class in the test set. Combining the two confusion matrix diagrams, we can find that the overall recognition performance of Class A and Class B targets is relatively poor. Considering that Class A and Class B are composed of the multi-object mixture, the extracted features are relatively complex, and the recognition accuracy is relatively low compared to the other three. Confusion matrices reflect the same pattern.

Figure 6 shows the validation accuracy during the training process of 100 epochs. We can see that whether there is pre-training on dataset A or dataset B, the model achieves the recognition accuracy before about 20 epochs—the later training process is just fine-tuning. However, on the whole, from the initial model training, ImageNet, and AudioSet training, the validation accuracy of the model is getting higher and higher, and the model effect is getting better and better, which is consistent with the experimental results we obtained on the validation set.

It is particularly important to note that due to the limitation of the number of samples in the ShipEar dataset, we adopted two different dataset division methods in the experiments. If the influence of the time dimension is not considered, there will be training validation to a certain extent. The crossover of the test set has an impact on the results of the entire model. Considering the influence of the time dimension, the partitioning method of dataset B is obviously closer to the practical application, that is, the existing data are used to predict the upcoming target type. The most ideal division method is to directly train and verify a sound sample without slicing it, which is also the direction of future research.

5. Conclusions

In this paper, we propose a Transformer-based acoustic target recognition model STM. It achieves a recognition accuracy of 97.7% and 89.9%, which is better than the performance of the CRNN-9 model shown in [20] (state-of-the-art) and ResNet18. This result shows that our proposed method can provide good technical support for underwater target recognition systems. In addition, we also analyze the performance of three feature extraction methods (STFT, FBank, MFCC) in ResNet18 and STM. The experimental results confirm that FBank can achieve the highest recognition accuracy on both ResNet18 and STM. When using AudioSet pre-training, it achieves the best performance at this stage.

At the same time, there is still a lot of worthwhile research to do based on our work, such as using feature fusion to improve the performance of the model, according to the complementarity of different features; use the adversarial generative network or diffusion model to generate simulation data samples to expand the dataset, improve the robustness and generalization ability of the model, and so on. These will be the research directions of our future works.

Author Contributions

Conceptualization, P.L. and Y.W.; data curation, P.L. and Q.L.; formal analysis, P.L. and W.X.; investigation, J.W.; methodology, P.L. and Y.W.; project administration, Y.W.; resources, P.L.; software, P.L.; supervision, Y.W. and Q.L.; validation, P.L., J.W. and Q.L.; visualization, P.L.; writing—original draft, P.L. and J.W.; writing—review and editing, P.L., J.W., Y.W., Q.L. and W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, Grant No. 2016YFC1401800, and the National Natural Science Foundation of China, Grant No. 61379056 and 61972406.

Institutional Review Board Statement

The study did not require ethical approval.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data are avaliable in a publicly accessible repository. The data presented in this study are openly (accessed on 1 April 2022) available at http://atlanttic.uvigo.es/underwaternoise/ in ref. [31].

Conflicts of Interest

The authors declare no conflict of interest.

References

Berger, C.R.; Zhou, S.; Preisig, J.C.; Willett, P. Sparse Channel Estimation for Multicarrier Underwater Acoustic Communication: From Subspace Methods to Compressed Sensing. IEEE Trans. Signal Process. 2010, 58, 1708–1721. [Google Scholar] [CrossRef] [Green Version]
Kamal, S.; Mohammed, S.K.; Pillai, P.R.S.; Supriya, M.H. Deep Learning Architectures for Underwater Target Recognition. In Proceedings of the 2013 Ocean Electronics (SYMPOL), Kochi, India, 23–25 October 2013; pp. 48–54. [Google Scholar] [CrossRef]
Ferguson, E.L.; Ramakrishnan, R.; Williams, S.B.; Jin, C.T. Convolutional Neural Networks for Passive Monitoring of a Shallow Water Environment Using a Single Sensor. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, 5–9 March 2017; pp. 2657–2661. [Google Scholar] [CrossRef] [Green Version]
Valdenegro-Toro, M. Improving Sonar Image Patch Matching via Deep Learning. In Proceedings of the 2017 European Conference on Mobile Robots, ECMR 2017, Paris, France, 6–8 September 2017; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Lecun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time-Series. Handb. Brain Theory Neural Netw. 1995, 3361, 1995. [Google Scholar]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
Gong, Y.; Chung, Y.A.; Glass, J.R. PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation. arXiv 2021, arXiv:2102.01243. [Google Scholar]
Rybakov, O.; Kononenko, N.; Subrahmanya, N.; Visontai, M.; Laurenzo, S. Streaming Keyword Spotting on Mobile Devices. In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 2277–2281. [Google Scholar] [CrossRef]
Li, P.; Song, Y.; McLoughlin, I.; Guo, W.; Dai, L. An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition. In Proceedings of the Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar] [CrossRef] [Green Version]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 538–547. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, TX, USA, 1–4 November 2016; pp. 2249–2255. [Google Scholar] [CrossRef]
Nawab, S.H.; Quatieri, T.F. Short-Time Fourier Transform. In Advanced Topics in Signal Processing; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 1987; pp. 289–337. [Google Scholar]
Lim, Y. A digital filter bank for digital audio systems. IEEE Trans. Circuits Syst. 1986, 33, 848–849. [Google Scholar] [CrossRef]
Das, A.; Jena, M.R.; Barik, K.K. Mel-Frequency Cepstral Coefficient (MFCC)—A Novel Method for Speaker Recognition. Digit. Technol. 2014, 1, 1–3. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Liu, F.; Shen, T.; Luo, Z.; Zhao, D.; Guo, S. Underwater Target Recognition Using Convolutional Recurrent Neural Networks with 3-D Mel-spectrogram and Data Augmentation. Appl. Acoust. 2021, 178, 107989. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Gong, Y.; Chung, Y.A.; Glass, J.R. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
Hu, G.; Wang, K.; Peng, Y.; Qiu, M.; Shi, J.; Liu, L. Deep Learning Methods for Underwater Target Feature Extraction and Recognition. Comput. Intell. Neurosci. 2018, 2018, 1214301. [Google Scholar] [CrossRef] [PubMed]
Sun, Q.; Wang, K. Underwater Single-Channel Acoustic Signal Multitarget Recognition Using Convolutional Neural Networks. J. Acoust. Soc. Am. 2022, 151, 2245–2254. [Google Scholar] [CrossRef] [PubMed]
Hong, F.; Liu, C.; Guo, L.; Chen, F.; Feng, H. Underwater Acoustic Target Recognition with a Residual Network and the Optimized Feature Extraction Method. Appl. Sci. 2021, 11, 1442. [Google Scholar] [CrossRef]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. ESResNet: Environmental Sound Classification Based on Visual Domain Models. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; pp. 4933–4940. [Google Scholar] [CrossRef]
Gwardys, G.; Grzywczak, D. Deep Image Features in Music Information Retrieval. Int. J. Electron. Telecommun. 2014, 60, 321–326. [Google Scholar] [CrossRef]
Palanisamy, K.; Singhania, D.; Yao, A. Rethinking CNN Models for Audio Classification. arXiv 2020, arXiv:2007.11154. [Google Scholar]
Santos-Domínguez, D.; Torres-Guijarro, S.; Cardenal-López, A.; Pena-Gimenez, A. ShipsEar: An Underwater Vessel Noise Database. Appl. Acoust. 2016, 113, 64–69. [Google Scholar] [CrossRef]

Figure 1. The Architecture of Transformer.

Figure 2. The framework of STM.

Figure 3. Examples of features. (a): Original audio wave; (b): STFT spectrogram; (c): FBank spectrogram; (d): MFCCs.

Figure 4. The process of patch and positional embedding.

Figure 5. The confusion matrix of the STM. (a): Confusion matrix on Dataset A; (b): Confusion matrix on Dataset B.

Figure 6. The validated accurary on the model. (a): Validated accuracy on Dataset A; (b): Validated accuracy on Dataset B.

Table 1. The details of the ShipsEar dataset.

Class	Vessel Name	Files	Samples
A	fishing boats, trawlers, mussel boats, tugboats and dredgers	17	385
B	motorboats, pilot boats and sailboats	19	320
C	passenger ferries	30	868
D	ocean liners and ro-ro vessels	12	498
E	background noise	12	232

Table 2. The recognition accuracy of three different feature extraction methods on ResNet18.

Feature	Dataset A (%)	Dataset B (%)
STFT	76.9	60.1
FBank	93.8	88.2
MFCC	93.8	85.5

Table 3. The recognition accuracy of three different feature extraction methods on STM.

Feature	Dataset A (%)	Dataset B (%)
STFT	66.8	51.4
FBank	85.7	63.0
MFCC	86.8	76.0
STFT + ImageNet	83.1	61.3
FBank + ImageNet	94.1	81.2
MFCC + ImageNet	93.8	82.3
STFT + AudioSet	82.8	64.5
FBank + AudioSet	97.7	89.9
MFCC + AudioSet	91.8	78.3

Table 4. The accuracy of different models.

Model	Dataset A (%)	Dataset B (%)
Baseline [31]	75.4	-
CNN	84.0	39.9
ResNet18	93.7	88.1
ResNet + 3D [26]	94.3	-
CRNN-9 [20]	94.6	87.3
STM	81.9	71.7
STM + ImageNet	94.1	82.3
STM + AudioSet	97.7	89.9

Table 5. The recognition accuracy after data augmentation.

Feature	Dataset A (%)	Dataset B (%)
FBank	85.7	63.0
FBank + ImageNet	94.1	81.2
FBank + AudioSet	97.7	89.9
FBank + AudioSet+Aug	98.8	89.8

Table 6. The results of each class on Dataset A.

Class	Accuracy (%)	Precision	Recall	F1-Score	Support
A	94.7	1.0000	0.9474	0.9730	57
B	89.6	1.0000	0.8958	0.9451	48
C	100	0.9489	1.0000	0.9738	130
D	100	0.9867	1.0000	0.9933	74
E	100	1.0000	1.0000	1.0000	34
average	97.7	0.9871	0.9686	0.9770	343

Table 7. The results of each class on Dataset B.

Class	Accuracy (%)	Precision	Recall	F1-Score	Support
A	89.8	0.8833	0.8983	0.8908	59
B	77.6	0.8261	0.7755	0.8000	49
C	95.4	0.8794	0.9538	0.9151	130
D	86.5	0.9552	0.8649	0.9078	74
E	94.1	1.0000	0.9412	0.9697	34
average	89.9	0.9088	0.8867	0.8967	346

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, P.; Wu, J.; Wang, Y.; Lan, Q.; Xiao, W. STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2022, 10, 1428. https://doi.org/10.3390/jmse10101428

AMA Style

Li P, Wu J, Wang Y, Lan Q, Xiao W. STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition. Journal of Marine Science and Engineering. 2022; 10(10):1428. https://doi.org/10.3390/jmse10101428

Chicago/Turabian Style

Li, Peng, Ji Wu, Yongxian Wang, Qiang Lan, and Wenbin Xiao. 2022. "STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition" Journal of Marine Science and Engineering 10, no. 10: 1428. https://doi.org/10.3390/jmse10101428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition

Abstract

1. Introduction

2. Related Work

2.1. Transformer Model

2.2. Deep Learning for Underwater Acoustic Target Recognition

3. Materials and Methods

3.1. System Overview

3.2. Feature Extraction

3.3. Patch and Position Embedding

3.4. Model Details

3.5. Data Augmentation

3.6. Pretraining

4. Experiments and Analysis

4.1. Dataset and Data Processing

4.2. Experimental Setup

4.3. Implementation Details

4.4. Experimental Results

4.4.1. Comparison of Feature Extraction Methods and Pre-Training

4.4.2. Comparison of Classification Models

4.4.3. Comparison of Data Augmentation

4.4.4. Results Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI