1. Introduction
Underwater acoustic target recognition is a hot research issue in the field of underwater acoustic signal processing. Affected by the complex underwater environment, background noise, and sound scattering, the underwater acoustic target recognition technology meets great challenges. Artificial intelligence technology has promoted underwater acoustic target recognition. Machine learning [
1] and deep learning [
2,
3,
4] have been used in underwater acoustic target recognition in recent years.
In the past decade, convolutional neural networks (CNNs) [
5,
6] have been widely used for deep learning end-to-end network modeling in many fields, as the inductive biases inherent to CNNs such as spatial locality and translation equivariance are believed to be helpful. To capture long-range global information, a recent trend is to add an attention mechanism to the deep learning network architecture. Such attention architecture has achieved good performance for audio classification fields such as audio event classification [
7,
8], speech command recognition [
9], and emotion recognition [
10]. Attention-based models are also made successful in the vision domain [
11,
12,
13], it is reasonable to ask whether it is also made useful in underwater acoustic target recognition?
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing the modeling of dependencies without regard to their distance in the input or output sequences [
8,
14]. Some attention mechanisms are used in conjunction with other network architectures [
15]. The Transformer proposed a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. Instead of using other network architectures, Transformer achieves good performance using a self-attention mechanism.
In this paper, we introduce a Transformer-based recognition model and use experiments to study the impact of different feature extraction methods on the accuracy. The contributions of this paper can be summarized as follows:
We propose a spectrogram Transformer model (STM) for underwater acoustic target recognition, in which underwater audio is specially processed to fit the model. To the best of our knowledge, this is the first work to introduce Transformer into the underwater acoustic target recognition field.
We compare the performance of different feature extraction methods (Short-time Fourier transform [
16], Filter Bank [
17], Mel Frequency Cepstral Coefficients [
18]) for ResNet18 [
19] and STM. The experimental results show that the FBank features are more suitable for deep learing models. Especially under the pre-training of AudioSet, the recognition accuracy is significantly higher than other features.
Combine with pre-training, under two dataset partitioning methods, STM has achieved 97.7% and 89.9% recognition accuracy, which are 13.7% and 50% higher than the CNN Model. STM also outperforms the state-of-the-art model CRNN-9 [
20] by 3.1% and ResNet18 by 1.8%.
2. Related Work
In this section, we will introduce the Transformer model and deep learning for underwater acoustic target recognition.
2.1. Transformer Model
Learning from the mechanism of the human brain to solve information overload, artificial intelligence improves the ability of the neural network to process information from two aspects: 1. Additional external memory: optimize the memory structure of the neural network to improve the information storage capacity of the neural network. 2. Attention: filter out a large amount of irrelevant information through the top-down information selection mechanism.
In order to select information related to a specific task from
N input vectors
, we need to introduce a task related query vector
Q, and calculate the correlation between each input vector and query vector through a scoring function. The attention distribution of the
nth input vector
is:
Summarize the input information of all vectors, the attention distribution of
X is:
The Transformer [
21] was proposed by Vaswani et al. in 2017, which consists of self-attention mechanisms. The Transformer adopts the key–value pairs attention mechanism, where
represents the input information of the task group. Given the task-related query vector
q, the attention function is:
Attributed to using multi-head attention improves model parallelism in Transformer, good performance has been achieved in the field of natural language processing (NLP). Subsequently, the BERT [
22] algorithm used to generate word vectors achieved significant improvements in 11 NLP tasks.
With the development and application of the Transformer in NLP, it has also been introduced into the field of computer vision. The earliest work is the image classification model ViT [
11] proposed by Google, which divides the image into multiple image blocks and uses the standard Transformer Encoder structure for classification. In the field of speech recognition, Yuan et al. proposed the AST model [
23], realizing a non-convolution, pure attention sound recognition model.
2.2. Deep Learning for Underwater Acoustic Target Recognition
From the current research status of deep learning applied to underwater target recognition, the research content mainly involves three aspects:
The first is that in the field of underwater identification, due to military confidentiality and security reasons, the collection and production of data sets are difficult, and the number of samples is scarce. Therefore, as many existing samples as possible will be used, and data enhancement technology will be used to realize samples to meet the needs of large data volumes in deep learning.
The second is the applications of the most orthodox deep learning field (such as computer vision, natural language recognition, etc.), by optimizing and designing complex deep neural network structures, relying only on neural networks. The network completes feature extraction. For example, Hu [
24], Sun [
25], etc. use the original time domain as the input of the model.
The third is in the data preprocessing before the input of the neural network. Because of the serious pollution of the collected data by environmental noise and the limitation of the collected data format, the audio samples are denoised and spectrally transformed. The purpose of performing operations such as image denoising is to make the sample as pure as possible through a method similar to feature engineering, with obvious features, which is more conducive to the needs of deep neural network feature extraction later [
20,
25,
26]. Our work will also follow this approach.
3. Materials and Methods
This section describes the overall architecture of STM, the feature extraction method, and some implementation details.
3.1. System Overview
For a long time, underwater acoustic targets have mainly relied on sonar operators for manual identification and interpretation. A mature sonar operator needs a lot of time to train, and is also easily affected by the external environment and mental health. With the development of artificial intelligence technology, some researchers have applied machine learning and deep learning to underwater acoustic target recognition. The Transformer model has proved its superiority of the model in the fields of Natural Language Processing (NLP), Computer Vision (CV), and speech recognition. The spectrogram of the sound after feature extraction is essentially a single-channel image, and spectrogram processing can also be conducted in underwater acoustic target recognition.
A Transformer consists of encoder and decoder models. Since STM is used for classification recognition, it uses the encoder of the Transformer only. As shown in
Figure 1, the encoder of Transformer is composed of a stack of multiple layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a fully connected feed-forward network. A residual connection around each of the two sub-layers, followed by layer normalization.
Figure 2 shows the architecture of the STM system. The STM model was implemented in three steps:
Step 1. We perform feature extraction on the audio signal to obtain the spectrogram as the input of the model.
Step 2. The spectrogram is split into a sequence of patches for positional embedding and summation.
Step 3. It is used as the input of the Transformer Encoder, and the output result is obtained after passing through a layer of MLP after training.
3.2. Feature Extraction
In the underwater acoustic target recognition task, short-time Fourier transform (STFT), Mel filterbank (FBank), and Mel Frequency cepstral coefficients (MFCC) are commonly used in the audio feature extraction stage.
STFT is a natural extension of Fourier transform in addressing signal non-stationarity by applying windows for segmented analysis. In the continuous domain, STFT could be represented as,
where
is the input signal,
is the window function, and
is a two-dimensional function of time
n and frequency
w.
A filter bank (FBank) is a system that divides the input signal
into a set of analysis signals
, each of which corresponds to a different region in the spectrum of
. It could be represented as,
where
is the mth window function,
is the FFT of input signal
.
MFCC takes into account human perception for sensitivity at appropriate frequencies by converting the conventional frequency to Mel Scale. It could be represented as from FBank:
where
C is the number of MFCC.
Although STFT can retain the most comprehensive time-frequency information, it cannot highlight the spectral characteristics of the signal well. Both FBank and MFCC can highlight spectral features based on human hearing design, but the DCT (discrete cosine transform) in the MFCC method filters out part of the signal information and also increases the amount of calculation.
Figure 3 shows the different spectrograms obtained by these three feature extraction methods. To get a better classification recognition accuracy, we used the spectrograms obtained by the three feature extraction methods as the input of the model, and analyzed and compared the result after training.
3.3. Patch and Position Embedding
After feature extraction, we can get a 2D spectrogram, but the input of Transformer Encoder is a 1D token sequence. To handle the 2D spectrogram, we split the spectrogram into a sequence of N 16 × 16 patches with an overlap of 6 in both the time and frequency dimension. Then, we flatten each 16 × 16 patch to a 1D patch embedding using a linear projection layer. As the Transformer model can not capture the sequence information of the input patches, STM adds a piece of trainable sequence information to each patch for the model to capture the features of the entire spectrum. In addition, we add a CLASS token (CLS) at the beginning of the patch sequence. We illustrated this process in
Figure 4, the value of the CLS represents the classification result of the input spectrogram [
11].
3.4. Model Details
We use PyTorch to build, train, and validate our proposed method. In the feature extraction stage, we use 254 ffts for STFT, 128 mel-filters, Hanning windows, and the stride length is set to 10 ms for FBank and MFCC. When going through the convolutional layers, setting the size of each patch to 16 × 16 and the overlap in time and frequency domains to 6, results in 600 one-dimensional patch sequences. The Transformer Encoder receives 600 inputs, 12 layers, and 12 heads. In the final Multilayer Perceptron (MLP) layer, we use the sigmoid function to process the data to get the final classification labels.
3.5. Data Augmentation
The limitation of the dataset is a common problem in the underwater acoustic target recognition, which may lead to overfitting model training. To expand the dataset, we augment the dataset with time masking and frequency masking [
27]. The specific implementation of the two methods is as follows:
Time masking: using the mean to mask t consecutive time steps in the time domain, that is , where , T is the masking parameter, and is the duration of the spectrogram. During the execution, t and are randomly selected within the range of values.
Frequency masking: using the mean to mask f consecutive frequency bins in the frequency domain, that is , where , F is the masking parameter, and v is the number of frequency bin of the spectrogram. During the execution, f and are randomly selected within the range of values.
3.6. Pretraining
One drawback of the Transformer model is that it requires a large amount of data to train the model. In image classification, when the training data is more than 14 M, the performance can exceed the performance of CNN [
11]. Considering that spectrograms and images have a certain similarity, and previous research work [
23,
28,
29,
30] also confirmed the possibility of transfer learning from image tasks to audio tasks. In our work, we pre-train the model using two already trained networks, which are the ImageNet dataset model trained on ViT and the Audioset dataset model trained on AST.
4. Experiments and Analysis
In this section, we introduce the dataset and the data processing methods. In addition, we analyze the experimental results.
4.1. Dataset and Data Processing
The dataset used in our study is ShipsEar [
31], which was recorded using digitalHyd SR-1 recorders at the port of Vigo, Spain in the autumn of 2012 and summer of 2013, with a total of 90 sound records and 11 vessel targets. The duration of each recording varies from 15 s to 10 min. According to the size of the vessel, it can be divided into 5 categories (4 vessel classes and 1 background noise class). The division method and the number of files are described in detail in
Table 1. Since it was proposed, ShipsEar has been widely used in various studies and has become a benchmark in the field of underwater acoustic target recognition.
In our experiments, we split the sound file into data samples with a time interval of 5 s. After removing some useless sound clips, 2303 data samples were obtained. When the model was trained, we use the following two dataset division methods:
- (1)
In the first method, each 5 s sound clip was regarded as an independent sample, and all the samples were randomly divided into the training set, validation set, and test set according to the ratio of 7:1.5:1.5. Dataset A was obtained by this data division, and the number of samples in our training set, validation set, and test set were 1614, 346, and 343, respectively.
- (2)
In the second division method, the original sound files were sorted in chronological order. The first 70% was the training set, and the remaining was divided into the validation set and the test set according to the ratio of 1:1. Dataset B was obtained by this data division, and the number of samples in our training set, validation set, and test set were 1611, 346, and 346, respectively.
The training set in the dataset partition was the data sample used for model training and fitting. The validation set was a separate set of samples that can be used to tune the model’s hyperparameters and make an initial assessment of the model’s capabilities. The test set was used to evaluate the generalization ability of the final model. All of the 2303 data samples participated in the experiment.
In particular, it should be pointed out here that Dataset B was more challenging in recognition tasks than Dataset A, and was also more suitable for practical applications. This is because Dataset A does not consider the impact of the time dimension when dividing the data, while Dataset B is divided on a time-series basis, just like the real world, using samples of objects that have appeared in the past to predict the class of upcoming objects.
4.2. Experimental Setup
On the basis of the above dataset division, firstly, we use the modified ResNet18 model to experiment with three feature extraction methods, and preliminarily determine the best feature extraction method; then, we further verify the three methods on the STM model, to determine the final feature extraction method; finally, we use this determined feature to conduct a comparative experimental analysis in CNN, ResNet18, and STM.
4.3. Implementation Details
The model was trained on a workstation configured with an Nvidia RTX3090 GPU, 2 Xeon 4210R CPUs, and 128 GB of internal memory. During training, we set the batch size and the maximum number of epochs to 24 and 100, respectively. The cross-entropy loss function and Adam optimizer were used. The initial learning rate was set to in the case of pre-training with the AudioSet dataset, and in all other cases. we saved the model with the highest recognition accuracy and used the test set to verify the final recognition accuracy of this model.
In addition, the specific parameters of the CNN model and reset18 model used as a comparative experiment are as follows:
For the CNN model, we used two convolutional layers (the first convolutional layer: in_chanels(1), out_chanels(64), kernel_size(5), stride(1), padding(2); activation function (Tanh), maximum pooling (2); second convolutional layer: in_chanels(64), out_chanels(256), kernel_size(5), stride(1), padding(2); activation function (Tanh), maximum Pooling (2)), one dropout layer set to 0.1, two fully connected layers with softmax activation function.
For the ResNet18 model, we used the basic ResNet18 network structure. Since the spectrogram obtained from the file obtained by using the acoustic target is single-channel, we only modify the input channel to 1, and the other structures are not changed.
4.4. Experimental Results
This section presents a few numerical results. We comparatively analyze the recognition accuracy of STM and other classification models with different feature extraction methods, pre-training datasets and data augmentation.
4.4.1. Comparison of Feature Extraction Methods and Pre-Training
We compared the recognition accuracy of different feature extraction methods on ResNet18 and our model.
Table 2 shows the recognition accuracy of three different feature extraction methods using the ResNet18 model. We can find that the Spectrogram obtained by STFT transformation only gets 76.9% and 60.1% recognition accuracy on Datasets A and B. FBank and MFCC obtained relatively good results. The classification recognition accuracy of 93.8% was achieved on Dataset A, and the performance on Dataset B was slightly worse at 88.2% and 85.5%, respectively.
From the results in
Table 2, it can be seen that the FBank feature has obtained relatively good results on the datasets divided by the two data methods, and obtained a recognition accuracy of 88.2% on the dataset B that is relatively closer to the actual application scenario. It is 28.1% and 2.7% higher than the other two feature extraction methods. So is the FBank feature also the most suitable feature for our model? For this reason, we used three feature extraction methods in our model for comparison.
Table 3 shows the recognition accuracy of three different feature extraction methods in our model. We find that the Spectrogram obtained by STFT transformation only achieves 66.8% and 51.4% recognition accuracy on Datasets A and B. FBank and MFCC obtained better results. The classification recognition accuracy of 85.7% and 86.8% were achieved on Dataset A, and the performance on Dataset B was slightly worse at 63.0% and 76.0%, respectively.
With the addition of ImageNet pre-training, the recognition accuracy of the three feature extraction methods has been improved to a certain extent. The improvement of FBank and MFCC is particularly obvious, reaching 94.1%, 81.2% and 93.8%, 82.3% on Dataset A and Dataset B, respectively. However, the STFT Spectrogram only reached 83.1% and 61.3%.
Since the AudioSet pre-training model uses FBank features in the initial training, in this experiment, the FBank feature extraction method achieved the best performance of 97.7% on Dataset A and 89.9% on Dataset B, and the performance of the other two features decreased slightly.
Experimental results show that relatively good results can be obtained by using FBank and MFCC features in our model. In the case of using AudioSet pre-training, FBank has significantly higher recognition accuracy. This corresponds to what was mentioned in the feature extraction section above: STFT can retain the most comprehensive time-frequency information, but it cannot highlight the spectral characteristics of the signal well. FBank and MFCC can highlight spectral features based on human hearing design, but the DCT (discrete cosine transform) in the MFCC method filters out part of the signal information.
4.4.2. Comparison of Classification Models
In order to better analyze the results, we compare some research results that also use the ShipsEar dataset with our experimental results. The specific comparison results are shown in
Table 4. Hong [
26] used 3D fusion features and the ResNet18 model, and achieved a recognition accuracy of 94.3% on Dataset A. Liu [
20] used 3D Mel Spectrogram and convolutional recurrent neural network to obtain 94.6% and 87.3% recognition accuracy on Dataset A and Dataset B, respectively. In our CNN model, the recognition accuracy rates of 84.0% and 39.9% were obtained on dataset A and dataset B, respectively. In the ResNet18 model, the performance was improved to a certain extent, reaching 94.3% and 88.1%, respectively. The results are already on par with those of CRNN-9, and even exceed the performance of CRNN-9 on dataset B. It can be seen from the comparison of the results that when we select the FBank feature, in the case of AudioSet pre-training, we can get 97.7% recognition accuracy on Dataset A, which is the best performance at present. We obtained 89.9% recognition accuracy on Dataset B, surpassing the ResNet18 model.
Comparing the performance of various models on dataset A, we can see that the trained model exceeds the Baseline by 6.5%. However, the performance is relatively poorer than the CNN, ResNet18, CRNN-9 in [
20] and the ResNet in [
26]. As mentioned in [
11], its performance is inferior to CNN when the training set is insufficient (less than 14 M). After using pre-training, combined with certain prior knowledge, the performance of the model is qualitatively improved, and finally exceeds the Baseline by 22.3%, CNN by 13.7%, ResNet18 by 4% and the CRNN-9 is 3.1%. Considering the situation that is closer to the actual application scenario, the experimental results on dataset B show that with pre-training, our model still outperforms the ResNet18 model by 1.8%.
4.4.3. Comparison of Data Augmentation
To expand the ShipsEar dataset, we used online data augmentation to perform time and frequency masking on the spectrogram. The masking is random, and the scale of the dataset expansion is the training epoch.
Table 5 shows some of our experimental results with online data augmentation. After 100 epochs of training, which is equivalent to expanding the training set by 100 times, we found that the recognition accuracy on dataset A reached 98.8%, which was 1.1% higher than that without data augmentation. The performance on dataset B is roughly the same as before (reduced by 0.1%). After data enhancement processing, although the diversity of training samples is enhanced to a certain extent, since the sound signal is a time series signal, only the existing signal is randomly masked, and the test data without the model may not have a big performance boost.
4.4.4. Results Analysis
To clearly show the prediction information of each type of underwater acoustic target after passing through our proposed model, we calculated the confusion matrix of the two data division methods.The confusion matrix of the final classification result is shown in
Figure 5.
In order to better evaluate the entire model, we use the recognition rate, precision rate, recall rate, and F1-score as performance evaluation indicators. The calculation methods of each indicator are as follows:
In the above calculation method, N represents the total number of samples, and represents the number of samples of class i that are predicted to be samples of class j.
When calculating the average value of each indicator, we use the macro calculation method. The specific calculation method is as follows:
The detailed experimental results of dataset A and dataset B are shown in
Table 6 and
Table 7. The support in the table represents the number of samples of this class in the test set. Combining the two confusion matrix diagrams, we can find that the overall recognition performance of Class A and Class B targets is relatively poor. Considering that Class A and Class B are composed of the multi-object mixture, the extracted features are relatively complex, and the recognition accuracy is relatively low compared to the other three. Confusion matrices reflect the same pattern.
Figure 6 shows the validation accuracy during the training process of 100 epochs. We can see that whether there is pre-training on dataset A or dataset B, the model achieves the recognition accuracy before about 20 epochs—the later training process is just fine-tuning. However, on the whole, from the initial model training, ImageNet, and AudioSet training, the validation accuracy of the model is getting higher and higher, and the model effect is getting better and better, which is consistent with the experimental results we obtained on the validation set.
It is particularly important to note that due to the limitation of the number of samples in the ShipEar dataset, we adopted two different dataset division methods in the experiments. If the influence of the time dimension is not considered, there will be training validation to a certain extent. The crossover of the test set has an impact on the results of the entire model. Considering the influence of the time dimension, the partitioning method of dataset B is obviously closer to the practical application, that is, the existing data are used to predict the upcoming target type. The most ideal division method is to directly train and verify a sound sample without slicing it, which is also the direction of future research.
5. Conclusions
In this paper, we propose a Transformer-based acoustic target recognition model STM. It achieves a recognition accuracy of 97.7% and 89.9%, which is better than the performance of the CRNN-9 model shown in [
20] (state-of-the-art) and ResNet18. This result shows that our proposed method can provide good technical support for underwater target recognition systems. In addition, we also analyze the performance of three feature extraction methods (STFT, FBank, MFCC) in ResNet18 and STM. The experimental results confirm that FBank can achieve the highest recognition accuracy on both ResNet18 and STM. When using AudioSet pre-training, it achieves the best performance at this stage.
At the same time, there is still a lot of worthwhile research to do based on our work, such as using feature fusion to improve the performance of the model, according to the complementarity of different features; use the adversarial generative network or diffusion model to generate simulation data samples to expand the dataset, improve the robustness and generalization ability of the model, and so on. These will be the research directions of our future works.