An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images

Chen, Xinzhe; Liang, Hong

doi:10.3390/jmse11091781

Open AccessArticle

An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images

by

Xinzhe Chen

^1,2 and

Hong Liang

^1,2,*

¹

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

²

Ocean Institute of Northwestern Polytechnical University, Taicang 215400, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(9), 1781; https://doi.org/10.3390/jmse11091781

Submission received: 7 July 2023 / Revised: 15 August 2023 / Accepted: 8 September 2023 / Published: 12 September 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Class incremental learning with sonar images introduces a new dimension to underwater target recognition. Directly applying networks designed for optical images to our constructed sonar image dataset (SonarImage20) results in significant catastrophic forgetting. To address this problem, our study carefully selects the Dynamically Expandable Representation (DER)—recognized for its superior performance—as the baseline. We combine the intrinsic properties of sonar images with deep learning theories and optimize both the backbone and the class incremental training strategies of DER. The culmination of this optimization is the introduction of DER-Sonar, a class incremental learning network tailored for sonar images. Evaluations on SonarImage20 underscore the power of DER-Sonar. It outperforms competing class incremental learning networks with an impressive average recognition accuracy of 96.30%, a significant improvement of 7.43% over the baseline.

Keywords:

class incremental learning; sonar images; underwater target recognition; deep learning

1. Introduction

In recent years, the application of deep learning techniques for underwater target recognition has gained significant traction. Researchers have primarily focused on two types of images: underwater optical images and sonar images. Underwater optical images face challenges due to the significant propagation loss of light underwater [1]. This limitation makes them ideal only for near-range and high-brightness scenes. In contrast, sonar images leverage the favorable propagation characteristics of sound signals underwater, making them suitable for various complex underwater environments [2]. Numerous studies have shown that using deep learning networks for sonar-image-based underwater target recognition can result in high recognition accuracy [3]. However, a common limitation of these studies is the use of a one-shot training approach. This method inputs all class data together, resulting in networks that cannot incrementally learn new class data. The challenge arises when using only new class data to train the network; the network will seriously forget the knowledge of the old class data, resulting in a serious decline in the recognition rate, which is called the catastrophic forgetting problem [4]. This not only increases the training time but also raises potential privacy and security concerns for the old class data.

Class incremental learning (CIL) trains networks by using new class data or a mix of new class data and a minimal portion of old class data [5]. It uses a combination of a backbone for categorization and some class incremental training strategies to mitigate the decay in the recognition accuracy due to the catastrophic forgetting problem with the old class data. Given the scarcity of sonar images [6], most CIL studies focus on optical images [7]. These studies, based on richer optical image data, use methods such as knowledge-distillation-based and dynamic-network-based networks. Knowledge-distillation-based networks use knowledge distillation algorithms to suppress network parameters from gradually tending to new class data features during training; examples include Learning without Forgetting (LwF) [8], Incremental Classifier and Representation Learning (iCaRL) [9], and Pooled outputs distillation Network (PodNet) [10]. These networks have shown promising results on small-scale image datasets like MNIST and CIFAR100. On the other hand, dynamic-network-based networks continuously expand new backbone networks to learn new class data, providing independent network parameter spaces for each class feature. Dynamic-network-based networks like Dynamically Expandable Representation (DER) [11], Feature boOSTing and comprEssion for class incRemental learning (FOSTER) [12], and Memory efficient Expandable MOdel (MEMO) [13] have excelled on large-scale datasets like ImageNet100 and ImageNet1K. However, there is a critical difference between sonar and optical images. While they both reflect target contour features and spatial distribution characteristics, sonar imaging is influenced by underwater environmental noise and reverberation interference. This results in images with a low resolution, blurred contours, and low contrast [14]. Therefore, CIL for sonar images requires the backbone to have stronger feature extraction capabilities for subtle features, and additional class incremental learning training strategies that are more specific to the characteristics of sonar images are required as directly applying existing class incremental learning networks (CILNs) to sonar images may lead to performance degradation.

To investigate the problem of CIL for sonar images, we validate CILNs proposed for optical images on the sonar image dataset. Subsequently, considering the differences between sonar images and optical images, as well as CIL theories and deep learning theories, we propose a CILN suitable for sonar images. The main contributions of this study are as follows:

Several representative CILNs designed for optical images are evaluated on the constructed sonar image dataset (SonarImage20). The result shows that all networks suffer from the catastrophic forgetting problem, confirming previous suspicions. The best-performing DER is selected as the baseline for subsequent improvements.
An optimized backbone (OptResNet) based on ResNet18 is proposed to replace the backbone of the baseline. This modification improves the network’s feature extraction capabilities and recognition accuracy while significantly reducing the number of parameters.
A series of new strategies are introduced to optimize the class incremental training strategies of the baseline, which enhances the stability and robustness of the network training. A novel CILN(DER-Sonar) that integrates the optimized backbone with the optimized class incremental training strategies is proposed, effectively addressing the catastrophic forgetting problem on the sonar image dataset.

2. Dataset

Sonar images are rare, and there are few public sonar image datasets with a rich number of classes and a sufficient number of samples. Therefore, we read a large amount of sonar image research literature and found three public sonar image datasets: Marine Debris Watertank [15], Marine Debris Turntable [16], and Seabed Objects-KLSG [17]; the former two consist of sonar images acquired by forward-looking sonar devices and the latter consists of sonar images acquired with side-scan sonar devices. We remove the duplicate and low number of classes of sonar images from the three datasets and merge them into a dataset with 20 classes and 5000 samples, which we call SonarImage20. In total, 80% of the dataset is randomly selected as the training dataset and 20% as the validation dataset. RandomResizedCrop and RandomHorizontalFlip are used to preprocess the sonar images, and the image size is standardized as

112 \times 112

; SonarImage20 is shown in Figure 1.

For CIL, a specific protocol must be set to learn different classes of the dataset. There are two commonly used protocols: one is to start learning from 0, whereby all 20 classes are divided into four stages for CIL, and the other is to train 10 classes first, and then the remaining 10 classes are divided into two stages for CIL. In this paper, we adopt the first more sophisticated protocol and select 40 images of old classes (1% of the number of training datasets, two per class) to combine with the new class images for network training.

3. Methods

In this section, we introduce the concept of CIL and evaluate the performance of different CILNs on SonarImage20 in Section 3.1. Then, we present the composition of the baseline in Section 3.2. Finally, we provide a comprehensive and detailed explanation of the optimized backbone and the optimized class incremental training strategies of DER-Sonar in Section 3.3. We delve into the feasibility of these improvements to address the catastrophic forgetting problem posed by sonar images.

3.1. CIL Concept and Performance Evaluation for SonarImage20

The CILN, in phase t, receives a new set of data

D_{t} = {\{(x_{i}^{t}, y_{i}^{t})\}}_{i = 1}^{n_{t}}

, where

n_{t}

is the number of

D_{t}

and

x_{i}^{t} \in X_{t}

is the input image.

y_{i}^{t} \in Y_{t}

is the corresponding label, and we save a part of the old data as the exemplar set

R_{t}

, which is the subset of

⋃_{s = 1}^{t - 1} D_{s}

. The CILN is trained on

\overset{\land}{D_{t}} = D_{t} \cup R_{t}

and needs to perform well in the label space of all visible classes

\overset{\land}{Y} = ⋃_{s = 1}^{t} Y_{s}

during the inference, in particular,

Y_{t} \cap Y_{m} = ⌀

for

t \neq m

.

In this study, we investigate the CIL of SonarImage20 by evaluating several representative CILNs, namely LwF, iCaRL, PodNet, DER, FOSTER, and MEMO. The first three networks are based on knowledge distillation while the latter three networks are based on a dynamic network. Figure 2 illustrates that all CILNs exhibit a significant decline in recognition accuracy at each stage of learning new class data, indicating the occurrence of catastrophic forgetting. Despite the fact that all CILNs suffer from the catastrophic forgetting problem, DER stands out as it achieves the highest classification recognition accuracy across all four stages. This result is similar to the results obtained by some researchers conducting CIL studies on optical images [18], which show that DER and MEMO can achieve the best recognition accuracy performance on most datasets and that their performance is very close to each other, with only minor differences in performance for different protocol settings. Consequently, we select DER as the baseline for further investigation.

3.2. Baseline

In the following, we introduce the main components of DER, backbone, and class incremental training strategies.

3.2.1. Backbone

DER uses ResNet as the backbone for extracting image features, which was proposed by He Kaiming et al. [19]. ResNet is designed to address the challenge of training deep neural networks by introducing a novel residual framework. For ImageNet, DER adopts ResNet18 as the backbone, while for CIFAR100, ResNet34 is used. The image size of ImageNet is

224 \times 224

and the image size of CIFAR100 is

32 \times 32

, and considering that the image size of ImageNet is comparable to that of SonarImage20 (

112 \times 112

), we select ResNet18 as the backbone for DER.

The architecture of ResNet18 consists of an input image downsampling layer, a set of basic blocks, and fully connected (FC) layers. The fundamental building blocks, known as basic blocks, play a pivotal role in enabling the network to learn residual mappings. This mechanism effectively mitigates challenges such as gradient vanishing or exploding, as well as the decline in recognition performance with increasing network depth. Detailed illustrations of ResNet18 and the basic block are shown in Figure 3 and Figure 4, respectively.

3.2.2. Class Incremental Training Strategies

When encountering a new dataset containing different classes, DER takes several steps to handle the situation. First, it fixes the parameters of the previously trained backbone, which is trained on the previous class data. Next, it extends a new backbone to acquire the feature information specific to the new dataset. Finally, it aggregates the output features of the all backbones by using an FC layer. This FC layer has an input dimension equal to the sum of the output dimensions of both the old and new backbones

d_{old} + d_{new}

, and the output dimension matches the total number of classes

|\overset{\land}{Y}|

. The output can be represented as follows:

f (x) = W_{new}^{T} \cdot [ϕ_{old} (x), ϕ_{new} (x)]

(1)

where

ϕ_{old} (x)

represents the trained backbone with old class data,

ϕ_{new} (x)

represents the newly extended backbone,

[•, •]

represents the feature aggregation, and

W_{new}^{T} \in R^{(d_{old} + d_{new}) \times |\hat{Y}|}

represents the weights of the FC layer; Figure 5 illustrates the CIL process of DER.

During the training process, DER uses three weighted cross-entropy losses: train loss, auxiliary loss, and sparsity loss. Train loss serves as the primary loss for network training, calculated by using the output of

D_{t}

through the network. Auxiliary loss assigns labels of the old class to a single class during the loss calculation, facilitating the rapid learning of new class features. Sparsity loss introduces sparsity into the new backbone by masking certain neurons, thereby reducing the number of parameters. In practice, while sparsity loss reduces the number of parameters to a certain extent, it also reduces the recognition accuracy of the network. Therefore, we only use train loss and auxiliary loss. Assuming that the input image

x_{i}^{t}

passes through the network and yields a predicted label y, while the true label is denoted by

y_{i}^{t}

and the prior distribution is denoted by

p_{f} (y = y_{i}^{t} | x_{i}^{t})

, the loss of DER is formulated as follows:

\begin{matrix} L_{DER} = L_{train} + λ \cdot L_{aux} = \\ - \frac{1}{|{\overset{\land}{D}}_{t}|} \sum_{i = 1}^{|\overset{\land}{D_{t}}|} log [p_{f} (y = y_{i}^{t}| x_{i}^{t})] - \frac{λ}{|D_{t}| + 1} \{\sum_{i = 1}^{|D_{t}|} log [p_{f} (y = y_{i}^{t}| x_{i}^{t})] + \sum_{i = 1}^{|R_{t}|} log [p_{f} (y = y_{|D_{t}| + 1}^{t}| x_{i}^{t})]\} \end{matrix}

(2)

where

λ

is the weighting coefficient that controls the auxiliary loss; by default, for t = 1,

λ = 0

, and for

t > l

,

λ = 1

.

3.3. Optimized Class Incremental Learning Network (DER-Sonar)

Sonar images and optical images have significant differences in their characteristics, making it impractical to directly apply CILNs designed for optical images. To address this problem, we propose an optimized CILN(DER-Sonar) specifically designed for sonar images. In detail, we optimize both the backbone and the class incremental training strategies of DER by considering the size, resolution, and feature distribution specific to sonar images.

3.3.1. Optimized Backbone

We made the following improvements to the backbone of DER based on the characteristics of sonar images:

(1) Improvements in the convolutional layers. For low-resolution sonar images, using multiple smaller convolutional kernels instead of a large convolutional kernel can allow the network to focus on finer features in the images and effectively reduce the number of parameters. Compared to the commonly used

3 \times 3

convolutional kernels,

1 \times 1

convolutional kernels lack the ability to induce local feature biases and thus are usually used for dimensionality reduction or expansion. On the other hand,

2 \times 2

convolutional kernels possess the ability to induce local feature biases and are suitable for even-sized images. However, several studies have shown that even-sized convolutional kernels suffer from feature drift due to the lack of a pixel center point.

To address the problem of feature drift caused by using even-sized convolutional kernels, we introduce a new type of even-sized convolution kernel called Symmetric Padding Convolution (SPConv) [20]. SPConv divides the feature map into four groups and applies padding in different directions, as shown in Figure 6. This approach allows certain channels of the network to utilize information from specific directions, enabling the network to leverage information from different directions and alleviating the problem of feature drift to some extent. Therefore, we adopt SPConv as the replacement for

3 \times 3

convolutional kernels.

Before introducing the specific improvement scheme using SPConv to replace

3 \times 3

convolutional kernels, we present the concept of a receptive field. A receptive field refers to the size of the mapping region on the input feature map corresponding to a pixel point on the output feature map of a convolutional layer, and the calculation formula for the receptive field is shown below:

F_{n - 1} = (F_{n} - 1) S + K

(3)

here,

F_{n}

represents the number of pixels in the output feature map, which should be initialized as 1 when it is the bottom layer in a series of convolutional layers;

F_{n - 1}

denotes the size of the mapping region corresponding to the input feature map, S represents the stride, and K represents the kernel size. Figure 7 illustrates the schematic diagram of the receptive field.

In ResNet18, the input image downsampling layer utilizes a

7 \times 7

convolutional kernel with a stride of 2 for dimensionality expansion and resolution downsampling. We replace the

7 \times 7

convolutional kernel with a

3 \times 3

convolutional kernel with the stride of 2 and two

2 \times 2

SPConv. The calculation process for the receptive field is as follows:

\begin{matrix} F_{4} = 1 \to F_{3} = (F_{4} - 1) \cdot 1 + 2 = 2 \to \\ F_{2} = (F_{3} - 1) \cdot 1 + 2 = 3 \to F_{1} = (F_{2} - 1) \cdot 2 + 3 = 7 \end{matrix}

(4)

As in the derivation of Equation (4), after passing through a

3 \times 3

convolutional kernel with the stride of 2 and two

2 \times 2

SPConv, the receptive field corresponding to a single pixel is 7, which is the same as the receptive field obtained by using a

7 \times 7

convolutional kernel with a stride of 2. Moreover, the theoretical computational complexity of the improved input image downsampling layer is reduced from

7 \times 7 = 49

to

3 \times 3 + 2 \times 2 \times 2 = 17

.

The basic block of ResNet18 consists of two stacked convolutional layers with the

3 \times 3

convolutional kernel, where the first

3 \times 3

convolutional kernels have a stride of 2 and the rest have a stride of 1. We retain the

3 \times 3

convolutional kernel with the stride of 2 and replace the

3 \times 3

convolutional kernel with the stride of 1 with two

2 \times 2

SPConv. According to Equation (3), it can be calculated that the receptive field corresponding to a single pixel in the output layer remains unchanged. Additionally, the theoretical computational complexity of the improved basic block is reduced from to

2 \times 3 \times 3 = 18

to

4 \times 2 \times 2 = 16

.

(2) Improvements in the normalization layer and activation function. Inspired by ConvNeXT’s improvements to ResNet [21], we reduce the usage of normalization layers and activation functions. The normalization layer is performed only after the first convolutional layer of the input image downsampling layer and the basic block, while the activation function is applied only before the last convolutional layer of the input image downsampling layer and the basic block.

For the normalization layer, we replace BatchNorm with GroupNorm [22]. BatchNorm performs better at normalizing with larger batch sizes, but since our batch size is set to 32, which is smaller compared to the commonly used 256, 512, and 1024 in ImageNet, GroupNorm normalizes individual samples independently of the batch size. Additionally, compared to LayerNorm [23], which also normalizes individual samples, GroupNorm allows for flexible customization by dividing individual samples into multiple groups for normalization. After many attempts, we present the setup of 64 channels as a group.

For the activation function, ReLU truncates negative gradients at zero, leading to a phenomenon known as neuron death. In contrast, Swish [24] and Mish [25] activation functions preserve the negative gradient inputs, effectively avoiding neuron death problems caused by ReLU. Furthermore, both Swish and Mish activation functions possess characteristics such as being unbounded, having a lower bound, being smooth, and being nonmonotonic. The curve plots of ReLU, Swish, and Mish are shown in Figure 8.

The unbounded property prevents network saturation, the lower bound enhances regularization effects, smoothness facilitates easier optimization of the network, and nonmonotonicity improves the generalization performance. After replacing ReLU with Swish and Mish, experimental observations indicate that Mish yields better results. Therefore, we adopt Mish as the activation function. The functional forms of ReLU, Swish, and Mish are shown below:

\begin{matrix} ReLU (x) & = max (0, x) \\ Swish (x) & = \frac{x}{1 - e^{- x}} \\ Mish (x) & = x \cdot tanh [ln (1 + e^{x})] \end{matrix}

(5)

(3) Attention Mechanism Module. Although ResNet uses the inductive bias advantage of convolutions to enable the network to learn local feature information from different parts of the feature map, it does not utilize the global feature information of the image. Therefore, we introduce the attention mechanism module called the Squeeze and Excitation Networks (SENet) [26]. Figure 9 illustrates the process of the SENet. First, the squeeze operation performs average pooling on the feature maps of each channel to obtain channel-level global features. Then, the excitation operation performs scale transformation on the global features by using two

1 \times 1

convolutional kernels to adjust the channel dimensionality. After the dimensionality expansion, ReLU is applied for nonlinear activation, and after the dimensionality reduction, Sigmoid is applied to constrain the output values between 0 and 1. This produces attention weights for each channel, which are multiplied by the input feature maps to obtain the output feature maps. We analyze the distribution of global feature values after squeezing for sonar images and find that a significant number of global feature values are negative. The original SENet uses ReLU, which sets all negative feature values to zero, resulting in the neglect of global feature information for the corresponding channels. Therefore, we replace ReLU with Mish, which allows for the use of negative feature value information. Essentially, the SENet allows the model to focus more on channels with the highest information content while suppressing unimportant channel features. In this study, to strike a balance between recognition accuracy and computational efficiency, we ultimately decide to add SENet only to the third and fourth stacks of the basic blocks.

(4) Improvements in the overall structure. ResNet18 uses two main components for feature extraction: an input image downsampling layer and four sets of pairwise stacked basic blocks (

[2, 2, 2, 2]

). Given that the size of the image in SonarImage20 (

112 \times 112

) is half the size of the image in ImageNet (

224 \times 224

), we remove the max pooling operation used in the input image downsampling layer of ResNet18. This ensures that sonar images maintain the same resolution as ImageNet images during subsequent feature extraction. While keeping the overall number of basic blocks unchanged, we modify the four sets of pairwise stacked basic blocks (

[2, 2, 2, 2]

) to four sets of unevenly stacked basic blocks (

[1, 2, 4, 1]

). This change increases the number of mid- to high-dimensional basic blocks and enhances the deep feature extraction capability of the network.

Incorporating the improvements mentioned above, we propose a refined variant of ResNet18 called OptResNet. The overall architecture of OptResNet is shown in Figure 10, while the optimized basic block structure of OptResNet is shown in Figure 11.

3.3.2. Optimized Class Incremental Training Strategies

Based on the characteristics of the feature distribution in sonar images, the latest theories in deep learning, and the shortcomings of the original training strategies for DER, we made the following improvements to the original class incremental training strategies for DER:

(1) Improvements in the scheduler. The original scheduler used in DER is the MultiStepLR scheduler. It starts with a constant initial learning rate at the beginning of network training and sets several milestones. When the epoch reaches a milestone, the learning rate is multiplied by a predetermined factor (usually ranging from 0.01 to 0.5) to gradually decrease the learning rate. The MultiStepLR scheduler can lead to the following problems: (1) an excessively large initial learning rate, resulting in unstable parameters in the early phase of the network, and (2) a steep slope of the learning rate reduction, causing significant fluctuations in the gradients.

At the beginning of network training, the parameters are unstable and the gradients for network updates are large. Setting the initial learning rate too high can lead to unstable parameter updates, while setting it too low can easily trap the gradients in local optima. To address this problem, WarmUp sets a very small learning rate at the beginning of training and increases it linearly with each epoch until the desired learning rate is reached. WarmUp ensures that the parameter distribution of the network remains stable during the initial training phase, which facilitates stable updates of the network parameters in subsequent stages.

During training, the loss approaches the global minimum as the number of epochs increases. At this point, a smaller learning rate is needed to bring the network parameters closer to this optimal point. The CosineAnneal scheduler is a common learning rate adjustment strategy that reduces the learning rate by using a cosine function [27]. The learning rate reduction process involves a slow decrease, followed by an accelerated decrease, and finally a slow descent to the preset minimum learning rate. This strategy ensures the stability of the gradient descent in the network.

The WarmUpCosineAnneal scheduler combines the learning rate adjustment strategies of WarmUp and the CosineAnneal scheduler. It uses WarmUp to gradually increase the learning rate during the initial training phase, ensuring the stability of the network parameters in the early phases. During the training process, it uses the CosineAnneal scheduler to gradually decrease the learning rate and guide the network parameters towards the global optimum. The schematic representation of the MultiStepLR scheduler and the WarmUpCosineAnneal scheduler for learning rate adjustment is shown in Figure 12a and Figure 12b, respectively. In this study, we replace the original MultiStepLR scheduler in DER with the WarmUpCosineAnneal scheduler.

(2) Dynamic Loss. As shown in Equation (2), the weighting coefficient

λ

of auxiliary loss is a constant that remains unchanged during different CIL stages. In the process of class incremental learning, the auxiliary loss accelerates the feature learning of the newly generated backbone network for new class data. However, as the number of old classes learned continues to increase, keeping the weighting coefficient of train loss and auxiliary loss unchanged in the loss formulation causes the parameters of the network’s FC layers to overly favor the features of the new classes, leading to a decrease in recognition accuracy for the old classes. To address this problem, we propose the dynamic loss as a replacement for the original loss. The dynamic loss dynamically adjusts the weighting coefficient of the train loss and the auxiliary loss during the CIL stages with the addition of new classes and suppresses the FC layers’ network parameters from converging with the new class data. The proposed dynamic loss is formulated in Equation (6):

\begin{matrix} L_{dynamic} = 2 \cdot [(\frac{N_{stage}}{N_{stage} + 1}) \cdot L_{train} + (\frac{1}{N_{stage} + 1}) \cdot L_{aux}] \end{matrix}

(6)

where

N_{stage}

is the number of CIL stages. The initial multiplication by 2 ensures that the weight coefficients of train loss and auxiliary loss sum to 2, thus keeping the overall magnitude of the dynamic loss consistent with the original loss formulation.

(3) Data augmentation. Due to the difficulty and high costs of acquiring sonar images in underwater environments, it is common to collect multiple samples of the same target from different angles and within the same scene. As a result, these samples exhibit a high degree of similarity for a given target. However, when such samples are used for network training, there is a considerable risk of overfitting. Data augmentation is an algorithm proposed to overcome the lack or excessive similarity of training data in neural networks by modifying the feature distribution of the original data, thereby reducing the risk of overfitting. The primary methods of data augmentation can be divided into geometric transformations and pixel transformations. Geometric transformations include flipping, rotating, cropping, and scaling, while pixel transformations include adjusting the brightness, saturation, and adding Gaussian noise. Choosing appropriate data augmentation techniques can effectively improve the training performance of neural networks for different datasets. However, this introduces a substantial and complex hyperparameter-tuning process into network training.

Auto data augment, which adaptively selects data augmentation techniques based on the spatial distribution of the data, can reduce the workload and significantly improve network performance. The most commonly used auto data augment techniques are AutoAugment [28], RandAugment [29], and TrivialAugment [30]. AutoAugment uses reinforcement learning search to discover the decision space of a subset of the dataset and transfers it to the target dataset; RandAugment uses two hyperparameters,

N

and

M

, to randomly determine the number and intensity of data augmentation techniques applied to each data, respectively; TrivialAugment randomly selects only one data augmentation technique and its intensity for each data. We apply these three auto data augmentation methods to SonarImage20. Among them, TrivialAugment achieves the highest improvement in recognition accuracy. Therefore, we select TrivialAugment as the data augmentation method to be used.

(4) Improvements in the classifier. In general, convolutional neural networks (CNNs) often flatten the feature maps from the last layer into a one dimensional vector. This vector is then passed through FC layers to obtain output values corresponding to each class. The label associated with the highest output value is selected as the predicted result. This process can be seen as a linear classifier obtained by multiplying the feature maps with the weights of the FC layers

ω_{t}

. The specific formula is formulated as follows:

y^{*} = arg max [ω_{t} \cdot φ (x)]

(7)

where

y^{*}

represents the predicted result and

φ (x)

represents the feature map vector obtained by the CNNs.

However, the weights of the FC layers

ω_{t}

are changed with the addition of new class features and gradually shift in the direction of these new class features. This leads to a bias in the prediction results towards the new class data, causing the forgetting of the old class data. The NCM (Nearest Class Mean) classifier does not rely on the probability values output by the FC layers [31]. Instead, it calculates the distance between the output feature vector of the image and the output feature vectors of the exemplars. The label corresponding to the exemplar with the closest distance is chosen as the predicted label. This approach mitigates the influence of the feature bias introduced by the FC layers, thus reducing the tendency of the predicted label to favor the labels of the new classes. The specific algorithmic procedure is described in Algorithm 1.

Algorithm 1 Nearest Class Mean Classifier

Input:

x

// input image

Require:

P = (P_{1}, P_{2} \dots, P_{t})

// exemplar set

Require:

φ

// weights of the network

for

i = 1, 2, \dots, t

do

μ_{i} \leftarrow \frac{1}{| P_{i} |} \sum_{p \in P_{i}} φ (p)

// mean of exemplars

end for

y^{*} = arg min_{i = 1, 2, \dots, t} ∥ φ (x) - μ_{i} ∥

// nearest prototype

Output:

y^{*}

// prediction label

4. Experiments

In the following, we introduce the implementation details in Section 4.1. Then, we present the comparative experimental results obtained by using DER-Sonar and other representative CILNs on SonarImage20 in Section 4.2. Finally, we perform ablation experiments to evaluate the individual impact of each improvement integrated into DER-Sonar in Section 4.3.

4.1. Implementation Details

We conduct extensive experiments on a workstation with two RTX 3090 GPUs and Linux operating system to verify the effectiveness of DER-Sonar on SonarImage20, and all experimental results were obtained by averaging the results over five times.

DER-Sonar and the other compared CILNs are implemented by using PyTorch [32] and PyCIL [33]. Since the amount of data required for SonarImage20 is much smaller than that required for ImageNet, the batch size is reduced from the usual 128 to 32. For DER-Sonar, the optimizer uses SGD with a momentum of 0.9 and weight decay of 2 × 10⁻⁴, the scheduler uses the WarmUpCosineAnneal scheduler mentioned in Section 3.2.2, and the epoch is set to 200. For all the compared CILNs, we use ResNet18 as the backbone, the optimizer and scheduler are set up according to the original paper, and the learning rate and epochs are tuned to the best of our ability based on SonarImage20.

4.2. Comparative Experiment

We conduct comparative experiments using SonarImage20 to evaluate DER-Sonar against other representative CILNs. In this comparison, DER serves as the baseline. Since the compared CILNs use a linear classifier (LC) while DER-Sonar uses the NCM classifier, as mentioned in Section 3.3.2, the prediction results of the NCM classifier are directly influenced by the quality of the exemplar set. To accurately assess the performance of DER-Sonar, we conducted additional experiments by using DER-Sonar with LC. The recognition accuracy of all CILNs at each stage are depicted in Figure 13.

Here, the upper bound represents the theoretical limit of CILNs, which use ResNet18 to train on all class data at each stage and initializes the network parameters of the new stage with the parameters trained in the previous stage. In the first stage, all the networks are not involved in CIL. We perform extensive hyperparameter tuning to ensure a recognition accuracy of 100% for all CILNs in the first stage, ensuring optimal initial parameters for the backbone and fairness in the subsequent stage. Among the second and third stages, DER-Sonar with LC demonstrates the best performance, while in the fourth stage, DER-Sonar with an NCM classifier performs the best. Overall, DER-Sonar achieves the highest accuracy at each stage.

We summarize the average recognition accuracy of the four stages and the training time for all networks in Table 1. DER-Sonar shows a significantly higher average recognition accuracy compared to all other networks, achieving an improvement of 7.43% over the baseline. Compared to the upper bound, our training time is only 78.98% of its duration while the recognition accuracy decreased by only 3.3%. Moreover, our training process only uses data from the newly added classes and a minimal amount of exemplars from the previous classes. Although our training time is higher than other CLNs, this is due to the functions used in our code not being optimized by PyTorch, such as SPConv, SENet, and dynamic loss. In fact, our improvements significantly reduce the number of network parameters, as detailed in Section 4.3.1. The training time is closely related to several factors, including the number of the network parameters, the optimization of the function library, and the operating platform, and we expect a significant reduction in training time once PyTorch optimizes the function library we use in future updates.

4.3. Ablation Experiment

In this section, we conduct a series of ablation experiments on DER-Sonar to investigate the impact of each improvement in the backbone (OptResNet) and the class incremental training strategy on the recognition accuracy, model parameter size, and training time.

4.3.1. Backbone

Every step of improvement in OptResNet is built upon the previous step’s improvements. Figure 14 illustrates the recognition accuracy achieved at each stage of improvement, demonstrating consistent improvements across all four stages.

Additionally, we investigate the impact of each improvement on the parameter count and training time, as shown in Table 2. Our improvements in the convolutional layers and overall architecture effectively reduce the parameter count while significantly improving the average recognition accuracy. Similarly, our improvements in the normalization layers, activation functions, and the introduction of attention mechanisms ensure an improved average recognition accuracy while maintaining a relatively low or unchanged parameter count. Compared to the baseline, OptResNet achieves a 5.05% increase in the average recognition accuracy while significantly reducing the parameter count by 30%. It is worth noting that although the training time has nearly doubled, this is not a result of our model improvements, but rather a consequence of the lack of optimization of the functions used in our code, as discussed in Section 4.2.

4.3.2. Class Incremental Training Strategies

Based on the improvements in the backbone, we conduct ablation experiments for all the improvements made in the class incremental training strategies. Since the recognition accuracy of the improvements in the backbone are obtained by using the WarmUpCosineAnneal scheduler, it will not be annotated in the following figure and table. As shown in Figure 15, each of our improvements in the class incremental training strategies increase the recognition accuracy of the network. Among them, the recognition accuracy of using the NCM classifier is lower than using LC in stage 2 and stage 3. However, it achieves the highest recognition accuracy in stage 4, demonstrating the best recognition performance for all classes on SonarImage20.

Since the improvements in the class incremental training strategies do not change the number of network parameters, we only examine the impact of each improvement on the training time. As shown in Table 3, the improvements in the class incremental training strategies based on the OptResNet result in an increase in training time of 301 s, with an average recognition accuracy improvement of 2.38%. Among them, the dynamic loss has the highest time overhead. However, the calculation of the dynamic loss only involves integer operations and does not involve complex floating-point calculations, suggesting that the additional time overhead is due to optimization problems.

5. Discussion

In our study, we construct a sonar image dataset named SonarImage20 and evaluate different CILNs proposed for optical images by using it. The results show that all networks suffer from severe catastrophic forgetting problems. Among them, we select the best performing network, DER, as the baseline. Considering the differences between sonar images and optical images as well as deep learning theories, we optimize its backbone and class incremental training strategies, thus proposing a novel CILN named DER-Sonar.

For the backbone of DER-Sonar, we propose a neural network called OptResNet, building upon ResNet18. We replace odd-sized convolutional kernels with SPConv; incorporate attention mechanisms into basic blocks; and optimize normalization layers, activation functions, and the overall architecture designs. Compared to ResNet18, OptResNet achieves an average recognition accuracy improvement of 5.08% while reducing network parameters by 30%. For the class incremental training strategies of DER-Sonar, we optimize the class incremental training strategies after a comprehensive consideration of various factors. We propose a dynamic loss as the loss function, use the NCM classifier instead of the linear classifier, utilize the WarmUpCosineAnneal scheduler for network training, and enhance the richness of the training data with the auto data augmentation method. On top of the backbone improvements, the average recognition accuracy is further increased by 2.38%.

In comparison to the existing CILNs, DER-Sonar demonstrates an absolute superiority in terms of recognition accuracy, effectively addressing the catastrophic forgetting problems. Although it introduces an additional training time overhead, we expect a significant reduction in training time with further optimization of the functions used in subsequent development environments.

6. Conclusions

In this paper, we introduce the novel DER-Sonar. This innovative CILN not only addresses the CIL challenges in sonar images but also significantly outperforms its counterparts. While DER-Sonar has demonstrated its effectiveness, several avenues remain open for exploration and improvement:

Optimization for training time: although DER-Sonar introduces an additional training time overhead, we anticipate a significant reduction in this aspect with further refinement of the methods used, especially in evolving development environments.
Dataset expansion: SonarImage20 provides a robust foundation, but expanding it with more diverse underwater scenarios can further test and refine the capabilities of DER-Sonar.
Investigate transfer learning: investigating how DER sonar can benefit from transfer learning, especially from models trained on large optical image datasets, could be a promising direction.
Address real-time challenges: since underwater target recognition often requires real-time processing, future iterations of DER-Sonar could focus on optimizing for real-time deployment in underwater vehicles.

In conclusion, while our research has made significant strides in the field of sonar image-based class incremental learning, the road ahead is filled with opportunities for innovation and refinement. We remain committed to pushing the boundaries of what is possible in this field.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; software, X.C.; validation, X.C. and H.L.; formal analysis, X.C.; investigation, X.C. and H.L.; data curation, X.C.; writing— original draft preparation, X.C.; writing—review and editing, H.L.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 61971354.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yu, X.; Xing, X.; Zheng, H.; Fu, X.; Huang, Y.; Ding, X. Man-Made Object Recognition from Underwater Optical Images Using Deep Learning and Transfer Learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1852–1856. [Google Scholar] [CrossRef]
Neupane, D.; Seok, J. A Review on Deep Learning-Based Approaches for Automatic Sonar Target Recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Chen, R.; Chen, Y. Improved Convolutional Neural Network YOLOv5 for Underwater Target Detection Based on Autonomous Underwater Helicopter. J. Mar. Sci. Eng. 2023, 11, 989. [Google Scholar] [CrossRef]
Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.D.; van de Weijer, J. Class-Incremental Learning: Survey and Performance Evaluation on Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5513–5533. [Google Scholar] [CrossRef] [PubMed]
Belouadah, E.; Popescu, A.; Kanellos, I. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Netw. 2021, 135, 38–54. [Google Scholar] [CrossRef] [PubMed]
Xu, H.; Yang, L.; Long, X. Underwater Sonar Image Classification with Small Samples Based on Parameter-based Transfer Learning and Deep Learning. In Proceedings of the 2022 Global Conference on Robotics, Artificial Intelligence and Information Technology (GCRAIT), Chicago, IL, USA, 30–31 July 2022; pp. 304–307. [Google Scholar] [CrossRef]
Irfan, M.; Jiangbin, Z.; Iqbal, M.; Masood, Z.; Arif, M.H.; ul Hassan, S.R. Brain inspired lifelong learning model based on neural based learning classifier system for underwater data classification. Expert Syst. Appl. 2021, 186, 115798. [Google Scholar] [CrossRef]
Li, Z.; Hoiem, D. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–102. [Google Scholar]
Yan, S.; Xie, J.; He, X. DER: Dynamically Expandable Representation for Class Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3014–3023. [Google Scholar]
Wang, F.Y.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. Foster: Feature boosting and compression for class-incremental learning. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 398–414. [Google Scholar]
Zhou, D.W.; Wang, Q.W.; Ye, H.J.; Zhan, D.C. A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Kim, J.; Song, S.; Yu, S.C. Denoising auto-encoder based image enhancement for high resolution sonar image. In Proceedings of the 2017 IEEE Underwater Technology (UT), Busan, Republic of Korea, 21–24 February 2017; pp. 1–5. [Google Scholar] [CrossRef]
Singh, D.; Valdenegro-Toro, M. The Marine Debris Dataset for Forward-Looking Sonar Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, QC, Canada, 10–17 October 2021; pp. 3741–3749. [Google Scholar]
Preciado-Grijalva, A.; Wehbe, B.; Firvida, M.B.; Valdenegro-Toro, M. Self-Supervised Learning for Sonar Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022; pp. 1499–1508. [Google Scholar]
Du, X.; Sun, Y.; Song, Y.; Sun, H.; Yang, L. A Comparative Study of Different CNN Models and Transfer Learning Effect for Underwater Object Classification in Side-Scan Sonar Images. Remote Sens. 2023, 15, 593. [Google Scholar] [CrossRef]
Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Deep class-incremental learning: A survey. arXiv 2023, arXiv:2302.03648. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wu, S.; Wang, G.; Tang, P.; Chen, F.; Shi, L. Convolution with even-sized kernels and symmetric padding. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2019; Volume 32. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Wu, Y.; He, K. Group Normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Lim, S.; Kim, I.; Kim, T.; Kim, C.; Kim, S. Fast AutoAugment. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2019; Volume 32. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 702–703. [Google Scholar]
Müller, S.G.; Hutter, F. TrivialAugment: Tuning-Free Yet State-of-the-Art Data Augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 774–782. [Google Scholar]
He, J.; Zhu, F. Exemplar-Free Online Continual Learning. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 541–545. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2019. [Google Scholar]
Zhou, D.W.; Wang, F.Y.; Ye, H.J.; Zhan, D.C. PyCIL: A Python toolbox for class-incremental learning. Sci. China Inf. Sci. 2023, 66, 197101. [Google Scholar] [CrossRef]

Figure 1. SonarImage20.

Figure 2. Recognition accuracy for all CILNs on SonarImage20.

Figure 3. Structure of ResNet18.

Figure 4. Structure of basic block.

Figure 5. CIL process of DER.

Figure 6. Process of SPConv.

Figure 7. Receptive field.

Figure 8. Curve plots of different active functions.

Figure 9. Process of the SENet.

Figure 10. Structure of OptResNet.

Figure 11. Structure of optimized basic block.

Figure 12. Scheduler for learning rate adjustment. (a) MutilStepCosineLR. (b) WarmUpCosineAnneal.

Figure 13. Recognition accuracy of all CILNs.

Figure 14. Recognition accuracy of all improvements in backbone.

Figure 15. Recognition accuracy of all improvements in class incremental training strategies.

Table 1. Average recognition accuracy and the training time for all CILNs.

CILN	Average Recognition Accuracy/%	Training Time/s
LwF	67.13	1391
iCaRL	79.34	1796
PodNet	85.36	2817
DER(baseline)	88.87	2412
FOSTER	84.13	2665
MEMO	88.01	1624
Ours_ LC	96.18	4085
Ours_NCM	96.30	4093
Upperbound	99.60	5182

Table 2. Average recognition accuracy, parameter count, and training time for improvements in backbone.

Backbone	Average Recognition Accuracy/%	Parameter Count/Million	Training Time/s
ResNet18 (baseline)	88.87	11.821	2412
ResNet18 + SPConv	91.06	10.515	3982
ResNet18 + SPConv + GN & Mish	91.35	10.505	3716
ResNet18 + SPConv + GN & Mish + SENet	91.68	10.505	3842
OptResNet (within all improvements)	93.92	8.275	3792
Upperbound	99.60	11.821	5182

Table 3. Average recognition accuracy and the training time for improvements in class incremental training strategy.

Class Incremental Training Strategy	Average Recognition Accuracy/%	Training Time/s
Baseline	88.87	2412
OptResNet	93.92	3792
OptResNet + dynamic loss	94.94	4070
OptResNet + dynamic loss + data augment	96.18	4085
OptResNet + dynamic loss + data augment + NCM	96.30	4093
Upperbound	99.60	5182

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Liang, H. An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images. J. Mar. Sci. Eng. 2023, 11, 1781. https://doi.org/10.3390/jmse11091781

AMA Style

Chen X, Liang H. An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images. Journal of Marine Science and Engineering. 2023; 11(9):1781. https://doi.org/10.3390/jmse11091781

Chicago/Turabian Style

Chen, Xinzhe, and Hong Liang. 2023. "An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images" Journal of Marine Science and Engineering 11, no. 9: 1781. https://doi.org/10.3390/jmse11091781

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimized Class Incremental Learning Network with Dynamic Backbone Based on Sonar Images

Abstract

1. Introduction

2. Dataset

3. Methods

3.1. CIL Concept and Performance Evaluation for SonarImage20

3.2. Baseline

3.2.1. Backbone

3.2.2. Class Incremental Training Strategies

3.3. Optimized Class Incremental Learning Network (DER-Sonar)

3.3.1. Optimized Backbone

3.3.2. Optimized Class Incremental Training Strategies

4. Experiments

4.1. Implementation Details

4.2. Comparative Experiment

4.3. Ablation Experiment

4.3.1. Backbone

4.3.2. Class Incremental Training Strategies

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI