1. Introduction
Computer vision is widely recognized as a critical component of artificial intelligence, as it enables machines to “see” and understand the physical world. Over the past two decades, computer vision has undergone rapid development, with the emergence of numerous theories and methods, resulting in significant progress in various core issues. Over the years, with the advancement of deep learning technology, image classification has been widely studied and applied successfully in several domains such as medical image analysis, autonomous driving, security monitoring, and garbage classification, among others. Jakub Kufel [
1] indicated that artificial intelligence demonstrates promising outcomes in the field of medicine. Particularly in radiation therapy, studies by Liesbeth Vandewinckele [
2], Guangqi Li [
3], Jakub Kufel [
4], and Krithika Rangarajan [
5] have all shown that image classification technology can assist doctors in identifying patient conditions. The potential of image classification technology is enormous.
Historically, image classification relied on statistical learning techniques such as Bayesian classifiers and K-Nearest Neighbors for feature extraction and pattern recognition. However, these approaches were limited in handling larger datasets and could not scale to larger tasks. The advent of artificial neural networks, particularly CNNs, revolutionized image classification, providing a flexible and powerful method for training and reasoning with large datasets. CNNs have become the gold standard for image classification and have resulted in significant advancements in image processing and computer vision. The limitations in the structure and learning algorithms of early neural network models hindered their ability to effectively address complex image processing problems. However, in the late 1980s and early 1990s, LeCun [
6] first proposed a convolutional neural network model combining convolution operation, pooling operation, and nonlinear activation function, which can effectively process image data. In the 2012 ImageNet large-scale visual recognition challenge, AlexNet [
7] achieved a significant result, surpassing traditional machine learning methods. Since then, several well-established CNN models, including VGGNet [
8], GoogLeNet [
9], ResNet [
10], and DenNet [
11], have shown impressive performance on various image classification tasks. However, these models are computationally expensive and require significant memory resources, making them challenging to deploy on mobile devices. To address this issue, researchers have developed lightweight models such as SqueezedNet [
12], MobileNet [
13], ShuffleNet [
14], and EfficientNet [
15], which are designed to be easily deployable on mobile devices.
There are a lot of techniques for ameliorating the model performance, which encompass enhancements of the model structure, augmentation of the training data, and adoption of transfer learning approaches. The success achieved by the Vision Transformer [
16] has brought attention to the utility of attention mechanisms in facilitating feature extraction. The attention mechanism is a technique for selectively attending to the most informative regions in an image while disregarding irrelevant regions. This approach is inspired by the human visual system’s ability to efficiently analyze and comprehend complex scenes through selective attention. The potential of this mechanism has led researchers to explore its application in computer vision systems to enhance their performance. Specifically, the attention mechanism can be viewed as a dynamic selection process that adaptively weights input features according to their importance, which enables the network to effectively focus on the most informative and relevant aspects of the input. The beginning of attention mechanisms in computer vision is the Recurrent Attention Model (RAM) [
17], which combined deep neural networks with attention mechanisms to recurrently predict important regions in an image and update the entire network. The initial stage heavily relied on recurrent neural networks (RNNs) for implementing the attention mechanism. Subsequently, the advent of the SENet [
18] marked the beginning of a new stage, which introduced a channel attention network. ECANet [
19] and Convolutional Block Attention Module (CBAM) [
20] were representative works in this phase. Finally, the concept of self-attention mechanism was initially introduced by Transformer [
21] in natural language processing and then successfully applied to computer vision by Vision Transformer. Series of models based on Vision Transformers such as Tokens-to-Token Vision Transformer [
22], Pyramid Vision Transformer [
23], Swin Transformer [
24], Convolutional Vision Transformer [
25], Vision Outlooker [
26], and CoAtNet [
27] demonstrate the immense potential of attention mechanisms.
GAP has played an important role in previous attention mechanisms. However, it suffers from significant information loss as it fails to preserve the spatial structure and positional information in the feature maps. Consequently, it cannot capture subtle differences and local structures in different regions of the image. Additionally, since GAP uniformly processes the entire image, it fails to differentiate between important and unimportant regions in the image. This results in the model assigning equal importance to features from all regions, without focusing on the areas that are more relevant to the classification task. Moreover, GAP leads to a single spatially invariant global feature representation lacking diversity. This limitation prevents the model from capturing multiple local feature patterns or information at different scales in the image, thus restricting the expressive capability of the model.
The proposed attention mechanism no longer relies solely on GAP, but instead combines GAP with GMP in a binary pooling approach. GMP is utilized to extract the most salient features from the image or feature maps. By performing max pooling over the entire feature map, only the most important features in each channel are retained, while suppressing less significant features. This helps reduce redundancy and highlight crucial elements in the image, resulting in more discriminative features being extracted. Furthermore, channel attention is incorporated to enhance the extraction of image feature vectors. Channel attention is implemented by applying fully connected layers and point convolutions after the pooling operation. This combination allows the model to better focus on important features in each channel, facilitating improved feature representation.
Our work contributions can be summarized as follows:
To address the issue of insufficient information in GAP, a binary pooling operation is proposed. This approach effectively enhances the representational capacity of the convolutional network.
The proposed channel attention mechanism leads to improved image feature representation.
Experimental results demonstrate that the proposed method outperforms other attention mechanisms, yielding superior performance on ImageNet.
The rest of the paper is structured as follows.
Section 2 presents a review of related work.
Section 3 illustrates how the model is constructed. In
Section 4, comprehensive experiments are conducted on ImageNet to evaluate the effectiveness of the proposed method, which achieves excellent results. Finally, this work is summarized in
Section 5.
4. Experiments
In this section, the experimental setup is first described and the effectiveness of the proposed method in image classification tasks is studied. Then, the ablation research of the model is introduced. Finally, the data of the model are visualized.
4.1. ImageNet Classification
The evaluation of the proposed method is conducted on the ImageNet [
41] classification dataset which consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k test images. For the training set, a
crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted. For the test set, images are resized to
and cropped from the center of the picture to
.
Setup: To evaluate the performance of the proposed attention block on ImageNet, two popular CNN architectures are used as backbone networks: ResNet18 and ResNet50. Regarding the hyperparameters, in ResNet18, the learning rate was initialized as 0.1 and decayed by each 30 epochs. An SGD optimizer with a weight decay of , a momentum of 0.9, and a batch size of 256 was used. In ResNet50, the learning rate was initialized as 0.05 and decayed by each 30 epochs. An SGD optimizer with a weight decay of , a momentum of 0.9, and a batch size of 128 was used.
The proposed method is also compared with other prominent networks with attention, including SENet [
18], CBAM [
20], SRM [
37], FcaNet [
40], and ECANet [
19]. Evaluation metrics included parameters, floating point operations per second (FLOPs), and top-1/top-5 accuracy. No augmentation techniques such as mixup [
42], cutout [
43], cutmix [
44], etc. or label regularization such as label smoothing [
45] are adopted in the implementation. All networks have been trained for 100 epochs on NVIDIA 3090 GPU, Intel i5-12400F CPU, and Pytorch framework.
Comparison of results on ResNet:
Figure 3 shows the training curves after inserting the proposed attention blocks into ResNet18 and ResNet50. At the beginning of the training, due to the model’s lack of exposure to effective representations in the data, both the training accuracy and the test accuracy are low, and the training and test losses are high. As training progresses, the training accuracy and test accuracy gradually improve, while the training and test losses decrease. This indicates that the model is learning to extract features and patterns from the data and achieving better fit to the training data. Starting from the 60th epoch, the improvement in both training accuracy and test accuracy tends to approach a smaller value, and the training and test losses stabilize. This suggests that the model has learned the general features present in the data and performs well on the test data.
Table 1 shows the comparison of results on the ResNet18 and ResNet50 backbones. The following observations can be obtained: (1) On the ResNet18 backbone, the proposed method achieves a higher top-1 accuracy compared to all other models, resulting in a 2.02% improvement over the baseline model. (2) However, on the larger ResNet50 backbone, the proposed method significantly increases the parameter count due to the dimensionality expansion operation in the fully connected layer. Despite the limited increase in computational cost, the proposed method still achieves a higher top-1 accuracy compared to all other models, resulting in a 0.63% improvement over the baseline model.
Taking a comparison between ResNet-18 and ResNet-18 with the proposed attention block, for image inputs of , ResNet-18 requires approximately 1.82 GFLOPs. Our attention block employs binary pooling in the pooling stage, two fully connected layers and a pointwise convolution in the aggregation stage, and finally, an inexpensive scaling operation. Overall, when setting the expansion ratio r to 8, the new ResNet-18 requires approximately 1.83 GFLOPs, which represents a relative increase of 0.55% compared to the original ResNet-18. In exchange for this slight additional computational burden, the new ResNet-18 achieves higher accuracy than ResNet-18. In addition, when the mini-batch size is 256, training a batch on ResNet-18 takes 216 ms, while on the new ResNet-18, it takes 240 ms. We believe this represents a reasonable time overhead, which could potentially be further reduced if pooling and small internal product operations are optimized in PyTorch. Considering its contribution to the model’s performance, the minor additional computational cost generated by the attention block is acceptable.
4.2. CIFAR-10 and CIFAR-100
Experiments are also conducted on two classic small datasets, CIFAR-10 and CIFAR-100 [
46], which consist of sets of 50,000 training and 10,000 test RGB images of size
pixels. These datasets are labeled with 10 and 100 classes, respectively. The attention blocks were integrated into ResNet-18 and ResNet-50 architectures. During training, the images were randomly flipped horizontally and zero-padded with four pixels on each side before undergoing random
cropping. Mean and standard deviation normalization was also applied. The training hyperparameters, such as batch size, initial learning rate, and weight decay, were set following the recommendations in the original paper. The performance on CIFAR-10 and CIFAR-100 is shown in
Table 2 and
Table 3. It can be seen that in each table, the new ResNet outperformed the baseline architectures, indicating that the benefits of the proposed attention block are not limited to the ImageNet dataset.
4.3. Ablation Studies
4.3.1. Expansion Rate
The expansion rate
r in the fully connected layer is an important hyperparameter that affects the capacity and computational cost of the attention block. To study its impact, experiments with various values of
r are performed on ResNet-18. The comparison table in
Table 4 suggests that the performance does not monotonically improve with increasing capacity. This is likely due to the channel interdependencies in the attention block that can lead to overfitting on the data set. Specifically, it is found that setting
achieved a good balance between accuracy and complexity. Therefore, this value was used in all experiments.
4.3.2. Activation Function
The activation function is a nonlinear function in neural networks that introduces nonlinear transformations to increase the expressive power and fitting capacity of the network. The activation function processes the input of a neuron and generates an output signal, which serves as the input for the next layer of neurons. Commonly used activation functions include Sigmoid, ReLU, and Tanh. Different activation functions have different effects on attention block, as shown in the
Table 5. Compared to using Sigmoid, using Tanh leads to a slight decrease in accuracy, while using ReLU results in a significant decrease. Therefore, the final selection is to use the Sigmoid activation function.
4.4. Data Visualization
It is worth mentioning that our work employs a novel binary pooling operation, as well as fully connected and point convolution layers, to obtain attention weights, which guarantees comprehensive feature vector extraction from images. In order to show this powerful attention mechanism more intuitively, it is necessary to focus on the learned attention weight. By using Grad-CAM++ [
47], the attention weight of pre-trained model on ImageNet training set can be seen in
Figure 4. It should be noted that the redder the color is, the greater the attention weight is. It can also be concluded that the red is covered on the target, which means that attention mechanism used in this work is really on the target.
5. Conclusions
This paper proposes a novel attention block for image classification. Compared to previous research on attention, our work utilizes more advanced pooling operations. Additionally, to enhance the extraction capability of feature vectors, fully connected and point convolution layers are adopted to aggregate feature maps maximally. Experiments on the ImageNet demonstrate the effectiveness and superior performance of the proposed model. The performance of the model is not limited to ImageNet alone. It is believed to exhibit promising outcomes in medical applications such as disease diagnosis, organ segmentation, and skin anomaly detection. Due to limitations in acquiring datasets, the efficacy of this model in medical imaging will be our focus in future research. However, there is still room for further research. For example, to avoid information loss and noise amplification, the dimensionality of operation is roughly chosen, which significantly increased the number of parameters in the attention block, although the increased computational complexity is acceptable. There are considerable optimization techniques in the fully connected layers. Furthermore, our proposed attention block is channel-based, but incorporating spatial attention may yield even better results. Exploring spatial attention mechanisms is thus an area for future research.