Fusing Context Features and Spatial Attention to Improve Object Detection

Liu, Tianjia; Wu, Jinsong; Luo, Xuze; Xu, Guangquan

doi:10.3390/app13074250

Open AccessArticle

Fusing Context Features and Spatial Attention to Improve Object Detection

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China

³

Department of Electrical Engineering, University of Chile, Santiago 8370451, Chile

⁴

Tianjin Key Laboratory of Advanced Networking (TANK), College of Intelligence and Computing, Tianjin University, Tianjin 300350, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(7), 4250; https://doi.org/10.3390/app13074250

Submission received: 23 January 2023 / Revised: 23 March 2023 / Accepted: 24 March 2023 / Published: 27 March 2023

(This article belongs to the Special Issue Technologies and Services of AI, Big Data, and Network for Smart City)

Download

Browse Figures

Versions Notes

Abstract

:

Context features are mostly used to determine the boundary of a target, which allows one to better locate an object. In this paper, we propose the fusion of the spatial attention mechanism and contextual features to simulate the recognition of objects based on the human eye, thereby improving the detection accuracy of detectors. We chose the PASCAL VOC2007+2012 general dataset to test the generality of our method and examined the improved accuracy of our proposed detector on various targets. Our method showed improved accuracy for small targets and partially overlapping targets. Our proposed model improved the detector’s accuracy by 3.34%.

Keywords:

context feature; attention mechanism; YOLO-v3; object detection

1. Introduction

With the development of object detection, the existing object detection algorithms have been classified into two-stage and single-stage algorithms. The representative model for two-stage algorithms is Faster RCNN [1], which generates a number of candidate regions and sends them into the network for object detection. The representative model for one-stage algorithms is YOLO. YOLO-v1 [2] partitions the input image into

S \times S

grids and predicts N bounding boxes for each grid. YOLO-v2 to YOLO-v4 [3,4,5] generate the anchor box through the K-means clustering method and predict the offset of the bounding box relative to the prior box. YOLO-v5 preprocesses the input image using mosaic data enhancement, adaptive image zoom, and other mechanisms. YOLO-v5 can compute the best anchor box value for adaptation during training. YOLO-v6 [6] can adapt models of different sizes due to the design of the reparameterizable EfficientRep Backbone and RepPAN Neck. YOLO-v7 [7] proposed a high-aggregation network, heavily parameterized convolution, auxiliary head detection, and dynamic label allocation to achieve an optimal speed and accuracy for the detector. Two-stage object detection aims to enhance the ability to detect targets by selecting a candidate box with a high coincidence degree with the real box after adjusting the candidate box twice. Therefore, the detection accuracy of two-stage object detection is high, but the detection speed is low, and does not meet the real-time requirements. Single-stage target detection algorithms omit the process of generating candidate boxes and directly extract the image features to predict the classification and location of objects through the network, which saves time, but has a low detection accuracy. In recent years, one-stage object detection has been continually improved, surpassing two-stage object detection in terms of accuracy and speed, and has been widely used in industry. Currently, there are two main approaches to improving the accuracy of object detection algorithms: enhancing general-purpose algorithms and specifically targeting complex objects. For instance, [8,9,10,11,12] proposes more optimized algorithms for all objects, while [13,14] is tailored to design algorithms for low-precision complex objects. Most two-stage and one-stage object detection algorithms are based on anchor frames. A anchor frame is an a priori frame with different aspect ratios predefined in the object detection algorithm. In object detection, multiple anchor frames are first generated, and then the category and offset are predicted for each anchor frame. Next, the anchor frame is adjusted according to the predicted offset to obtain the predicted bounding box. Finally, the predicted bounding box needed by the output is screened. The size, aspect ratio, and number of anchor frames have an important influence on the performance of target detectors based on anchor boxes. When selecting the size of the anchor box, the aim is that the anchor box perfectly fits the size of the object itself. Therefore, YOLO-9000 [1] was proposed, which uses K-means [4] instead of the manual selection method to determine the average size of the object, so that the size of the anchor box fits the object more perfectly; however, as the anchor box is closer to the object, the loss of surrounding environment information is greater. Context features can better help us determine the boundaries of objects and better localize objects. For example, birds and airplanes are small colored or white objects in a blue sky. Comparing the blue background information with the white object information can better help determine the boundaries and object positions. At the same time, the context features can also help determine the category of the object. If the background is a blue sky, the target may be a bird or an airplane. If the background is a beach, the object may be a human or a vehicle. Context features can help us further narrow the range of information to ensure feature extraction. Adding context features [15] to target detectors can help improve the object detection accuracy. Meanwhile, context features cannot be used alone. Just like human eyes, if the object is as important as the background, then most of the attention may be paid to the relationship between the background and the object rather than the properties of the object itself, which is also an attention mechanism [16] source. Therefore, we ignore most of the background information and only fuse a small part of the background information and object information for object detection.

This paper proposes the following three contributions:

Improving the YOLO-v3 model to better extract the features of small objects;
Adding context features to improve the detection accuracy of overlapping objects;
Integrating an attention mechanism to better detect difficult targets.

2. Related Works

Object detection algorithms based on deep learning are categorized into anchor and anchorless detection algorithms. An anchorless detection algorithm automatically learns to generate anchor boxes during network training according to the features of the dataset. CornerNet [10] outputs the prediction box by combining two sets of heat maps, where CNN predicts top left and bottom right points of the heat map. The works [10,17] located anchor boxes based on key points. FCOS predicts the upper, lower, left, and right offset of the center point to the object boundary and the category to which the center point belongs. YOLOx [18] predicts object locations by calculating the offset of the coordinates in the upper-left corner of the grid and the height and width of the prediction box. FoveaBox [19] predicts a category-sensitive language intention for each possible target object and generates bounding boxes for unknown categories at each possible location containing the target. The anchorage detection algorithm also needs to pair the key points, resulting in a decrease in the detection speed. If two targets overlap and share the same central point, anchor-free detection algorithms may detect only one of the targets, so there are missed and false tests.

The scale of the anchor frame is manually defined. By adjusting the setting of the anchor frame, it can cover as many target objects as possible, which improves the accuracy of target prediction. For target detection, the selection method of the anchor is an important factor that affects the accuracy and speed of the target detector. The target detector RCNN [1] uses deep learning for target detection, applying the “Select Search” method to select the anchor box, which relies on violent search, and sending about 2000 candidate boxes into the detector for detection; this improves the accuracy at the cost of considerably increasing the detection time. YOLO considers an image as a whole for detection, and it was the first method to design the anchor box selection process based on the idea of anchor-free detection, which greatly improved the detection speed. SSD [20] uses a fixed anchor box size for target detection and adds a feature pyramid to greatly improve the detection accuracy of the target, while YOLO-v2 [3] also uses the feature pyramid method and employs K-means to design the anchor box and achieve a higher detection accuracy. GA-RPN [21] positions the possible target center point and then generates the optimal anchor box according to the semantic features guided by the center point, which greatly improves the detection performance of the target detector.

However, YOLO-v2 still has its shortcomings, and the selected k may not be suitable if its value is uncertain regarding the clusters based on object partitions and clusters based on object size when the mean value is scattered; thus, the obtained anchor box may not be suitable for the corresponding detector. Therefore, we manually adjusted the anchor box based on K-means in order to improve the accuracy of the detector.

3. Attention Mechanism

The proposal of the attention mechanism itself was inspired by humans’ visual attention mechanisms for objects. We apply an attention mechanism to target detection so that certain important features of an object more significantly affect our recognition of that object. For example, the features of the long trunk and large ears of an elephant dominate our perception of the visual information of the elephant, and an attention mechanism is adopted to increase the weight of these features. Humans tend to focus more on recognized objects; while they try to ignore the background, they cannot completely dismiss it. For example, when we observe birds flying in the sky, we neither pay attention to how blue the sky is, nor recognize how many clouds there are, but the visual information we receive still tells us that the birds are flying in the sky, and thus the background information of the sky also has a certain impact on our recognition of the birds. We integrated such contextual information into the feature information of objects with a low weight in the form of spatial attention to improve the model’s detection of objects. For example, [22,23] incorporates attention mechanisms to enhance detection accuracy. This study adopted the attention mechanism SAM [16]. We let the detector imitate a human to detect objects. We fused the target features and context features extracted in the feature layer and then applied a spatial attention mechanism and sent them to the enhanced feature extraction network for enhanced feature extraction. The addition of a BiFPN module instead of an FPN (feature pyramid network) module could improve feature extraction, strengthening in particular the extraction of small target features, and the BiFPN module could also increase the fusion of the context features with the originally extracted features and better utilize the advantages of the context features.

4. Context Feature

The pixels in an image do not exist independently. There are relationships between neighboring pixels. A large number of pixels are related to each other to produce objects in an image. Therefore, context features refer to the relationships between pixels and surrounding pixels. Inspired by [24,25], we believe that context features contain context information that can fully utilize the constraints and dependencies between objects and their background, as well as between objects themselves, in contrast to the general features extracted by convolution. This improves the detection performance of the target detector.

The fusion of context features and an attention mechanism was used to detect small targets, and FA-SSD was proposed. In order to solve the problem of the low detection accuracy for small objects, low-level features and high-level features were fused in the attention mechanism. YOLO-v3 [4] uses FPN as the target extraction network, which introduces fewer features in the shallow layer, while it employs ResNet convolution to better extract feature information. We used YOLO-v3 as the foundation upon which to add context and spatial attention mechanisms, eliminating complicated information processing steps for the feature fusion of different layers. Moreover, we chose the same layer for context extraction to process less information and improve the detection speed of the detector.

5. Proposed Methods

5.1. Context Features

We introduced a multi-scale dilated convolution module to extract context features in the original network model, increased the receptive field while maintaining the resolution to make full use of context information, and applied a spatial attention mechanism after fusing the extracted context features and object features.

\begin{matrix} A M & = \sum_{i = 1}^{M} ω_{i} p x_{i} + \sum_{j = 1}^{N} ω_{j} p x_{j} (p x_{i} \in C F, p x_{j} \in F M) \\ = \sum_{k = 1}^{N} ω_{k} p x_{k}, \end{matrix}

(1)

where

A M \in R^{H \times W \times C}

represents the extracted attention map,

C F \in R^{H \times W \times C}

the context features,

F M \in R^{H \times W \times C}

the feature map,

p x_{i}

the value of the i-th pixel, and N represents the number of pixels in the image. Each attention map we extracted was concatenated from the context features and the original feature map with the addition of a weight

ω_{k}

.

First, we used the k-means method to determine the k average sizes of the objects and then divided each of the k average sizes into n average sizes. Here, we took n and k, respectively, to be 3 and 4. That is to say, we divided the sizes of the objects into 4 large average sizes and then divided each large average size into 3 small average sizes. We used 4 feature layers for feature extraction, and small objects were extracted in the shallow layer, while large objects were extracted in the deep layer; anchor boxes of 3 sizes were used to locate the objects on each grid point of each feature layer. The adoption of the feature-enhanced extraction FPN (feature pyramid network) could help avoid the problem of insufficient semantic information in shallow networks and their inability to detect small objects. At the same time, it could avoid the decrease in detection speed caused by fusing too many feature layers together. FPN could better integrate deeper semantic information and shallow information to increase the amount of feature extraction information and thereby improve detection accuracy. We extracted the feature information of the objects through ordinary convolution and then extracted the context feature information through multi-scale dilated convolution. The structure of the multi-scale dilated convolution process is shown in Figure 1. Multi-scale features were extracted using dilated convolutions with a convolution kernel size of

3 \times 3

and expansion rates of 1, 3, and 5, respectively, and multi-scale features were fused to obtain features containing context information. We sent the context feature information and the object feature information into the enhanced feature extraction network for further feature extraction. In Figure 2, the blue part represents the object information features, and the red parts represents the context features. The two sets of features were spliced according to the channel, and the spliced features were sent to the enhanced feature extraction network to obtain information-rich features.

5.2. Spatial Attention Mechanism

We used a spatial attention mechanism to extract target and context features from the network input for better target recognition. This mechanism assigned different weights to the features of the target itself and its surrounding context, thereby achieving the more precise extraction of the target.

\begin{matrix} M F & = σ (f^{3 \times 3} (| A v g P o o l (F); M a x P o o l (F) |)) \\ = σ (f^{3 \times 3} (| F_{a v g}; F_{max} |)), \end{matrix}

(2)

where

σ

is the sigmoid function used to obtain the weight coefficient of 0 1;

f^{3 \times 3}

indicates that the convolution operation is performed with a

3 \times 3

convolution kernel; and

| A v g P o o l (F); M a x P o o l (F) |

means that the average and maximum pooling operations are carried out on feature F, and the obtained features are spliced according to the channel dimension.

The SAM [26] module was used to implement our spatial attention mechanism. We sent the context features extracted from the feature extraction network to the SAM module. The structure of the SAM module is shown in Figure 3. Firstly, the maximum pooling operation and the average pooling operation were performed on the input features

F^{'} \in R^{C \times H \times W}

to obtain two feature maps with a channel number of 1. The two feature maps were spliced to obtain a feature map with a channel number of 2. Then, convolution with a convolution kernel of

3 \times 3

was used to extract features, and a sigmoid function was used to calculate the spatial weight of the extracted features. We generated a spatial attention feature map

M \in R^{C \times H \times W}

. Finally, the spatial attention features M and features

F^{'}

were multiplied pixel by pixel to highlight the target features of

F^{'}

. In order to gather as much semantic information as possible, the spatial attention mechanism was added to the blue module in Figure 4, the deep semantic information was sent to the SAM module and then fused with the shallow information, and the spatial attention mechanism was applied again to the fused features to better extract the feature information of the object.

5.3. Improved YOLO-v3

We took YOLO-v3 as the basis network and made certain modifications to it; then, we added context features and attention mechanism modules to the modified network to improve the detection accuracy. We added a feature extraction layer with dimensions of

104 \times 104

to improve the feature extraction for small target objects and attached this to the enhanced feature extraction network and the feature layer that had been compressed several times for splicing. YOLO-v3 uses the enhanced feature extraction FPN to extract the features of objects, that is to say, the width and height were compressed to

13 \times 13

after the continuous convolution from top to bottom through the backbone extraction network. Then, we convolved the feature layer with a width and height of

13 \times 13

from bottom to top, expanded the width and height, and performed the convolution with other feature maps after splicing. An SAM module was applied between the convolutional layer where deep convolution and shallow convolution were used to performed information fusion and the previous convolutional layer, and the target features and contextual features were extracted in the last convolutional layer.

\begin{matrix} p_{7}^{o u t} = & S A M (p_{7}^{i n}) \\ p_{6}^{o u t} = & S A M (u p s a m p l e (p_{7}^{o u t}) + p_{6}^{in}) \\ p_{5}^{o u t} = & S A M (u p s a m p l e (p_{6}^{o u t}) + p_{5}^{in}) \\ p_{4}^{o u t} = & S A M (u p s a m p l e (p_{5}^{o u t}) + p_{4}^{in}), \end{matrix}

(3)

where

p_{7}^{o u t}

is the input of the 7th convolution layer of the backbone network that is added to the SAM for output;

p_{6}^{o u t}

upsamples the output of

p_{7}

and then fuses it with the input of

p_{6}

and adds it to the SAM for output;

p_{5}^{o u t}

upsamples the output of

p_{6}

and then fuses it with the input of

p_{5}

and adds it to the SAM for output; and, similarly,

p_{4}^{o u t}

upsamples the output of

p_{5}

and then fuses it with the input of

p_{4}

and adds it to the SAM for the output.

YOLO-v3 uses ResNet-53 as the backbone feature extraction network. We modified the residual block module as the basis for the inverted residual block, which reduced the loss of information since the residual network was compressed and then expanded, and a linear activation function was used with low dimensions to prevent features from being destroyed. Depthwise separable convolution was used as the convolution method to reduce the operation parameters as much as possible and improve the detection speed of the model. We used (5) as the loss function for our detection target as a whole, where the confidence loss was calculated using the softmax function, andthe loss for predicted box l and ground truth box g was calculated using a smooth function. Additionally, we used the offset of the regression center coordinates (

c_{x}

,

c_{y}

) to the height (h) and width (w) of the border for positioning.

\begin{matrix} L (x, c, l, g) = & \frac{1}{N} (L_{c o n f} (x, c)) + α L_{l o c} (x, l, g) \end{matrix}

(4)

where

\begin{matrix} L_{c o n f} (x, c) = & - \sum_{i \in P o s}^{N} x_{i j}^{p} {log}_{} (\hat{c_{j}^{p}}) - \sum_{i \in N e g}^{N} x_{i j}^{p} {log}_{} (\hat{c_{j}^{p}}) \\ \hat{c_{j}^{p}} = & \frac{exp (c_{j}^{p})}{\sum_{p} exp (c_{i}^{p})} \end{matrix}

(5)

\begin{matrix} L_{l o c} (x, l, g) = & \sum_{i \in P o s, m \in {c x, c y, w, h}}^{N} \sum x_{i j}^{k} s m o o t h_{L 1} (l_{i}^{m} - \hat{g_{j}^{m}}) \end{matrix}

(6)

6. Performance

6.1. Evaluation Metrics

Generally speaking, precision (P) and recall (R) are two contradictory measurement standards. If one wants to improve P, R decreases, and vice versa. The relationship between the P and R of the same algorithm can be represented by the p(r) curve, shown in Figure 5. Ideally, one hopes that both P and R approach 1 infinitely and, therefore, that the area under the curve approaches 1 infinitely, which is referred to as the average precision (AP) of the detection algorithm. In object detection, the mainstream evaluation metric is mean average precision (mAP), as it simultaneously considers the precision and recall of a model and evaluates the performance of the model more comprehensively.

Evaluation metrics are utilized to assess the performance of a model. In Table 1, we provide a brief introduction to the four types of detection results (TP, FP, TN and FN). and the commonly used metrics are presented in Table 2. In this paper, we chose P, R, mAP, and FPS as the evaluation metrics.

6.2. Experimental Settings

Dataset: We tested our model on PASCAL VOC2007 and VOC2012, and we combined the two datasets as voc2007+2012 to test our model. The datasets were partitioned into a training set, validation set, and test set with a ratio of 7:1.5:1.5. The VOC dataset comprised a total of 20 categories, with VOC2007 containing a total of 9963 pictures and VOC2012 11,540 pictures. We combined these two datasets to reach a total of 21,503 pictures as our final dataset, and, according to the proportions mentioned above, we separated the validation and test sets to train and test the model.

Implementation details: We chose python to implement our network model, employed the pytorch framework to build the model, and used RTX2080 on the server for testing. In order to speed up training, that is, to speed up the convergence of the model, we partitioned the training period into a freezing period and a thawing period. We trained our model for 50 freeze epochs and 30 thaw epochs depending on how well the model converged. According to the computing power of the server, we set the batch size of the freeze period to eight, the batch size of the thaw period to four, and the IOU to 0.5 based on the general standard.

6.3. Experiments

We combined VOC 2007 with VOC 2012 data to test our model. As observed from Table 3, compared with the traditional two-stage detector, our model had a faster detection speed and higher accuracy, as we used a one-stage detector based on YOLO-v3 that eliminated the stage of generating anchor boxes and improved the detection speed. The accuracy of our proposed model outperformed that of one-stage detectors such as SSD and YOLO-v3. Moreover, because depthwise separable convolution was used, the number of unnecessary parameters was reduced, and the detection speed was improved. We also compared the model with the CenterNet model, which adopts an anchor-free design, and our proposed model again had a higher detection accuracy.Compared with other models, we also achieved a higher R and P. The R and P values of our model for each target class are shown in Table 4, which indicates that with sufficient training data, our model produced good predictive results.

6.4. Ablation Study

In Table 5, we present the results of our ablation experiments which were conducted to determine the impact of our modifications on each model. We combined the two datasets VOC2007 and VOC2012 to test our model. With ResNet53 as the backbone, we replaced the ordinary convolution step with depthwise separable convolution, which reduced the number of parameters and improved the detection speed of the model. The detection accuracy of the model only decreased slightly, and then the context features were added. It can be seen that our model presented only slight improvements, though its accuracy was greatly improved after adding the attention mechanism. Furthermore, only the combination of the attention mechanism and context features achieved better detection results.

6.5. Example Visualization of Experimental Results

We compared YOLO-v3 with our proposed target detection algorithm. The comparison results are shown in Figure 6. We found that the network with the attention mechanism and context feature extraction module better detected the targets. Our algorithm improved the accuracy of detecting small target objects. For certain large targets and small targets, some overlaps were also better detected.

7. Conclusions

In order to further improve the accuracy conferred by the context detection mechanism, we proposed the combination of the context mechanism with a spatial attention mechanism to take advantage of the former. We tested our context feature module on the basis of the YOLO-v3 model. After adding context features, the detection accuracy of our model improved to a certain extent, increasing from 82.34% to 83.73%. We assigned the detection target to not only the target itself, but also to the background surrounding the target, which produced a better target detector and improved our model’s detection accuracy (mAP) to 85.68%. It can be seen that the proposed addition of context and spatial attention mechanism modules improved the accuracy of the detector.

Author Contributions

Conceptualization, T.L. and J.W.; methodology, T.L., J.W. and X.L.; software, T.L. and X.L.; validation, T.L. and X.L.; investigation, T.L., J.W. and G.X.; supervision, J.W.; project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the China Guangxi Science and Technology Plan Project (Guangxi Science and Technology Base and Talent Special Project) under grant 2022AC20001; the National Key R&D Program of China under grant 2022YFB3102100; the National Science Foundation of China under grants U22B2027, 62172297, 62102262, and 61902276; the Tianjin Intelligent Manufacturing Special Fund Project under grant 20211097; the Chile CONICYT FONDECYT Regular project under grant 1181809; and Chile CONICYT FONDEF under grant ID16I10466.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems 2015 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLO-v6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLO-v7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Fan, D.P.; Li, T.; Lin, Z.; Ji, G.P.; Zhang, D.; Cheng, M.M.; Fu, H.; Shen, J. Re-thinking co-salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4339–4354. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Lin, L.; Lin, W.; Huang, S. Group object detection and tracking by combining RPCA and fractal analysis. Soft Comput. 2018, 22, 231–242. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Loh, Y.P.; Liang, X.; Chan, C.S. Low-light image enhancement using Gaussian Process for features retrieval. Signal Process. Image Commun. 2019, 74, 175–190. [Google Scholar] [CrossRef]
Kaya, E.C.; Alatan, A.A. Improving proposal-based object detection using convolutional context features. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1308–1312. [Google Scholar]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 6688–6697. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2965–2974. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Fan, Q.; Fan, D.P.; Fu, H.; Tang, C.K.; Shao, L.; Tai, Y.W. Group collaborative learning for co-salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12288–12298. [Google Scholar]
Liu, Y.; Han, J.; Zhang, Q.; Shan, C. Deep salient object detection with contextual information guidance. IEEE Trans. Image Process. 2019, 29, 360–374. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. The structure of the multi-scale dilated convolution process.

Figure 2. Combined feature map.

Figure 3. SAM attention mechanism module.

Figure 4. Improved YOLO-v3 attention mechanism fusion model.

Figure 5. Precision–recall curve.

Figure 6. Comparison of visual experimental results.

Table 1. A brief introduction to TP, FP, TN, and FN.

Detection Result	Abbreviation	Definition
true positive	TP	correctly predicted positive samples
false positive	FP	incorrectly predicted positive samples
true negative	TN	correctly predicted negative samples
false negative	FN	incorrectly predicted negative samples

Table 2. Commonly used model evaluation metrics.

Evaluation Metric	Abbreviation	Computing Method	Definition
Accuracy	A	$A = \frac{T P + T N}{T P + T N + F P + F N}$	The proportion of correct predictions made by the model among all predictions.
Precision	P	$P = \frac{T P}{T P + F P}$	The proportion of true-positive samples among all samples predicted as positive by the model.
Recall	R	$R = \frac{T P}{T P + F N}$	The proportion of samples correctly predicted by the model among all positive samples.
Average precision	AP	$A P = \int_{0}^{1} P (r) d r$	Area under the precision–recall curve.
Mean average precision	mAP	$m A P = \frac{1}{k} \sum_{1}^{k} A P_{i}$	The average of AP for all classes.
Frames per second	FPS	-	The number of pictures that can be processed per second.

Table 3. The experimental results of all models.

Method	Backbone	Input Size	Precision	Recall	mAP	FPS
Faster RCNN	ResNet-101	600 × 600	77.10	64.63	73.41	10.31
FCOS	Normal Convolution	600 × 600	76.38	67.06	74.53	14.53
SSD512	VGG16	512 × 512	84.15	67.58	76.80	17.21
CenterNet	Hourglass-104 Convolution	600 × 600	85.88	70.11	80.23	12.80
YOLO-v3	ResNet53	412 × 412	90.26	73.83	82.34	16.95
Our model	ResNet53+	412 × 412	91.67	76.97	85.68	19.89

Table 4. The R and P values for all classes.

Class	Airplane	Bicycle	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow
R	83.33	85.92	73.17	61.47	54.09	83.06	82.17	83.61	59.37	53.16
P	99.01	96.83	96.77	95.71	85.00	100	93.81	95.03	79.74	58.33
Class	Dining Table	Dog	Horse	Motorbike	Person	Potted Plant	Sheep	Sofa	Train	TV Monitor
R	63.89	88.70	86.47	86.05	82.98	40.44	28.42	76.47	83.47	80.45
P	80.23	87.80	95.83	97.37	93.86	88.10	100	85.71	91.82	91.45

Table 5. Results of ablation study.

ResNet53+	Depthwise Separable Convolution	Context feature	SAM	mAP	FPS
✓	×	×	×	82.34	16.95
✓	✓	×	×	82.13	22.01
✓	✓	✓	×	83.73	20.60
✓	✓	✓	✓	85.68	19.89

where ✓ represents the addition of the corresponding module, while × represents the absence of the corresponding module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, T.; Wu, J.; Luo, X.; Xu, G. Fusing Context Features and Spatial Attention to Improve Object Detection. Appl. Sci. 2023, 13, 4250. https://doi.org/10.3390/app13074250

AMA Style

Liu T, Wu J, Luo X, Xu G. Fusing Context Features and Spatial Attention to Improve Object Detection. Applied Sciences. 2023; 13(7):4250. https://doi.org/10.3390/app13074250

Chicago/Turabian Style

Liu, Tianjia, Jinsong Wu, Xuze Luo, and Guangquan Xu. 2023. "Fusing Context Features and Spatial Attention to Improve Object Detection" Applied Sciences 13, no. 7: 4250. https://doi.org/10.3390/app13074250

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fusing Context Features and Spatial Attention to Improve Object Detection

Abstract

1. Introduction

2. Related Works

3. Attention Mechanism

4. Context Feature

5. Proposed Methods

5.1. Context Features

5.2. Spatial Attention Mechanism

5.3. Improved YOLO-v3

6. Performance

6.1. Evaluation Metrics

6.2. Experimental Settings

6.3. Experiments

6.4. Ablation Study

6.5. Example Visualization of Experimental Results

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI