Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution

Li, Kai; Wang, Yanni; Hu, Zhongmian

doi:10.3390/app13169316

Open AccessArticle

Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution

by

Kai Li

,

Yanni Wang

^* and

Zhongmian Hu

College of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710311, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9316; https://doi.org/10.3390/app13169316

Submission received: 17 July 2023 / Revised: 7 August 2023 / Accepted: 14 August 2023 / Published: 16 August 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid advancement of deep learning has significantly accelerated progress in target detection. However, the detection of small targets remains challenging due to their susceptibility to size variations. In this paper, we address these challenges by leveraging the latest version of the You Only Look Once (YOLOv7) model. Our approach enhances the YOLOv7 model to improve feature preservation and minimize feature loss during network processing. We introduced the Spatial Pyramid Pooling and Cross-Stage Partial Channel (SPPCSPC) module, which combines the feature separation and merging ideas. To mitigate missed detections in small target scenarios and reduce noise impact, we incorporated the Coordinate Attention for Efficient Mobile Network Design (CA) module strategically. Additionally, we introduced a dynamic convolutional module to address misdetection and leakage issues stemming from significant target size variations, enhancing network robustness. An experimental validation was conducted on the FloW-Img sub-dataset provided by Okahublot. The results demonstrated that our enhanced YOLOv7 model outperforms the original network, exhibiting significant improvement in leakage reduction, with a mean Average Precision (mAP) of 81.1%. This represents a 5.2 percentage point enhancement over the baseline YOLOv7 model. In addition, the new model also has some advantages over the latest small-target-detection algorithms such as FCOS and VFNet in some respects.

Keywords:

target detection techniques; small target detection; YOLOv7 network model; attention module; dynamic convolution

1. Introduction

Object detection involves the classification and localization of targets within images or videos. Its significance has grown greatly in recent years due to its diverse array of applications. Additionally, object detection serves as the foundation for other advanced tasks within the realm of computer vision, such as target tracking and target segmentation. These tasks encompass two primary components: identifying the target’s class within an image and accurately determining the target’s spatial location [1]. Early conventional object-detection methods, including the Viola–Jones detector [2] and Histogram of Oriented Gradients (HOG) [3], relied on manual feature construction. However, these initial approaches were marred by slow processing speeds, reduced accuracy, and inadequate performance when faced with unfamiliar datasets. The introduction of Convolutional Neural Networks (CNNs) has ushered in a transformative era in object detection.

This study delves into target detection methods facilitated by fundamental deep learning models [4]. Notably, target detection methodologies can be broadly categorized into two major classifications: the two-stage model and the one-stage model. The former employs a CNN to generate a set of prospective candidate regions, subsequently accomplishing tasks related to classification and localization. On the other hand, the latter adopts a regression approach, directing input images through a CNN and directly yielding detection results thereafter.

Small Object Detection (SOD) has a relatively short history compared to other computer vision tasks. Small targets were defined in 2014 by the common dataset MSCOCO as objects having a density of less than 32 × 32 px. A small detection of targets network based on deep learning was proposed in the literature in 2016 [5], and by including a tiny-target-detection dataset and the evaluation metrics of the dataset, the part of the target box area of 0.05% to 0.58% of the total area in the same category of targets was defined as a small target from the relative size, which laid a certain foundation for exploring small target detection. The literature [6] proposed a technique based on upsampling to achieve better results in small target detection. In 2018, the literature [7] applied the deconvolutional RCNN to remote sensing small target detection. Subsequently, based on Faster RCNN [8], SSD [9], and the YOLO series of network models, researchers proposed many small-target-detection network models. ReDet [10], Oriented Bounding Boxes [11], and Box Boundary-Aware Vectors [12] enhance the identification of tiny targets by rotating the prediction frame and rotating the detector, but target only remote sensing scenarios; by including target detection layers and employing a transformer prediction Head to integrate the CBAM attention module, TPH-YOLOv5 [13] enhances the detection effectiveness of small targets [14], which successfully enhances the detection efficiency of the network for tiny targets, but it easily results in missed detection in non-dense scenarios; For YOLO-Z [15], although a series of operations such as replacing PAFPN with Bi-FPN and expanding the Neck layer make the shallow and middle features well fused, it is not applicable to target size scenarios with large variations. Fatih Cagatay Akyon and Sinan Onur Altinuc et al. [16] dealt with the problem of the difficulty in detecting small targets when the image has very few details and it is difficult to detect enough information. A Slice-Assisted Hyper-Inference (SAHI) was proposed to aid detection.

In order to better cope with the problems of modeling difficulty and insufficient training due to the insufficient data volume in small target detection, Gong Cheng et al. [17] constructed the SODA-D dedicated dataset for small target detection in driving behavior and the SODA-A dedicated dataset for small target detection for airborne scenarios, which provide a solid foundation for future research on small target detection.

Addressing the aforementioned challenges, this paper introduces an enhanced version of the You Only Look Once v7 (YOLOv7) target-detection model. Section 2 provides the foundational knowledge necessary for understanding the context and scope of this work. In Section 3, we delve into the refined components of the benchmark model and elucidate the rationale underlying each enhancement. The subsequent Section 4 offers a comprehensive exploration of the experimental setup, design considerations, achieved outcomes, and real-world performance of the novel model. In conclusion, we recapitulate the entirety of this endeavor, identifying limitations and charting pathways for future advancements.

This paper introduces significant contributions to the algorithmic landscape, enhancing the efficacy of target detection. Primarily, our contributions encompass the following key aspects. Initially, we synergize the concepts of feature separation and merging, culminating in the refinement of the Spatial Pyramid Pooling, Cross-Stage Partial Channel (SPPCSPC) module. Furthermore, we introduce an innovative module known as the Spatial Pyramid Pooling-Fast, Cross-Stage Partial Channel (SPPFCSPC) module. This new module represents a substantial leap forward, as it optimizes the preservation of the YOLOv7 model’s sensing field while ensuring precise localization of diverse small targets, irrespective of their sizes. Consequently, this innovation amplifies the extraction of effective features, concurrently mitigating the issue of detection leakage.

To mitigate the challenges of false positives and missed detections in small target detection, we introduce the Coordinate Attention for efficient mobile network design (CA) mechanism. This attention mechanism is a remarkable innovation capable of simultaneously incorporating inter-channel relationships and positional information across extensive distances. It adeptly allocates attention to multiple feature elements in parallel, resulting in reduced information loss during small target detection and superior achievement of feature optimization.

In the context of small target detection, we introduce a novel convolutional kernel that dynamically adapts the convolutional size in response to changes in target size. This adaptation enhances the model’s capacity for expression and contributes to elevated detector performance, while concurrently fortifying the network’s robustness and responsiveness to varying target sizes.

2. Related Work

2.1. YOLOv7

The main structure of the algorithm model of YOLOv7 [18] is relatively similar to its previous series, YOLOv5 [19]. This is primarily broken down into the following four components: input, Backbone, Neck, and Head.

The Backbone framework serves as the foundation for signature acquisition within YOLOv7. It facilitates the initial extraction of features for target detection, culminating in the generation of feature layers. These extracted feature layers from the Backbone play a pivotal role in the subsequent network construction, earning them the nomenclature of “effective feature layers”. Remarkably, the Backbone feature extraction network employed by YOLOv7 capitalizes on the E-ELAN module, characterized by a final stacking module comprising four branches. This configuration engenders a denser residual structure resulting from multiple stacks, a distinction that imparts simplicity to the optimization and permits accuracy enhancement via increased network depth.

The YOLOv7 Neck module undertakes the critical task of performing feature fusion on the previously generated successful feature layers. This module represents a noteworthy evolution in the layer-extraction network of YOLOv7, strategically devised to address the challenges arising from varying target sizes within the realm of deep learning. Moreover, it effectively tackles the issue of image noise. Notably, YOLOv7 retains the framework of the Panet structure from its predecessor series. This not only extends the inherent characteristics of the architecture for improved synergy, but also involves an additional round of feature downsampling, thereby achieving comprehensive feature fusion.

The vital components of YOLOv7, responsible for classification and regression, are collectively referred to as the YOLO Head. Notably, the Backbone and Feature Pyramid Network (FPN) now contribute improved effective feature layers. In the Head module, each feature layer is equipped with its corresponding parameters. The subsequent process involves mapping the feature map to a collection of feature points, followed by their amalgamation with feature frames. This strategic approach ensures that every prior frame corresponds to multiple feature channels, forming the fundamental premise of the YOLO Head. In essence, the YOLO Head evaluates the feature points, thereby establishing the association between previous frames and target entities. Analogous to its earlier iterations, YOLOv7 executes regression and classification through a 1 × 1 convolution featuring decoupled Heads.

In order to forecast object cases matching the previous frame, the entire YOLOv7 network performs the following tasks: input image processing; feature extraction; feature improvement; and prediction. Figure 1 depicts its particular network structure.

This layer-by-layer transfer-type structure leads to the loss of information transfer between each layer. In addition, because of the fixation of the detection frame of the YOLOv7 model, it has certain defects for both small targets and multi-target detection with large size variations. Its own SPPCSPC module also has some shortcomings in both the speed and accuracy of processing information. Therefore, we designed a new network model structure with an improved SPPFCSPC module, as described in Section 3.

2.2. Attentional Mechanisms

The application of attention mechanisms in deep learning has increased recently. Among them are excellent performances in image detection, speech processing, and natural language [20,21,22,23]. In the realm of computer vision, attention mechanisms have also had a significant impact, particularly on target identification. Of these, the Squeeze-and-Excitation Network (SENet) [22] and the Convolutional Block Attention Module (CBAM) [23] have displayed the most-traditional performance.

The SENet is an attention model that enhances the expressiveness of the model by introducing a “squeeze and excitation” module. It learns a weight from each channel of the base model, which is used to adjust the importance of that channel in the next layer, thus making the model more focused on the important features. This allows the SE attention mechanism to directly enhance the expressiveness of the model and improve the model performance; moreover, the small number of SE model parameters makes it easy to train and deploy, but this may lead to an increase in model complexity due to its additional computational cost to calculate the weights of each channel.

Convolutional neural networks have an attention mechanism called the Convolutional Block Attention Module (CBAM), which can enhance the network’s representation and generalization capabilities and excels at tasks such as image classification, target recognition, and picture segmentation. The benefit of the CBAM is that it can implement attention techniques on both channels and spaces, enhancing the network’s expressiveness and generalization by causing it to pay closer attention to critical features while suppressing less-essential ones. It also has the apparent disadvantages of lacking spatial information at various levels to enrich the space of features with a focus on space, only taking into account the data in local areas and failing to determine long-range dependence.

In this paper, we conducted controlled variable experiments on the Backbone, Neck, and Head of YOLOv7 and found that much medium and shallow texture and contour information, which is important for small target detection, is somewhat lost in all three parts of the feature extraction process, with the Backbone network having the most-serious loss. This situation has a certain degree of impact on small target detection, and it easily results in target missed detection.

Therefore, this paper starts by enhancing the attention of the network to small targets, taking into account the relationship between the input and input to reduce the occurrence of missed detection.

2.3. Study of Improvements Related to YOLOv7

As the latest algorithmic model of the YOLO series, there has been more and more research to improve it as the benchmark model to achieve specific effects, among which there are many classical research results in small target detection. For example, an underwater small-target-detection method based on YOLOv7 was proposed by Yi et al. In this article [24], YOLOv7 was used as the benchmark model, and in order to reduce the loss of information brought about by underwater detection, a human SE attention mechanism was added to the model with the integration of the EIoU loss function and the FPN structure to make the model’s detection more accurate. In order to solve the small-target-detection problem arising in satellite surveillance, Yu et al. [25] proposed an improved algorithm of YOLOv7 based on HorNet convolution and the BoTNet attention mechanism, which overcame the problem of size change and strong noise interference arising in ultra-small target detection well. In addition, there are many research results based on YOLOv7 improvements that have informed this research.

3. Improvement of YOLOv7 Object-Detection Model

3.1. SPPCSPC Module Improvements

The SPCSPC module is separated into two primary components of the YOLOv7 network. The role of the SPP part is to be able to increase the perceptual field, making the algorithm adapt to different resolution images, which is obtained by maximum pooling for different perceptual fields. From Figure 2, it can be seen that, in the first branch of the SPCSPC module, the images are subjected to four maximum pooling processes, namely 5, 9, 7, and 1. These four different maximum pooling layers make it possible to distinguish the size of the targets and make it better at localizing small targets.

The CSP module divides the whole SPCSPC module into two parts. The first part performs the regular processing of the feature map to ensure the speed of image processing. The second part is the aforementioned SPP module, which uses more than one group of large pooling modules to process the feature maps, making YOLOv7 adaptable to targets of various sizes with better generalization.

As shown in Figure 2, the SPP module processes the input feature map four times and obtains the target size in the maximum pooling channels of different scales to locate and process the target. However, the disadvantage of the SPP module is obvious, that is when processing the feature map in different pooling channels, this processing will ignore the target that does not correspond to the number of channels, then the final result will result in the missing features and affect the accuracy. This phenomenon is very serious, resulting in the missing of small targets during the processing. In addition, because the feature map is divided into four branches here and pooling is performed once for each branch, this leads to the slow processing speed of the YOLOv7 algorithm.

Therefore, the SPPFCSPC module was designed with reference to the SPPF structure in YOLO’s past series (as shown in Figure 3). In the new SPPFCSPC module, the four pooling channels in the original SPP module are integrated so that each time they undergo pooling, the results are passed to the next pooling layer while passing them to the final connecting layer. In this way, because of the continuous pooling process, the YOLOv7 model can be preserved to the maximum extent to ensure the localization of various small targets when they experience the difficulty of the field, thus improving the detection of small targets. In addition, because the feature maps passed into the next pooling layer have already been processed first by the previous pooling layer, the processing speed of this pooling layer can be improved to a certain extent, and eventually, the overall processing speed of the YOLOv7 model will also be improved to a certain extent.

3.2. YOLOv7 Introduces the CA

Among the current mainstream attention mechanisms, the SE attention mechanism is computationally expensive, and CBAM focuses mainly on the aggregation of spatial information, leading to its less-than-optimal effect. Based on this, the coordinate attention mechanism (coordinate attention for efficient mobile network design) [26] is introduced in this paper to enhance the model effect. The CA module can not only acquire the inter-channel messages between the different channels effectively, but also considers the position information of the feature space and codes the channel correlation and long-range reliance through the precise position information, which solves the deficiencies of the SE and CBAM modules; this not only reduces the leakage and misdetection efficiently, but also helps to locate and recognize the objectives more precisely.

The global pooling encoding is used in the channel attention mechanism, which converts the global information into a scalar and will ignore a large amount of important spatial information, which is very detrimental to the detection of small targets. To address this problem, CA optimizes the global pooling operation as two one-dimensional directional pooling operations in the width direction and height direction. The CA module is shown in Figure 4.

For a feature map X with input dimensions C × H × W, the input feature map is first pooled globally along two dimensions, the width and height, using pooling kernels of dimensions (H, 1) and (1, W), to obtain in the horizontal direction feature zh, of size C × 1 × H, and in the vertical direction feature zw, of size C × 1 × W. The outputs in channel c are, respectively:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \in i < W} x_{c} (h, i)

(1)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \notin j < H} x_{c} (j, w)

(2)

Then, the two spatially oriented feature maps are stitched together, dimensioned down using a convolutional kernel of size 1 × 1, and nonlinearized by batching and the Sigmoid function:

f = δ (F_{1} [z^{h}, z^{w}])

(3)

The size of the feature map f is calculated as C/r × 1 × (W + H), wherein r is the compression factor. Then, f is up-dimensioned by 11 convolutional kernels to match the total amount of channels after being broken down into the tensor in both directions along the vertical and horizontal axes. Following the processing using the Sigmoid activation function, the attention weights in the height and width directions are produced independently. The procedure is illustrated below:

g^{h} = σ (F_{h} (f^{h}))

(4)

g^{w} = σ (F_{h} (f^{w}))

(5)

Finally, the obtained attention weights are determined with the original feature map X(i, j) by multiplication to obtain the feature map Yc(i, j) with the attention weights:

Y_{c} (i, j) = X_{c} (i, j) \cdot g_{c}^{h} (i) \cdot g_{c}^{w} (i)

(6)

The new feature map obtained after multiplying the two feature maps takes into account both spatial features and channel features, which can more effectively increase the algorithm model’s performance for small target detection.

3.3. YOLOv7 Introduces Dynamic Convolution

Dynamic convolution [27] means finding a suitable balance between the number of network layers and the computational cost, to better improve the expression ability of the model. The essence of dynamic convolution is not a single convolutional kernel, but twenty according to the actual situation of the convolutional kernels’ dynamic adjustment, with the targeted selection of the appropriate parameters for feature extraction, so that the various small targets can have better feature extraction, thus avoiding much leakage and the many false detections of small targets. The general dynamic perceptron is shown in Figure 5.

As can be seen from Figure 5, the output is as follows:

y = g ({\tilde{W}}^{T} x + \tilde{b})

(7)

\tilde{W} = \sum_{k = 1}^{K} π_{k} (x) {\tilde{W}}_{k}

(8)

\tilde{b} = \sum_{k = 1}^{K} π_{k} (x) {\tilde{b}}_{k}

(9)

s . t . 0 ⩽ π_{k} (x) ⩽ 1, \sum_{k = 1}^{K} π_{k} (x) = 1,

(10)

where: W, b, and g denote the weight, bias, and activation function, respectively; πk denotes the attention weights, which are not fixed, but vary with the input. πk contains the attention weight calculation and dynamic weight fusion, i.e.,

O ({\tilde{W}}^{T} x + \tilde{b}) ≫ O (\sum π_{k} {\tilde{b}}_{k}) + O [π (x)]

(11)

where: O (·) denotes the amount of perceptron computation.

From the above, the dynamic convolutional structure having the same processing as the dynamic perceptron is shown in Figure 6. After the processing of the dynamic convolutional layer before the BN and ReLU layers, the same K channels with the convolutional kernel are set in one of the random layers, which are further fused with the involvement of the self-attention weights to obtain the convolutional kernel parameters of the layer. Meanwhile, in order to obtain the global spatial features, global average pooling is performed first. As a result of this process, what was originally a fixed convolutional kernel is now changed to a convolutional kernel that can be dynamically selected based on the inputs, significantly improving the feature expression ability.

On the basis of the aforementioned enhancements, Figure 7 depicts the framework of the YOLOv7-CA dynamic model proposed in this paper. The improved portion is depicted in the figure’s red box.

4. Experiments

Windows 10, Python 3.7, and PyTorch 1.7.1 made up the network experimental environment. Table 1 displays the pertinent hardware settings and model parameters, with 300 training samples.

4.1. Dataset

For our experiments, we employed the floating garbage dataset captured from the vantage point of an unmanned boat, curated by Ouka Smart Hublot. This dataset represents a pioneering endeavor, as it marks the first-ever compilation of a floating waste identification dataset obtained through the lens of an unmanned boat, within a real inland river environment. Our primary source of data was the FloW-Img [28] sub-dataset, notable for its emphasis on small target identification. In fact, over half of the targets present in this dataset have dimensions smaller than 32 × 32 px, rendering it apt for studies that focus on the detection of small targets.

The dataset comprises 20,000 images, meticulously captured across diverse lighting and wave conditions, as well as varying orientations and perspectives. To accommodate the requisite experimental specifications, we meticulously partitioned the dataset into distinct training, validation, and test sets. This partitioning adhered to an 8:1:1 ratio, effectively ensuring a robust experimental foundation. For a visual depiction of the dataset, please refer to Figure 8.

4.2. Evaluation Metrics

In this study, we engaged in a comprehensive comparison of the improved network models’ efficacy across various image types, maintaining consistency in the experimental environment. Our objective was to assess the impact on detection accuracy, false detection rates, and leakage phenomena. For this purpose, we chose four pivotal metrics: the accuracy–recall rate (P–R) curve, the number of parameters, the computational complexity, the mean Average Precision (mAP), and the F1-score.

The precision of prediction for the subject of evaluation was determined by dividing the number of correctly predicted positive samples by the overall amount of favorable samples. The following is the definition of precision:

P = \frac{T P}{T P + F P}

(12)

where TP is the proportion of samples with positive predictions that are in fact true and FP is the proportion of samples with positive predictions that are not.

Recall is the measurement of the proportion of correctly predicted positive samples to all of the actual samples that are positive, i.e., whether the detection target is even found. The following is the definition of recall:

R = \frac{T P}{T P + F N}

(13)

in which FN indicates the sample that is really a positive sample, but is predicted to be a negative sample.

The mAP is used to assess the overall effectiveness of the algorithm by averaging the Average Precision (AP) values for all categories. As stated in its definition, the mAP is as follows:

A P = \int_{0}^{1} P d R

(14)

m A P = \frac{\sum_{j = 1}^{c} A P_{j}}{c}

(15)

The F1-score is also known as the Balanced F-Score (BFS), which is defined as the reconciled average of precision and recall. Its formula is defined as follows:

F 1 - Score = 2 \times \frac{precision \times recall}{precision + recall}

(16)

The F1-score metric combines the results of the precision and recall outputs. The value of the F1-score ranges from 0 to 1, with 1 representing the best output of the model and 0 representing the worst output of the model.

The parameter count signifies the aggregate count of parameters that are necessary for learning within a deep neural network. Within the framework of deep learning, each individual neuron possesses a weight, and these weights are iteratively determined through the training process. The total number of parameters in a deep neural network is an amalgamation of all these weights. This encompasses the weights that interconnect the inputs and outputs, alongside the inclusion of the bias terms associated with all neurons.

Computation, in the context of deep learning models, pertains to the quantity of floating-point operations essential for executing both forward and backward propagation. Typically, a sequence of multiplication followed by addition is treated as a single operation, with emphasis on the greater computational load posed by multiplication. Within deep learning, the computational effort of a neural network is primarily defined by the total of the convolution, multiplication, and addition operations. Given the substantial computational demands inherent in deep neural networks, these models require robust computational resources for both the training and inference tasks.

4.3. Experimental Results and Comparative Analysis

In this section, the efficacy of introducing both the attention mechanism and the dynamic convolutional module is rigorously verified through ablation tests. These tests served to assess the impact of the individual modules on the algorithm’s efficiency, all conducted under identical conditions. The outcomes of these ablation tests are meticulously documented in Table 2. Notably, the incorporation of the improved SPPFCSPC module yielded a significant boost in the efficiency of the enhanced algorithm.

Table 2 offers insightful comparative data, demonstrating the performance improvements achieved through various attention mechanisms. Notably, the Coordinate Attention (CA) module emerged as the most-impactful, yielding an evident enhancement of 2.9% when contrasted with the SE and CBAM modules. The introduction of the improved Spatial Pyramid Pooling-Fast, Cross-Stage Partial Channel (SPPFCSPC) module contributed to a noteworthy increase of 3.0% in the mean Average Precision (mAP). This underlines the efficacy of the suggested feature structure amalgamation, especially in terms of facilitating the identification and recognition of small targets.

Furthermore, the incorporation of the dynamic convolutional module, as presented in this study, significantly bolstered the original model’s adaptability to targets of varying sizes. This adaptation culminated in a substantial accuracy increase of 3.5%. Ultimately, the enhanced YOLOv7-CA Dynamic algorithm translated to a remarkable 5.2% enhancement in the mAP. Remarkably, this improvement was realized with a mere 9% increment in the processing power, reinforcing the algorithm’s efficacy through experimental comparisons with the original counterpart.

Figure 9 visually demonstrates the thermal image comparison between the baseline YOLOv7 model and its variations after integrating distinct attention mechanisms. Upon the observation of the figure, a notable trend emerges. Upon incorporating the Coordinate Attention (CA) mechanism, the model exhibited a heightened emphasis on the target under consideration. Consequently, it adeptly captured precise positional information, thereby enhancing both the localization and identification accuracy for small targets. This effect was distinctly valuable in bolstering the overall model accuracy.

4.4. Experimental Comparison of YOLOv7 Network Model and Improved Network Model

Figure 10 compares the P–R curves for the upgraded network model’s identification of tiny objects floating on the water’s surface. The AP value of the water bottle is shown by the size of the region that the P–R curve and coordinate axis enclose. After numerous learners learn the data, if P–R Curve A of one of the learners completely envelops the P–R curve of another learner, B, it can be asserted that the performance of A is better than that of B. From the figure, it can be clearly seen that the P–R curve of the improved YOLOv7 network model in the detection of the small target dataset has completely enveloped the P–R curve of the baseline model. Therefore, the performance of the new model proposed in this paper was higher than the benchmark model in small target detection.

The effectiveness of the YOLOv7-CA dynamic network model was comprehensively evaluated across three distinct scenarios representing real-world images. These scenarios encompassed target-dense images, small target images with intricate backgrounds, and images featuring ultra-small targets. The detection outcomes of both the fundamental YOLOv7 network model and the YOLOv7-CA dynamic network model are showcased through Figure 11, Figure 12 and Figure 13.

In Figure 11, which pertains to target-intensive images, the original image hosts a total of 13 targets. The initial network model successfully detected 11 out of these 13 targets, resulting in 2 missed detections. In contrast, the improved network model attained a perfect detection rate, capturing all 13 targets.

Moving to Figure 12, depicting small target images set against complex backgrounds, both the original network model and its enhanced counterpart faltered in accurately detecting small targets. Additionally, the original model erroneously generated false detections due to the background interference. However, the enhanced network model surmounted these challenges and effectively identified small targets, even in the presence of intricate background interference.

The ultra-small target image illustrated in Figure 13, characterized by a minuscule target box size of 0.05 × 0.04, posed a substantial challenge. In this context, the original network model experienced missed detections. Notably, the improved network model continued to excel, successfully detecting all the ultra-small targets despite their small proportions.

4.5. Improving the YOLOv7 Network Model versus Other Network Models

The efficacy of the enhanced network model presented in this research was assessed by contrasting it with other network models, including the YOLOv7-CA dynamic network model. For this comparative analysis, we maintained consistency in the configuration environment and initial training settings across the models. The results are meticulously presented in Table 3. It is evident that the YOLOv7-CA dynamic network model outperformed the other conventional network models in terms of the mean Average Precision (mAP) values for images of identical sizes. This enhanced performance established its superior suitability for scenarios involving the identification of small targets.

5. Conclusions

This research introduced an enhanced YOLOv7-CA dynamic detection model to address the challenge of recognizing small targets effectively. By incorporating the principles of separation and merging, we developed the SPPCSPC module to meticulously extract intricate details from images. Additionally, we introduced the CA module, synergizing channel attention mechanisms with spatial attention methods. Building upon these advancements, we further introduced a dynamic convolutional module to counteract the false detections and missed detections that stem from significant variations in small target sizes and pronounced background interference. This dynamic convolution module adapted to diverse target sizes and backgrounds, consequently substantially enhancing detection accuracy. Empirical evidence corroborates that the improved YOLOv7-CA dynamic network model surpasses both the standard classical target-detection network model and the original network model in terms of accurate target detection.

Nonetheless, it is important to acknowledge the limitations inherent in the model presented in this study. For instance, the dataset’s sample size remains relatively small, and the diversity of background variations among the targets is limited. Moreover, the model we proposed is characterized by a considerable number of parameters and computations, which can demand a significant amount of computational resources for its practical implementation.

Given the relatively modest number of categories encompassed by the dataset used in this paper, our future endeavors will be directed towards its augmentation. We are committed to progressively expanding the detection categories within the dataset, thereby broadening the spectrum of detectable entities and augmenting the algorithm’s overall applicability. Moreover, we envisage future efforts focused on lightweighting the model. This strategic initiative aims to optimize the model for reduced computational demands during real-world deployment, effectively bolstering the algorithm’s performance and applicability across diverse practical scenarios. By doing so, we anticipate a heightened capability of the model to address various small-target-detection tasks.

Author Contributions

Conceptualization, K.L., Y.W. and Z.H.; data curation, K.L., Y.W. and Z.H.; methodology, K.L. and Y.W.; writing—original draft, K.L. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shaanxi Province, China (No. 2020JM-499 and No. 2020JQ-684) and the National Natural Science Foundation of China (No. 61803294).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper can be downloaded from the website: http://www.orca-tech.cn/datasets.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, X.B.; Mo, M.J.C.; Wang, H.T.; Leng, J. Recent advances in small object detection. J. Data Acquis. Process. 2021, 36, 391–417. (In Chinese) [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. I–511–I–518. Available online: http://ieeexplore.ieee.org/document/990517/ (accessed on 1 January 2023).
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Gu, Y.L.; Zong, X.X. A review of object detection study based on deep learning. Mod. Inf. Technol. 2022, 6, 76–81. (In Chinese) [Google Scholar]
Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the IEEE International Conference on Computer Vision, Las Vegas, NV, USA, 27–30 June 2016; pp. 214–230. [Google Scholar]
Krishna, H.; Jawahar, C.V. Improving small object detection. In Proceedings of the 4th IAPR Conference on Pattern Recognition, Nanjing, China, 26–29 November 2017; pp. 340–345. [Google Scholar]
Zhang, W.; Wang, S.; Thachan, S.; Chen, J.; Qian, Y. Deconv RCNN for small object detection on remote sensing images. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 2483–2486. [Google Scholar]
Zhao, J.K.; Sun, J.; Han, R.; Chen, S. Object detection based on improved Faster RCNN for remote sensing image. Comput. Appl. Softw. 2022, 39, 192–196+290. (In Chinese) [Google Scholar]
Jia, K.X.; Ma, Z.H.; Zhu, R.; Li, Y. Attention-mechanism based light single shot multiBox detector modelling improvement for small object detection on the sea surface. J. Image Graph. 2022, 27, 1161–1175. (In Chinese) [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.-S. ReDet: A rotationequivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2785–2794. [Google Scholar]
Zand, M.; Etemad, A.; Greenspan, M. Oriented bounding boxes for small and freely rotated objects. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Yu, D.; Xu, Q.; Guo, H.; Xu, J.; Lu, J.; Lin, Y.; Liu, X. Anchor-free arbitrary oriented object detector using box boundary-aware vectors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2535–2545. [Google Scholar] [CrossRef]
Zhu, X.K.; Lü, S.C.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Fu, H.X.; Song, G.Q.; Wang, Y.C. Improved YOLOv4 marine target detection combined with CBAM. Symmetry 2021, 13, 623. [Google Scholar] [CrossRef]
Benjumea, A.; Teeti, I.; Cuzzolin, F.; Bradley, A. YOLO-Z: Improving Small Object Detection in YOLOv5 for Autonomous Vehicles. 2022. Available online: https://arxiv.org/abs/2112.11798 (accessed on 1 January 2023).
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 1–20. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Realtime Object Detectors. 2022. Available online: https://arxiv.org/abs/2207.02696 (accessed on 1 January 2023).
Song, Q.; Li, S.; Bai, Q.; Yang, J.; Zhang, X.; Li, Z.; Duan, Z. Object detection method for grasping robot based on improved YOLOv5. Micromachines 2021, 12, 1273. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Li, J.; Wang, S.; Tao, R.; Shen, J. Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Trans. Image Process. 2020, 29, 8326–8338. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yi, W.; Wang, B. Research on Underwater Small Target Detection Algorithm Based on Improved YOLOv7. IEEE Access 2023, 11, 66818–66827. [Google Scholar] [CrossRef]
Yu, C.; Feng, Z.; Wu, Z.; Wei, R.; Song, B.; Cao, C. HB-YOLO: An Improved YOLOv7 Algorithm for Dim-Object Tracking in Satellite Remote Sensing Videos. Remote Sens. 2023, 15, 3551. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE Press: New York, NY, USA, 2020; pp. 11027–11036. [Google Scholar]
Cheng, Y.; Zhu, J.; Jiang, M.; Fu, J.; Pang, C.; Wang, P.; Sankaran, K.; Onabola, O.; Liu, Y.; Liu, D.; et al. FloW: A Dataset and Benchmark for Floating Waste Detection in Inland Waters. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10933–10942. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Proc. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zhang, X.; Guo, W.; Xing, Y.; Wang, W.; Yin, H.; Zhang, Y. AugFCOS: Augmented fully convolutional one-stage object detection network. Pattern Recognit. 2023, 134, 109098. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3510–3519. [Google Scholar]

Figure 1. YOLOv7 network architecture diagram.

Figure 2. SPPCSPC module structure diagram.

Figure 3. SPPFCSPC module structure diagram.

Figure 4. CA operation.

Figure 5. Dynamic perceptron.

Figure 6. Dynamic convolution structure diagram.

Figure 7. YOLOv7-CA dynamic structure diagram.

Figure 8. FloW-Img sub-dataset.

Figure 9. Thermal image comparison of different attention mechanisms.

Figure 10. P–R curve before and after improvement.

Figure 11. Dense target image detection effect comparison.

Figure 12. Complex background image detection effect comparison.

Figure 13. Comparison of the effect of ultra-small target image detection.

Table 1. Experiment-related hardware configuration and model parameters.

Device	Configuration	Parameters	Parameter Value
GPU	RTX3090 × 4	Image size/pixels	720 × 720
CPU	Intel Xeon Gold 5218 × 2	Learning rate	0.01
CUDA	11.0	Optimizer	Adam
CuDNN	10.2	Batch size	32

Table 2. Ablation experiments.

Algorithm	Params (MB)	FLOPs (GB)	mAP (%)	F1-Score (%)
YOLOv7	37.6	106.5	75.9	74.2
+SE	37.6	106.5	77.6	74.9
+CBAM	37.6	106.5	77.5	76.7
+CA	37.7	106.7	78.8	77.3
+SPPFCSPC	37.6	106.5	78.9	77.9
+Dynamic	51.4	117.0	79.4	78.6
+CA+Dynamic	51.5	117.2	80.2	79.4
+CA+SPPFCSPC	51.5	117.2	79.5	78.3
+Dynamic+SPPFCSPC	51.5	117.2	80.5	78.9
YOLOv7-CA Dynamic	51.5	117.2	81.1	79.5

Table 3. Comparison with other algorithms.

Algorithm	Image Size (px)	mAP@0.5	mAP@0.5:0.9
SSD [29]	720 × 720	0.696	0.298
Faster RCNN [30]	720 × 720	0.402	0.151
YOLOv3 [31]	720 × 720	0.453	0.187
YOLOv4	720 × 720	0.537	0.206
YOLOv5	720 × 720	0.561	0.243
YOLOv7	720 × 720	0.759	0.331
FCOS [32]	720 × 720	0.437	0.164
VFNet [33]	720 × 720	0.551	0.237
TOOD [34]	720 × 720	0.511	0.219
Ours	720 × 720	0.811	0.381

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, K.; Wang, Y.; Hu, Z. Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution. Appl. Sci. 2023, 13, 9316. https://doi.org/10.3390/app13169316

AMA Style

Li K, Wang Y, Hu Z. Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution. Applied Sciences. 2023; 13(16):9316. https://doi.org/10.3390/app13169316

Chicago/Turabian Style

Li, Kai, Yanni Wang, and Zhongmian Hu. 2023. "Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution" Applied Sciences 13, no. 16: 9316. https://doi.org/10.3390/app13169316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved YOLOv7 for Small Object Detection Algorithm Based on Attention and Dynamic Convolution

Abstract

1. Introduction

2. Related Work

2.1. YOLOv7

2.2. Attentional Mechanisms

2.3. Study of Improvements Related to YOLOv7

3. Improvement of YOLOv7 Object-Detection Model

3.1. SPPCSPC Module Improvements

3.2. YOLOv7 Introduces the CA

3.3. YOLOv7 Introduces Dynamic Convolution

4. Experiments

4.1. Dataset

4.2. Evaluation Metrics

4.3. Experimental Results and Comparative Analysis

4.4. Experimental Comparison of YOLOv7 Network Model and Improved Network Model

4.5. Improving the YOLOv7 Network Model versus Other Network Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI