A Novel Crop Pest Detection Model Based on YOLOv5

Yang, Wenji; Qiu, Xiaoying

doi:10.3390/agriculture14020275

Open AccessArticle

A Novel Crop Pest Detection Model Based on YOLOv5

by

Wenji Yang

^* and

Xiaoying Qiu

Software College, Jiangxi Agricultural University, Nanchang 330045, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(2), 275; https://doi.org/10.3390/agriculture14020275

Submission received: 6 December 2023 / Revised: 14 January 2024 / Accepted: 28 January 2024 / Published: 8 February 2024

(This article belongs to the Special Issue Advances in Integrated Pest Management Strategies)

Download

Browse Figures

Versions Notes

Abstract

:

The damage caused by pests to crops results in reduced crop yield and compromised quality. Accurate and timely pest detection plays a crucial role in helping farmers to defend against and control pests. In this paper, a novel crop pest detection model named YOLOv5s-pest is proposed. Firstly, we design a hybrid spatial pyramid pooling fast (HSPPF) module, which enhances the model’s capability to capture multi-scale receptive field information. Secondly, we design a new convolutional block attention module (NCBAM) that highlights key features, suppresses redundant features, and improves detection precision. Thirdly, the recursive gated convolution (

g^{3} C o n v

) is introduced into the neck, which extends the potential of self-attention mechanism to explore feature representation to arbitrary-order space, enhances model capacity and detection capability. Finally, we replace the non-maximum suppression (NMS) in the post-processing part with Soft-NMS, which improves the missed problem of detection in crowded and dense scenes. The experimental results show that the mAP@0.5 (mean average precision at intersection over union (IoU) threshold of 0.5) of YOLOv5s-pest achieves 92.5% and the mAP@0.5:0.95 (mean average precision from IoU 0.5 to 0.95) achieves 72.6% on the IP16. Furthermore, we also validate our proposed method on other datasets, and the outcomes indicate that YOLOv5s-pest is also effective in other detection tasks.

Keywords:

pest detection; YOLOv5; new convolutional block attention module; recursive gated convolution; Soft-NMS

1. Introduction

The frequent infestations of pests pose a substantial threat to agricultural productivity, leading to considerable economic losses [1]. Consequently, identifying effective strategies to combat pest and disease issues, enhance crop yields, and minimize economic losses in agriculture has emerged as a paramount concern [2，3，4]. In the past, pest detection tasks required agricultural experts to work in the field. However, the large agricultural planting area and diverse pest species led to problems such as labor-intensive, time-consuming, and insufficient agricultural experts [5,6]. As a result, there was an urgent need to research pest detection technologies to meet the growing demands of the market. However, pest detection based on machine learning (ML) required feature design, which was not only inefficient, cumbersome, and error-prone, but also extremely susceptible to background and light [7]. Therefore, the use of ML techniques to achieve accurate and robust pest detection continued to face significant challenges.

In recent years, convolutional neural networks (CNNs) of deep learning (DL), have brought new solutions to these challenges and have attracted significant attention [8]. Several excellent classical object detection models based on CNN have been proposed by researchers, including two-stage models: RCNN [9], Fast-RCNN [10], Faster-RCNN [11], Mask-RCNN [12] and one-stage models: you only look once (YOLO) series [13,14,15,16,17], single shot multibox detector (SSD) [18] etc. These models greatly improve feature extraction ability, detection speed, and precision. However, due to the unique characteristics of samples and different practical application requirements, existing models may perform poorly in certain scenarios, making the inspection task still challenging.

In order to enhance the overall performance of detection models, domestic and foreign scholars have improved existing models or proposed new models. To achieve real-time monitoring of orchard pests, Zhang et al. [19] compared the placement of ghost modules in different locations. The mean average precision (mAP) of their proposed method was 1.5% higher than YOLOv5 and required fewer parameters and GFLOPs. To realize the rapid lightweight of the passion fruit pest detection model, Li et al. [20] added a convolutional block attention module (CBAM) to the YOLOv5 model’s neck to enhance target object detection and used a point-line distance intersection over union (PLDIoU) loss function to reduce calculation and detection time. To expand the training set and prevent overfitting, they applied a hybrid online data augmentation algorithm. The mAP of their model achieved 1.59% higher than YOLOv5 while retaining its lightweight and real-time capabilities. To achieve precise detection of unopened bolls, Zhang et al. [21] incorporated the squeeze and excitation (SE) attention model into the end of the cross-stage partial (CSP) module and modified the residual block on CSP to a dense block structure inspired by DenseNet. They replaced the path aggregation network (PAN) module with a bidirectional feature pyramid network (BiFPN) to achieve advanced feature fusion performance. Their experiments showed that their proposed method is superior in terms of detection precision, speed, model size, and computational cost compared with YOLOv3, SSD, Faster-RCNN, and YOLOv5. To address the issues of low efficiency and unreliable cotton pest detection, Zhang et al. [22] used efficient channel attention (ECA) to mitigate the impact of complex backgrounds, adopted the focal loss function in YOLOX to address precision loss caused by sample imbalance, and utilized the hard swish activation function to achieve better performance. The experimental findings demonstrated that their model’s mAP increased by 11.5%, 21.17%, 9.34%, 10.22%, and 8.33% compared to Faster R-CNN, SSD, YOLOv3, YOLOv4, and YOLOv5, respectively. Xiang et al. [23] addressed the issue of feature extraction in small pest learning by introducing the CAC3 module into the backbone network. They enhanced feature extraction in the Neck with the ConvNext module. The mAP@0.5 of their proposal achieved 91.9% on the teddy cup pest dataset, representing an almost 8% improvement over the original model. Yang et al. [24] replaced part of the backbone network with CSPResNeXt and the ELAN-W module with the VoVGSCSP module. This modification reduced model parameters and computational workload while maintaining the original precision and improving detection speed on a maize pest dataset. Chen et al. [25] introduced a refinement method for pest detection based on a two-stage detection framework. They proposed the context feature enhancement module, the region of interest (RoI) feature fusion module, and the non-task separation module to optimize the two-stage algorithm. This approach obtained features at various scales, adaptively weighted and fused pest features, and decoupled the features of the classification and localization networks, further enhancing pest detection precision. The mAP of their proposed model achieved 72.7% on the SimilarPest5 dataset, outperforming other state-of-the-art object detection methods. Dai et al. [26] incorporated the Swin transformer (SWinTR) and transformer (C3TR) mechanisms into the backbone network of YOLOv5m to increase the receptive field and capture more global features. They replaced spatial pyramid pooling (SPP) with ResSPP to extract more features. Finally, they replaced the three C3 outputs in the neck with SWinTR and concatenate (Concat) operation with WConcat to enhance the network’s feature fusion capabilities. Experimental results showed that the mAP of the proposed model achieved 96.4% on the pest dataset. Moreover, it outperformed the original YOLOv3, YOLOv4, and YOLOv5m models.

From the above literature, object detection plays a crucial role in the prevention and control of plant diseases and pests. It enables the identification and localization of pests on crops, facilitating prompt implementation of appropriate control measures. However, while classic models have been rigorously trained, validated, and tested on datasets like Coco and VOC, their efficacy might be suboptimal when applied to agricultural pest scenarios. To address this issue, we create the IP16 dataset from the IP102 dataset specifically for agricultural pest detection. This dataset encompasses a series of challenges, including variations in target sizes and complex background noise interference. In order to help farmers achieve accurate and efficient pest detection, we propose a high-performance YOLOv5s-pest model. Our method focuses on the enhancement of the model’s feature extraction capabilities, thereby improving the model’s performance in pest detection tasks. The contributions of this paper are as follows:

We design the hybrid spatial pyramid pooling fast (HSPPF) module, replacing the SPPF module within the end of YOLOv5 backbone, enhancing the backbone’s ability to extract pest information.

Our innovation, the C3NCBAM module, replaces the conventional C3 module, amplifying the model’s feature extraction capability without incurring a significant surge in parameters or computational demands.

Leveraging the

g^{3} C o n v

module, we optimize certain Conv modules in the model’s neck, augmenting the model’s modeling capability and refining its pest detection precision.

We employ the soft non-maximum suppression (Soft-NMS) to replace the NMS to minimize the missed detection rate of pests.

Furthermore, we also create the IP16 dataset to validating our proposed method, a resourceful repository tailored for training and validating pest detection algorithms. From the experimental results, it can be concluded that our approach, YOLOv5s-pest, achieves the best balance between precision and computational complexity.

For the convenience of the readers, the subsequent sections of this paper unfold as follows: Section 2 provides a comprehensive exposition of the materials and methods employed in our study. Section 3 is results and discussion, where we present the findings from our experimental analyses and engage in a detailed discourse. Section 4 is our conclusion.

2. Materials and Methods

2.1. Data Acquisition

Currently, there is a limited availability of publicly accessible datasets for pest detection. In this paper, we created a pest dataset named IP16, we select 14 classes of pests from the IP102 [27] dataset. These pests have complex backgrounds and diverse characteristics, and are mainly found in eight common crop categories: rice, wheat, beet, alfalfa, corn, citrus, mango, and vine. These pests represent typical pests in common crops. We divide wireworm and rice leaf roller into larvae and adults, resulting in 16 classes, as shown in Figure 1. Then, we use LabelImg (version 1.8.6, accessed on 12 January 2023) software to annotate the pests in the images. After selecting and labeling images of 16 pest classes, we obtain 2696 raw images containing various sizes of pests, occluded targets, and complex backgrounds.

2.2. Data Augmentation

To avoid problems of the validation set containing training set images, we partition the raw images into the raw training and validation set at a ratio of 8:2 before data augmentation, as illustrated in Figure 2a. To address the data class imbalance and improve the model’s generalization ability, we apply data augmentation methods to the raw training set and raw validation set. Data augmentation methods are divided into simple augmentation and image fusion, where simple augmentation methods involve adding Gaussian noise, salt and pepper noise, rotating images, and scaling images.

Image fusion is used for the combination of different images, which can combine different raw images or images enhanced through simple augmentation. We select two or four images from the same class to fuse into one multi-objective image, or select images from two or four classes to fuse into one image with multiple classes of pests.

After the above operation (simple augmentation and image fusion), we obtain the training set and the validation set, which are 8960 and 2240 images, respectively, for a total of 11,200 images, which is the required dataset IP16. Each class of IP16 has at least 2000 instances, as shown in Table 1. Taking the training set as an example, the augmentation process of the two images in training set is shown in Figure 2b.

2.3. YOLOv5 Model

We choose version 7.0 of YOLOv5 for experimentation in this paper. The network structure of YOLOv5 is divided into Backbone, Neck, and Prediction. Backbone is responsible for extracting image features. The main task of the Neck is to fuse the features provided by Backbone. Prediction obtains three feature maps of sizes 20 × 20, 40 × 40, and 80 × 80 generated by Neck, which are used to determine the positions of objects in the image and classify them. The YOLOv5 model is composed of modules including Conv, CBS, C3, Upsample, SPPF, and Concat, where CBS is Conv + batch normalization (BN) + sigmoid linear unit (SiLU). SiLU is a commonly activation function. SPPF consists of CBS and three maximum pooling (MaxPool), where the kernels size (k) of MaxPool is 5. The function of Concat is to concatenate channels along a specific dimension. The purpose of Upsample is to increase the resolution of the feature maps. The structure of the YOLOv5 model and its modules is depicted in Figure 3.

2.4. HSPPF Module

The primary purpose of SPP [28] is to generate fixed-size output, effectively capturing information across various receptive fields through MaxPool operations with k = 1, 5, 9, and 13. The parallel pooling structure enhances the flexibility and performance of the model. The detailed structure of the SPP module is illustrated in Figure 4a.

Ultralytics proposes a variant called SPPF based on SPP. Compared to SPP, SPPF also has different large receptive fields and fixed-size output, but it significantly reduces computation, thereby enhancing the speed of the model. The detailed structure of the SPPF module is illustrated in Figure 4b.

We propose the HSPPF to replace SPPF in this paper. This module integrates both average pooling (AvgPool) and MaxPool across various receptive fields. AvgPool retains more background information about the image, emphasizes the downsampling of the overall feature information. MaxPool retains more texture information, selects features with better recognition and handles irrelevant information. The HSPPF module enhances the precision of model detection by fusing features of different pooling receptive fields. Therefore, we utilize the HSPPF module to further enhance the performance of the model. The detailed structure of the HSPPF module is depicted in Figure 4c.

Given an input

x \in R^{H \times W \times C}

, the HSPPF module works as follows: It first acquires

F_{c} = C B S (x) \in R^{H * W * 0.5 * C}

through first CBS. Subsequently, after traversing three MaxPool, it obtains

F_{m} (0)

,

F_{m} (1)

and

F_{m} (2)

. Likewise, after passing through three AvgPool, it obtains

F_{a} (0)

,

F_{a} (1)

and

F_{a} (2)

. Finally, it connects

F_{c}

,

F_{m} (0)

,

F_{m} (1)

,

F_{m} (2)

,

F_{a} (0)

,

F_{a} (1)

and

F_{a} (2)

, resulting in

F_{C o n C a t} \in R^{H \times W \times 3.5 \times C}

, as shown in Equations (1)–(3). This process aggregates feature maps through MaxPool and AvgPool of various receptive fields and connects them to obtain abundant features that combine the advantages of both pooling techniques.

F_{C o n C a t} = [F_{c}, \sum_{0 \leq i \leq 2} F_{m} (i), \sum_{0 \leq i \leq 2} F_{a} (i)]

(1)

F_{m} (i) = \{\begin{matrix} F_{m} (F_{c}), & i = 0, \\ {F_{m} (F}_{m} (i - 1)), & 1 \leq i \leq 2 . \end{matrix}

(2)

F_{a} (i) = \{\begin{matrix} F_{a} (F_{c}), & i = 0, \\ {F_{a} (F}_{a} (i - 1)), & 1 \leq i \leq 2 . \end{matrix}

(3)

where

F_{m} (i)

is the output through MaxPool and

F_{a} (i)

is the output through AvgPool. [·,·,·] is the channel concatenation operation.

2.5. NCBAM

Attention mechanisms commonly used today include SE [29], CBAM [30], ECA [31], and coordinate attention (CA) [32]. SE focuses solely on channel importance, it overlooks spatial information. CBAM adds a spatial attention module (SAM) after the channel attention module (CAM), thereby achieving a dual mechanism that includes both channel information and spatial information. Nonetheless, it still falls short of capturing long-range dependencies within the feature maps. Both ECA and CA build upon SE’s emphasis on channel relationships.

In this paper, we propose a new attention mechanism, called NCBAM. This module comprises two components: an improved coordinate attention module (ICAM) and a SAM, which makes up for the shortcoming of spatial information in CA and the inability of CBAM to capture long-range correlations in feature maps by emphasizing important information in both channel and spatial dimensions while suppressing non-essential information. The combination of NCBAM and C3 enhances the model’s complexity, improves its feature extraction capability, and enhances the precision while maintaining detection speed. Assuming the input feature is

x = [x_{1}, x_{2} \dots, x_{C}] \in R^{C \times H \times W}

, where

C

,

H

and

W

are the channel, height, and width of the input feature, respectively.

y = [y_{1}, y_{2} \dots, y_{C}] \in R^{C \times H \times W}

is the output of ICAM, while

M_{s}

is the output of the gray dotted box of SAM. The detailed structure of the new convolutional block attention module (NCBAM) is depicted in Figure 5.

2.5.1. ICAM

CA, as a novel and effective attention mechanism, captures cross-channel information, and directional and location-sensitive information, thereby improving the model’s ability to detect pests accurately. The encoding process of CA is divided into two steps: coordinate information embedding and coordinate attention generation. To enhance the robustness of the model and obtain rich features, we add channel shuffle [33] in coordinate attention generation stages to rewash the channels to reduce the influence of channel order. The improved attention mechanism is called ICAM, the detailed structure of ICAM is shown in Figure 6.

Coordinate information embedding: In order to solve the problem of global pooling easy to lose position information, the input feature x is decomposed into two one-dimensional features (X and Y directions) encoding process. This decomposition method generates a pair of direction-aware feature maps along two spatial directions so that CA has long-term dependencies and saves the precise location information. Assuming that the c-th channel of

x

is represented by

x_{c} (i, j)

. Each channel is encoded along the horizontal and vertical coordinates using the pooling kernel of size (H, 1) and (1, W), as shown in Equations (4) and (5).

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq j < W} x_{c} (h, j)

(4)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (i, w)

(5)

where

z_{c}^{h} (h)

is the output of the cth channel with height

h

, and

z_{c}^{w} (w)

is the output of the cth channel with width

w

.

Coordinate attention generation: This step aims to effectively leverage precise position information and the global receptive fields obtained in the previous step. To bolster the model’s robustness and derive richer feature maps, we add channel shuffle to rearrange the channels. The process of coordinated attention generation is as follows:

① it first connects

z^{h}

,

z^{w}

from the previous step. Then, it applies the convolution function

f^{1 \times 1}

of k = 1 to reduce the channels, as depicted in Equation (6).

f = s (δ (f^{1 \times 1} ([z^{h}, z^{w}]))), f \in R^{\frac{C}{r} \times 1 \times (H + W)}

(6)

where

δ

is the non-linear activation function; r is the reduction ratio;

s

is the channel shuffle operation. [·,·] is the concatenation operation along certain dimension. f is the output through Concat,

f^{1 \times 1}

,

δ

, and

s

operation.

② Then, it splits f along the horizontal direction and the vertical direction into two separate tensors

f^{h}

and

f^{w}

. Subsequently, two convolution functions,

F_{h}

and

F_{w}

of k = 1, are applied to transform

f^{h}

and

f^{w}

into tensors with the same number of channels to the input

x

, as shown in Equations (7) and (8).

g^{h} = σ (F_{h} (f^{h}))

(7)

g^{w} = σ (F_{w} (f^{w}))

(8)

where σ is the sigmoid function.

③ The c-th channel of

g^{h}

is

g_{c}^{h} (i)

, and

g^{w}

is

g_{c}^{w} (j)

. Finally, the output of c-th channel of ICAM as shown in Equation (9).

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(9)

2.5.2. SAM

To exploit the spatial relationships among features, we add the SAM following the ICAM. The SAM module works as follows: it first uses MaxPool and AvgPool to pool feature maps

y

from ICAM. Then, it concatenates channels from MaxPool and AvgPool, applies convolution functions,

f^{7 \times 7}

of k =

7

, and activates results after

f^{7 \times 7}

using the sigmoid function, resulting in spatial attention feature maps

M_{s}

. These feature maps effectively capture spatial information. The detailed structure of the SAM is illustrated in Figure 7. The

M_{s}

is shown in Equation (10).

M_{s} (y) = σ (f^{7 \times 7} ([F_{a} (y), F_{m} (y)]))

(10)

where

σ

refers to the sigmoid function.

2.5.3. C3NCBAM

C3 utilizes a residual structure for learning, which facilitates the training of deeper networks while maintaining good performance. In our study, we incorporate the NCBAM module into C3, resulting in the creation of a new module called C3NCBAM. This module combines the advantages of C3 and NCBAM, reduces the loss of feature information, and improves the feature extraction ability and detection precision of the model. The detailed structure of C3NCBAM is depicted in Figure 8.

2.6. $g^{3} C o n v$ Module

The Conv operator of YOLOv5 has a limitation: the receptive field of small convolution kernels is restricted, and captures global information by stacking multiple layers. However, increasing the number of layers can lead to loss of fine-grained features. The transformer’s self-attention mechanism enables model to effectively capture global information, thereby overcoming the limitations of convolutional operations of small convolution kernels and enhancing the model’s expressive power.

With the great success of the new model of vision transformer [34,35,36,37,38], Rao et al. [39] introduced the input adaptive of the vision transformer, long-range, and high-order spatial characteristics into CNNs and proposed the recursive gated convolution (

g^{n} C o n v

) to enable high-order spatial interaction through gated convolution and recursive design. The

g^{n} C o n v

exhibits high flexibility and customizability, allowing for the extension of second-order interactions in self-attention to any order without introducing excessive computations. As a result, the

g^{n} C o n v

effectively integrates the strengths of vision transformers and CNNs.

In pest detection tasks, the YOLOv5 model lacks global informations and the capability to effectively model long-range dependencies, which limits the effectiveness of pest detection to some extent. To address these limitations, we integrate the

g^{3} C o n v

(When n = 3,

g^{n} C o n v

is

g^{3} C o n v

) into YOLOv5, combining the advantages of CNNs and transformers to improve the effectiveness of pest detection.

The

g^{n} C o n v

module works as follows: Assuming the input is

x \in R^{H \times W \times C}

, the output channels of the first Conv (k = 3, stride = 2, p = 1) are

2 \times C

. The feature size is (

\frac{H + 2 p - k}{s t r i d e} + 1

,

\frac{W + 2 p - k}{s t r i d e} + 1

). For convenience, this size is denoted as (

H_{1}

,

W_{1}

). If the result is not an integer, it can be rounded down. Then, the output channels of the last Conv are C. To ensure that higher-order interactions do not introduce significant computational overhead, this module sets the number of channels in each order to be an exponential multiple of 2. The

g^{n} C o n v

achieves n-order spatial interactions.

C_{i}

is channel dimension in each order, is shown in Equation (11):

C_{i} = \frac{C}{2^{n - i - 1}}, 0 \leq i \leq n - 1

(11)

When

n = 1

, this structure of

g^{1} C o n v

is illustrated in Figure 9. Assuming

ϕ_{i n}

is a set of projected features composed of

p_{0}

and

{\{q_{i}\}}_{i = 0}^{n - 1}

, and

ϕ_{o u t}

is a set of projected features composed of

{\{p_{i}\}}_{i = 1}^{n}

to perform channel mixing, where

p_{0}

is feature from the frist split and

{\{q_{i}\}}_{i = 0}^{n - 1}

stores recursive branches features

q_{0}, q_{1} \dots . . q_{n - 1}

from the split of depth-wise convolution (DWConv).

{\{p_{i}\}}_{i = 1}^{n}

stores the result of multiplying two features.

i

will be increased by 1 after each order.

y

is the result of multiplying two features of last recursion step of the recursive body. The output of recursive body of

g^{1} C o n v (x)

is shown in Equation (12).

[p_{0}^{H_{1} {\times W}_{1} \times C}, q_{0}^{H_{1} {\times W}_{1} \times C}] = ϕ_{i n} (x) \in R^{H_{1} {\times W}_{1} \times 2 C}, p_{1} = d (q_{0}) ⨀ p_{0} \in R^{H_{1} {\times W}_{1} \times C}, y = ϕ_{o u t} (p_{1}) \in R^{H_{1} {\times W}_{1} \times C},

(12)

When

n > 1

, the

g^{n} C o n v

further enhances model capacity by introducing higher-order interactions. The operation of the

g^{n} C o n v

module performs the gated convolution recursively, as shown in Equation (13). The structure of the

g^{3} C o n v

is depicted in Figure 10.

[p_{0}^{H_{1} {\times W}_{1} \times C_{0}}, q_{0}^{H_{1} {\times W}_{1} \times C_{0}}, \dots, q_{n}^{H_{1} {\times W}_{1} \times C_{n - 1}}] = ϕ_{i n} (x) \in R^{H_{1} {\times W}_{1} \times (C_{0} + \sum_{0 \leq i \leq n - 1} C_{i})}, g_{i} = \{\begin{matrix} I d e n t i t y, & n = 0, \\ L i n e a r (C_{i - 1}, C_{i}), & 1 \leq n \leq n - 1 . \end{matrix} p_{i + 1} = d_{i} (q_{i}) ⊙ g_{i} (p_{i}) / α \in R^{H_{1} {\times W}_{1} \times C_{i}}, i = 0,1, \dots, n - 1 .

(13)

where

{{d}_{i}}

is a set of depth-wise convolutions,

{{g}_{i}}

is matched the dimensions in different orders, and 1

/ α

is scale factor to stabilize the training. Finally, the output of recursive body of

g^{n} C o n v

is

ϕ_{o u t} (q_{n})

.

2.7. Soft-NMS

In the post-processing stage of object detection tasks, NMS is a critical step, which aims to eliminate redundant bounding boxes that represent the same object [40], thereby improving the model’s precision and efficiency. The NMS algorithm arranges the candidate boxes in descending order based on their scores, selects the candidate box with the highest score, and deletes other candidate boxes with an overlap greater than a threshold. This process is repeated for the remaining candidate boxes. The major issue of greedy algorithm is that the overoverlapping targets can lead to the deletion of candidate boxes for other targets. In addition, the threshold of NMS is not easy to determine. If a low threshold is applied, it may result in an increase in the miss-rate and a drop in average precision. If a high threshold is applied, the increase in false positives (FP) will surpass the increase in true positives (TP) by a considerable margin. In real life, pest distribution is dense, so the traditional NMS algorithm is used to process the detection results, which is easy to cause adjacent pests to be missed. Aiming at the shortcomings of traditional NMS algorithm, Soft-NMS algorithm [41] optimizes the NMS algorithm by decaying the detection score. The Soft-NMS algorithm has the following advantages: ① It addresses the issue of missed detection that can occur when objects are in close proximity to each other. ② It does not have additional hyperparameters. ③ Relative to NMS, it has the same computational complexity

{O (N}^{2})

, where

N

is the number of detection boxes.

Assuming that the current detection boxes with the highest score is

M

,

N_{t}

is the NMS threshold,

{B = {b}_{i}, 1 \leq i \leq N}

is a set of the detection boxes,

D

is a set of final detection boxes and

{S = {s}_{i}, 1 \leq i \leq N}

is a set of the scores of

{{b}_{i}, 1 \leq i \leq N}

. The Soft-NMS algorithm has two forms, linear and Gaussian penalty function, as shown in Equations (14) and (15).

s_{i} = \{\begin{matrix} s_{i}, & i o u (M, b_{i}) < N_{t} \\ s_{i} (1 - i o u (M, b_{i})), & i o u (M, b_{i}) \geq N_{t} \end{matrix}

(14)

s_{i} = s_{i} e^{- \frac{{i o u (M, b_{i})}^{2}}{σ}}, \forall b_{i} \notin D .

(15)

where

i o u (M, b_{i})

is the IoU value of

M

and

b_{i}

,

σ

is sigma function.

In real life, pest detection is mostly crowded and dense, with multiple overlapping targets. Therefore, the use of Soft-NMS algorithm can reduce missed detection, and improve the detection effect of pest detection.

2.8. YOLOv5-Pest Model

To improve the performance of the model in detecting pests, we propose the YOLOv5-pest model by implementing several improvements to the YOLOv5 model. These improvements include designing the HSPPF and C3NCBAM modules, introducing the

g^{3} C o n v

module in the neck, and replacing the NMS module in the post-processing stage with Soft-NMS. The structure of the YOLOv5-pest model is depicted in Figure 11.

2.9. Evaluation Indicators

To verify the performance of model, we use precision (P), recall (R), and mAP for evaluation indicators, as shown in Equations (16)–(19). In addition, we also use parameters, layers and GFLOPs (giga floating-point operations per second) to measure model size, model depth, and computational power.

P = \frac{T P}{T P + F P}

(16)

R = \frac{T P}{T P + F N}

(17)

A P = \int_{0}^{1} P (R) d R

(18)

m A P = \frac{\sum_{i = 1}^{N} {A P}_{i}}{N}

(19)

where

T P

represents the number of pests correctly identified as belonging to their corresponding classes,

F P

signifies the count of pests not belonging to the class but identified as such, and False negatives (

F N

) indicates the number of pests belonging to the class but erroneously identified as not belonging to the class.

N

is the number of classes of pests. In Equation (18),

A P

is the average precision, which is the area under the

P - R

curve. In Equation (19),

{A P}_{i}

is the

A P

of the i-th class,

m A P

is the mean average precision of all classes. In this study, mAP contained mAP@0.5 and mAP@0.5:0.95, where mAP@0.5 represents IoU = 0.5, the mean average precision of all classes, mAP@0.5:0.95 represents the mean average precision of all classes of IoUs increasing from 0.5 to 0.95 at intervals of 0.05.

3. Results and Discussion

3.1. Training Environment

The hardware and software configuration parameters of experiments are shown in Table 2.

3.2. Hyperparameter Configuration

Given our constraints on computational resources, we set the model’s hyperparameters as follows: the image resolution is set to 640 × 640. During training, we set the maximum number of epochs to 300 with a batch size of 16. Additionally, we configure the momentum at 0.937, initialize the learning rate to 0.01, and set the weight decay to 0.0005.

3.3. Selection of Baseline Model

YOLOv5 is divided into five versions according to different depth coefficient (the number of residual blocks) and width coefficient (the number of channels) of the model: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. To select the best version, each of the five versions is trained on the IP16 dataset.

As shown in Table 3, the experimental results indicate that the larger the depth and width coefficients, the stronger the feature extraction and fusion capabilities of the model, and the higher the precision, but the deeper the layers, the larger the parameters, and the larger the GFLOPs. Although YOLOv5n has the smallest Parameters and GFLOPs, the mAP@0.5:0.95 of YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x is 4.6%, 7.6%, 8.7%, and 10.3% higher than YOLOv5n, respectively. The P, R, and mAP of YOLOv5n are significantly lower than other models. Compared with YOLOv5m, YOLOv5l, YOLOv5x, the P, R, mAP@0.5 and mAP@0.5:0.95 of YOLOv5s are close, but the parameters and GFLOPs of the model are also relatively slightly. Therefore, after considering the requirements for detection precision and real-time performance, we select YOLOv5s as the baseline model.

3.4. Ablation Experiments

To validate the effectiveness of the HSPPF, C3NCBAM,

g^{3} C o n v

, and Soft-NMS modules, we integrate these modules into the baseline model, as delineated in Table 4.

These experiments demonstrate that after adding of the HSPPF module, the mAP@0.5 is 0.8% higher than YOLOv5s. After adding the C3NCBAM module, the P, R, mAP@0.5 is 2.1%, 2.2%, 1.8% and 0.7% higher than YOLOv5s, respectively. After adding

g^{3} C o n v

, the mAP@0.5:0.95 is 0.7% higher than YOLOv5s. After the introduction of Soft-NMS, the mAP@0.5:0.95 is 5.5% higher than YOLOv5s. It can be concluded that the HSPPF module gets more information effectively by the operation of combining AvgPool and MaxPool. The NCBAM module emphases important information while suppressing non-essential details. The structure of C3NCBAM is more conducive to extracting features than C3. Compared to Conv, the

g^{3} C o n v

module enhances model’s ability to model the complex spatial interactions and is easy to capture long-term dependencies. The Soft-NMS reduces missed detection and false positives improving overall detection precision. These improvement methods are very effective in improving precision, recall, and mean average precision.

3.5. Comparison Experiments

3.5.1. Comparison of Different Attention Mechanisms

To verify the effectiveness of the NCBAM attention mechanism, we compare NCBAM with different attention mechanisms. To ensure fairness of the comparison, each attention is added to the same location in C3 as the C3NCBAM module. The comparison results are shown in Table 5.

The experimental results demonstrate that our proposed module NCBAM outperforms SE, CBAM, ECA, and CA modules. Specifically, the mAP@0.5 is 1.1%, 1%, 1.3%, and 0.6% higher than SE, CBAM, ECA, and CA, respectively. The mAP@0.5:0.95 is 0.7% higher than SE, CBAM. It can be concluded that compared with SE, CBAM, ECA, and CA, the NCBAM is more effective in extracting important characteristics of pests.

3.5.2. Comparison of Different Methods

To validate the detection performance of YOLOv5s-pest, we compare it with several methods, such as YOLOv5s-transfromer, YOLOv5s-bifpn, YOLOv5m-ghost from the hub of YOLOv5, YOLOv5-Lite-g from version 1.4 of YOLOv5-Lite [42], SE-YOLOv5s [43], MG-YOLO [44], YOLOv7-tiny [45], YOLOXs [46], and YOLOv8s [47]. Ensuring a fair comparison, we ensure that all models have similar computational complexity. The comparison results are shown in Table 6.

The experimental results demonstrate that the mAP@0.5 of YOLOv5s-pest is 92.5% and the mAP@0.5:0.95 of is 72.6%. Compared to YOLOv5s, YOLOv5s-transformer, YOLOv5s-bifpn, YOLOv5m-ghost, YOLOv5-lite-g, SE-YOLOv5s, MG-YOLO, YOLOv7-tiny, YOLOX-s and YOLOv8s, the mAP@0.5 of the YOLOv5s-pest is 2.3%, 1.5%, 1.3%, 7%, 1.8%, 7.5%, 1.7%, 2.1%, 1.6% and 2% higher than these models, respectively. The mAP@0.5:0.95 is 6.2%, 7.3%, 6.5%, 11.7%, 6.6%, 13.3%, 7.5%, 6.2%, 6.7% and 2.5% higher than these models, respectively. In terms of the comparison results of the above methods, our proposed method, YOLOv5s-pest, achieves the outstanding performance. After improving the backbone, neck and postprocessing of the baseline model, P, R, and mAP have increased greatly while parameters and GFLOPs increase only a little. Hence, YOLOv5s-pest can be chosen as the preferred option for pest detection.

3.5.3. Comparison of YOLOv5s and YOLOv5s-Pest on Different Datasets

To further verify the effectiveness of YOLOv5s-pest, we perform a comparative experiment on the insect [48], pest24 [49], and pascal visual object classes (VOC) 2007 + 2012 dataset [50], the results are shown in Table 7.

As can be seen from Table 7, the mAP@0.5:0.95 of the YOLOv5s-pest on the insect, pest24, and pascal VOC dataset, is 64.1%, 42.9%, 66.5%, which is 1.4%, 3.4%, 6.9% higher than the YOLOv5s, respectively. This experiment proves that our model also has a certain degree of improvement effect on other datasets. In the future, we can try to apply our proposed method to more detection tasks.

3.6. Training Curve Analysis

The training curves change of mAP of different methods on the IP16 are shown in Figure 12. During training, we train models using pretrained weights provided by the official YOLOv5, which are trained on large public datasets. These pretrained weights include yolov5n.pt, yolov5s.pt, yolov5m.pt, yolov5l.pt, and yolov5x.pt. The training curve of YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x is depicted as shown in Figure 12a. All methods achieve fast convergence and the best performance of in 300 epochs.

The training curves change of mAP of our custom methods, which add HSPPF, C3NCBAM,

g^{3} C o n v

, Soft-NMS, C3SE, C3CBAM, C3ECA, and C3CA, respectively, are shown in Figure 12b,c. The mAP of our custom methods are relatively low in the first few epochs. With the increase in epochs, the detection performance of models is improved by learning. The mAP of models gradually increase, eventually surpass the performance of YOLOv5s.

The training curves change of mAP of different methods of comparison experiments, are shown in Figure 12d. We use different pretrained weights to train these methods. These pretrained weights include yolov5s.pt (YOLOv5s-transformer, YOLOv5s-bifpn, SE-YOLOv5s, and MG-YOLO), yolov5m.pt (YOLOv5m-ghost), v5lite-g.pt (YOLOv5-Lite-g), yolov7-tiny.pt (YOLOv7-tiny), yolox_s.pth (YOLOs), and yolov8s.pt (YOLOv8s). The results clearly demonstrate that the training curve of mAP of YOLOv5s-pest is higher than other methods and the epochs of the end of training curve of YOLOv5s-pest are less than other methods.

3.7. Visualization of Attention Mechanisms

To visualize the reasoning process, we employ gradient-weighted class activation mapping (Grad-CAM) [51] to generate heat maps. The heat maps after applying different attention mechanisms are shown in Figure 13. It can be observed that, different columns add different attention mechanisms. After adding NCBAM, the red region in the target is larger than SE, CBAM, ECA, and CA. Therefore, from the visual results, it can be concluded that NCBAM has a stronger ability to filter redundant features and extract key features than SE, CBAM, ECA, and CA.

3.8. Discussion

In agricultural production, pests are a continuous concern as they pose potential threats to crop yield and quality. DL has garnered significant attention and application in the domain of pest monitoring [52,53]. Recent researches [54,55,56,57,58,59,60] exhibit the YOLO series’ efficacy in addressing diverse crop-related challenges. However, CNNs base on DL approaches, have limitations in practical pest detection scenarios [61], restricting their real-world applicability. Under complex backgrounds and the presence of different pest species, We propose YOLOv5s-pest (HSPPF + C3NCBAM +

g^{3} C o n v

+ Soft-NMS) to improve detection precision, where HSPPF module is strategically crafted to fortify the model’s capacity to extract multi-scale receptive field information within feature maps. Additionally, the incorporation of the NCBAM serves to emphasize pertinent features while concurrently suppressing redundant information, thereby enhancing detection precision without compromising computational speed. This model not only demonstrates significant improvement in detecting multiple pest species, but also has scalability in other object detection tasks. However, a limitation of this approach is that it slightly increases the model’s parameters and GFLOPs.

Future research should focus on developing more sustainable, efficient, and environmentally friendly pest detection strategies to ensure the sustainability and stability of agricultural production. Lightweight networks provide efficient and resource-saving solutions for pest detection.

4. Conclusions

Crop pest detection plays a critical role in modern agriculture [62,63]. With the advancement of technology, advanced techniques such as ML and DL have been extensively utilized for crop monitoring and management. Traditional manual methods for pest detection have been laborious and time-consuming [64]. Pest detection base on ML [65,66,67] approachesrely heavily on handcrafted feature extractors and intricate rule-based systems. DL techniques, particularly the YOLO series, facilitate rapid feature extraction from vast datasets, ensuring timely and accurate pest detection. The complexities exist in real-world agricultural settings [68], such as intricate backgrounds, overlapping, and occlusion, present considerable challenges for effective pest detection. To address these, we propose YOLOv5s-pest. To validate the effectiveness of our modules and model, we create a dataset called IP16, which has complex backgrounds and scenarios where pests overlap and occlude. Upon individually integrating HSPPF, C3NCBAM,

g^{3} C o n v,

and Soft-NMS into the YOLOv5, distinct improvements are observed in the P, R, and mAP on the IP16 dataset. This proves the effectiveness of each of these modules in enhancing the model’s capabilities. The P, R, mAP@0.5, and mAP@0.5:0.95 of YOLOv5s-pest, which integrates innovative modules (HSPPF, C3NCBAM, and

g^{3} C o n v

) and strategies (Soft-NMS), achieves 94.4%, 87.1%, 92.5%, and 72.6%, respectively. These novel additions endow the model with distinct advantages, setting it apart in terms of efficiency and performance within the realm of pest detection, demonstrating exceptional prowess in navigating challenges such as intricate agricultural backgrounds, overlapping, and occlusion scenarios. Consequently, the YOLOv5s-pest can serve as a technical reference and tool for pest control, assist farmers in taking timely preventive measures, provide innovative and effective solutions for pest control in agriculture and contribute to sustainable agricultural development.

Given the pivotal role of on-site cameras or agricultural robots in pest detection, we can deploy our model across various platforms, such as cloud servers, local servers, and embedded devices. Such deployment not only contributes to the enhancement of agricultural productivity and food safety but also facilitates environmental conservation efforts. By leveraging our model across diverse platforms, we can achieve broader coverage in pest monitoring and provide more effective solutions for the agricultural sector, thereby contributing to the realization of sustainable agricultural development goals.

As DL methodologies in pest detection continue to advance, future trajectories emphasize the development of lightweight, real-time monitoring systems. Anticipating this evolution, we aim to more intelligent, scalable, and easily deployable pest detection systems, further driving innovation and sustainable development in agriculture.

Author Contributions

Conceptualization, W.Y. and X.Q.; Methodology, X.Q.; Software, X.Q.; Validation, X.Q.; Formal analysis, X.Q.; Resources, X.Q.; Data curation, W.Y. and X.Q.; Writing—original draft, X.Q.; Writing—review & editing, W.Y. and X.Q.; Visualization, X.Q.; Project administration, W.Y.; Funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [National Natural Science Foundation of China] grant number [62366018], [Natural Science Foundation of Jiangxi Province] grant number [20212BAB212005], [Natural Science Foundation of Jiangxi Province] grant number [20224BAB202015].

Data Availability Statement

Publicly available dataset (IP102) can be found here: https://github.com/xpwu95/IP102 (accessed on 23 February 2023). Our dataset (IP16) can be found here: https://github.com/join1233-lab/v5-pest (accessed on 5 December 2023). Additional supplements are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, C.; Chen, H.; Ma, Z.; Zhang, T.; Yang, C.; Su, H.; Chen, H. Pest-YOLO: A model for large-scale multi-class dense and tiny pest detection and counting. Front. Plant Sci. 2022, 13, 973985. [Google Scholar] [CrossRef]
Singh, A.; Dhiman, N.; Kar, A.K.; Singh, D.; Purohit, M.P.; Ghosh, D.; Patnaik, S. Advances in controlled release pesticide formulations: Prospects to safer integrated pest management and sustainable agriculture. J. Hazard. Mater. 2020, 385, 121525. [Google Scholar] [CrossRef]
Abate, T.; van Huis, A.; Ampofo, J.K.O. Pest Management Strategies in Traditional Agriculture: An African Perspective. Annu. Rev. Entomol. 2000, 45, 631–659. [Google Scholar] [CrossRef]
Gonzalez de Santos, P.; Ribeiro, A.; Fernandez Quintanilla, C.; Lopez Granados, F.; Brandstoetter, M.; Tomic, S.; Pedrazzi, S.; Peruzzi, A.; Pajares, G.; Kaplanis, G.; et al. Fleets of robots for environmentally-safe pest control in agriculture. Precis. Agric. 2016, 18, 574–614. [Google Scholar] [CrossRef]
Liu, C.K.; Zhai, Z.Q.; Zhang, R.Y.; Bai, J.; Zhang, M. Field pest monitoring and forecasting system for pest control. Front. Plant Sci. 2022, 13, 990965. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Zhu, T.; Li, X.; Dong, J.; Liu, J. Recommending Advanced Deep Learning Models for Efficient Insect Pest Detection. Agriculture 2022, 12, 1065. [Google Scholar] [CrossRef]
Aladhadh, S.; Habib, S.; Islam, M.; Aloraini, M.; Aladhadh, M.; Al-Rawashdeh, H.S. An Efficient Pest Detection Framework with a Medium-Scale Benchmark to Increase the Agricultural Productivity. Sensors 2022, 22, 9749. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Wang, D.; Li, M.; Gao, Y.; Wu, J.; Yang, X. Field detection of tiny pests from sticky trap images using deep learning in agricultural greenhouse. Comput. Electron. Agric. 2021, 183, 106048. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 9 November 2022).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhang, Y.; Cai, W.; Fan, S.; Song, R.; Jin, J. Object Detection Based on YOLOv5 and GhostNet for Orchard Pests. Information 2022, 13, 548. [Google Scholar] [CrossRef]
Li, K.S.; Wang, J.C.; Jalil, H.; Wang, H. A fast and lightweight detection algorithm for passion fruit pests based on improved YOLOv5. Comput. Electron. Agric. 2023, 204, 107534. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, G.; Liu, Y.; Wang, C.; Yin, Y. An Improved YOLO Network for Unopened Cotton Boll Detection in the Field. J. Intell. Fuzzy Syst. 2022, 42, 2193–2206. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, B.; Hu, Y.; Li, C.; Li, Y. Accurate cotton diseases and pests detection in complex background based on an improved YOLOX model. Comput. Electron. Agric. 2022, 203, 107484. [Google Scholar] [CrossRef]
Xiang, Q.; Huang, X.; Huang, Z.; Chen, X.; Cheng, J.; Tang, X. Yolo-Pest: An Insect Pest Object Detection Algorithm via CAC3 Module. Sensors 2023, 23, 3221. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Xing, Z.; Wang, H.; Dong, X.; Gao, X.; Liu, Z.; Zhang, X.; Li, S.; Zhao, Y. Maize-YOLO: A New High-Precision and Real-Time Method for Maize Pest Detection. Insects 2023, 14, 278. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wang, R.; Du, J.; Chen, T.; Liu, H.; Zhang, J.; Li, R.; Zhou, G. Feature Refinement Method Based on the Two-Stage Detection Framework for Similar Pest Detection in the Field. Insects 2023, 14, 819. [Google Scholar] [CrossRef]
Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A New Pest Detection Method Based on Improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.K.; Cheng, M.M.; Yang, J. IP102: A Large-scale Benchmark Dataset for Insect Pest Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8779–8788. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; p. 11211. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2022, arXiv:2112.01527. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv 2022, arXiv:211109883. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. arXiv 2021, arXiv:2012.12877. [Google Scholar]
Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview Transformers for Video Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3323–3333. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.-N.; Lu, J. HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions. arXiv 2022, arXiv:2207.14284. [Google Scholar]
Chu, J.; Zhang, Y.; Li, S.; Leng, L.; Miao, J. Syncretic-NMS: A Merging Non-Maximum Suppression Algorithm for Instance Segmentation. IEEE Access 2020, 8, 114705–114714. [Google Scholar] [CrossRef]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--Improving Object Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy; 2017; pp. 5562–5570. [Google Scholar]
Chen, X.; Gong, Z. YOLOv5-Lite: Lighter, Faster and Easier to Deploy. Zenodo. 2021. Available online: https://github.com/ppogg/YOLOv5-Lite (accessed on 25 May 2023).
Zhang, Z.D.; Zhang, B.; Lan, Z.C. FINet: An Insulator Dataset and Detection Benchmark Based on Synthetic Fog and Improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 6006508. [Google Scholar] [CrossRef]
Li, K.; Zhu, X.; Qiao, C.; Zhang, L.; Gao, W.; Wang, Y. The Gray Mold Spore Detection of Cucumber Based on Microscopic Image and Deep Learning. Plant Phenomics 2023, 5, 0011. Available online: https://github.com/D715-ky/fungal-spore-detection (accessed on 25 May 2023). [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. Available online: https://github.com/WongKinYiu/yolov7 (accessed on 9 June 2023).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. Available online: https://github.com/Megvii-BaseDetection/YOLOX (accessed on 9 June 2023).
Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 July 2023).
Baidu. Insect Dataset. 2019. Available online: https://aistudio.baidu.com/datasetdetail/34213 (accessed on 14 September 2023).
Wang, Q.J.; Zhang, S.Y.; Dong, S.F.; Zhang, G.C.; Yang, J.; Li, R.; Wang, H.Q. Pest24: A Large-Scale Very Small Object Data Set of Agricultural Pests for Multi-Target Detection. Comput. Electron. Agric. 2020, 175, 105585. Available online: http://aisys.iim.ac.cn/zhibao.html (accessed on 14 September 2023). [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. Available online: http://host.robots.ox.ac.uk/pascal/VOC (accessed on 14 September 2023). [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Valicharla, S.K.; Li, X.; Greenleaf, J.; Turcotte, R.; Hayes, C.; Park, Y.-L. Precision Detection and Assessment of Ash Death and Decline Caused by the Emerald Ash Borer Using Drones and Deep Learning. Plants 2023, 12, 798. [Google Scholar] [CrossRef] [PubMed]
Popescu, D.; Dinca, A.; Ichim, L.; Angelescu, N. New trends in detection of harmful insects and pests in modern agriculture using artificial neural networks. a review. Front. Plant Sci. 2023, 14, 1268167. [Google Scholar] [CrossRef] [PubMed]
Chu, J.; Li, Y.; Feng, H. Research on Multi-Scale Pest Detection and Identification Method in Granary Based on Improved YOLOv5. Agriculture 2023, 13, 364. [Google Scholar] [CrossRef]
Zhu, R.; Hao, F.; Ma, D. Research on Polygon Pest-Infected Leaf Region Detection Based on YOLOv8. Agriculture 2023, 13, 2253. [Google Scholar] [CrossRef]
Yang, H.; Lin, D.; Zhang, G.; Zhang, H.; Wang, J.; Zhang, S. Research on Detection of Rice Pests and Diseases Based on Improved yolov5 Algorithm. Appl. Sci. 2023, 13, 10188. [Google Scholar] [CrossRef]
Yang, Z.; Feng, H.; Ruan, Y.; Weng, X. Tea tree pest detection algorithm based on improved Yolov7-Tiny. Agriculture 2023, 13, 1031. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, C.; Feng, Y.; Li, D. Pests Identification of IP102 by YOLOv5 Embedded with the Novel Lightweight Module. Agronomy 2023, 13, 1583. [Google Scholar] [CrossRef]
Song, L.; Liu, M.; Liu, S.; Wang, H.; Luo, J. Pest species identification algorithm based on improved YOLOv4 network. Signal Image Video Process 2023, 17, 3127–3134. [Google Scholar] [CrossRef]
Han, S.; Jiang, X.; Wu, Z. An Improved YOLOv5 Algorithm for Wood Defect Detection Based on Attention. IEEE Access 2023, 11, 71800–71810. [Google Scholar] [CrossRef]
Li, M.; Cheng, S.; Cui, J.; Li, C.; Li, Z.; Zhou, C.; Lv, C. High-Performance Plant Pest and Disease Detection Based on Model Ensemble with Inception Module and Cluster Algorithm. Plants 2023, 12, 200. [Google Scholar] [CrossRef]
Dhanaraj, R.K.; Ali, M.; Sharma, A.K.; Nayyar, A. Deep Multibranch Fusion Residual Network and IoT-based pest detection system using sound analytics in large agricultural field. Multimed. Tools. Appl. 2023. [Google Scholar] [CrossRef]
Wang, X.; Zhang, S.; Wang, X.; Xu, C. Crop pest detection by three-scale convolutional neural network with attention. PLoS ONE 2023, 18, e0276456. [Google Scholar] [CrossRef]
Sun, L.; Cai, Z.; Liang, K.; Wang, Y.; Zeng, W.; Yan, X. An intelligent system for high-density small target pest identification and infestation level determination based on an improved YOLOv5 model. Expert Syst. Appl. 2024, 239, 122190. [Google Scholar] [CrossRef]
Martineau, M.; Conte, D.; Raveaux, R.; Arnault, I.; Munier, D.; Venturini, G. A survey on image-based insect classification. Pattern Recognit. 2017, 65, 273–284. [Google Scholar] [CrossRef]
Xiao, D.; Feng, J.; Lin, T.; Pang, C.; Ye, Y. Classification and recognition scheme for vegetable pests based on the BOF-SVM model. Int. J. Agric. Biol. Eng. 2018, 11, 190–196. [Google Scholar] [CrossRef]
Deng, L.; Wang, Y.; Han, Z.; Yu, R. Research on insect pest image detection and recognition based on bio-inspired methods. Biosyst. Eng. 2018, 169, 139–148. [Google Scholar] [CrossRef]
Jiao, L.; Li, G.Q.; Chen, P.; Wang, R.J.; Du, J.M.; Liu, H.Y.; Dong, S.F. Global Context-Aware-Based Deformable Residual Network Module for Precise Pest Recognition and Detection. Front. Plant Sci. 2022, 13, 895944. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The diagram of 16 classes of pests collected. (a) alfalfa plant bug, (b) aphids, (c) locustoidea, (d) mole cricket, (e) oides decempunctata, (f) papilio xuthus, (g) penthaleus major, (h) potosia brevitarsis, (i) rice leaf roller (larvae), (j) rice leaf roller (adult), (k) rice leafhopper, (l) salurnis marginella, (m) serica orientalis, (n) thrips, (o) wireworm (larvae), and (p) wireworm (adult).

Figure 2. (a). The schematic of dataset. After selecting and labeling from the IP102 dataset, we obtain raw IP16. After data augmentation, IP16 is generated, which serves as the dataset for our experiments. (b). Example of data augmentation for two images in the training set of the dataset.

Figure 3. The structure of the YOLOv5 model, with black dashed boxes indicating the detailed structure of modules within the YOLOv5. The yellow cuboids are feature maps at different scales generated by Neck, which is 80 × 80, 40 × 40 and 20 × 20 from top to bottom.

Figure 4. The structures of (a) SPP, (b) SPPF, and (c) HSPPF modules.

Figure 5. The structure of the NCBAM.

Figure 6. Detailed structure of the ICAM.

Figure 7. Detailed structure of the SAM.

Figure 8. Detailed structure of the C3NCBAM module.

Figure 9. Detailed structure of the

g^{1} C o n v

module.

Figure 9. Detailed structure of the

g^{1} C o n v

module.

Figure 10. The detailed structure of the

g^{3} C o n v

module.

Figure 10. The detailed structure of the

g^{3} C o n v

module.

Figure 11. The structure of the YOLOv5-pest, the red dotted boxes are the positions of HSPPF, C3NCBAM, and

g^{3} C o n v

modules in our proposed method.

Figure 11. The structure of the YOLOv5-pest, the red dotted boxes are the positions of HSPPF, C3NCBAM, and

g^{3} C o n v

modules in our proposed method.

Figure 12. The curve change of mAP of different models during the training process. (a) The training curve of mAP of YOLOv5 of different depth and width, (b) the training curves of mAP of models, which add HSPPF, C3NCBAM,

g^{3} C o n v

and Soft-NMS, (c) the training curves of mAP of models, which add different attention mechanisms, (d) the training curves of mAP of different methods of comparison experiments.

Figure 12. The curve change of mAP of different models during the training process. (a) The training curve of mAP of YOLOv5 of different depth and width, (b) the training curves of mAP of models, which add HSPPF, C3NCBAM,

g^{3} C o n v

and Soft-NMS, (c) the training curves of mAP of models, which add different attention mechanisms, (d) the training curves of mAP of different methods of comparison experiments.

Figure 13. Visualization of heat maps produced by models with different attention, where the red and blue regions denote the high weight and low weight, respectively.

Table 1. The statistics of the number of instances per class.

Class	Instances	Class	Instances
alfalfa plant bug	2000	rice leaf roller (larvae)	2216
aphids	2474	rice leaf roller (adult)	2187
locustoidea	2333	rice leafhopper	2123
mole cricket	2029	salurnis marginella	2016
oides decempunctata	2597	Serica orientalis	2370
papilio xuthus	2235	thrips	2136
penthaleus major	2375	wireworm (larvae)	2592
potosia brevitarsis	2330	wireworm (adult)	2107

Table 2. Training environment.

Configuration	Parameter
Operating System	64-bit Windows 10
CPU	Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz
GPU	NVIDIA GeForce RTX 2080 Ti
Cuda	Cuda 10.1
Cudnn	Cudnn 7.6.5
Deep Learning Frame	Pytorch 1.8.1

Table 3. Comparison results of different versions of YOLOv5.

Model	Layers	P	R	mAP@0.5	mAP@0.5:0.95	Parameters	GFLOPs
YOLOv5x	445	0.966	0.909	0.939	0.721	86,318,749	204.9
YOLOv5l	368	0.939	0.89	0.928	0.705	46,219,069	108.5
YOLOv5m	291	0.931	0.885	0.92	0.694	20,931,933	48.4
YOLOv5s	214	0.92	0.856	0.902	0.664	7,062,781	16.1
YOLOv5n	214	0.913	0.834	0.885	0.618	1,785,565	4.3

Table 4. Results of ablation experiment after integrating HSPPF, C3NCBAM,

g^{3} C o n v

, Soft-NMS.

Table 4. Results of ablation experiment after integrating HSPPF, C3NCBAM,

g^{3} C o n v

, Soft-NMS.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Parameters	GFLOPs
YOLOv5s	0.92	0.856	0.902	0.664	7,062,781	16.1
YOLOv5s + HSPPF	0.937	0.857	0.91	0.665	7,455,997	16.4
YOLOv5s + C3NCBAM	0.941	0.878	0.92	0.671	7,091,213	16.2
$YOLOv 5 s + g^{3} C o n v$	0.934	0.87	0.909	0.671	7,967,741	17.2
YOLOv5s + Soft-NMS	0.93	0.864	0.919	0.719	7,062,781	16.1

Table 5. Comparison results before and after integrating different attention mechanisms.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Parameters	GFLOPs
YOLOv5s	0.92	0.856	0.902	0.664	7,062,781	16.1
YOLOv5s + C3SE	0.928	0.869	0.909	0.664	7,086,461	16.1
YOLOv5s + C3CBAM	0.934	0.87	0.91	0.664	7,087,245	16.1
YOLOv5s + C3ECA	0.937	0.866	0.907	0.668	7,062,805	16.1
YOLOv5s + C3CA	0.939	0.873	0.914	0.668	7,090,429	16.1
YOLOv5s + C3NCBAM	0.941	0.878	0.92	0.671	7,091,213	16.2

Table 6. Comparison results of different methods.

Method	P	R	mAP@0.5	mAP@0.5:0.95	Parameters	GFLOPs
YOLOv5s	0.92	0.856	0.902	0.664	7,062,781	16.1
YOLOv5s-transformer	0.931	0.854	0.91	0.653	7,063,037	15.9
YOLOv5s-bifpn	0.937	0.867	0.912	0.661	7,128,317	16.3
YOLOv5m-ghost	0.908	0.807	0.855	0.609	8,586,789	18.7
YOLOv5-lite-g	0.931	0.859	0.907	0.66	5,500,541	15.9
SE-YOLOv5s	0.9	0.804	0.85	0.593	7,282,621	16.5
MG-YOLO	0.933	0.871	0.908	0.651	6,523,702	14.6
YOLOv7-tiny	0.927	0.875	0.904	0.664	6,055,578	13.3
YOLOX-s	-	-	0.909	0.659	8,943,487	26.8
YOLOv8s	0.918	0.858	0.905	0.701	11,166,560	28.8
YOLOv5s-pest	0.944	0.871	0.925	0.726	8,389,389	17.6

Table 7. Comparison results of YOLOv5s and YOLOv5s-pest on different datasets.

Dataset	Train	Val	Class	Method	P	R	mAP@0.5	mAP@0.5:0.95
Insect	1693	245	7	YOLOv5s	0.753	0.853	0.875	0.627
Insect	1693	245	7	YOLOv5s-pest	0.802	0.83	0.869	0.641
Pest24	12,701	5077	24	YOLOv5s	0.69	0.62	0.636	0.395
Pest24	12,701	5077	24	YOLOv5s-pest	0.724	0.59	0.664	0.429
VOC	16,551	4952	20	YOLOv5s	0.815	0.775	0.831	0.596
VOC	16,551	4952	20	YOLOv5s-pest	0.819	0.785	0.843	0.665

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Qiu, X. A Novel Crop Pest Detection Model Based on YOLOv5. Agriculture 2024, 14, 275. https://doi.org/10.3390/agriculture14020275

AMA Style

Yang W, Qiu X. A Novel Crop Pest Detection Model Based on YOLOv5. Agriculture. 2024; 14(2):275. https://doi.org/10.3390/agriculture14020275

Chicago/Turabian Style

Yang, Wenji, and Xiaoying Qiu. 2024. "A Novel Crop Pest Detection Model Based on YOLOv5" Agriculture 14, no. 2: 275. https://doi.org/10.3390/agriculture14020275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Crop Pest Detection Model Based on YOLOv5

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Data Augmentation

2.3. YOLOv5 Model

2.4. HSPPF Module

2.5. NCBAM

2.5.1. ICAM

2.5.2. SAM

2.5.3. C3NCBAM

2.6. g 3 C o n v Module

2.7. Soft-NMS

2.8. YOLOv5-Pest Model

2.9. Evaluation Indicators

3. Results and Discussion

3.1. Training Environment

3.2. Hyperparameter Configuration

3.3. Selection of Baseline Model

3.4. Ablation Experiments

3.5. Comparison Experiments

3.5.1. Comparison of Different Attention Mechanisms

3.5.2. Comparison of Different Methods

3.5.3. Comparison of YOLOv5s and YOLOv5s-Pest on Different Datasets

3.6. Training Curve Analysis

3.7. Visualization of Attention Mechanisms

3.8. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.6. $g^{3} C o n v$ Module