YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection

Zhang, Yuzhuo; Wang, Tianyi; You, Yong; Wang, Decheng; Zhang, Dongyan; Lv, Yuchan; Lu, Mengyuan; Zhang, Xingshan

doi:10.3390/agriculture13061197

Open AccessArticle

YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection

by

Yuzhuo Zhang

¹,

Tianyi Wang

^1,2,*

,

Yong You

¹

,

Decheng Wang

¹,

Dongyan Zhang

³

,

Yuchan Lv

¹,

Mengyuan Lu

¹ and

Xingshan Zhang

¹

College of Engineering, China Agricultural University, Beijing 100083, China

²

College of Agricultural Unmanned System, China Agricultural University, Beijing 100193, China

³

National Engineering Research Center for Agro-Ecological Big Data Analysis & Application, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Agriculture 2023, 13(6), 1197; https://doi.org/10.3390/agriculture13061197

Submission received: 15 May 2023 / Revised: 30 May 2023 / Accepted: 1 June 2023 / Published: 4 June 2023

(This article belongs to the Special Issue Sensor-Based Precision Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The growth of Achnatherum splendens (A. splendens) inhibits the growth of dominant grassland herbaceous species, resulting in a loss of grassland biomass and a worsening of the grassland ecological environment. Therefore, it is crucial to identify the dynamic development of A. splendens adequately. This study intended to offer a transformer-based A. splendens detection model named YOLO-Sp through ground-based visible spectrum proximal sensing images. YOLO-Sp achieved 98.4% and 95.4% AP values in object detection and image segmentation for A. splendens, respectively, outperforming previous SOTA algorithms. The research indicated that Transformer had great potential for monitoring A. splendens. Under identical training settings, the AP value of YOLO-Sp was greater by more than 5% than that of YOLOv5. The model’s average accuracy was 98.6% in trials conducted at genuine test sites. The experiment revealed that factors such as the amount of light, the degree of grass growth, and the camera resolution would affect the detection accuracy. This study could contribute to the monitoring and assessing grass plant biomass in grasslands.

Keywords:

Achnatherum splendens; deep learning; grassland environment; image segmentation; transformer model

1. Introduction

Typical grasslands are indispensable in global ecological, economic, and social values areas [1]. In a healthy grassland ecosystem, all types of livestock can flourish, producing high-quality meat and milk. As the largest green vegetation and biological resource, the natural grassland covers about 400 million hm² of land in China [2]. However, the ecology of grassland in China is facing challenges. For instance, knowledge about grassland resource protection is lacking, the vicious cycle of destruction and degradation of grassland is undergoing, and livestock production efficiency needs to be improved [3,4]. In typical Inner Mongolia grasslands, A. splendens grows in large quantities [5]. The stress resistance to drought, cold, alkaline, and salt of A. splendens is exceptional [6]. Due to its robust root system, A. splendens has an advantage in absorbing water. Nevertheless, A. splendens is unsuitable for animals as a primary food source because of its low nutritional content and high leaf fiber content [7]. Jiang et al. discovered that A. splendens could affect soil hydrological parameters and the dynamics of soil water and salt [8]. The water absorption mechanism of the A. splendens roots system facilitates the passage of salt ions and leads to salt buildup. Yang et al. [9] also found that A. splendens can alter soil microbiological properties and harm plant production. This results in excessive growth of A. splendens at the expense of other forage grasses. In other words, the proliferation of A. splendens could potentially influence the development of dominant forage grasses in grasslands, eventually leading to a decrease in grassland biomass and deterioration of grasslands. However, if the growth of A. splendens is effectively understood and controlled, its positive effect will become more prominent. A. splendens is suggestive vegetation as emergency feed and a water sources for cattle searching in pastoral grasslands [10], and it has become an indicator factor for the watershed climate change, human activity, and the biological environment of grasslands [6]. Therefore, effectively quantifying the biomass of A. splendens and monitoring its dynamics change is very meaningful to studying grassland degradation.

Using UAV remote sensing to acquire spectral images and laser points cloud data have been proven effective and efficient in detecting and estimating vegetation height and coverage [11,12,13]. Some researchers are also conducting a study on assessing grass plant biomass-based remote sensing data. Guo Y et al. investigated the capacity of hyperspectral measures to quantify plant invader coverage and the impact of senescent plant coverage [14]. In addition to using satellite remote sensing imagery data directly, vegetation indices such as NDVI and EVI produced from MODIS data are commonly used in vegetation coverage. Zha Y et al. used satellite remote sensing data to systematically monitor grassland cover changes near Qinghai Lake in Western China and derived the NDVI to quantify grassland cover variations between 1987 and 2000 [15]. Converse RL et al. evaluated the grassland covering of Sevilleta National Wildlife Refuge in 2009, 2014, and 2019 using satellite remote sensing data and confirmed that multi-endmember spectral mixture analysis could be utilized effectively to monitor semiarid grassland and shrub systems in New Mexico [16]. Although UAV and satellite remote sensing could measure vegetation height and coverage quickly and efficiently, shortcomings still exist. For instance, aerial remote sensing is restricted by weather conditions. Overlarge cloud coverage could kill the survey mission. For satellite data, the cloud could block and prevent the sensor received reflectance from the ground, while the cloud could introduce noise and color variance into UAV data [15]. Especially for UAV remote sensing, precipitation and wind can also affect the mission. Moreover, the sensors carried by UAVs, such as laser multi-line radar, multispectral, and hyperspectral sensors, are prohibitively expensive for grassland management and research and need more utility [17,18]. Besides, it is difficult to statistically detect changes in overall biomass and partial grassland cover under the same conditions, such as atmospheric and seasonal conditions, because two sets of satellite images must be employed [15]. Although regular RGB cameras can be utilized for some situations, researchers have yet to use them to produce good detection results for detecting and estimating the height of grass growth on grasslands, including A. splendens. Typical grasslands often have a temperate, semiarid continental climate with frequent periods of severe wind [19], which make the UAV survey very challenging. In addition, assessing grassland in a particular area generally involves delineating individual sample points for estimating the height and cover of grass. Thus, ground proximal sensing-based detection is preferable for this investigation.

With image pre-processing and data fusion, RGB imagery data can be used to derive biomass information [20,21,22,23]. Therefore, delineating the outer contours of A. splendens can also prepare for the subsequent biomass estimation. Gebhardt S et al. detected the broad-leaved dock (Rumex obtusifolius L., R.o.), a weed in European grasslands, by converting RGB images into grayscale images and segmenting them [24]. In the experiments, the average R.o. detection rate ranged from 71% to 95% for 108 images, including more than 3600 objects. Petrich L et al. proposed a method based on UAV visible light images to locate and detect the poisonous Colchicum autumnale on the grassland [25]. This method relied on a convolutional neural network to find flowers and achieved an accuracy rate of 88.6% in test experiments. Wang L et al. used four semantic segmentation algorithms to detect the images of woody plants that invaded grassland ecosystems collected by UAVs [26]. The research shows that the ResNet algorithm has the highest comprehensive accuracy, and the segmentation performance for eastern redcedar (ERC) decreases with the decrease in resolution. In Gallmann J et al.’s study, their drone-based images effectively detected different flowering plants in a meadow. The experimental precision and recall are close to 90% through the Faster R-CNN algorithm [27]. The above research indicated that detecting A. splendens is practical using visible light images based on deep learning.

Deep learning has become a prevalent approach in image processing tasks, and convolutional neural networks (CNNs), with good robustness and robustness, are widely used in deep learning algorithms [28,29,30,31,32]. Nonetheless, when the Transformer idea was presented to the realm of computer vision, there was much expectation about its development. Even its applicability is equivalent to that of the CNN model. In 2020, the Facebook AI-proposed DETR (Object Detection with Transformer) identification system established a new paradigm for object detection [33]. This approach performs end-to-end detection without requiring non-maximum suppression (NMS) post-processing. Several specialists and academics have explored the possibilities of Transformer in vision and offered novel detection techniques. Afterwards, the transformer-based network Swin Transformer obtained the top performance (state of the art, SOTA) on various vision tests [34]. Several specialists have started incorporating the transformer model into the deep learning algorithm to accomplish detection. Lin et al. [35] conducted rapid and accurate monitoring of the emergence rate of peanut seedlings by integrating the transformer and CSNet models. Wang Dandan and He Dongjian [36] calculated apple instances against a complicated backdrop by proposing an attention method. Olenskyj AG et al. [37] concluded the evaluation of object detection, CNN regression, and Transformer models, demonstrating that Transformer can provide accurate grape production estimates. These professionals and academics have used Transformer for visual inspection tests in the industrial and agricultural sectors. However, there are few related studies on the application of Transformer in detecting grassland plants, especially A. splendens. Consequently, this research will investigate if the Transformer model is superior to CNN for detecting images of A. splendens.

The flexibility of the tracked robot’s movement and the ability to acquire target information from multiple angles align with this paper’s research. This study equipped the crawler robot with a camera to achieve ground detection. To sum up, this study aims to realize the ground monitoring of A. splendens and propose an effective detection method based on deep learning. This study proposed a new deep learning detection method for A. splendens based on improving the backbone, neck, and head parts of the traditional YOLOv5. In addition, this study verified whether the Transformer model was superior to the CNN model in grassland vegetation detection through the detection and segmentation of A. splendens.

2. Materials and Methods

2.1. Materials

The trial location was situated in Baiyinxile Ranch, Xilinhot City, Xilin Gol League, Inner Mongolia, China, at 43°37′ East longitude and 116°42′ North latitude (Figure 1). This region had a semiarid grassland climate. Roughly 80% of the yearly precipitation in this region occurred between June and September, amounting to about 350 mm of precipitation annually. At this time, the season of high temperatures created circumstances of high temperature and humidity favorable for plant development. The development of A. splendens in the experimental region was shown in Figure 2. In the testing stage, the phenological phase of development of A. splendens was the flowering and fruiting stage. At this stage, A. splendens had a high vertical growth ability, and its height was usually above 150 cm. Its leaves were linear or tubular, hard in texture, long, and thin. The leaves were dark green with a glossy surface. Its stalks grew prostrate or erect, usually light green, with fine hairs on the surface.

2.2. Data Collection

In August 2022, data were gathered in the experimental area. As a result of the grassland’s challenging topography, the crawler chassis was chosen as the mobile robot carrier (Figure 3a). STM32 (STMicroelectronics, Geneva, Switzerland) was responsible for the mobile robot’s basic control; it transmitted signals to the motor to drive the crawler wheels. The motherboard used NVIDIA Jetson TX1 (NVIDIA Corp., Santa Clara, CA, USA) to acquire inertial measurement unit (IMU), odometer information, and different sensor data sent by STM32. In this research, NVIDIA Jetson TX1 primarily acquired picture data sent by the camera, processed and evaluated the image, and then calculated the grass height of A. splendens in this region. The camera used in this work was the MYNT AI D100-50 (Slightech, Wuxi, China; Figure 3b), a binocular stereo-depth camera capable of obtaining color depth information. The camera’s product parameters are listed in Table 1. The camera can obtain clear images, which is conducive to subsequent detection tasks, and then input depth information to the main control terminal of the mobile robot. The mobile robot captured one thousand photographs in the testing area. PyTorch 1.8 (Meta Corp., Menlo Park, CA, USA) with CUDA 10.2 was the employed framework (NVIDIA Corp., Santa Clara, CA, USA). The robot covered an area of 5 m² per minute in the grassland of the test area and could obtain image information of 1 m² A. splendens in all directions per minute.

Nine test sites of different coverage in the experimental area were selected for the experiment to evaluate the proposed detection models. In the test experiment, the camera angle was changed ten times at the same test site to obtain the data, and each set of data was repeated ten times for the experiment. At each test site, we collected images from different orientations. A total of 4500 images were contained in all test sites, and the ratio of the training set and verification set in the data set was 7:3. Finally, after image enhancement, the training and verification set images were increased to 5000.

2.3. The Research Method for the Detection of A. splendens

2.3.1. Model Preparation

As shown in Figure 4, the Transformer encoder block consisted of Embedded Patches, LayerNorm, Dropout, Multi-Head Attention, and MLP. Each Transformer encoder block contained two sublayers. The first sublayer was a Multi-Head Attention layer, and the second sublayer (MLP) was a fully connected layer. A residual dropout connection was used between each sublayer. The Transformer encoder block added the ability to capture different local information. It could also utilize the self-attention mechanism for mining feature representation potential.

After optimizing and improving the Transformer encoder block, the Swin block achieved the SOTA effect in the image segmentation and detection field. The overall flow of the module is shown in Figure 5. The module first performed LayerNorm on the feature map. It determined whether the feature map needed to be shifted through the shift_size parameter and divided into windows. The module calculated the attention and used the mask to distinguish whether it was window attention or shift window attention, which limited the content that could be seen at each position in the attention. After merging each window, the previous shift operation must be restored by reverse shift (restore the previous shift operation). Finally, the module process was completed through the LayerNorm+ fully connected layer and the residual dropout connection.

CBAM was a straightforward yet efficient attention module (Figure 6). It was a lightweight, plug-and-play module that could be trained end-to-end and was compatible with CNN architectures. Given a feature map, CBAM sequentially inferred an attention map along two independent dimensions of channel and space and then multiplied the attention map with the input feature map to conduct adaptive feature refinement.

2.3.2. The Proposal of the YOLO-Sp Model in This Study

This study proposed a new deep-learning model called the YOLO-Sp model. By introducing the Swin block, CBAM module, CIOU Loss, VFL Loss, and decoupled head methods, this study completed the optimization and adjustment of the classic YOLOv5 model. As seen in Figure 7, this study has made network adjustments in the backbone, neck, and head parts of YOLOv5. Some convolution blocks and CSP bottleneck blocks in YOLOv5 have been replaced with Transformer encoder blocks, which could employ the self-attention mechanism to exploit the potential of feature representation and enhance the capacity to gather diverse local information. The Swin block in the backbone could therefore accommodate the size-adaptive output of SPP. In the neck section, the network also implemented the Transformer concept. The CBAM module successively inferred the attention map in two independent dimensions (channel and space), which multiplied with the input feature map for adaptive feature optimization. The connection between the Swin block and CBAM module could efficiently convey robust semantic features through the attention mechanism during up sampling and down sampling. The coupled-head describes the head of YOLOv5. Referring to the head concept of YOLOX, in this study, the coupled-head was replaced with the decoupled-head to increase convergence speed, while the number of channels of the regression head was altered as necessary.

CIoU Loss was used as the regression loss of the model designed. CIoU took the aspect ratio of the bounding box into the loss function, further improving the regression accuracy. The penalty item of CIoU was an impact factor called av added to the penalty item of DIoU. This factor considered that the aspect ratio of the predicted box fitted the aspect ratio of the ground truth box. The penalty term is described in Equation (1). a in Equation (1) was a parameter used for trade-off, and v was a parameter used to measure the consistency of the aspect ratio. Their expressions were described in Equations (2) and (3), respectively. CIoU was calculated by Equation (5).

I o U (R, R') = \frac{|R \cap R'|}{|R \cup R'|}

(1)

R_{C I o U} = \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(2)

α = \frac{v}{(1 - I o U) + v}

(3)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(4)

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(5)

where b and b^gt represent the center points of the predicted frame and the rear frame, respectively; α is the parameter used to do trade-off; v is a parameter used to measure the consistency of aspect ratio; p stands for calculating the Euclidean distance between two center points; c represents the diagonal distance of the smallest closure area that can contain both the predicted frame and the real frame.

VFL Loss was used as a classification loss calculated by Equation (6). The main improvement of VFL was to propose an asymmetric weighting operation. For positive samples, q was the IoU of bbox and gt, and for negative samples, q = 0. FL was not used when it was a positive sample, but ordinary BCE, an adaptive IoU weight, was added to highlight the primary example. Moreover, it was the standard FL when it was a negative sample. It could be found that VFL is simpler than QFL, and its main features were asymmetric weighting of positive and negative samples and prominent positive samples as the primary samples. Thus, this study introduced VFL Loss into the model calculation classification loss.

V F L (p, q) = \{\begin{array}{l} - q (q l o g (p) + (1 - q) l o g (1 - p)) q > 0 \\ - a p^{γ} l o g (1 - p) q = 0 \end{array}

(6)

where p is a label; q is the predicted probability; a is the weight parameter; γ is adjustable printing.

2.4. Experimental Evaluation Index

The test set was used to evaluate the performance of the detection model once training was complete. The precision, recall, and F1-score were used as evaluation indicators and are described in Equations (7)–(9). AP and mAP could reflect the average accuracy and effect of the model, which were calculated by Equations (10) and (11).

p r e c i s i o n = \frac{T_{p o s i t i v e}}{T_{p o s i t i v e} + F_{p o s i t i v e}}

(7)

r e c a l l = \frac{T_{p o s i t i v e}}{T_{p o s i t i v e} + F_{n e g a t i v e}}

(8)

F 1 - s c o r e = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(9)

A P = \frac{1}{n} \sum_{i = 1}^{n} p r e c i s i o n (i)

(10)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(11)

where

n

is the number of IoU thresholds, and

T_{p o s i t i v e}

,

F_{p o s i t i v e}

, and

F_{n e g a t i v e}

are true positives (correct detection), false positives (false detection), and false negatives (miss), respectively; A_c represents the area of the smallest box that contains both the predicted box and the actual box.

3. Results

The model’s epoch was set to 200 to get the ideal model weight, and the optimal weight was updated every ten iterations. After completing the tuning test, this study modified the parameters at all levels (LearningRate = 0.01, BatchSize = 8, Momentum = 0.94, mosaic = 1.0, warmup momentum = 0.8, weight decay = 0.0005, and warmup epochs = 3.0). Figure 8 depicted the training outcomes of the model’s first 50 cycles following parameter optimization and tweaking. The first line in Figure 8 was the loss circumstance in the training phase, and the second was the loss circumstance in the val phase. The third and fourth rows illustrated the evolution of the indicators during the procedure. After 40 rounds, the loss throughout the entire process would approach zero. When the round approaches 40, the loss would decrease abruptly. After 20 rounds of the val procedure, the loss would converge on zero. Before the fifth round, there would be significant swings, which would stabilize progressively after the tenth round. Other indicators shared the same predicament as val loss before five rounds. Using val Box and Box, it could be determined that the bounding box derived from the model’s weight after 20 rounds of training were already rather precise. Precision and recall levels fluctuated greatly before 100 epochs but converged to 1.0 after 100 epochs. After approximately 10 epochs, mAP-50 converged to 1.0. The model might be generalized and stable. As the threshold increased, the mAP value of the model varied strongly before epoch 40 but converged to 1.0 after epoch 40. In conclusion, the experimental training results demonstrated that the model’s average accuracy was outstanding.

To completely assess the performance of the YOLO-Sp model presented in this study, this study compared and evaluated the mainstream image segmentation model indicated in the Introduction and the image segmentation model using a CNN as its backbone (Table 2). The parameters of each model used the YOLO-Sp model trained in the previous step, and the best weights were selected for comparative analysis in the next iteration. Since the properties of grassland vegetation in RGB images were difficult to identify, the extracted characteristics of A. splendens were often confused with the surrounding environment. BlendMask and other algorithms’ AP values did not surpass 90%, but they fared better than other CNN algorithms, with an AP value of 86.3%. As demonstrated in Table 2, the AP value of the conventional model after transplanting the Transformer was greatly enhanced. Although Swin increased the model size and inference time, the AP value after transplanting to the backbone was more than that of most conventional CNN models. The AP value of YOLO-Sp-X proposed in this research was 95.4%, which had the best segmentation performance for A. splendens images. The model with the best overall performance was YOLO-Sp-M, whose AP value reached 95.2% when it was tiny. The Swin block was also ideal for transplanting in the conventional image segmentation paradigm, as seen from the side. Additionally, the AP value of Cascade R-CNN through Swin surpassed 90%. Cascade R-model CNN’s size was enormous, while the model size acquired by Swin-Small for backbone training was over 1 GB. This study compared Mask R-CNN with Cascade R-CNN, and the conclusion was that the performance of conventional networks might be enhanced by transplanting Swin block. In this study, the addition of the CBAM module as a complement to the attention had done an outstanding job of ensuring CNN network compatibility.

To report the performance of the different models of YOLO-Sp in terms of mean average precision (mAP) at a confidence threshold of 0.5 (Table 3), this study evaluated each model on a test dataset and recorded the precision and recall for each class. The mAP was calculated as the average of the precision–recall curve across all categories. The results show that YOLO-Sp models with more extensive backbones (i.e., YOLO-Sp-M, YOLO-Sp-L, and YOLO-Sp-X) achieved higher mAP scores than those with smaller backbones (i.e., YOLO-Sp-N and YOLO-Sp-S). Specifically, YOLO-Sp-X achieved the highest mAP score of 95.4%, followed by YOLO-Sp-M at 95.2%, YOLO-Sp-L at 94.6%, YOLO-Sp-S at 86.7%, and YOLO-Sp-N at 81.4%. Overall, these results suggest that using more extensive backbones can significantly improve the performance of the YOLO-Sp object detection algorithm, especially for detecting small or low-contrast objects. However, it is essential to balance the trade-off between accuracy and computational cost when selecting a model for a specific application. Lastly, the decoupled-convergence head speed had been enhanced, enhancing its overall performance. Figure 9 was the image segmentation effect of YOLO-Sp-M.

This study compared YOLO-Sp to the existing popular YOLO series algorithm models regarding object detection (Table 4). All model parameters correspond to the YOLO-Sp model trained above and, following iteration, determined the appropriate weight for comparative analysis. The bounding box could detect the location of A. splendens with reasonable accuracy, which was simpler than picture segmentation. Other YOLO models were outperformed by the YOLO-Sp model. YOLO-Sp-X had an AP value of 98.4%, making it the model with the highest AP value. YOLO-Sp-M demonstrated the greatest overall performance. The AP value produced a superior performance by preserving the equilibrium between model size and pre-process time. Based on YOLOv5, YOLO-Sp was enhanced, and the average AP value of each model was boosted by more than 5%. As demonstrated in Figure 10, the detection impact of YOLO-Sp-X was illustrated by the bounding box surrounding the entire A. splendens.

In this study, the model’s real detection accuracy was ultimately evaluated. The best weights generated by YOLO-Sp-X training with good comprehensive performance in the above experimental process were selected for testing. Additionally, 1800 images were chosen and separated into nine groups for detection. As illustrated in Table 5, results were computed by noting the positive and negative samples for each category. The precision values achieved by the algorithm range from 97.5% to 99.5%, with an average precision of 98.6%. This indicates that the algorithm could accurately detect objects in the images with high confidence. However, it is worth noting that the algorithm’s precision could have been more consistent across all test groups, with some test groups achieving higher precision values than others. To better understand the algorithm’s performance, it is necessary to evaluate its recall and F1-score. Recall measures the ability of the algorithm to detect all positive instances, while the F1-score measures the algorithm’s precision and recall. The recall values achieved by the algorithm range from 97.5% to 99.5%, with an average recall of 98.7%. The F1-scores performed by the algorithm range from 98.2% to 99.5%, with an average F1-score of 98.9%. Overall, the results suggest that the detection algorithm could accurately detect objects in the images with high precision and recall. However, it is essential to note that various factors, such as lighting, image quality, and object orientation, may influence the algorithm’s performance. In nine experimental groups, the accuracy of data gathered under insufficient sunshine or camera backlight was lower than under normal conditions (sufficient light). Therefore, further testing and analysis may be necessary to fully evaluate the algorithm’s performance in real-world scenarios.

4. Discussion

It can be seen from Figure 8 that both loss and metrics fluctuated greatly in the previous rounds. The analysis may be because the modules introduced by the model described in this study must be changed and adjusted to the original network. While these modules may be designed to enhance specific aspects of the model’s functionality, they may also introduce new parameters or alter existing ones, potentially disrupting the delicate balance between the model’s components. As a result, the initial stages of training may involve a period of instability as the model adjusts to these changes and attempts to optimize its performance. In the early stages of training, the model may make significant changes to these parameters to find an optimal configuration, resulting in large fluctuations in loss and metrics. As training progresses and the model becomes more familiar with the data, it may converge to a more stable configuration with smaller changes between iterations. We believe that compatibility with the CNN network has been greatly enhanced by installing the CBAM module. Simultaneously, the decoupled-head boosts the generalization introduced by the Swin block and increases the convergence speed of the model.

From the multiple experiments conducted in this paper, compared to existing SOTA deep learning models [28,29,30,31,32,33,34,39,40,41], the YOLO-Sp model suggested in this work produced superior results in A. splendens for object identification and image segmentation. This demonstrates that Transformer possesses more potential benefits in identifying A. splendens. This advantage may also occur in other plants native to temperate meadow steppe or mild temperate grassland. The diverse field circumstances of grasslands necessitate substantial training data for grassland feature extraction. Transformer adapts effectively to large data sets, and it is evident from this study that the model’s performance improves as data grows. Transformer gains a deeper understanding of the interaction between the learnt characteristics, making it fitter for grassland environments. The current study investigated the distribution of focus from the model. It has been discovered that each attention head may learn to do tasks differently while sensing A. splendens with varying basal coverage. The Swin block module is highly portable and compatible with the conventional Conv module (Swin block and Conv extract the global data in the upper and lower layers in the YOLO-Sp feature). CNNs are well-suited for image-based tasks that involve identifying spatial patterns, while Transformers are better suited for tasks that involve sequential data and long-term dependencies. Combining these two models allows the resulting hybrid model to capture both spatial and sequential information, allowing it to perform well on a wide range of tasks. The combination of Transformer and CNN can produce superior network structure generalization. The YOLO-Sp proposal incorporated the location, translation invariance, and hierarchy of CNN. When evaluating and predicting lengthier texts, it additionally contains Transformer’s ability to capture the influence of semantic linkages with longer intervals. Future advancements in detecting temperate grassland vegetation represented by A. splendens may hinge on developing a model that can more effectively integrate Transformer and CNN. The YOLO-Sp proposal is a specific example of this approach to detecting temperate grassland vegetation, suggesting potential for future advancements in this area by further integrating these models.

In this study, after labelling the data, it was discovered that herbaceous plants represented by A. splendens may have made this task challenging. Performing rectangle labelling for target identification on millions of data is arduous, but the rectangular frame is more precise and accessible. Notwithstanding, conducting polygon labelling for segmentation jobs on slender plants is difficult and imprecise, resulting in low labor efficiency. This study also investigates the usage of other techniques (Figure 11). Consequently, research into algorithms that can achieve effective detection results through simple labelling may become popular. Deep learning needs a big amount of data and an improved labeling technique.

Using intelligent robots for grassland resource surveys could significantly improve the efficiency and accuracy of these surveys, leading to better management of grassland resources and, ultimately, a more sustainable ecosystem. However, factors such as camera resolution, shooting distance, and weather can all impact the image capture quality, further complicating plant detection tasks. Table 5 shows that the ultimate accuracy is obtained in insufficient sunlight and low backlight conditions. This finding may be useful for grassland plant detection, where the detection of slender plants such as A. splendens is challenging due to the intermingling of plant pixels with pixels of the surrounding environment. While the results show some potential for real-time intelligent estimation of grass plant biomass, further research is needed to improve the accuracy and detection speed of the model. By testing the performance of different object detection models in grassland environments, the study sheds light on the strengths and limitations of various models, highlighting areas for future research and improvement. These findings may help inform the development of more robust and adaptable object detection models for use in diverse environmental settings. The way to improve the detection speed is to use lightweight networks, such as ShuffleNet [42] and MobileNets [43], on the backbone network. Some scholars apply these networks to large-scale detection algorithms to improve detection speed [44,45]. However, how to better combine the lightweight network with the Transformer still needs us to explore through a lot of experiments

A. splendens is distributed in Northwest and Northeastern provinces of China, Inner Mongolia, Shanxi, and Hebei. It grows on slightly alkaline grasslands and sandy slopes at an altitude of 500–900 m. A. splendens has strong vitality, is resistant to drought, salt, and alkali, and can still grow vigorously on barren land where other plants cannot survive [5]. The research results of this paper can be used to locate and detect growth. Mechanized root cutting and harvesting are the tasks after detection. It is not easy to precisely control this in A. splendens. We also find the importance of interdisciplinary research, combining computer science, ecology, and agriculture expertise. For example, this study focuses on the flowering and fruiting stage of A. splendens, which may be more beneficial for detection because of its external phenotype. For other grassland vegetation, there are multiple phenological phases of development. It is necessary to use different algorithms for detection at various stages of different plants. Researchers can develop solutions to complex environmental challenges by working collaboratively, such as monitoring and managing grassland resources. This type of research provides practical benefits and contributes to a more holistic understanding of the interactions between technology and the environment. The results demonstrate the potential for real-time intelligent estimation of grass plant biomass, offering hope for more efficient and accurate grassland resource surveys. Moreover, the study highlights the importance of interdisciplinary research, which can lead to more effective solutions to environmental challenges. Overall, this research represents a modest step forward in developing intelligent robots for grassland resource surveys, and it may contribute to the broader field of computer vision and machine learning.

5. Conclusions

To intelligently monitor the growth of A. splendens, the monitoring method proposed in this paper can be implemented by a ground mobile robot with a binocular depth camera. Ground-based detection is more weatherproof than drones, and binocular depth cameras are less expensive than multithreaded lidar and hyperspectral sensors. This paper proposed a new Transformer-based model called YOLO-Sp. Based on YOLOv5, it improved and optimized the backbone, neck, and head parts. Swin block can increase the computational complexity of the original model. Transformer does have a better ability to acquire global information. The Introduction of the CBAM module makes the model better integrated with the CNN network, fully absorbing the advantages of CNNs locality, translation invariance, and hierarchy. The improved convergence speed of the decoupled-head also makes the overall performance of YOLO-Sp the best. For detecting A. splendens, YOLO-Sp has achieved better results in comparing the AP value with the SOTA depth model in the same period under the same training conditions and in terms of actual measurement accuracy. However, the model also has some things that could be improved, such as the inability to adapt well to light intensity, and the reasoning speed should be improved. A. splendens is relatively slender in the image, and the pixels are easily mixed with the surrounding environment, which has much to do with its growth condition. Our follow-up work is improving the accuracy and extending the method to other plant detections in grassland ecosystems. This work may contribute to the development of intelligent detection of grassland plant biomass.

Author Contributions

Y.Z.: Methodology, validation, writing original draft, and formal analysis. D.W.: Funding acquisition and review and editing. Y.Y.: Funding acquisition and project administration. D.Z.: Review and editing. T.W.: Funding acquisition and review and editing. M.L.: Data acquisition and curation. Y.L.: Data acquisition and curation. X.Z.: Data acquisition and curation. All authors contributed significantly to the manuscript, ultimately approved for publication. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2021YFD1300503) and the Science and Technology Plan of Inner Mongolia Autonomous Region (2022YFSJ0039). The work presented in this study was partially supported by the China Agriculture Research System of Double first-class Construction Project and the earmarked fund for CARS (CARS-34). 2115 Talent Cultivation and Development Support Program of China Agricultural University.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data will be made publicly available when the article is accepted for publication.

Acknowledgments

The authors would like to acknowledge the College of Engineering at China Agricultural University and Xilingol Mengzhiyuan Animal Husbandry Company. The authors gratefully appreciate the reviewers who provided helpful suggestions for this manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Yang, Y.; Wang, Z.; Li, J.; Gang, C.; Zhang, Y.; Zhang, Y.; Odeh, I.; Qi, J. Comparative assessment of grassland degradation dynamics in response to climate variation and human activities in China, Mongolia, Pakistan and Uzbekistan from 2000 to 2013. J. Arid. Environ. 2016, 135, 164–172. [Google Scholar] [CrossRef]
Kang, L.; Han, X.; Zhang, Z.; Sun, O.J. Grassland Ecosystems in China: Review of Current Knowledge and Research Ad-vancement. Philos. Trans. R. Soc. B Biol. Sci. 2007, 362, 997–1008. [Google Scholar] [CrossRef]
Malchair, S.; De Boeck, H.; Lemmens, C.; Merckx, R.; Nijs, I.; Ceulemans, R.; Carnol, M. Do climate warming and plant species richness affect potential nitrification, basal respiration and ammonia-oxidizing bacteria in experimental grasslands? Soil Biol. Biochem. 2010, 42, 1944–1951. [Google Scholar] [CrossRef]
Yu, C.; Zhang, Y.; Claus, H.; Zeng, R.; Zhang, X.; Wang, J. Ecological and Environmental Issues Faced by a Developing Tibet. Environ. Sci. Technol. 2012, 46, 1979–1980. [Google Scholar] [CrossRef]
Tsui, Y. Achnatherum splendens, a plant of industrial importance. J. Bot. Soc. China 1951, 5, 123–125. [Google Scholar]
Irfan, M.; Dawar, K.; Fahad, S.; Mehmood, I.; Alamri, S.; Siddiqui, M.H.; Saud, S.; Khattak, J.Z.K.; Ali, S.; Hassan, S.; et al. Exploring the potential effect of Achnatherum splendens L.–derived biochar treated with phosphoric acid on bioavailability of cadmium and wheat growth in contaminated soil. Environ. Sci. Pollut. Res. 2022, 29, 37676–37684. [Google Scholar] [CrossRef]
Koyama, A.; Yoshihara, Y.; Jamsran, U.; Okuro, T. Role of tussock morphology in providing protection from grazing for neighbouring palatable plants in a semi-arid Mongolian rangeland. Plant Ecol. Divers. 2015, 8, 163–171. [Google Scholar] [CrossRef]
Jiang, Z.-Y.; Li, X.-Y.; Wu, H.-W.; Zhang, S.-Y.; Zhao, G.-Q.; Wei, J.-Q. Linking spatial distributions of the patchy grass Achnatherum splendens with dynamics of soil water and salt using electromagnetic induction. Catena 2017, 149, 261–272. [Google Scholar] [CrossRef]
Yang, C.; Li, K.; Sun, J.; Ye, W.; Lin, H.; Yang, Y.; Zhao, Y.; Yang, G.; Wang, Z.; Liu, G.; et al. The spatio-chronological distribution of Achnatherum splendens influences soil bacterial communities in degraded grasslands. Catena 2022, 209, 105828. [Google Scholar] [CrossRef]
Ni, J.; Cheng, Y.; Wang, Q.; Ng, C.W.W.; Garg, A. Effects of vegetation on soil temperature and water content: Field monitoring and numerical modelling. J. Hydrol. 2019, 571, 494–502. [Google Scholar] [CrossRef]
Ekhtari, N.; Glennie, C.; Fernandez-Diaz, J.C. Classification of Airborne Multispectral Lidar Point Clouds for Land Cover Mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2068–2078. [Google Scholar] [CrossRef]
Han, W.; Zhao, S.; Feng, X.; Chen, L. Extraction of multilayer vegetation coverage using airborne LiDAR discrete points with intensity information in urban areas: A case study in Nanjing City, China. Int. J. Appl. Earth Obs. Geoinf. 2014, 30, 56–64. [Google Scholar] [CrossRef]
Arenas-Corraliza, I.; Nieto, A.; Moreno, G. Automatic mapping of tree crowns in scattered-tree woodlands using low-density LiDAR data and infrared imagery. Agrofor. Syst. 2020, 94, 1989–2002. [Google Scholar] [CrossRef]
Guo, Y.; Graves, S.; Flory, S.L.; Bohlman, S. Hyperspectral Measurement of Seasonal Variation in the Coverage and Impacts of an Invasive Grass in an Experimental Setting. Remote Sens. 2018, 10, 784. [Google Scholar] [CrossRef] [Green Version]
Zha, Y.; Gao, J. Quantitative detection of change in grass cover from multi-temporal TM satellite data. Int. J. Remote Sens. 2011, 32, 1289–1302. [Google Scholar] [CrossRef]
Converse, R.L.; Lippitt, C.D.; Lippitt, C.L. Assessing Drought Vegetation Dynamics in Semiarid Grass- and Shrubland Using MESMA. Remote. Sens. 2021, 13, 3840. [Google Scholar] [CrossRef]
Malmstrom, C.M.; Butterfield, H.S.; Planck, L.; Long, C.W.; Eviner, V.T. Novel fine-scale aerial mapping approach quantifies grassland weed cover dynamics and response to management. PLoS ONE 2017, 12, e0181665. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Reinermann, S.; Asam, S.; Kuenzer, C. Remote Sensing of Grassland Production and Management—A Review. Remote Sens. 2020, 12, 1949. [Google Scholar] [CrossRef]
Wang, Y.; Xue, M.; Zheng, X.; Ji, B.; Du, R.; Wang, Y. Effects of environmental factors on N2O emission from and CH4 uptake by the typical grasslands in the Inner Mongolia. Chemosphere 2005, 58, 205–215. [Google Scholar] [CrossRef]
Castro, W.; Junior, J.M.; Polidoro, C.; Osco, L.P.; Gonçalves, W.; Rodrigues, L.; Santos, M.; Jank, L.; Barrios, S.; Valle, C.; et al. Deep Learning Applied to Phenotyping of Biomass in Forages with UAV-Based RGB Imagery. Sensors 2020, 20, 4802. [Google Scholar] [CrossRef]
Schreiber, L.V.; Amorim, J.G.A.; Guimarães, L.; Matos, D.M.; da Costa, C.M.; Parraga, A. Above-ground Biomass Wheat Estimation: Deep Learning with UAV-based RGB Images. Appl. Artif. Intell. 2022, 36, 2055392. [Google Scholar] [CrossRef]
Morgan, G.R.; Wang, C.; Morris, J.T. RGB Indices and Canopy Height Modelling for Mapping Tidal Marsh Biomass from a Small Unmanned Aerial System. Remote Sens. 2021, 13, 3406. [Google Scholar] [CrossRef]
Poley, L.G.; McDermid, G.J. A Systematic Review of the Factors Influencing the Estimation of Vegetation Aboveground Biomass Using Unmanned Aerial Systems. Remote Sens. 2020, 12, 1052. [Google Scholar] [CrossRef] [Green Version]
Gebhardt, S.; Schellberg, J.; Lock, R.; Kühbauch, W. Identification of broad-leaved dock (Rumex obtusifolius L.) on grassland by means of digital image processing. Precis. Agric. 2006, 7, 165–178. [Google Scholar] [CrossRef]
Petrich, L.; Lohrmann, G.; Neumann, M.; Martin, F.; Frey, A.; Stoll, A.; Schmidt, V. Detection of Colchicum autumnale in drone images, using a machine-learning approach. Precis. Agric. 2020, 21, 1291–1303. [Google Scholar] [CrossRef]
Wang, L.; Zhou, Y.; Hu, Q.; Tang, Z.; Ge, Y.; Smith, A.; Awada, T.; Shi, Y. Early Detection of Encroaching Woody Juniperus virginiana and Its Classification in Multi-Species Forest Using UAS Imagery and Semantic Segmentation Algorithms. Remote Sens. 2021, 13, 1975. [Google Scholar] [CrossRef]
Gallmann, J.; Schüpbach, B.; Jacot, K.; Albrecht, M.; Winizki, J.; Kirchgessner, N.; Aasen, H. Flower Mapping in Grasslands With Drones and Deep Learning. Front. Plant Sci. 2022, 12, 3304. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870v3. [Google Scholar]
Chen, X.; Girshick, R.; He, K.; Dollar, P. TensorMask: A Foundation for Dense Object Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF In-ternational Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar] [CrossRef] [Green Version]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12346 LNCS, pp. 213–229. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Lin, Y.; Chen, T.; Liu, S.; Cai, Y.; Shi, H.; Zheng, D.; Lan, Y.; Yue, X.; Zhang, L. Quick and accurate monitoring peanut seedlings emergence rate through UAV video and deep learning. Comput. Electron. Agric. 2022, 197, 106938. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Olenskyj, A.G.; Sams, B.S.; Fei, Z.; Singh, V.; Raja, P.V.; Bornhorst, G.M.; Earles, J.M. End-to-end deep learning for directly estimating grape yield from ground-based imagery. Comput. Electron. Agric. 2022, 198, 107081. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef] [Green Version]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696v1. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430v2. [Google Scholar]
YOLOv5 Models. Available online: https://Github.com/Ultralytics/Yolov5 (accessed on 15 December 2022).
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wan, X.; Song, H.; Luo, L.; Li, Z.; Sheng, G.; Jiang, X. Pattern Recognition of Partial Discharge Image Based on One-dimensional Convolutional Neural Network. In Proceedings of the 2018 Condition Monitoring and Diagnosis (CMD), Perth, WA, Australia, 23–26 September 2018; pp. 1–4. [Google Scholar] [CrossRef]
Gomes, R.; Rozario, P.; Adhikari, N. Deep Learning optimization in remote sensing image segmentation using dilated convolutions and ShuffleNet. In Proceedings of the 2021 IEEE International Conference on Electro Information Technology (EIT), Mt. Pleasant, MI, USA, 14–15 May 2021; pp. 244–249. [Google Scholar] [CrossRef]

Figure 1. Location of the experimental area.

Figure 2. A. splendens images captured in the experimental area. (a) Image from the UAV perspective. (b) Image from the ground mobile robot perspective.

Figure 3. The equipment used to collect data. (a) Ground mobile robot. (b) Binocular stereo depth camera.

Figure 4. Backbone diagram of Transformer encoder block. MLP, multilayer perceptron [33].

Figure 5. Backbone diagram of Swin block. LN, LayerNorm. MLP, multilayer perceptron. W-MSA, window attention. SW-MSA, shifted window attention [34].

Figure 6. Backbone diagram of CBAM [38].

Figure 7. The architecture of the YOLO-Sp model.

Figure 8. Training results of YOLO-Sp. (Box: CIoU and VFL average loss; val Box: validation dataset bounding box loss; mAP50: mean mAP with a threshold greater than 0.5; mAP50-95: mean mAP with thresholds ranging from 0.5 to 0.95).

Figure 9. Image segmentation effects of A. splendens at different test sites.

Figure 10. Object detection effects of A. splendens at different test sites.

Figure 11. Different labelling effects for object detection and image segmentation. (a) Polygon labelling for segmentation. (b) Rectangle labelling for object detection.

Table 1. The product parameters of the binocular stereo depth camera.

Item		Product Parameters
MYNT EYE	Model	D1000-50/Color
	Frame Rate	Up to 60 FPS
	Depth Resolution	On chip 1280 × 720 640 × 480
	Pixel Size	3.75 × 3.75 μm
	Baseline	120.0 mm
	Visual Angle	D: 70° H: 64° V: 38°
	Focal Length	3.9 mm
	Working Distance	0.49–10 m
	IMU Frequency	200 Hz
	Weight	152 g

Table 2. Detection results of different algorithms in image segmentation.

Algorithm	Model/Backbone	Model Size (MB)	Inference Time (s)	AP (%)
YOLO-Sp	YOLO-Sp-N	50.4	0.26	81.4
	YOLO-Sp-S	120.5	0.38	86.7
	YOLO-Sp-M	207.3	0.45	95.2
	YOLO-Sp-L	420.8	0.73	94.6
	YOLO-Sp-X	740.5	1.10	95.4
Mask R-CNN	ResNet34	230.4	0.51	77.1
	ResNet50	333	0.73	82.3
	ResNet101	474.8	0.92	82.7
	Swin-Tiny	554	1.07	91.4
	Swin-Small	771.6	1.45	92.1
	Swin-Base	1021.1	1.50	90.8
Cascade R-CNN	ResNet34	404	0.84	78.2
	ResNet50	517	1.12	82.5
	ResNet101	665	1.30	83.1
	Swin-Tiny	951.4	1.48	87.1
	Swin-Small	1180.3	1.67	90.3
	Swin-Base	1600.4	2.11	88.6
Faster R-CNN	ResNet101	427.5	0.80	76.5
BlendMask	ResNet101	454.2	0.85	86.3
YOLACT	ResNet101	448.8	0.77	83.2
SOLO	ResNet101	438	0.73	84.5
TensorMask	ResNet101	624	1.48	82.4

Table 3. The mAP results of different YOLO-Sp models.

Model	mAP50	mAP60	mAP70	mAP80	mAP90
YOLO-Sp-N	81.2%	79.4%	75.5%	67.8%	50.1%
YOLO-Sp-S	86.5%	85.1%	81.9%	74.8%	56.3%
YOLO-Sp-M	94.9%	93.8%	91.4%	86.4%	73.4%
YOLO-Sp-L	94.2%	93.1%	90.3%	83.6%	68.6%
YOLO-Sp-X	95.0%	94.1%	91.6%	84.6%	68.1%

Table 4. Detection results of different algorithms in object detection.

Algorithm	Model	Model Size (MB)	Pre-Process Time (s)	AP (%)
YOLO-Sp	YOLO-Sp-N	15.8	0.15	83.6
	YOLO-Sp-S	20.5	0.24	88.7
	YOLO-Sp-M	52.4	0.33	98.2
	YOLO-Sp-L	137.8	0.38	98.1
	YOLO-Sp-X	270.8	0.43	98.4
YOLOv7	YOLOv7-tiny	14.2	0.18	85.5
	YOLOv7-X	100.2	0.31	94.8
	YOLOv7-E6	240.4	0.42	95.6
YOLOX	YOLOX-S	22.6	0.30	85.7
	YOLOX-M	50.1	0.32	90.2
	YOLOX-L	124.7	0.36	93.4
	YOLOX-X	235.5	0.40	94.5
YOLOv5	YOLOv5-N	10.6	0.20	79.3
	YOLOv5-S	18.8	0.23	85.9
	YOLOv5-M	40.4	0.28	92.6
	YOLOv5-L	95.2	0.34	92.8
	YOLOv5-X	210.5	0.39	92.2

Table 5. Final accuracy test for model YOLO-Sp-X.

Test Group	Image Size (px)	Precision (%)	Recall (%)	F1-Score
1	(640) × (640)	99.0	100.0	99.5
2	(640) × (640)	98.0	98.9	98.4
3	(640) × (640)	99.0	100.0	99.5
4	(640) × (640)	99.5	99.5	99.5
5	(640) × (640)	97.5	97.4	97.4
6	(640) × (640)	99.5	99.5	99.5
7	(640) × (640)	98.0	98.9	98.4
8	(640) × (640)	99.0	100.0	99.5
9	(640) × (640)	97.5	97.4	97.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, T.; You, Y.; Wang, D.; Zhang, D.; Lv, Y.; Lu, M.; Zhang, X. YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection. Agriculture 2023, 13, 1197. https://doi.org/10.3390/agriculture13061197

AMA Style

Zhang Y, Wang T, You Y, Wang D, Zhang D, Lv Y, Lu M, Zhang X. YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection. Agriculture. 2023; 13(6):1197. https://doi.org/10.3390/agriculture13061197

Chicago/Turabian Style

Zhang, Yuzhuo, Tianyi Wang, Yong You, Decheng Wang, Dongyan Zhang, Yuchan Lv, Mengyuan Lu, and Xingshan Zhang. 2023. "YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection" Agriculture 13, no. 6: 1197. https://doi.org/10.3390/agriculture13061197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-Sp: A Novel Transformer-Based Deep Learning Model for Achnatherum splendens Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Data Collection

2.3. The Research Method for the Detection of A. splendens

2.3.1. Model Preparation

2.3.2. The Proposal of the YOLO-Sp Model in This Study

2.4. Experimental Evaluation Index

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI