Research

20 pages, 6038 KiB

Open AccessArticle

A Lightweight Forest Pest Image Recognition Model Based on Improved YOLOv8

by Tingyao Jiang and Shuo Chen

Appl. Sci. 2024, 14(5), 1941; https://doi.org/10.3390/app14051941 - 27 Feb 2024

Viewed by 730

In response to the shortcomings of traditional pest detection methods, such as inadequate accuracy and slow detection speeds, a lightweight forestry pest image recognition model based on an improved YOLOv8 architecture is proposed. Initially, given the limited availability of real deep forest pest [...] Read more.

In response to the shortcomings of traditional pest detection methods, such as inadequate accuracy and slow detection speeds, a lightweight forestry pest image recognition model based on an improved YOLOv8 architecture is proposed. Initially, given the limited availability of real deep forest pest image data in the wild, data augmentation techniques, including random rotation, translation, and Mosaic, are employed to expand and enhance the dataset. Subsequently, the traditional Conv (convolution) layers in the neck module of YOLOv8 are replaced with lightweight GSConv, and the Slim Neck design paradigm is utilized for reconstruction to reduce computational costs while preserving model accuracy. Furthermore, the CBAM attention mechanism is introduced into the backbone network of YOLOv8 to enhance the feature extraction of crucial information, thereby improving detection accuracy. Finally, WIoU is employed as a replacement for the traditional CIOU to enhance the overall performance of the detector. The experimental results demonstrate that the improved model exhibits a significant advantage in the field of forestry pest detection, achieving precision and recall rates of 98.9% and 97.6%, respectively. This surpasses the performance of the current mainstream network models. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

16 pages, 2661 KiB

Open AccessArticle

Collaborative Encoding Method for Scene Text Recognition in Low Linguistic Resources: The Uyghur Language Case Study

by Miaomiao Xu, Jiang Zhang, Lianghui Xu, Wushour Silamu and Yanbing Li

Appl. Sci. 2024, 14(5), 1707; https://doi.org/10.3390/app14051707 - 20 Feb 2024

Viewed by 491

Abstract

Current research on scene text recognition primarily focuses on languages with abundant linguistic resources, such as English and Chinese. In contrast, there is relatively limited research dedicated to low-resource languages. Advanced methods for scene text recognition often employ Transformer-based architectures. However, the performance [...] Read more.

Current research on scene text recognition primarily focuses on languages with abundant linguistic resources, such as English and Chinese. In contrast, there is relatively limited research dedicated to low-resource languages. Advanced methods for scene text recognition often employ Transformer-based architectures. However, the performance of Transformer architectures is suboptimal when dealing with low-resource datasets. This paper proposes a Collaborative Encoding Method for Scene Text Recognition in the low-resource Uyghur language. The encoding framework comprises three main modules: the Filter module, the Dual-Branch Feature Extraction module, and the Dynamic Fusion module. The Filter module, consisting of a series of upsampling and downsampling operations, performs coarse-grained filtering on input images to reduce the impact of scene noise on the model, thereby obtaining more accurate feature information. The Dual-Branch Feature Extraction module adopts a parallel structure combining Transformer encoding and Convolutional Neural Network (CNN) encoding to capture local and global information. The Dynamic Fusion module employs an attention mechanism to dynamically merge the feature information obtained from the Transformer and CNN branches. To address the scarcity of real data for natural scene Uyghur text recognition, this paper conducted two rounds of data augmentation on a dataset of 7267 real images, resulting in 254,345 and 3,052,140 scene images, respectively. This process partially mitigated the issue of insufficient Uyghur language data, making low-resource scene text recognition research feasible. Experimental results demonstrate that the proposed collaborative encoding approach achieves outstanding performance. Compared to baseline methods, our collaborative encoding approach improves accuracy by 14.1%. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

16 pages, 34705 KiB

Open AccessArticle

DFP-Net: A Crack Segmentation Method Based on a Feature Pyramid Network

by Linjing Li, Ran Liu, Rashid Ali, Bo Chen, Haitao Lin, Yonglong Li and Hua Zhang

Appl. Sci. 2024, 14(2), 651; https://doi.org/10.3390/app14020651 - 12 Jan 2024

Viewed by 555

Abstract

Timely detection of defects is essential for ensuring safe and stable operation of concrete buildings. Automatic segmentation of concrete buildings’ surfaces is challenging due to the high diversity of crack appearance, the detailed information, and the unbalanced proportion of crack pixels and background [...] Read more.

Timely detection of defects is essential for ensuring safe and stable operation of concrete buildings. Automatic segmentation of concrete buildings’ surfaces is challenging due to the high diversity of crack appearance, the detailed information, and the unbalanced proportion of crack pixels and background pixels. In this work, the Double Feature Pyramid Network is designed for high-precision crack segmentation. Our work reached the state-of-the-art level in crack segmentation, with key contributions outlined as follows: firstly, considering the diversity of crack shapes, the network constructs a feature pyramid containing three feature extraction backbones to extract the global feature map with three scale input images. In particular, due to the biggest challenge being too much single-pixel crack area, the targeted feature pyramid based on the high-resolution is added to extract adequate shallow semantic information. Lastly, designing a cascade feature fusion unit to aggregate the extracted multi-dimensional feature maps and obtain the final prediction. Compared with existing crack detection methods, the superior performance of this method has been verified based on extensive experiments, with Pixel Accuracy of 65.99%, Intersection over Union of 44.71%, and Recall of 62.95%, providing a reliable and efficient solution for the health monitoring and maintenance of concrete structures. This work contributes to the advancement of research and practical applications in related fields, offering robust support for the monitoring and maintenance of concrete structures. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

12 pages, 1946 KiB

Open AccessArticle

A Large-Class Few-Shot Learning Method Based on High-Dimensional Features

by Jiawei Dang, Yu Zhou, Ruirui Zheng and Jianjun He

Appl. Sci. 2023, 13(23), 12843; https://doi.org/10.3390/app132312843 - 30 Nov 2023

Viewed by 697

Abstract

Large-class few-shot learning has a wide range of applications in many fields, such as the medical, power, security, and remote sensing fields. At present, many few-shot learning methods for fewer-class scenarios have been proposed, but little research has been performed for large-class scenarios. [...] Read more.

Large-class few-shot learning has a wide range of applications in many fields, such as the medical, power, security, and remote sensing fields. At present, many few-shot learning methods for fewer-class scenarios have been proposed, but little research has been performed for large-class scenarios. In this paper, we propose a large-class few-shot learning method called HF-FSL, which is based on high-dimensional features. Recent theoretical research shows that if the distribution of samples in a high-dimensional feature space meets the conditions of compactness within the class and the dispersion between classes, the large-class few-shot learning method has a better generalization ability. Inspired by this theory, the basic idea is use a deep neural network to extract high-dimensional features and unitize them to project the samples onto a hypersphere. The global orthogonal regularization strategy can then be used to make samples of different classes on the hypersphere that are as orthogonal as possible, so as to achieve the goal of sample compactness within the class and the dispersion between classes in high-dimensional feature space. Experiments on Omniglot, Fungi, and ImageNet demonstrate that the proposed method can effectively improve the recognition accuracy in a large-class FSL problem. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

22 pages, 9343 KiB

Open AccessArticle

Foreign-Object Detection in High-Voltage Transmission Line Based on Improved YOLOv8m

by Zhenyue Wang, Guowu Yuan, Hao Zhou, Yi Ma and Yutang Ma

Appl. Sci. 2023, 13(23), 12775; https://doi.org/10.3390/app132312775 - 28 Nov 2023

Cited by 3 | Viewed by 911

Abstract

The safe operation of high-voltage transmission lines ensures the power grid’s security. Various foreign objects attached to the transmission lines, such as balloons, kites and nesting birds, can significantly affect the safe and stable operation of high-voltage transmission lines. With the advancement of [...] Read more.

The safe operation of high-voltage transmission lines ensures the power grid’s security. Various foreign objects attached to the transmission lines, such as balloons, kites and nesting birds, can significantly affect the safe and stable operation of high-voltage transmission lines. With the advancement of computer vision technology, periodic automatic inspection of foreign objects is efficient and necessary. Existing detection methods have low accuracy because foreign objects attached to the transmission lines are complex, including occlusions, diverse object types, significant scale variations, and complex backgrounds. In response to the practical needs of the Yunnan Branch of China Southern Power Grid Co., Ltd., this paper proposes an improved YOLOv8m-based model for detecting foreign objects on transmission lines. Experiments are conducted on a dataset collected from Yunnan Power Grid. The proposed model enhances the original YOLOv8m by incorporating a Global Attention Module (GAM) into the backbone to focus on occluded foreign objects, replacing the SPPF module with the SPPCSPC module to augment the model’s multiscale feature extraction capability, and introducing the Focal-EIoU loss function to address the issue of high- and low-quality sample imbalances. These improvements accelerate model convergence and enhance detection accuracy. The experimental results demonstrate that our proposed model achieves a 2.7% increase in mAP_0.5, a 4% increase in mAP_0.5:0.95, and a 6% increase in recall. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

17 pages, 1825 KiB

Open AccessArticle

Three-Dimensional Human Pose Estimation with Spatial–Temporal Interaction Enhancement Transformer

by Haijian Wang, Qingxuan Shi and Beiguang Shan

Appl. Sci. 2023, 13(8), 5093; https://doi.org/10.3390/app13085093 - 19 Apr 2023

Cited by 1 | Viewed by 1623

Abstract

Three-dimensional human pose estimation is a hot research topic in the field of computer vision. In recent years, significant progress has been made in estimating 3D human pose from monocular video, but there is still much room for improvement in this task owing [...] Read more.

Three-dimensional human pose estimation is a hot research topic in the field of computer vision. In recent years, significant progress has been made in estimating 3D human pose from monocular video, but there is still much room for improvement in this task owing to the issues of self-occlusion and depth ambiguity. Some previous work has addressed the above problems by investigating spatio-temporal relationships and has made great progress. Based on this, we further explored the spatio-temporal relationship and propose a new method, called STFormer. Our whole framework consists of two main stages: (1) extract features independently from the temporal and spatial domains; (2) modeling the communication of information across domains. The temporal dependencies were injected into the spatial domain to dynamically modify the spatial structure relationships between joints. Then, the results were used to refine the temporal features. After the preceding steps, both spatial and temporal features were strengthened, and the estimated final pose will be more precise. We conducted substantial experiments on a well-known dataset (Human3.6), and the results indicated that STFormer outperformed recent methods with an input of nine frames. Compared to PoseFormer, the performance of our method reduced the MPJPE by 2.1%. Furthermore, we performed numerous ablation studies to analyze and prove the validity of the various constituent modules of STFormer. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

17 pages, 5396 KiB

Open AccessArticle

Research on Vehicle Re-Identification Algorithm Based on Fusion Attention Method

by Peng Chen, Shuang Liu and Simon Kolmanič

Appl. Sci. 2023, 13(7), 4107; https://doi.org/10.3390/app13074107 - 23 Mar 2023

Viewed by 1058

Abstract

The specific task of vehicle re-identification is how to quickly and correctly match the same vehicle in different scenarios. In order to solve the problem of inter-class similarity and environmental interference in vehicle images in complex scenes, one fusion attention method is put [...] Read more.

The specific task of vehicle re-identification is how to quickly and correctly match the same vehicle in different scenarios. In order to solve the problem of inter-class similarity and environmental interference in vehicle images in complex scenes, one fusion attention method is put forward based on the idea of obtaining the distinguishing features of details—the mechanism for the vehicle re-identification method. First, the vehicle image is preprocessed to restore the image’s attributes better. Then, the processed image is sent to ResNet50 to extract the features of the second and third layers, respectively. Then, the feature fusion is carried out through the two-layer attention mechanism for a network model. This model can better focus on local detail features, and global features are constructed and named SDLAU-Reid. In the training process, a data augmentation strategy of random erasure is adopted to improve the robustness. The experimental results show that the mAP and rank-k indicators of the model on VeRi-776 and the VehicleID are better than the results of the existing vehicle re-identification algorithms, which verifies the algorithm’s effectiveness. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

15 pages, 3855 KiB

Open AccessArticle

LC-YOLO: A Lightweight Model with Efficient Utilization of Limited Detail Features for Small Object Detection

by Menghua Cui, Guoliang Gong, Gang Chen, Hongchang Wang, Min Jin, Wenyu Mao and Huaxiang Lu

Appl. Sci. 2023, 13(5), 3174; https://doi.org/10.3390/app13053174 - 01 Mar 2023

Cited by 6 | Viewed by 2121

Abstract

The limited computing resources on edge devices such as Unmanned Aerial Vehicles (UAVs) mean that lightweight object detection algorithms based on convolution neural networks require significant development. However, lightweight models are challenged by small targets with few available features. In this paper, we [...] Read more.

The limited computing resources on edge devices such as Unmanned Aerial Vehicles (UAVs) mean that lightweight object detection algorithms based on convolution neural networks require significant development. However, lightweight models are challenged by small targets with few available features. In this paper, we propose an LC-YOLO model that uses detailed information about small targets in each layer to improve detection performance. The model is improved from the one-stage detector, and contains two optimization modules: Laplace Bottleneck (LB) and Cross-Layer Attention Upsampling (CLAU). The LB module is proposed to enhance shallow features by integrating prior information into the convolutional neural network and maximizing knowledge sharing within the network. CLAU is designed for the pixel-level fusion of deep features and shallow features. Under the combined action of these two modules, the LC-YOLO model achieves better detection performance on the small object detection task. The LC-YOLO model with a parameter quantity of 7.30M achieves an mAP of 94.96% on the remote sensing dataset UCAS-AOD, surpassing the YOLOv5l model with a parameter quantity of 46.61M. The tiny version of LC-YOLO with 1.83M parameters achieves 94.17% mAP, which is close to YOLOv5l. Therefore, the LC-YOLO model can replace many heavyweight networks to complete the small target high-precision detection task under limited computing resources, as in the case of mobile edge-end chips such as UAV onboard chips. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

20 pages, 26421 KiB

Open AccessArticle

Deep Learning-Based Algorithm for Recognizing Tennis Balls

by Di Wu and Aiping Xiao

Appl. Sci. 2022, 12(23), 12116; https://doi.org/10.3390/app122312116 - 26 Nov 2022

Cited by 2 | Viewed by 1881

Abstract

In this paper, we adjust the hyperparameters of the training model based on the gradient estimation theory and optimize the structure of the model based on the loss function theory of Mask R-CNN convolutional network and propose a scheme to help a tennis [...] Read more.

In this paper, we adjust the hyperparameters of the training model based on the gradient estimation theory and optimize the structure of the model based on the loss function theory of Mask R-CNN convolutional network and propose a scheme to help a tennis picking robot to perform target recognition and improve the ability of the tennis picking robot to acquire and analyze image information. By collecting suitable image samples of tennis balls and training the image samples using Mask R-CNN convolutional network an algorithmic model dedicated to recognizing tennis balls is output; the final data of various loss functions after gradient descent are recorded, the iterative graph of the model is drawn, and the iterative process of the neural network at different iteration levels is observed; finally, this improved and optimized algorithm for recognizing tennis balls is compared with other algorithms for recognizing tennis balls and a comparison is made. The experimental results show that the improved algorithm based on Mask R-CNN recognizes tennis balls with 92% accuracy between iteration levels 30 and 35, which has higher accuracy and recognition distance compared with other tennis ball recognition algorithms, confirming the feasibility and applicability of the optimized algorithm in this paper. Full article

(This article belongs to the Special Issue Deep Learning and Computer Vision for Object Recognition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Deep Learning and Computer Vision for Object Recognition

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (9 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI