A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection

Zhang, Dongsheng; Zhang, Zhigang; Zhao, Na; Wang, Zhihai

doi:10.3390/electronics12183904

Open AccessArticle

A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection

by

Dongsheng Zhang

¹,

Zhigang Zhang

²

,

Na Zhao

³ and

Zhihai Wang

^4,*

¹

School of Electric Power, Yinchuan University of Energy, Yinchuan 750100, China

²

School of Physics and Electronic Science, Changsha University of Science and Technology, Changsha 410114, China

³

School of Traffic and Transportation Engineering, Changsha University of Science and Technology, Changsha 410114, China

⁴

School of Foreign Languages, Yinchuan University of Energy, Yinchuan 750100, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3904; https://doi.org/10.3390/electronics12183904

Submission received: 16 August 2023 / Revised: 8 September 2023 / Accepted: 14 September 2023 / Published: 15 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Transmission lines are often located in complex environments and are susceptible to the presence of foreign objects. Failure to promptly address these objects can result in accidents, including short circuits and fires. Existing foreign object detection networks face several challenges, such as high levels of memory consumption, slow detection speeds, and susceptibility to background interference. To address these issues, this paper proposes a lightweight detection network based on deep learning, namely YOLOv5 with an improved version of CSPDarknet and a Swin Transformer (YOLOv5-IC-ST). YOLOv5-IC-ST was developed by incorporating the Swin Transformer into YOLOv5, thereby reducing the impact of background information on the model. Furthermore, the improved CSPDarknet (IC) enhances the model’s feature-extraction capability while reducing the number of parameters. To evaluate the model’s performance, a dataset specific to foreign objects on transmission lines was constructed. The experimental results demonstrate that compared to other single-stage networks such as YOLOv4, YOLOv5, and YOLOv7, YOLOv5-IC-ST achieves superior detection results, with a mean average precision (mAP) of 98.4%, a detection speed of 92.8 frames per second (FPS), and a compact model size of 10.3 MB. These findings highlight that the proposed network is well suited for deployment on embedded devices such as UAVs.

Keywords:

YOLOv5; foreign object detection; lightweight network structure; Swin Transformer; attention mechanism

1. Introduction

The electric power industry is the most important industry for the development of the national economy, and transmission lines represent an essential part of the power industry. Because transmission lines are located in environments that are more complex and variable, it is very easy for a variety of foreign objects to attach to them. Foreign objects on transmission lines (such as tree branches and wind-blown objects) may lead to line short circuits, fires, or even blackouts and other safety accidents. Through the detection of foreign objects, potential safety hazards can be detected and removed at an early stage to ensure the safe operation of transmission lines. The detection of foreign objects on transmission lines is significant for ensuring the safe and reliable operation of the power system, reducing faults and power outages, improving economic efficiency and operation and maintenance efficiency, and realizing intelligent management. This is of great significance to the sustainable development of the power industry, users, and society.

The traditional method of inspection is both time-consuming and labor-intensive, and it often fails to comprehensively cover the entire transmission line. To address these issues, some researchers have turned to machine learning techniques for the detection of foreign objects on transmission lines. Yao et al. [1] pioneered the use of the slice method to segment images, followed by the use of Support Vector Machines (SVMs) for object detection. This approach yielded a commendable foreign object detection rate of 91.6%. Additionally, other scholars have leveraged the Canny edge detection operator to extract edge information from images. Subsequently, a straight-line detection method was employed to identify these lines, and mathematical morphology processing was applied to finalize the detection [2,3]. While these methods are capable of distinguishing the presence or absence of foreign objects, they fall short of categorizing the various types of foreign objects. Moreover, they lack real-time capabilities.

In recent years, the rapid advancement of deep learning has brought about significant improvements in both accuracy and speed in target-detection algorithms when compared to traditional machine learning approaches. The current landscape of deep-learning-based target detection methods can be broadly classified into two main categories. The first category encompasses two-stage detection algorithms, exemplified by R-CNN [4], Fast R-CNN [5], and Faster R-CNN [6]. Two-stage detection algorithms break down the target detection task into two distinct stages: candidate region generation and target classification and localization. Initially, in the candidate-region-generation stage, the algorithm generates a series of candidate boxes or regions that have the potential to contain targets. Subsequently, during the target classification and localization phase, these candidate boxes or regions undergo further classification and localization processes. The two-stage design of these algorithms typically yields heightened detection accuracy and lower false-positive rates. However, their computational complexity renders them less suitable for deployment on resource-constrained embedded devices. The other category comprises single-stage detection algorithms represented by the YOLO series [7,8,9,10], SSD [11], and CenterNet [12]. Single-stage detection algorithms are those that predict the location and category of the target directly from the image and complete the target detection task in one go. Single-stage detection algorithms accomplish target detection in one go without additional candidate-region-generation steps, so they are usually faster and more efficient. Du et al. [13] proposed BV-YOLOv5S based on YOLOv5S to realize real-time defect detection in road pits. Li et al. [14] developed a lightweight convolutional neural network called WearNet to enable the real-time detection of scratches on sliding metal parts.

Previous studies have shown that it is possible to accomplish the detection of foreign objects on transmission lines via deep learning techniques [15,16]. However, the existing algorithms suffer from high levels of memory consumption and slow detection speeds, and they are easily affected by background elements. The purpose of this paper is to provide an innovative strategy for the detection of foreign objects on transmission lines to achieve fast and accurate detection on embedded devices. The main contributions of this paper are shown below.

(1): YOLOv5-IC-ST is proposed based on an improved version of CSPDarknet. This advancement involves integrating a Swin Transformer and attention mechanism into CSPDarknet, resulting in an amplified capacity of the network to emphasize detected targets while diminishing the influence of background elements on the network’s performance. Additionally, conventional convolutions are replaced by the Ghost module, effectively reducing the overall number of network parameters. This strategic modification plays a pivotal role in enabling the seamless deployment of YOLOv5-IC-ST on resource-constrained embedded devices.
(2): In this study, a pioneering attention mechanism called the Unified Attention Mechanism (UAM) is introduced. It stands out by amalgamating three distinct components: the spatial attention module, channel attention module, and scale attention mechanism. Unlike conventional attention mechanisms, the UAM possesses an unparalleled capability to concurrently prioritize channel, spatial, and scale information. This holistic attention mechanism significantly bolsters the network’s capacity to apprehend and exploit pertinent features across diverse dimensions. This ultimately translates into heightened performance across a spectrum of tasks.
(3): The loss function is optimized by changing the original aspect ratio to aspect value regression, which suppresses uncorrelated anchors. The new loss function improves the ability to locate and detect small targets.

The structure of this paper is summarized in the following way: In Section 2, an introduction to the pertinent literature concerning the detection of foreign objects on transmission lines is presented. Section 3 elaborates the research methodology employed in this study, encompassing aspects such as dataset composition, enhancements to the network architecture, and the optimization of the loss function. In Section 4, a comprehensive account of the experiments and outcomes is provided. This includes insights into the experimental setup, evaluation metrics, and comparative experiments. Finally, Section 5 encapsulates the entire paper through a succinct summary.

2. Related Work

Zhou et al. [15] replaced Resnet [17] and FPN in Cascade RCNN [16] with a deeply aggregated feature-extraction network and an efficient weighted, bi-directional feature-fusion network to achieve the detection of anti-vibration hammer defects on transmission lines and reduce the computational cost of the model. Jun et al. [18] proposed a YOLOV5 algorithm based on the PWR-YOLOV5 corrosion component detection method. The jump layer’s full connectivity and adaptive-feature-fusion factor were introduced in the process of deep feature fusion to realize the extraction of features at different stages. The PWR-YOLOV5 algorithm has a mAP of 95.37% and a detection speed of 64.9 FPS. Wang et al. [19] conducted a practical examination of a DPM (Deformable Parts Model), Faster R-CNN, and SSD. The aim was to measure the models’ capability to detect foreign objects on transmission lines, and the experimental results show that the SSD algorithm has a mAP of 85.2% and a detection speed of 26 FPS. Liu et al. [20] realized the segmentation of foreign object images in transmission line channels by combining image-enhancement techniques with meta-learning techniques in U-Net [21]. Liu et al. [22] realized the segmentation of foreign object images in transmission line channels via the YOLOv3 embedded attention mechanism to realize foreign object detection in transmission lines. The method can effectively detect foreign objects with a network mAP of 88.79% and a detection speed of 55.18 FPS. Song et al. [23] optimized the Yolov4 algorithm by adding k-mean clustering and DIoU-NMS methods. The algorithm can achieve a detection accuracy of 81.72%.

However, most of these studies can only detect transmission line defects and single defects and cannot categorize the defect types. Moreover, the current algorithm model is large, which is not conducive to real-time detection. Therefore, on the basis of further improving the accuracy of the model, we designed a network capable of real-time detection based on the YOLOv5 algorithm which provides theoretical support for the subsequent deployment of an aircraft platform.

3. Materials and Methods

3.1. Dataset and Data Preparation

Due to the absence of a publicly available dataset specifically tailored for foreign object detection on high-voltage transmission lines, this paper undertook the creation of a bespoke dataset. Through the search engines Bing, Google, and Baidu, 1028 images in four different categories were obtained: 239 nests, 355 kites, 239 balloons, and 195 plastics. Figure 1 shows some of the images in the dataset. Label calibration was performed using labeling, as illustrated in Figure 2, which depicts the dataset calibration process. To bolster the model’s robustness and mitigate the impacts of the occlusion and varying lighting conditions typical of real-world scenarios, the dataset underwent augmentation. This augmentation involves the introduction of noise, adjustments in brightness, cutout operations, rotation, cropping, scaling, and other image-enhancement techniques, as outlined in Figure 3, illustrating the dataset enhancement approach. As a result of these image enhancements, the dataset was expanded to encompass a total of 4893 images. To facilitate the dataset’s utility, it was meticulously divided into three subsets: the training set, validation set, and test set, distributed at a ratio of 8:1:1. The training set facilitated model fitting, enabling the development of the detection model. The validation set played a pivotal role in assessing the effectiveness of the model fitting, thereby guiding the optimization of hyperparameters. Ultimately, the test set served as the final arbiter, validating the model’s performance.

3.2. YOLOv5-IC-ST Network Structure

Figure 4 illustrates the architectural layout of YOLOv5-IC-ST. The structure of the YOLOv5-IC-ST network can be dissected into four fundamental components: the input layer, the improved CSPDarknet, the neck module, and the prediction mechanism. The input layer is composed of three key segments: mosaic data augmentation, image size manipulation, and a dynamic anchor frame computation. Mosaic data augmentation amalgamates four distinct images to achieve an augmented image background. For image size processing, the approach strategically appends a minimal black border to diverse original images, accommodating different dimensions, and uniformly scales them to a consistent size of 640 × 640 × 3. In the dynamic anchor frame calculation, the initial anchor frame is juxtaposed with the predicted frame output, discrepancies are quantified, and reverse updates are performed iteratively. This process iteratively refines parameters to ascertain the most suitable anchor frame value. The enhanced CSPDarknet integrates a UAM (Unified Attention Mechanism) module and a Swin Transformer module. This integration enriches the model’s ability to emphasize channel and spatial information, concurrently mitigating the influence of intricate backgrounds. The incorporation of Ghost and C3Ghost modules counteracts the expansion of weight parameters induced by the attention mechanisms. This strategy effectively reduces computational overhead while enhancing the learning efficiency of the entire convolutional neural network. Within the prediction component, feature maps of three distinct scales, (80,80), (40,40), and (20,20),2 were extracted to facilitate the detection of foreign objects on transmission lines.

3.3. Attention Mechanism

The integration of the attention mechanism facilitates the model’s selective focus on meaningful regions within the image, effectively filtering out noise and irrelevant elements. This functionality significantly refines the model’s ability to grasp the nuances of the image, resulting in improved accuracy. Considering the intricate and complex backgrounds prevalent in the transmission-line foreign object dataset, the incorporation of the attention mechanism takes on heightened importance. This augmentation equips the model with the capability to effectively counteract background interference, thereby enhancing its prowess in extracting salient features from the image. This, in turn, amplifies the model’s comprehension of the image content, ultimately elevating its accuracy and performance.

This paper introduces a novel attention mechanism termed the UAM, which synergizes the spatial attention module, channel attention module, and scale attention mechanism. Unlike other attention mechanisms [24,25,26,27], the UAM stands out by simultaneously attending to channel, spatial, and scale information. The UAM is constructed as a sequential arrangement of the channel attention module, spatial attention module, and scale attention module. The process commences with the output result of the convolutional layer traversing the channel attention module to acquire a weighted outcome. This output then proceeds through the spatial attention module and the scale attention module to ultimately yield a definitive weighted output. The specific configuration of the UAM is illustrated in Figure 5. In this context, F denotes the feature map following the convolutional layer, while H and W represent the height and width of the feature map, respectively. Furthermore, C signifies the number of channels present in the input feature map. The channel attention module’s purpose lies in compressing the feature map within the spatial dimension to generate one-dimensional vector pre-operations. In this module, the global spatial information of F undergoes compression via Max Pool and Avg Pool to engender two feature maps, denoted S1 and S2, both of which possess dimensions of 1 × 1 × C. Subsequently, two one-dimensional feature maps emerge through multilayer perception (MLP). These maps are then normalized, resulting in the creation of a weighted feature map, denoted MCH. The spatial attention module, on the other hand, compresses the channel dimension through average pooling and maximum pooling. In this module, a 1 × 1 × 1 convolutional operation activates F, and the outcome is subjected to the sigmoid function, culminating in the generation of a weighted feature map, MSP. The scale attention module begins with a global pooling step, followed by Fc, which then undergoes activation through the relu function. The output subsequently encounters another round of Fc, culminating in the final output after passing through the sigmoid activation function, yielding the weighted feature map, MSC. In a parallel arrangement, MCH, MSP, and MSC are amalgamated through element-wise summation, ultimately generating the output feature map, F^, after executing the sigmoid activation function. This intricate architecture is illustrated in a graphical representation in Figure 5.

3.4. Swin Transformer

In tandem with the advancement of deep learning, the Transformer architecture has rapidly evolved within the realm of computer vision. Consequently, a growing number of Transformers have found applications across diverse computer vision tasks, such as image classification and target detection [28,29,30]. Notably, the Swin Transformer has emerged as a notable contender, adeptly homing in on the global characteristics of feature maps while maintaining a lower level of complexity [31]. The Swin Transformer module comprises several integral components, which encompass a Multi-Layer Perceptron (MLP), LayerNorm level normalization (LN), Window Multi-head Self-Attention (W-MSA), and Shifted-Window-based Multi-head Self-Attention (SW-MSA). An overview of the structure of the Swin Transformer module is presented in Figure 6. This module undertakes the bulk of global feature processing, culminating in distinct output formulations, as delineated in Equations (1) through (4), where

{\hat{z}}^{l}

is the feature output of the MLP in the l module, and

{\hat{z}}^{l}

is the feature output of the (S)W-MSA mechanism.

z^{l - 1}

is the feature output of

l - 1

.

{\hat{z}}^{l} = W - MSA (LN (z^{l - 1})) + z^{l - 1}

(1)

z^{l} = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}

(2)

{\hat{z}}^{l + 1} = SW - MSA (LN (z^{l})) + z^{l}

(3)

z^{l + 1} = MLP (LN ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}

(4)

3.5. Improved CSPDarknet

An enhanced CSPDarknet architecture was constructed through the amalgamation of various critical modules, each contributing to its improved performance. These modules encompass the CBL module, Ghost module, C3Ghost module, UAM module, and Swin Transformer. The CBL (Convolutional–BatchNorm–LeakyReLU) module stands as a fundamental building block, encompassing convolution, batch normalization, and a leaky ReLU activation function. Distinctively, the C3Ghost module segments the input into two segments. One segment undergoes CBL processing prior to traversing the residual component, eventually concluding with a convolution operation. Meanwhile, the other segment directly encounters convolution operations. Subsequently, these two processed segments are concatenated through a concatenation operation. This novel composition, coupled with the inclusion of other modules like the Ghost module, UAM module, and Swin Transformer, collectively brings forth the enhanced CSPDarknet architecture.

To reduce the amount of computation and the number of parameters, Huawei proposed GhostNet and GhostConv in 2020 [32]. Figure 7 shows the ghost module network structure. Unlike conventional convolution, the computation of GhostConv is performed in two parts. First, a feature map of m features with fewer feature channels is obtained using an

n \cdot k \cdot k

regular convolutional computation, and then s feature maps are generated using cheap linear operations. Finally, n feature maps are obtained by joining two feature maps where

n = m \cdot s

. The

q_{1}

and

q_{2}

in Equations (5) and (6) are the floating-point operations of regular convolution and GhostConv, respectively. The input feature map size is

c \cdot h \cdot w

, the convolution kernel is

n \cdot k \cdot k

, and the output feature map size is

n \cdot h^{'} \cdot w^{'}

. Additionally, d · d is the size of the convolution kernel for the linear operation s << c.

q_{1} = n \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k

(5)

q_{2} = \frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot d \cdot d

(6)

Equation (7) shows a computational quantity comparison for the conventional convolutional kernel GhostConv. Equation (8) shows a comparison of the number of parameters for the conventional convolutional kernel GhostConv. By combining Equations (7) and (8), it can be concluded that when the value of

k

and the value of

d

are approximate, the number of parameters and the computation of feature extraction for the Ghost convolution are about 1/s of the standard convolution.

\begin{matrix} r_{s} = \frac{n \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h^{'} \cdot w^{'} \cdot d \cdot d} \\ = \frac{c \cdot k \cdot k}{\frac{1}{s} \cdot c \cdot k \cdot k + \frac{s - 1}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s \end{matrix}

(7)

r_{c} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s

(8)

3.6. EIOU Loss Function

The native localization loss function within the YOLOv5 algorithm employs the CIoU (Complete Intersection over Union) metric [33]. The computation of the CIoU metric is articulated through Equations (9)–(11), where

α

and

v

are the trade-off parameters which measure the consistency of the aspect ratio, respectively. It is apparent from the formulation that the CIoU loss function encompasses considerations of the overlap area, centroid distance, and the bounding box regression’s aspect ratio. However, the aspect ratio’s genuine discrepancy and its correlation with the confidence level are not encompassed within this formulation. As a result, the CIoU metric frequently encounters challenges in achieving precise localization, particularly when confronted with diminutive target objects.

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{b^{t}})}{c^{2}} + α v

(9)

α = \frac{v}{(1 - I o U) + v}

(10)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(11)

In pursuit of enhanced localization precision, this study employed the EIoU (Embedding Intersection over Union) metric [34] for the computation of localization loss. In contrast to the CIoU approach, the EIoU approach dissects the influence factor of the aspect ratio, independently evaluating the length and width of both the target frame and the anchor frame. The EIoU loss function constitutes three integral components: overlap loss, center distance loss, and width and height losses. The EIoU metric imparts supervision to the backpropagation process by leveraging the genuine dissimilarities in length and width between the projected frame and the annotated frames. This strategy yields an optimal resolution for the loss function, thus amplifying localization accuracy, particularly in scenarios involving diminutive target objects.

The EIoU metric is defined as shown in Equation (12), where

b^{gt}

is the center of mass of the real box,

w^{gt}

is the side width of the real box, and

h^{gt}

is the side length of the real box. b, w, and h are the center of mass, the side width, and the side length of the prediction box, respectively.

c

is the diagonal of the smallest outer rectangle of the real box and the prediction box, and

C_{w}

and

C_{h}

are the side degree and side length, respectively.

L_{EIoU} = 1 - I o U + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + \frac{ρ^{2} (w, w^{gt})}{C_{w}^{2}} + \frac{ρ^{2} (h, h^{gt})}{C_{h}^{2}}

(12)

4. Results and Discussion

This section of the study encompasses a series of comparison experiments involving YOLOv3-tiny, YOLOv7-tiny, and other related models to substantiate the performance of the proposed model. Additionally, ablation experiments were conducted to ascertain and validate the extent of the enhancements to the model. All experimental trials were conducted within a Windows 10 operating environment featuring an Intel i5-12400 CPU and an NVIDIA GeForce RTX 3090 with a 24 GB GPU. The neural network framework employed for these experiments was PyTorch version 1.11. YOLOv5s was used for the pre-trained model. The experimental training epoch was set to 300, the initial learning rate was 0.001, 0.01 after 50 epochs and 0.1 after 100 epochs, the optimizer was Adam, and the Adam beta1 value was 0.937.

4.1. Evaluation Metrics

In this study, three key evaluation metrics were adopted, precision (P), recall (R), and the mean average precision (mAP), serving as measures for assessing the algorithm’s performance. The Intersection over Union (IOU) metric was leveraged to quantify the extent of the overlap between the predicted frame and the reference frame. A prediction was deemed accurate when its IOU surpassed a threshold of 0.5, as indicated in Equation (13). Precision, as articulated in Equation (14), gauges the ratio of accurate instances within the retrieved outcomes, effectively representing the model’s accuracy. A higher precision value corresponds to a reduced level of noise in the retrieved outcomes, signifying superior precision. Recall, as expressed in Equation (15), measures the proportion of pertinent instances retrieved from the ground truth, elucidating the model’s coverage. Elevated recall values signify the model’s enhanced ability to retrieve relevant instances and, consequently, its heightened search comprehensiveness. The concept of average precision (AP), presented in Equation (16), evaluates the model’s detection efficacy concerning a specific class of targets. The mean average precision (mAP), as defined in Equation (17), consolidates AP values across all target classes, offering a comprehensive assessment of the model’s detection performance.

I O U ({box}_{g t}, {box}_{p}) = \frac{|b o x_{g t} \cap b o x_{p}|}{|b o x_{g t} \cup b o x_{p}|}

(13)

P = \frac{T P}{(T P + F P)}

(14)

R = \frac{T P}{F N + T P}

(15)

A P = \frac{\sum_{i = 1}^{n} P_{i}}{n}

(16)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(17)

4.2. Performance Evaluation

The model-training process entails three distinct loss functions: bounding box loss, objectiveness loss, and classification loss. Each serves a specific purpose in fine-tuning the model’s performance. Bounding box loss quantifies the disparity between the model’s predicted bounding box and the actual bounding box, effectively guiding the model’s localization accuracy. Classification loss, on the other hand, assesses the precision of the model’s predictions regarding target categories. In the context of target detection, the model’s role extends beyond solely predicting bounding box locations and target confidence; it also involves predicting the target’s category. This entails the utilization of the target confidence loss, which evaluates the accuracy of the model’s confidence estimation for detecting the target within a given bounding box. Moreover, each bounding box prediction is accompanied by a confidence score denoting the likelihood of a target object’s presence within that box. Categorization loss evaluates the accuracy of the model’s predictions concerning target categories. Similar to the aforementioned components, the model is tasked with not only predicting bounding box locations and target confidence but also the target’s specific category. A visual representation of the training loss function’s trajectory for YOLOv5-IC-ST is depicted in Figure 8. The model is subjected to training over the course of 290 iterations, contributing to its refinement and optimization.

Figure 9 and Figure 10 showcase the precision–recall curve plots for YOLOv5 and YOLOv5-IC-ST, respectively. Notably, the mean average precision (mAP) for YOLOv5 amounts to 0.940, whereas the enhanced YOLOv5-IC-ST attains a superior mAP of 0.984. The mAP increased by 4.4%, fully demonstrating the effectiveness of the improvement These comparisons underscore the enhanced performance achieved by YOLOv5-IC-ST. Figure 11 and Figure 12 further depict the detection results in visuals for YOLOv5 and YOLOv5-IC-ST, respectively. The YOLOv5-IC-ST shows better localization and identification for both smaller bird nests and larger plastics. The implementation of the YOLOv5-IC-ST algorithm remarkably demonstrates its ability to precisely identify foreign objects, showcasing its robust detection capabilities. These visual representations substantiate the effectiveness of the YOLOv5-IC-ST algorithm in significantly bolstering the detection accuracy for foreign objects.

4.3. Comparison of Attentional Mechanis

Comparative experiments on attention mechanisms were conducted to validate the performance of the UAM. Various attention modules, including an SE (Squeeze and Excitation Network), a CA (Class Agnostic Segmentation Network), ECA (Efficient Channel Attention), a CBAM, and the UAM, were used for the comparison. Table 1 shows the results of the comparison experiments. Adding the attention mechanism module can significantly improve the accuracy of the model, although it results in an increase in the model parameters. The CA and ECA attention modules each have a high number of parameters, which is not conducive to the deployment of the model in embedded devices, and low mAP values. Compared to the SE and CBAM, the UAM has a slightly higher number of parameters but the highest mAP value. This is due to the fact that the UAM pays attention to channel, spatial, and scale information and reduces the number of parameters through a parallel structure.

4.4. Comparison of the Performance of Different Detection Algorithms

To validate the effectiveness of the proposed algorithm, the authors of this paper conducted comparative experiments involving YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, YOLOv8s, and YOLOv7-tiny [35]. The results, presented in Table 2, demonstrate that the algorithm introduced in this paper outperforms the other models in terms of accuracy, detection speed, and model size. Specifically, the proposed algorithm achieves a mean average precision (mAP) of 98.4%, a detection speed of 92.8 frames per second (FPS), and a compact model size of 10.30 MB. These results indicate that the algorithm presented in this paper offers superior performance compared to the alternative models in the evaluation metrics of accuracy, detection speed, and model size. This highlights the significance of the proposed algorithm for effective and efficient detection tasks.

4.5. Ablation Experiment

In order to comprehensively assess the efficacy of the enhanced module, a series of ablation experiments were meticulously conducted, and their outcomes are succinctly presented in Table 3. By isolating and testing individual improvements, a comprehensive understanding of their impact on the overall performance was gained. The baseline performance of YOLOv5, without any enhancements, yielded a mean average precision (MAP) of 94.0% accompanied by a model size of 14.46 MB. Substituting the regular convolution with GhostConv notably reduced the model size to 8.86 MB, albeit with a marginal decline in accuracy to 93.8%. The introduction of the Swin Transformer, in contrast, yielded a 2.0% increase in the MAP, a testament to its adeptness at mitigating the influence of complex backgrounds. Furthermore, the incorporation of the Unified Attention Mechanism (UAM) exerted a pronounced influence, elevating the MAP to 98.2%. This underscores the UAM’s capability to concurrently prioritize channel, spatial, and scale information, yielding a marked enhancement. Following a loss function optimization, the model size saw a reduction to 10.30 MB, while a slight 0.2% increase in the MAP was observed. The application of the EIoU loss function, though it did not impact the model’s weight, distinctly fortified its proficiency in accurately pinpointing small targets. Collectively, these ablation experiments showcase the potency of each enhancement in refining the performance of YOLOv5-IC-ST. The remarkable reduction in model size, coupled with the enhanced detection accuracy and the capability to handle intricate backgrounds, collectively accentuates the pivotal role of these proposed modifications.

5. Conclusions

In response to the intricate task of detecting foreign objects on transmission lines, this paper introduces a streamlined detection network named YOLOv5-IC-ST. YOLOv5-IC-ST encompasses a collection of pivotal enhancements, featuring a novel lightweight backbone network, the integration of the Swin Transformer, and the incorporation of the Unified Attention Mechanism (UAM). The UAM focuses on channel, spatial, and scale information simultaneously through a tandem structure. These enhancements collectively engender a substantial elevation in YOLOv5-IC-ST’s performance, culminating in a remarkable augmentation in the localization of small target anomalies, the effective suppression of background interference, and accelerated processing speeds. A comparative analysis with other networks demonstrates that the network proposed in this paper outperforms them in the detection of foreign objects in transmission lines. YOLOv5-IC-ST achieves an impressive mean average precision (mAP) of 98.4%, operates at a detection speed of 92.8 frames per second (FPS), and has a compact model size of 10.30 MB. Collectively, these advantages in detection accuracy, speed, and model size make YOLOv5-IC-ST highly suitable for deployment on embedded devices. In the future, we plan to expand the dataset by including more categories of foreign objects, enriching its diversity. Additionally, to ensure the model’s robustness and effectiveness in real-world scenarios, we intend to collect images of real scenes that encompass occlusion, variations in lighting conditions, and other complex situations. This proactive approach will contribute to further improving the model’s performance and applicability.

Author Contributions

Conceptualization, D.Z.; methodology, D.Z.; software, Z.Z.; validation, Z.Z.; formal analysis, D.Z.; investigation, Z.W.; data curation, N.Z.; writing—original draft preparation, D.Z.; writing—review and editing, D.Z.; visualization, Z.Z.; project administration, D.Z.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a project on undergraduate teaching quality from the Yinchuan University of Energy, grant number 2021-JG-X-02, the Ningxia Hui Autonomous Region college students’ innovation and entrepreneurship training program, grant number S202213820006, the Open Fund of the Key Laboratory of Highway Engineering of Ministry of Education, grant number kfj220201, and the Open Research Fund of Hunan Provincial Key Laboratory of Flexible Electronic Materials Genome Engineering, grant number 202019.

Data Availability Statement

Requests for the dataset used in this paper may be made to the email address remindn@163.com. The dataset may be used for scholarly communication only.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yao, N.; Hong, G.; Guo, Y.; Zhang, T. The Detection of Extra Matters on the Transmission Lines Based on the Filter Response and Appearance. In Proceedings of the 2014 Seventh International Symposium on Computational Intelligence and Design, Hangzhou, China, 13–14 December 2014; pp. 542–545. [Google Scholar]
Bhujade, R.M.; Adithya, V.; Hrishikesh, S.; Balamurali, P. Detection of power-lines in complex natural surroundings. Comput. Sci. 2013, 101–108. [Google Scholar] [CrossRef]
Tong, W.-G.; Li, B.-S.; Yuan, J.-S.; Zhao, S.-T. Transmission line extraction and recognition from natural complex background. In Proceedings of the 2009 International Conference on Machine Learning and Cybernetics, Baoding, China, 12–15 July 2009; pp. 2473–2477. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Du, F.-J.; Jiao, S.-J. Improvement of lightweight convolutional neural network model based on YOLO algorithm and its research in pavement defect detection. Sensors 2022, 22, 3537. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Zhang, L.; Wu, C.; Cui, Z.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Technol. 2022, 123, 1999–2015. [Google Scholar] [CrossRef] [PubMed]
Zhou, F.; Wen, G.; Qian, G.; Ma, Y.; Pan, H.; Liu, J.; Li, J. A high-efficiency deep-learning-based antivibration hammer defect detection model for energy-efficient transmission line inspection systems. Int. J. Antennas Propag. 2022, 2022, 3867581. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wu, J.; Sun, Y.; Wang, X. Corrosion detection method of transmission line components in mining area based on multiscale enhanced fusion. Mob. Inf. Syst. 2022, 2022, 3867581. [Google Scholar] [CrossRef]
Wang, B.; Wu, R.; Zheng, Z.; Zhang, W.; Guo, J. Study on the method of transmission line foreign body detection based on deep learning. In Proceedings of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 26–28 November 2017; pp. 1–5. [Google Scholar]
Liu, X.; Chen, X.; Cao, S.; Gou, J.; Wang, H. An Algorithm for Recognition of Foreign Objects in Transmission Lines with Small Samples. In Proceedings of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March 2022; pp. 2026–2030. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Liu, P.; Zhang, Y.; Zhang, K.; Zhang, P.; Li, M. An Improved YOLOv3 Algorithm and Intruder Detection on Transmission Line. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 5736–5741. [Google Scholar]
Song, Y.; Zhou, Z.; Li, Q.; Chen, Y.; Xiang, P.; Yu, Q.; Zhang, L.; Lu, Y. Intrusion detection of foreign objects in high-voltage lines based on YOLOv4. In Proceedings of the 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; pp. 1295–1300. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5217–5226. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Zhang, Y.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. arXiv 2021, arXiv:2101.08158. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]

Figure 1. Example images from the dataset: (a) plastic, (b) a balloon, (c) a nest, and (d) a kite.

Figure 2. The process of calibrating the dataset.

Figure 3. Example of image enhancement.

Figure 4. YOLOv5-IC-ST network structure.

Figure 5. Convolutional block attention module architecture.

Figure 6. The structure of the Swin Transformer module.

Figure 7. Ghost module structure.

Figure 8. The loss function during training.

Figure 9. The precision–recall curve for YOLOv5.

Figure 10. The precision–recall curve for YOLOv5-IC-ST.

Figure 11. The detection results for YOLOv5.

Figure 12. The detection results for YOLOv5-IC-ST.

Table 1. Results of the attentional mechanism comparison experiment.

Attention	Parameter Size (MB)	mAP%
None	9.67	94.41
+SE	9.88	96.88
+CA	10.13	95.10
+ECA	10.22	95.12
+CBAM	10.12	97.53
+UAM	10.30	98.40

Table 2. Comparison of experimental results with different algorithms.

Algorithm	mAP (%)	Detection Speed (FPS)	Model Size (MB)
YOLOv3-tiny	92.4	66.3	32.17
YOLOv4-tiny	93.2	86.4	22.98
YOLOv7-tiny	95.1	88.6	24.45
YOLOv8s	96.7	90.6	22.34
YOLOv5s	94.0	90.8	14.46
ours	98.4	92.8	10.30

Table 3. Results of ablation experiments.

YOLOv5	GhostConv	Swin Transformer	UAM	EIoU	mAP (%)	Model Size (MB)
√					94.0	14.46
√	√				93.8	8.86
√	√	√			95.8	9.67
√	√	√	√		98.2	10.30
√	√	√	√	√	98.4	10.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, D.; Zhang, Z.; Zhao, N.; Wang, Z. A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection. Electronics 2023, 12, 3904. https://doi.org/10.3390/electronics12183904

AMA Style

Zhang D, Zhang Z, Zhao N, Wang Z. A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection. Electronics. 2023; 12(18):3904. https://doi.org/10.3390/electronics12183904

Chicago/Turabian Style

Zhang, Dongsheng, Zhigang Zhang, Na Zhao, and Zhihai Wang. 2023. "A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection" Electronics 12, no. 18: 3904. https://doi.org/10.3390/electronics12183904

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset and Data Preparation

3.2. YOLOv5-IC-ST Network Structure

3.3. Attention Mechanism

3.4. Swin Transformer

3.5. Improved CSPDarknet

3.6. EIOU Loss Function

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Performance Evaluation

4.3. Comparison of Attentional Mechanis

4.4. Comparison of the Performance of Different Detection Algorithms

4.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI