1. Introduction
In recent years, new energy electric vehicles have gradually occupied the global automobile manufacturing market with their advantages of low energy consumption and low carbon emission. At the same time, the manufacturing process of new energy power batteries has also attracted widespread attention [
1]. Defects will inevitably appear on the pole of new energy power batteries during the process of laser welding [
2], which can cause great potential hazards to the production and safe usage of batteries. Therefore, it is of great significance to accurately detect these laser welding defects. Automatic optical inspection (AOI) systems are often used in industrial production to detect and process weld defects [
3,
4]. An AOI system mainly consists of three modules: the image acquisition module, image processing module, and image analysis module. The image acquisition module primarily consists of a CMOS industrial camera and LED light source [
5]. The image analysis module is the most important link in the AOI system, and it is also the core module that determines whether the AOI system can efficiently identify weld defects. Usually, an AOI system mainly utilizes the defect detection algorithm to identify laser welding defects. Similarly, for the safe production of new energy power batteries, it is necessary to design a highly efficient laser welding defect detection algorithm for battery poles. Currently, with the rapid development of convolutional neural network (CNN) research and image processing algorithms, deep learning-based target detection has obtained considerable development and has gradually been used in AOI systems for defect detection. Deep learning-based target detection algorithms are divided into two categories: two-stage and one-stage target detection algorithms. One of the first two-stage target detection algorithms was the R-CNN proposed by Girshick et al. [
6] in 2014. Subsequently, to improve the R-CNN algorithm, Girshick et al. [
7] proposed Fast R-CNN in 2015. In the same year, Ren SQ et al. [
8] proposed Faster R-CNN, which further improved the detection speed of the model. Regarding one-stage target detection algorithms, Redmon J et al. [
9] first proposed You Only Look Once (YOLO) in 2016. Single Shot Multibox Detector (SSD) was proposed by W Liu et al. [
10] where the one-stage algorithm reduces the target detection problem to a regression problem without generating candidate frames, greatly increasing the model’s speed and making it possible to deploy the target detection algorithm to the industry. The subsequent proposals of some algorithms such as YOLOv2 [
11], YOLOv3 [
12], YOLOv4 [
13], and YOLOX [
14] have led to a considerable improvement in the detection accuracy and speed of the model.
Among the aforementioned target detection algorithms, Faster R-CNN is the most mature target detection algorithm in the second stage algorithms, which is used in industrial production in large quantities. Kaihua Zhang et al. [
15] proposed an improved Faster R-CNN algorithm by clustering to generate anchor frames and fused migration learning with ResNet-101 to solve the problem of inefficiency and the lack in the safety of lithium battery connector solder joint detection methods. Min-jae Jung et al. [
16] proposed an automatic weld defect detection method based on the Faster R-CNN algorithm to address the problem of manual and costly weld quality inspection in the shipbuilding industry. Moyun Liu et al. [
17] proposed an improved YOLO algorithm by designing an enhanced multi-scale feature module to enable the extracted feature maps to represent richer information, resulting in a better balance between the accuracy and speed of the model in the X-ray image weld defect dataset. Yu-Ting Li et al. [
18] used ResNet-101 as a classifier based on the YOLOv2 algorithm to achieve the automatic detection of solder defects in the printed circuit board (PCB) dual inline package (DIP) process, reducing the cost of manual inspection. Jiexin Zheng et al. [
19]. proposed a YOLOv3-based algorithm for steel surface defect detection using MobileNet as the backbone network with good performance on the NEU-DET dataset. Srihari M et al. [
20] proposed an effective casting defect detection model based on YOLOv4 for improvement. Although early one-stage inspection algorithms achieved the basic accuracy required for defect detection in industry, the detection speed still needs improvement.
With the introduction of the YOLOv5 algorithm, it is finally possible to achieve a high accuracy and fast rate in the industry field of defect detection. Meng Zhang et al. [
21]. solved the problem of complex background and difficult defect detection in solar cell images based on the YOLOv5 algorithm, but the model complexity increased, leading to a decrease in detection speed. Zhuang Li et al. [
22] proposed a two-stage industrial defect detection framework based on the improved YOLOv5 model and Optimized-Inception-ResnetV2 to achieve the accurate identification of minor target defects on steel surfaces. However, the detection accuracy still fell short of the industrial production requirements. Dingming Yang et al. [
23] proposed an improved pipeline weld defect detection algorithm for YOLOv5, which effectively improved the detection efficiency and basically met the accuracy and speed requirements for defect detection in the industry. Although the above methods have basically achieved defect detection in the industry, some problems remain unsolved. The main problems at present are: how to improve the detection accuracy while ensuring the model detection speed is fast enough and how to solve the problem that small target defects are challenging to detect.
However, with the rapid development of industrial automation manufacturing, the defect detection speed and accuracy have put forward higher requirements. Aiming to solve the rapidly expanding demand for detecting laser welding defects of a lithium battery pole, we developed a YOLOv5-based algorithm as an image analysis module for the AOI system. We did not use the officially provided pre-training weights, and all network models were trained from scratch. We compared the improved model with some influential algorithms such as YOLOv7 [
24], YOLOv6 [
25], and YOLOX, and the results show that our model has high accuracy and can meet the industrial demand for real-time detection.
2. Laser Welding Defect Detection Model for Lithium Battery Pole
2.1. YOLOv5 Algorithm
The official YOLOv5 source code provides four network models with increasing network depth and feature map width, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. To meet the industrial requirements for defect detection algorithms such ass real-time detection and easy deployment, we chose YOLOv5s as the infrastructure, which has the smallest model size and the fastest detection speed.
Figure 1 shows the network structure of YOLOv5, which mainly includes four parts: Input, Backbone, Neck, and Head.
Input includes three modules: Adaptive image scaling, Mosaic data enhancement, and Adaptive anchor frame calculation. When images of different sizes are entered into Input, the algorithm first normalizes the image size to 640 × 640 by the adaptive image scaling technique. It then randomly selects four images for cropping, scaling, and layout processing by Mosaic data enhancement before finally inputting them into the network for training. The YOLOv5 algorithm provides nine priori anchor frames trained on the MS COCO dataset, calculates the maximum recall between the standard information of the dataset and the priori anchor frames before starting the training, and uses the K-means algorithm to re-cluster the anchor frames that best fit the dataset when the maximum recall is below 0.98.
Backbone is the backbone network used to extract the input image features, which mainly consists of Conv Batch Normalization SiLU (CBS), CSP bottleneck with 3 Conv (C3), and Spatial Pyramid Pooling Fast (SPPF) modules. Compared with the previous algorithms, the latest YOLOv5 algorithm uses CBS with 6 × 6 convolutional kernels to replace the original Focus module at the first layer of the network, reducing the training time cost. The first CBS was followed by four CBS, four C3 modules, and the SPPF module, which is in the last layer of the backbone network. The input image of 640 × 640 × 3 is first halved, and the number of channels is changed to four times by 6 × 6 convolutional kernels to obtain a 320 × 320 × 12 feature map and then fed into the CBS and C3 modules. Finally, the feature map was obtained by SPPF to extract advanced semantic information.
YOLOv5, like YOLOv4, still uses the feature pyramid network and pixel aggregation network (FPN-PAN) structure for its Neck part. The FPN structure transmits semantic information by an up-sampling operation from top to bottom, and the PAN structure transmits location information by a down-sampling process from bottom to top. Combining the two structures enables the network to fuse more feature information, constituting a multi-scale feature fusion module that can retain large-scale and small-scale target feature information. When the input image is 640 × 640, the Head part outputs a feature map grid of 20 × 20, 40 × 40, and 80 × 80, representing its predicted small, medium, and large targets. At the same time, there are three anchor boxes for prediction at each scale, and finally, the prediction box with the highest confidence is filtered by non-maximum suppression.
2.2. The Improved YOLOv5 Model
The laser welding defects of a battery pole are irregular in shape, random in location, and vary in size. In addition, there are often many small target defects. In this case, the original YOLOv5 model cannot fully meet the detection requirements, and there is a low detection accuracy and high missing inspection rate. The improved YOLOv5 network model is shown in
Figure 2. First, to improve the algorithm’s ability to detect small target defects, we improved the backbone network by replacing the original 3 × 3 convolutional kernels with 6 × 6 convolutional kernels and replacing the original SPPF module in the last layer of the backbone network with our SPPSE module, then we introduced the lightweight convolutional neural network Re-Parameterization of Visual Geometry Group (RepVGG) [
26] in the upper layer of the three detection heads. Finally, the original loss function CIoU [
27] was modified to SIoU [
28].
2.2.1. The Improved CBS Module
To improve the feature extraction capability of the backbone network for small targets of welding defects, we increased the convolutional kernels in the CBS module from 3 × 3 to 6 × 6. Using larger convolutional kernels can effectively improve the sensory field perception, obtain more contextual information, enable the backbone network to capture small low-level features more efficiently, and improve the small target detection capability of the model. The improved CBS module is shown in
Figure 3, which consists of 6 × 6 convolutional kernels, BN, and activation function layers (SiLU) connected in series. BN significantly improves convergence without needing other forms of regularization; SiLU has the properties of unbounded upper and lower bounds, smooth and non-monotonic, and outperformed ReLU on deep models.
2.2.2. The SPPSE Module
Spatial Pyramid Pooling (SPP) [
29] was first proposed to avoid the problems of incomplete cropping and the shape distortion of image objects caused by the R-CNN algorithm for image region cropping and scaling operations, to solve the problem of repeated feature extraction of images by convolutional neural networks, to greatly improve the speed of generating candidate frames, and to save computational costs. To make the model adapt to images with different resolutions, we combined the Spatial Pyramid Pooling Cross Stage Partial Conv (SPPCSPC) module of YOLOv7 with the attention mechanism Squeeze and Excitation Network (SENet) [
30], and we named it the SPPSE module, which is shown in
Figure 4a. In the SPPSE module, the input features are divided into two parts, one for the convolution operation and the other for the SPP structure. Finally, the two elements are combined using Concat. This design reduces the computation by half and makes the model inference faster, and the accuracy is improved. Meanwhile, to select the key information in the current task and improve the efficiency and accuracy of image information processing, we added the attention mechanism SENet at the top of the SPPSE module, which can suppress other useless information from different channels and enhance the focus on the target region.
As shown in
Figure 4b, the attention mechanism SENet consists of squeeze and excitation. The model can determine the importance of each channel feature based on the interrelationship between different channels and then assign different weights to each channel to achieve the effect of the channel of interest and more important to the result.
2.2.3. The Improved RepVGG Module
RepVGG was proposed in 2021 based on the idea of re-parameterization. Its core technology is to enhance the feature extraction capability by re-parameterizing the structure and improving the inference speed by using a multi-branch design during training and a single-branch procedure during inference. The training and inference schematic of the RepVGG module is shown in
Figure 5.
As shown in
Figure 5a, the RepVGG module uses a multi-branch structure consisting of 3 × 3 convolutional kernels, 1 × 1 convolutional kernel, and identity branches in the training phase. It enhances the model’s ability and efficiency of extracting feature information by connecting the different perceptual fields obtained from each branch to an additive block and then extending the network downward and doing the same operation. When inference is made, the RepVGG model transforms the training model into the inference model, as shown in
Figure 5b. The inference model is equivalent to the structure of multiple VGGs directly connected, which improves the inference speed of the model. Meanwhile, to improve the detection accuracy of the model, we used SiLU to replace the previous activation function ReLU.
2.2.4. The Improved Loss Function
The loss function of the YOLOv5 model is shown in Equation (1), and it consists of three components: confidence loss function
, classification loss function
, and bounding box regression loss function
.
The YOLOv5 source code uses CIoU as the bounding box regression loss function. CIoU can take into account the dimensional information of the predicted frame and the real frame such as the overlap area, centroid distance, and aspect ratio, but it cannot truly reflect the difference between the confidence level, width, and height, so we chose to use SIoU instead. The SIoU loss function redefines the penalty measure to solve the problem of mismatch between the true and predicted frames and considers the vector angle between the expected regressions.
The SIoU [
28] loss function mainly contains the following four parts: Angle cost
, Distance cost
, Shape cost
, and IoU cost. Angle cost is defined by Equation (2):
where
,
is the distance between the center point of the prediction box and the ground truth box;
is the vertical distance between the center point of the prediction box and the ground truth box.
Considering the Angle cost defined above, the Distance cost is defined by Equation (3):
where
,
,
,
and
are the horizontal and vertical coordinates of the center point of the ground truth box;
and
are the horizontal and vertical coordinates of the center point of the prediction box;
is the horizontal distance between the center point of the prediction box and the ground truth box. Shape cost is defined by Equation (4):
where
,
;
and
are the width and height of the ground truth box;
and
are the width and height of the prediction box.
The value of
is calculated by the genetic algorithm, which determines how much attention should be paid to the cost of the shape. The value of
is different in various datasets and ranges from 2 to 4. IoU cost is defined by Equation (5):
where
is the area of the ground truth box;
is the area of the prediction box. The final total defining equation of the loss function SIoU is obtained as Equation (6).