2.1. Image CZS Preprocessing
An attention mechanism focuses on modeling the relationship between the input and the output of the algorithm, regardless of distance [
22]. For defect inspection, the camera angles of industrial cameras are relatively fixed. So, we can predefine the possible defect regions and extract features only from the specific region. The self-attention mechanism calculates the sequence semantic representation by associating different positions in the sequence. We add the image preprocessing equivalent to adding a self-attention mechanism to YOLOv3 preprocessing. CZS operations are as shown in
Figure 2. The blue box represents the cutting region, and the green box and the red box represent two kinds of defect markup regions. Color boxes numbered 1 to 8 in the original image on the left side correspond to the splicing regions of the image on the right side. All color boxes are the main areas of concern for defect detection. The entire process involved three steps: cut predefined regions from the original image, zoom regions to the same size and splice regions together to form a new image—named CZS operations for short.
Cutting operations take a small square box containing a defect region as a cutting region. The box is slightly larger than the smallest box containing the defect region, so as to ensure fault-tolerant positioning of the same type of images. Define the width of the original image
and the height
, respectively. The ratio of the width of the center point of the defect region to the original image is
, and thus the width of the center point of the defect region is
. Similarly, define the height,
x-coordinate and
y-coordinate as
,
and
. Additionally, the height,
x-coordinate and
y-coordinate of the defect region’s center point can be expressed as
,
and
. Define the width, height,
x-coordinate and
y-coordinate of the cutting box’s top left corner as
,
,
and
, respectively.
The α is the expansion coefficient, which takes a value between 1 and 2. This means that cutting box is 1- to 2-fold larger than the smallest box containing the defect region. Formulas (2) and (3) mean that when the defect region is close to the boundary of the original image, the top left corner coordinates of the cutting box should be consistent with the original image.
The zooming operation is to scale all cutting boxes on an image to the same size, so that can be fully held by a new 416 × 416 image (the standard image size processed by YOLOv3 is 416 × 416 pixel). According to YOLOv3, images will be scaled to 416 pixels in width (
W) and height (
H) before being processed. Suppose the number of cutting boxes on an image is
. So, the number of boxes that can be held in a row (
) or a column (
) of a new image is calculated as Formula (4). The target size that a cutting box should be scaled to is calculated as Formula (5). The scaling factor
β is calculated as Formula (6). The
function is used to obtain the square root of the passed argument. The
function is used to obtain the smallest integer larger than the passed argument. The
function is used to obtain the largest integer smaller than the passed argument.
The splicing operation is to combine cutting boxes from the original image into a new image after being scaled. Splicing mainly consists of two processing works; one is to map a cutting box to the splicing region; another is to map the actual defect markup region to the splicing region. For the first work, we first sort the cutting boxes from small to large according to the
value. If the
values of two cutting boxes are the same, then we sort them from small to large according to their
value. Suppose that a cutting box is ranked as
,
, then the width, height,
x-coordinate and
y-coordinate of the box’s top left corner in the new image are as defined as
,
,
and
. Operator // represents the round function and % represents the remainder function.
For the second work, the size ratio of the width, height,
x-coordinate and
y-coordinate of the center point of an actual defect markup region to the new image is
,
,
and
.
In addition to the above algorithms, the new image generated may have some blank regions. We filled them with 0 or 255 values, shown as the square labelled 9 on the right-side image of
Figure 2. Then, CZS preprocessing of the image is finished.
2.2. Tailoring the Backbone Network
According to the size of the inspection target, the YOLOv3 backbone network can be tailored to detect the defect regions more efficiently.
As shown in
Figure 3, the backbone network of classical YOLOv3 includes 53 layers, so called Darknet-53. Among them,
Convolutional is the convolution layer,
Residual is the hop connection layer of residual network,
Avgpool is pooling layer by average, and
Connected is the full connection layer. The labels ×1, ×2, ×8, ×8 and ×4 represent repeated execution 1, 2, 8, 8 and 4 times, respectively. Note that the five repeated steps correspond to five down-sampling. Additionally, the outputs of the ×8, ×8 and ×4 down-sampling of the last three steps correspond to the classification prediction feature map (YOLO layer) at three scale resolution levels—52 × 52, 26 × 26 and 13 × 13. The final feature maps of classical YOLOv3 have three sizes, the 52 × 52 resolution has better support for detecting tiny objects, and the 13 × 13 resolution is more suitable for identifying larger objects.
Classical YOLOv3 is used for general object detection, including both large and small objects. The distance of the camera from the object of which photos are taken will also affect the size of the object to be recognized. However, the vision-based defect inspection algorithm is different from classical YOLOv3. The camera angle of industry cameras is relatively fixed, and the shape and size of the defect to be inspected are also relatively fixed. Therefore, based on the fixed shape and size of the defect, only the corresponding resolution networks need to be retained, instead of retaining all three scales (52 × 52, 26 × 26 and 13 × 13) of networks. For example, in the production site of automobile rubber and plastic parts, the visible defect commonly has a moderate size and is obviously distinguished, so the inspection network of such defects does not require a very high resolution. However, on the other hand, in the field of silicon chip solder joint quality inspection, defect inspection of welding points needs high precision. The solder joint layout on the chip is very fine and tiny, so it needs a very high-resolution network for identification. In short, the network structure can be optimized according to targeted inspection tasks. Tailoring the YOLOv3 backbone network can be based on the following formulas.
The if (condition) statement is the basic conditional control structure. This allows the tailor function to happen, depending on whether a given condition is true. The tailor function is used to delete part of the input argument network used to identify a certain scale. The yolo52×52 is the smallest resolution inspection network part of YOLOv3, while the yolo13×13 represents the largest resolution network part. The every function means every inspected target. The N represents the full amount of the inspected targets on an image. The wi and hi represent the width and height of a target. The W and H represent an image’s width and height.
In the case study of chapter 4, the 52 × 52 tiny resolution-scale network is shrunk. That is to partially delete the third down-sampling layers of the backbone network. For classical YOLOv3, the third down-sampling layers consist of eight rounds of repetition. In our algorithm, seven rounds of repetition are tailored off, that is to delete 14 convolutional layers for all. Thus, the backbone network is condensed from 53 layers to 39 layers. Our algorithm’s backbone network turns into Darknet-39.
2.3. Data Augmentation
In order to enhance attention and improve the recognition accuracy of deep learning networks, it is also necessary to implement data augmentation to expand the dataset. The whole process of data augmentation is shown in
Figure 4. There are mainly two strategies.
Strategy 1: It is to add random noise to the defect markup region of the original image, which changes from normal to noisy or faulty. As shown in
Figure 5, the rectangular cover is used to simulate the noise or missing faults on the surface of the markup region. The position, size and color can be set randomly. From left to right,
Figure 5a is normal;
Figure 5b uses a rectangle to cover 1/3 region of a markup region, which is equivalent to adding some noise, so it should be ensured that the network training can recognize such markup region;
Figure 5c,d completely cover one or two markup regions with rectangles to simulate the missing faults.
Strategy 2: It is to rotate the image by 90°, 180° and 270° and flip it horizontally. That is equaled to expand into 8 images by rotation and flipping, as shown in
Figure 6. In the original picture, there are a total of 8 regions to be inspected, which are divided into two types, represented by red boxes and green boxes respectively.
The use of a dataset enhancement strategy is not only to improve the quality of training, but also to effectively reduce manpower consumption. In this study, there are only 630 original photos, which can only be manually added to the defect markup region. Then, a python script can be used to automatically complete image preprocessing and dataset enhancement, so that the final dataset size for training and testing reaches 40,320 photos.