MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board

Zhang, Yizhuo; Liu, Xingyu; Liu, Hantao; Yu, Huiling

doi:10.3390/app13127006

Open AccessArticle

MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board

School of Computer Science, Changzhou University, Changzhou 312000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7006; https://doi.org/10.3390/app13127006

Submission received: 11 May 2023 / Revised: 30 May 2023 / Accepted: 7 June 2023 / Published: 10 June 2023

(This article belongs to the Special Issue AI Applications in the Industrial Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Defects in wood growth affect the product’s quality and grade. At present, the research on texture defects of wood mainly focuses on defect localization, ignoring the splicing problem of maintaining texture consistency. In this paper, we designed the MRS-Transformer network and introduced image inpainting to the field of solid wood board splicing. First, we proposed an asymmetric encoder-decoder based on Vision Transformer, where the encoder uses a fixed mask(M) strategy, discarding the masked patches and using only the unmasked visual patches as input to reduce model calculations. Second, we designed a reverse Swin (RS) module with multi-scale characteristics as the decoder to adjust the divided image patches’ size and complete the restoration from coarse to fine. Finally, we proposed a weighted L2 loss (MSE, mean square error), which assigns different weights to the unmasked region according to the distance from the defective region, allowing the model to make full use of the effective pixels to repair the masked region. To demonstrate the effectiveness of the designed modules, we used MSE (mean square error), LPIPS (learned perceptual image patch similarity), PSNR (peak signal to noise ratio), SSIM (structural similarity), and FLOPs (floating point operations) to measure the quality of the model generated wood texture images and the model computational complexity, we designed relevant ablation experiments. The results show that the MSE, LPIPS, PSNR, and SSIM of the wood images restored by the MRS-Transformer reached 0.0003, 0.154, 40.12, 0.9173, and the GFLOPs is 20.18. Compared with images generated by the Vision Transformer, the MSE and LPIPS were reduced by 51.7% and 30%, PSNR and SSIM were improved by 12.2% and 7.5%, and the GFLOPs were reduced by 38%. To verify the superiority of MRS-Transformer, we compared the image inpainting algorithms with Deepfill v2 and TG-Net, respectively, in which the MSE was 47.0% and 66.9% lower; the LPIPS was 60.6% and 42.5% lower; the FLOPs was 70.6% and 53.5% lower; the PSNR was 16.1% and 26.2% higher; and the SSIM was 7.3% and 5.8% higher. MRS-Transformer repairs a single image in 0.05 s, nearly five times faster than Deepfill v2 and TG-Net. The experimental results demonstrate that the RSwin module effectively alleviates the sense of fragmentation caused by the division of images into patches, the proposed weighted L2 loss improves the semantic consistency of the edges of the missing regions and makes the generated wood texture more detailed and coherent, and the adopted asymmetric encoder-decoder effectively reduces the computational effort of the model and speeds up the training.

Keywords:

solid wood board; image inpainting; vision transformer; RSwin; weighted L2 loss

1. Introduction

During the process of growth and processing, wood will produce a variety of defects, such as warts, dead knots, insect eyes, and sawing injuries, which will not only affect the physical properties of wood products, such as mechanical properties, but also affect the aesthetics and reduce the grade of the products. Defects in wood are an important criterion for assessing wood’s quality and commercial value. With the rapid development of sensor and computer technologies, relevant non-destructive testing technologies have been gradually applied to the field of wood inspection, such as laser technology, infrared technology, and machine vision technology. Existing studies mostly focus on the detection and localization of defects [1,2], but in the splicing of solid wood boards after locating the defects, the visual consistency of the wood texture after splicing is rarely considered. Thus, we introduced image inpainting techniques to reconstruct the texture of the wood defective part, and later found wood with a similar texture to the defective area for splicing by matching algorithms to improve the use value of the wood.

The difficulty of image inpainting lies in being able to maintain the consistency of the image texture structure while generating detailed realistic textures in the missing parts. Currently, the main algorithms of the image generation class are VAE (Variational Division Encoder) [3], DDPM (Denoising Diffusion Model) [4], and GAN (Generative Adversarial Network) [5]. In recent years, the method based on the combination of GAN+CNN [6] to reconstruct the texture of the missing part of the image is the main research direction of image inpainting, but the existing image inpainting networks for reconstructing wood defective texture still have the problems of obvious marginal effect and incoherent visual texture, resulting in low restoration accuracy and long model training time. The irregularity and complexity of wood texture deepen the difficulty of inpainting, making it difficult for the existing models to achieve the desired effect. In recent years, Transformer [7] has achieved great success with its powerful feature extraction ability, solving the problem of the limited perceptual field of CNN models. However, the Transformer model applied to the vision field still has shortcomings; on the one hand, the Transformer model uses global attention, making the computation too large; on the other hand, Vision Transformer (ViT) [8] divides the image into multiple non-overlapping patches when used for image inpainting, it cannot perform pixel level modeling; especially when using the ViT model as a decoder, the result of the patching will have a clear sense of fragmentation and obvious marginal effects. To improve existing models, we proposed three strategies.

We designed an asymmetric encoder-decoder structure; the encoder takes only the unmasked visual patches as input, while the input to the decoder is the output of the encoder with the masked patches—this is achieved to reduce encoder calculations without losing wood image texture information.
We designed the RSwin module with the multi-scale feature as a decoder and flexibly adjusted the size of the divided image patches to realize the inpainting of the wood texture image from coarse to fine and to solve the sense of fragmentation between the image patches.
We proposed a weighted L2 loss, which assigns different weights to the unmasked region according to the distance from the defective region; the closer the distance to the missing region, the larger the weight assigned, which achieves the full utilization of the effective texture features at the edge of the wood defects, and solves problems such as semantic incoherence at the connection between the masked region and the unmasked region.

2. Related Work

Many advances have been made in studies related to the detection of defects in solid wood boards [9,10,11,12]. However, little research has been carried out on maintaining the texture consistency of wood board splicing [13]. We proposed to introduce image inpainting into the field of solid wood board splicing to transform the search for similarly textured board splicing into a wood defect texture reconstruction problem.

Research on image inpainting is broadly divided into two categories: traditional methods and deep learning methods. Traditional methods focus on diffusion [14,15] or patch [16,17,18], which propagate adjacent undamaged information to defective locations or fill in missing regions by copying information from similar background patches in the internal or external search spaces. Although traditional methods can patch low-level features, such as texture, color, and shape, they cannot generate semantically meaningful content. In contrast, deep learning methods are trained to infer the content of the missing regions to complete the inpainting by training deep convolutional neural networks that can generate high-level semantic features. Pathak et al. [6] designed the first neural network model for image inpainting, which used an encoder-decoder structure where the encoder maps images with missing regions to a low-dimensional feature space, and the decoder used that space to construct the output image. However, due to information bottlenecks in the channel’s fully-connected layer, the recovered region of the output image usually contains visual artifacts and exhibits blurriness. Iizuka et al. [19] used a dilation convolution layer instead of a fully connected layer in the channel direction to propose global and local discriminators to improve the quality of inpainting. However, the relatively large dilation factor using dilation convolution makes the created filters very sparse and significantly increases training time. Yang et al. [20] used the output of the contextual encoder as the network input to improve the image inpainting quality by minimizing the feature differences in the image background, but this approach requires iteratively solving the multiscale optimization problem, which increases the calculations to some extent. Yu et al. [21,22] and Chen et al. [13] took a two-step method for the image inpainting task, first restoring the rough outline of the image by a coarse generator and then reconstructing a finer texture with a fine generator combined with attention, where Yu et al. proposed gated convolution in Deepfill v2 [22] to replace the vanilla convolution in Deepfill v1 [21], solving the problem that vanilla convolution treats all pixels as valid pixels, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location in all layers, which makes the effect of inpainting better. Chen et al. [13] proposed to normalize the front and back views separately to improve the missing region texture generation capability. However, these two methods only use the attention layer in the fine generator, do not obtain global modeling relationships well, and are prone to texture incoherence. Nazeri et al. [23] first predicted the edges of the missing regions and then repaired the image based on the predicted edge information. Guo et al. [24] divided the generator into a dual-branch structure of edge generator and image generator, and finally performed image fusion; both of these inpainting methods use the edge texture features to assist in the inpainting but the image edge generation cannot reconstruct high-density texture well, resulting in poor texture generation when a large area of the texture is missing from the image, which affects the final inpainting effect. Although image inpainting methods based on CNN+GAN have made many advances and proposed targeted solutions, the inpainting task relies heavily on known feature information of the image, and the CNN can only rely on dilated convolution to achieve long-distance relationship modeling, which has difficulty achieving true global relationship modeling.

Transformer was originally used in the field of NLP (natural language processing), which is a branch of artificial intelligence. NLP involves the technology and methods for the interaction and processing of human natural language by computers. The goal of NLP is to enable computers to understand, interpret, generate, and manipulate natural language text or speech data. The Transformer has found widespread applications in NLP tasks such as machine translation, speech recognition, and text generation, using global relational modeling to extract features, and the breakthrough in NLP has attracted the interest of researchers in computer vision (CV) due to its powerful global modeling capabilities. Because visual data are temporally and spatially consistent, ordinary Transformer models cannot meet the requirements of computer vision tasks. The strategy of ViT, which divides images into patches of the same size and uses the patches as the basic unit for model training, makes it possible to use Transformer models for computer vision tasks. Many scholars have also experimented with Transformer models in image inpainting. Wan et al. [25] used ViT coarse inpainting first, then CNN fine reconstruction. Yu et al. [26] used a multi-generator network; one generator was a traditional CNN encoder, one downscaled the image and then sub-patch it into the ViT model and, finally, the output of both generators was decoded using a CNN model. Dong et al. [27] used Transformer to repair the structure, incrementally adding the structure to the subsequent CNN for texture generation. Li et al. [28] used the Transformer model for the encoder and decoder of image inpainting, using the method of updating the mask region to improve the inpainting effect, but still needed to use convolutional layers to process the image at the start and end, failing to exploit the full capability of Transformer. Although these studies have attempted to use Transformer for image inpainting, they still use CNN+GAN as the basic idea, adding Transformer as an auxiliary or branching module to the model. Transformer uses self-attention for feature extraction, is computationally demanding, and requires a large amount of memory during training, making it challenging to train the model effectively. Many scholars have attempted to optimize the Transformer structure by using local attention instead of global attention; for example, Liu et al. [29] used a shifted window attention to narrow the scope of attentional computation and He et al. [30] used a random mask operation that selects only a small amount of data as input. The use of local attention reduces model calculations but, at the same time, reduces the performance of Transformer, which derives much of its superior performance from its global modeling capabilities. The redundancy in image information is very large compared to natural language data, so removing redundancy is key to the use of Transformer in computer vision.

Unlike other image inpainting tasks, wood defect texture inpainting requires the following considerations: the wood texture is irregular and the texture around the defect is deformed due to the presence of the defect, so the masked area for the defect is larger than for other inpainting tasks to ensure the consistency of the completed texture. In addition, the texture reconstruction at the location of the wood defects prepares for later splicing, which is carried out in long strips, so the mask should be set to rectangular. Considering these two characteristics, combined with the fact that the defective region being masked can only provide location information, we designed an asymmetric encoder-decoder structure inspired by the MAE. The encoder only takes the unmasked visual patches as input, while the input to the decoder is the output of the encoder with the masked patches, and the different dimensions of the encoder and decoder inputs form an asymmetric structure, reducing the encoder computation by inputting only the unmasked patches. ViT is trained with image patches as the basic unit, which cannot model at the pixel level. We designed the RSwin module with multiple scales as a decoder to flexibly adjust the size of the divided image patches, hoping to achieve the purpose of generating wood texture features from coarse to fine. In addition, to solve the problem of incoherence between the texture generated in the image patching task and the background region, we proposed a weighted L2 loss, which assigns different weights to the unmasked region according to the distance from the defective region, and the closer the distance to the missing region, the larger the weight assigned, in the hope of enhancing the coherence of the missing regions with the features of the known regions after inpainting.

3. Methods

3.1. Vison Transformer

ViT applies Transformer to computer vision tasks and has attracted the attention of many researchers in the field of computer vision due to its simplicity, effectiveness, and scalability. In NLP tasks such as machine translation and speech recognition, words or speech signals are processed as a sequence and inputted into Transformer. While images are two-dimensional, we need to convert the two-dimensional image into a one-dimensional sequence before the input. ViT divides the image into multiple equal-sized and non-overlapping patches, then projects each patch as a fixed-length vector and adds a positional embedding to each vector and then inputs it into Transformer for training. Each Transformer consists of a multi-head self-attention layer, a multi-layer perceptron (MLP), and two LayerNorm layers. The structure of the wood texture using the ViT model is shown in Figure 1.

3.2. MRS-Transformer

For the use of ViT as the base framework for image inpainting, we optimized three aspects: the masking strategy (MASK), the encoder input form (Encoder), and the decoder design (Decoder).

3.2.1. Mask

The image inpainting task requires masking the image to simulate defective areas to train the network model and reconstructing the complete image by learning features from unmasked areas of the image. In industrial scenarios where solid wood boards are stitched together, the regions of the boards with defects are removed in regular rectangles, therefore the MRS-Transformer uses a rectangular mask. In addition, the direction of wood texture growth can be distorted by the presence of defects and it is necessary to mask all defective areas and surrounding distorted grain for consistency of visual texture after inpainting, which makes masking a much larger area than other image painting tasks. Therefore, this study introduced the asymmetric encoder-decoder model structure in MAE, where the encoder takes only the unmasked visual patches as input, while the input to the decoder is the output of the encoder with the masked patches, so that, the larger the masked area, the fewer patches are input to the encoder, and the less computation is required for the encoder. However, the less the unmasked area is retained, the more difficult it is to reconstruct the texture. After several experiments, it was found that when the area of the mask is about 50% of the image, the area of the mask can be maximized while ensuring the complementary accuracy and the computational effort is minimized.

Due to the different sizes and locations of wood texture defects, we proposed a strategy of training the model with a fixed mask size to improve the repair accuracy of wood texture defects. By locating the defect, determining the mask range, cropping out a suitable wood texture image, expanding or scaling the defect image to a fixed size, then inputting it into the model to obtain the completed texture image, and finally mapping it to the original position of the image by the opposite operation, as shown in Figure 2.

3.2.2. Encoder

The encoder consists of several Transformer modules, each consisting of a multi-headed self-attention layer and a multi-layer perceptron (MLP), which are passed between them through a norm layer and a residual connection. The multi-headed self-attention is the core of the Transformer, through which the global modeling of the image is achieved. The attention is formulated as:

\begin{matrix} Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V, \end{matrix}

(1)

where Q, K, and V are the query, key, and value matrices, respectively, and

\sqrt{d_{k}}

is the scaling factor.

3.2.3. Decoder

Although self-attention achieves global relationship modeling by calculating the similarity between all image patches, it also consumes an enormous amount of computational effort. Since the masked image patches cannot provide information for image inpainting, inputting them into the network model will only increase the computational effort and take up extra memory space. This study designed an asymmetric encoder-decoder structure, the structure of which is shown in Figure 3, where the encoder only takes the unmasked patches as input. The decoder takes the output of the encoder and the masked patch together as input, and the decoder output is the repaired image.

As the ViT decoder uses image patches as input and output, patch sizes that are too large will affect the inpainting accuracy; patch sizes that are too small will increase the computation and prolong the repair time. To address the drawback that the ViT model cannot be adaptively adjusted to divide patch size, we designed a new decoder module, RSwin-Transformer, for wood texture image inpainting, as shown in Figure 4.

The module retains the Swin Transformer shifted window multi-headed self-attention (W-MSA and SW-MSA), which reduces the computational complexity of the model while ensuring global modeling characteristics. The computational complexity of the normal multi-headed self-attention (MSA) and the window multi-headed self-attention (W-MSA) are:

\begin{matrix} Ω (MSA) & = 4 h w C^{2} + 2 {(h w)}^{2} C \end{matrix}

(2)

\begin{matrix} Ω (W-MSA) & = 4 h w C^{2} + 2 M^{2} h w C, \end{matrix}

(3)

where h × w is the number of image division patches, M is the size of the window attention, and C is the channel. From the above equation, we can see that the window attention mechanism allows the computational complexity of the model to increase linearly with the image size.

The RSwin module improves the Swin multi-scale design approach by designing a Patch Diverging layer in the decoder, the principle of which is shown in Figure 5, H is the height of the feature map, W is the width of the feature map, and C is the number of channels in the feature map. As the network deepens, the output size can be enlarged by a factor of r, while the number of channels is reduced by a factor of

r^{2}

. This enables the image to be represented at a larger scale and the basic patches in the window attention to be at a smaller size, effectively reducing the fragmented feeling between patches and enhancing the accuracy of image inpainting.

3.2.4. Weighted L2 Loss

Most traditional image inpainting algorithms calculate the loss only for the defective part or for the whole image, which leads to semantic inconsistencies in the connection between the defective and non-defective regions. According to a priori knowledge, the closer the valid pixels in the unmasked area are to the location of the defect, the more helpful the repair will be. We designed a weighted L2 loss, which assigns different weights to the unmasked region according to the distance from the defective region. First, we set up a matrix of the same size as the original image and set the pixels in the masked region to zero, and then we calculated the Euclidean distance between each pixel in the unmasked region and its nearest masked region to obtain a matrix template. With

α

(set to 0.999) as the base and the matrix template as the exponent, a weight matrix was obtained with values in the range

[0, 1]

. This weight matrix was used as the weights of the corresponding pixel points when calculating the L2 loss. The weight in the matrix corresponding to the defective region of the image waas one. The closer the pixel point is to the defect, the closer the weight is to approximately one, and the further away it is, the smaller the weight is. The formula is as follows:

\begin{matrix} l o s s & = \frac{\sqrt{{(I_{r} - I_{o})}^{2}} ⊙ α^{A}}{\sum_{i = 0}^{n} \sum_{j = 0}^{n} α^{A_{i j}}}, \end{matrix}

(4)

where

\sqrt{{(I_{r} - I_{o})}^{2}}

represents the L2 loss of the restored image compared to the original image,

α^{A}

represents the weight matrix, which represents the averaging of the losses.

MRS-Transformer is shown in Figure 6. On a solid wood board image with defects, we cropped the image to a suitable size with the defect at the center according to a fixed masking strategy and made it into a standard size that can be input into the model. The unmasked patches were then converted to vectors via a linear mapping layer and were input into an encoder consisting of twelve-layer ViT. The output of the encoder was fed into a decoder together with the masked patches mapped to the vectors, which consisted of an eight-layer RSwin module. The final reconstructed wood texture image was mapped back to the original position of the image.

In the experiment, the image size was set to 256 × 256, and the patch size was 16 × 16. Each patch was mapped to a dimension of 786. After passing through the encoder, the feature map input to the decoder had a size of 16 × 16 × 768. To achieve a more detailed inpainting result, it is desirable to have a smaller scaling factor in the Patch Diverging layer of the RSWin. In the experiment, a scaling factor of 4 was used. After the first set of RSwin decoders, the Patch Diverging layer transformed the feature map into a size of 64 × 64 × 48. Then, after the second set of RSwin decoders, the Patch Diverging layer restored the feature map to its original size of 256 × 256 × 3. The scaling factor of 4 was determined to be the optimal value through experimentation. If the scaling factor were set to 2, it would require four sets of RSwin modules. In this case, the feature map size of the final RSwin module would be 128 × 128 × 12. However, the computational complexity of RSwin would be too high, making it impractical to train the model on the available hardware.

3.3. Qualitative Evaluation of Texture Generation

Four image quality evaluation metrics, MSE (mean square error), PSNR (peak signal to noise ratio), SSIM (structural similarity), LPIPS (learned perceptual image patch similarity), and FLOPs (floating point operations) were used as quantitative metrics for the observation data.

MSE is the mean value of the square of the difference between the original image X and the generated image Y.

\begin{matrix} MSE = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(X (i, j) - Y (i, j))}^{2}, \end{matrix}

(5)

where H and W are the dimensions of the image; and

Y (i, j)

are the pixel values of the original and generated image at the location

(i, j)

. MSE takes values in the range [0, 1], with smaller values indicating less image distortion.

PSNR (Peak Signal to Noise Ratio) is a fully referenced image quality evaluation metric.

\begin{matrix} PSNR = 10 {log}_{10} \frac{{(2^{n} - 1)}^{2}}{MSE}, \end{matrix}

(6)

where n is the color depth of each pixel in the image, which is taken as 8 bits. The unit of PSNR is dB, and the larger the value, the smaller the image distortion.

SSIM (structural similarity) is also a fully referenced image quality evaluation metric, which measures image similarity in terms of luminance, contrast, and structure, respectively.

\begin{matrix} SSIM (x, y) = l (x, y) c (x, y) s (x, y), \end{matrix}

(7)

LPIPS (Learned Perceptual Image Patch Similarity) is a reference image quality evaluation metric that is more in line with human perception; the lower the LPIPS value, the more similar the two images are, and vice versa, the greater the difference. The formula of LPIPS is expressed as follows:

\begin{matrix} LPIPS = \frac{1}{n} \sum_{i = 1}^{n} ω_{i} {∥ϕ_{i} (x) - ϕ_{i} (y)∥}_{2}, \end{matrix}

(8)

where x and y are two images to be compared,

ϕ_{i}

is a feature extractor in a deep convolutional neural network, n is the number of feature maps,

w_{i}

is the weight of the corresponding feature map, and

{|\cdot|}_{2}

denotes the

L_{2}

norm.

4. Experiments

4.1. Experimental Preparation

4.1.1. Data Acquisition

The experiments were conducted using an OscarF810CIRF industrial camera to capture 3000 wood texture images, of which 500 were defective. The 2500 of these defect-free images were expanded to 10,000 by data augmentation methods such as rotation, brightness transformation, mirroring, and contrast transformation, and divided into a training and validation set in the ratio of 8:2. To improve the model training speed, the high-resolution images were uniformly processed to 256 × 256 pixels. Some samples are shown in Figure 7.

The size and defect location of the wood texture images with defects collected in the dataset varied. The defect locations were positioned at the center of the intercepted samples and processed uniformly to 256 × 256 pixels. The processed partial samples are shown in Figure 8.

4.1.2. Experimental Environment and Key Parameters

The image acquisition device used was the OscarF810CIRF industrial camera, which had a resolution of 2048 × 2048. To capture the images, an LED light array with a uniform light panel was utilized to illuminate the surface of the board. This ensured strict control over the brightness of the light source and guaranteed uniform lighting, minimizing the effects of shadows and overexposure. The captured images were uniformly processed to a size of 256 × 256.

The experimental setup was as follows: the system used was Ubuntu 20.4, the deep learning framework was PyTorch 3.9, and the GPU was Tesla V100 with 32 GB memory.

The main parameters of the model were as follows: the batch size was set to 32. During the training process, the Adam optimizer was employed with a learning rate of

1.8 \times 10^{- 4}

. The model was trained for a total of 1200 epochs until convergence to the optimal state.

In the experiments, the image size used was 256 × 256 pixels, and the images were divided into non-overlapping image patches of size 16 × 16 pixels at the encoding stage, resulting in N (256) image patches. The dimension of each patch mapping was set to 768, which represents the sum of the number of pixel points of the three channels of each patch. The encoder used a ViT model with a depth of twelve layers. The decoder used a two-group RSwin model, with each group consisting of four layers of depth. For each group of RSwin, a Diverging layer was used to reduce the size of the divided image patch by a factor of 4, the size of the image by a factor of 4, and the number of channels by a factor of 16.

4.2. Experiments on Texture Generation of Defective Solid Wood Boards

In order to evaluate the model’s effectiveness in repairing texture on wood samples with defects, we conducted experiments to generate texture in the areas that were repaired by the model. First, the location of the defect was identified on the original image of the solid wood board. Then, the cropping range was determined based on the size of the defect and the deformation of the texture. Finally, the repaired texture was mapped to the original position of the image. Figure 9 demonstrates that the MRS-Transformer generates a natural and consistent wood texture that meets the requirement of visual texture consistency for solid wood veneer splicing. Afterward, the texture-matching algorithm can be utilized to identify a similar and suitable solid wood board from a database for splicing purposes.

4.3. Ablation Experiments

Ablation experiments were conducted to demonstrate the effectiveness of the proposed improvements to the model.

Group A experiments were trained directly on the dataset using the ViT model as the encoder and decoder, with L2 loss as the loss function, twelve layers for the encoder, and eight layers for the decoder.

Group B experiments introduced an asymmetric encoder-decoder structure based on Group A, where the encoder was fed with only unmasked image patches and the decoder structure remained unchanged, using L2 loss as the loss function.

Group C replaced the decoder with the proposed RSwin module based on Group B, with twelve layers in the encoder and eight layers in the decoder, using L2 loss as the loss function.

Group D, the final proposed algorithm, utilized weighted L2 loss as the loss function based on Group C, with the same encoder and decoder as in Group C.

The results of the experiments are presented in Table 1, which employed four image evaluation metrics and model computation volume to measure the model’s effectiveness from various perspectives. It can be observed that, after the improvements to the input method of the encoder in Group B, the three metrics of MSE, PSNR, and SSIM did not change significantly, and although the LPIPS metric was worse than in Group A, the computational effort of the model decreased by 28.3%. After the redesign of the RSwin module as a decoder in Group C, the other three metrics did not change significantly, and the LPIPS metric decreased by 20.6%, indicating an enhancement of image details from a human eye visual perspective, and the computational effort decreased by 26.3%. Group D, which utilized weighted L2 loss based on Group C, achieved a significant improvement in completing the edges of missing regions of wood, as indicated by the 51.7% and 34.2% decreases in MSE and LPIPS compared with Group C, respectively. Furthermore, the PSNR and SSIM metrics were 12.2% and 7.5% higher than Group C, respectively.

In addition to objective metrics, visual images at each stage of improvement were chosen as subjective references to make the experimental results more intuitive. As shown in Figure 10, through the comparison of the completed images after four experiments, it can be clearly observed that Group D (MRS-Transformer) has the closest texture structure and semantic information to the original image. Compared with Group B using an asymmetric encoder-decoder structure, the patchy segmentation feeling of Group C’s completed results was significantly reduced after introducing the RSwin structure, and the completed texture was more coherent and detailed. Group D introduced weighted L2 loss, and the patchy segmentation feeling was almost eliminated. The edge connections also became coherent semantically, and the texture connections became more natural.

4.4. Comparative Experiments

To validate the effectiveness of the proposed MRS-Transformer, we compared it with two state-of-the-art image inpainting algorithms, namely Deepfill v2 and TG-Net, on the solid wood boards dataset. Both these algorithms are based on the GAN model and use a two-stage inpainting approach, where the coarse generative network completes the missing region contours and the fine generative network produces refined results.

The main distinction lies in the network structures used in the coarse and refinement stages of Deepfill v2 and TG-Net.

Deepfill v2 employs a 16-layer hourglass network with gated convolutions for both stages. It consists of four encoder layers, eight bottom-neck layers, and four decoder layers. Each gated convolution layer performs two convolution operations, one on the image features and the other on the mask features.

TG-Net utilizes a simpler architecture. The coarse generator employs a 16-layer CNN network, with eight upsampling layers and eight downsampling layers. In the refinement stage, TG-Net employs a 32-layer U-Net structure, where each group consists of two layers. The first eight groups perform downsampling, while the latter eight groups handle upsampling. Additionally, the refinement network incorporates an attention mechanism with ten layers.

In terms of computational complexity, the models’ FLOPs (Floating Point Operations) are used as a measure. Convolution operations are the primary source of computational workload for both Deepfill v2 and TG-Net. The choice of convolutional layer parameters significantly affects the computational complexity of the models. The formula for calculating the computational complexity of a convolutional layer is as follows:

\begin{matrix} Ω (Conv) = (2 K^{2} C_{in}) C_{out} H W, \end{matrix}

(9)

where H is the height of the input feature map; W is the width of the input feature map;

C_{in}

is the number of input channels;

C_{out}

is the number of output channels; and K is the size of the convolutional kernel.

The computation in the Transformer mainly comes from the matrix multiplication involved in calculating attention scores. Due to the two-dimensional nature of images, the number of patches increases quadratically with the image size. As a result, the computational complexity of ViT grows quadratically with the image size. In the MRS model, the encoder discards the input of masked patches, reducing the computational complexity by almost half. The decoder utilizes a windowed attention mechanism, which calculates attention scores within fixed-size windows. This leads to a linear increase in computational complexity, significantly reducing the overall computation burden of the model. As a result, the MRS algorithm achieves lower computational complexity compared to the other two algorithms while delivering superior restoration performance.

The experimental results, shown in Table 2, indicate that the MRS-Transformer achieved a superior performance with MSE, PSNR, LPIPS, and SSIM values of 0.0003, 40.12, 0.154, and 0.9173, respectively. Compared to Deepfill v2 and TG-Net, MRS-Transformer reduced the MSE metric by 47.0% and 66.9%, and the LPIPS metric by 60.6% and 42.5%, respectively. Meanwhile, our algorithm achieved higher PSNR and SSIM metrics, with improvements of 16.1% and 26.2%, and 7.3% and 5.8%, respectively. Moreover, MRS-Transformer reduced the GFLOPs metric by 70.6% and 53.5%, respectively, and completed a single wood texture image in only 0.05s, which is nearly five times faster than the other two algorithms.

To further evaluate and compare the performance of the proposed algorithm with other methods, we conducted a visual analysis of the completed images. As shown in Figure 11, we displayed the original image in the first column, followed by the masked image with 50% of the central region masked in the second column, and the completed images generated by different methods in the remaining columns. To ensure the credibility of the experimental results, we did not apply any post-processing to the completed images. The comparison results indicate that the proposed method can generate more realistic and coherent textures for the missing regions of solid wood boards, achieving superior semantic consistency at the edges and joints of the missing regions. In contrast, the other two algorithms failed to remove the artifacts at the edges and maintain semantic consistency at the joints, resulting in incomplete and incoherent wood textures.

The experimental results show that the MRS-Transformer algorithm surpasses the DeepFill v2 and TG-Net algorithms in both objective metrics and visual results. As the solid wood board texture complementation task involves large masked areas, it poses challenges in image generation. This study addresses this by optimizing the complementation process, resulting in superior performance compared to the other two algorithms. Although the gated convolution operation proposed by DeepFill v2 allows the network to provide dynamic feature selection and partially solve the convolutional problem, the Transformer model still performs better in learning effective features and modeling longer distance scenes, leading to superior complementation results. Similarly, the TG-Net algorithm proposes separate normalization methods for the front and back backgrounds to enhance the weight of mask region features, but this approach can result in incoherent texture and background regions and color differences. In contrast, this study proposed L2 loss weighted by the distance from the missing region, with weights gradually decreasing from the mask region to the background region, resulting in more coherent generated texture features.

5. Conclusions

To address the problem of defect removal and texture stitching in solid wood boards, this paper introduced an image texture inpainting method and designed MRS-Transformer for wood texture reconstruction. We employed an asymmetric encoder-decoder architecture where only the unmasked patches were used as input, significantly reducing the computational effort. The decoder uses our proposed RSwin module, which employs multi-scale features to achieve coarse-to-fine texture complementation and improve patching accuracy. To address issues of incoherent texture, semantic inconsistency, and artifacts at edge connections, we proposed a weighted L2 loss function, which proves to be highly effective. We compared our method with existing methods, including Deepfill v2 and TG-Net, and demonstrated the feasibility and superiority of our approach.The experimental results demonstrate that the proposed model provides a feasible solution for visually consistent splicing and completion of solid wood boards with defects.

Although the MRS-Transformer has achieved good results in the defect repair of solid wood boards, it still has limitations. On the one hand, fixing the defects in the center of the image is suitable for most wood images with defects. However, for edge defects in the wood, important surrounding information necessary for repair is missing, which affects the quality of the repair. On the other hand, due to the diverse types of wood defects, a fixed masking strategy is not suitable for all types of wood defects, especially for elongated defects, where the current masking strategy would result in the loss of significant texture information that aids in the repair process.

In the next stage, we will consider the issue of repairing edge defects in wood and address the contradiction between masking strategies and different types of defects. We will propose better solutions to handle the specific scenarios encountered in wood defect repair.

Author Contributions

Conceptualization, Y.Z. and X.L.; methodology, X.L.; software, H.Y.; validation, X.L., Y.Z. and H.L.; formal analysis, H.L.; investigation, X.L.; resources, X.L.; data curation, H.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L.; visualization, X.L.; supervision, X.L.; project administration, X.L.; funding acquisition, Y.Z. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Research Startup Fund of Changzhou University (Grant No. ZMF21020362). We would like to express our sincere gratitude for this support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data was prepared and analyzed in this study and available upon a request to the author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MSE	Mean Square Error
LPIPS	Learned Perceptual Image Patch Similarity
SSIM	Structural Similarity
PSNR	Peak Signal to Noise Ratio
FLOPs	Floating Point Operations
CNN	Convolutional Neural Network
ViT	Vision Transformer
MAE	Mask AutoEncoder
NLP	Natural Language Processing
L2 Loss	Mean Square Error
MLP	Multi-Layer Perceptron

References

Wang, B.; Yang, C.; Ding, Y.; Qin, G. Detection of Wood Surface Defects Based on Improved YOLOv3 Algorithm. BioResources 2021, 16, 6766–6780. [Google Scholar] [CrossRef]
Fang, Y.; Guo, X.; Chen, K.; Zhou, Z.; Ye, Q. Accurate and Automated Detection of Surface Knots on Sawn Timbers Using YOLO-V5 Model. BioResources 2021, 16, 5390–5406. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2014, arXiv:2013:1312.6114. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 24–28 July 2022; pp. 1–10. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ke, Z.-N.; Zhao, Q.-J.; Huang, C.-H.; Ai, P.; Yi, J.-G. Detection of wood surface defects based on particle swarm-genetic hybrid algorithm. In Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Nanning, China, 3–5 October 2016; pp. 375–379. [Google Scholar]
Yang, Y.; Wang, H.; Jiang, D.; Hu, Z. Surface detection of solid wood defects based on SSD improved with ResNet. Forests 2021, 12, 1419. [Google Scholar] [CrossRef]
He, T.; Liu, Y.; Yu, Y.; Zhao, Q.; Hu, Z. Application of deep convolutional neural network on feature extraction and detection of wood defects. Measurement 2020, 152, 107357. [Google Scholar] [CrossRef]
Lopes, D.; Bobadilha, G.; Grebner, K.M. A Fast and Robust Artificial Intelligence Technique for Wood Knot Detection. Bioresources 2020, 15, 9351–9361. [Google Scholar] [CrossRef]
Chen, J.; Ge, Y.; Wang, Q.; Zhang, Y. TG-Net: Reconstruct visual wood texture with semantic attention. Comput. Graph. 2022, 102, 546–553. [Google Scholar] [CrossRef]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Ballester, C.; Bertalmio, M.; Caselles, V.; Sapiro, G.; Verdera, J. Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 2001, 10, 1200–1211. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ding, D.; Ram, S.; Rodríguez, J.J. Image inpainting using nonlocal texture matching and nonlinear filtering. IEEE Trans. Image Process. 2018, 28, 1705–1719. [Google Scholar] [CrossRef] [PubMed]
Lee, J.H.; Choi, I.; Kim, M.H. Laplacian patch-based image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2727–2735. [Google Scholar]
Le Meur, O.; Gautier, J.; Guillemot, C. Examplar-based inpainting based on local geometry. In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 3401–3404. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Guo, X.; Yang, H.; Huang, D. Image inpainting via conditional texture and structure dual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 14134–14143. [Google Scholar]
Wan, Z.; Zhang, J.; Chen, D.; Liao, J. High-fidelity pluralistic image completion with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 4692–4701. [Google Scholar]
Yu, Y.; Zhan, F.; Wu, R.; Pan, J.; Cui, K.; Lu, S.; Ma, F.; Xie, X.; Miao, C. Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, China, 20–24 October 2021; pp. 69–78. [Google Scholar]
Dong, Q.; Cao, C.; Fu, Y. Incremental transformer structure enhanced image inpainting with masking positional encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11358–11368. [Google Scholar]
Li, W.; Lin, Z.; Zhou, K.; Qi, L.; Wang, Y.; Jia, J. MAT: Mask-Aware Transformer for Large Hole Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10758–10768. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
He, K.; Sun, J. Statistics of patch offsets for image completion. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 16–29. [Google Scholar]

Figure 1. The architecture of wood texture restoration with Vision Transformer.

Figure 2. Steps for restoring texture in defective solid wood board images.

Figure 3. Architecture of the asymmetric encoder-decoder.

Figure 4. The architecture of Decoder (RSwin) and principles of W-MSA and SW-MSA.

Figure 5. The principle of Patch Diverging.

Figure 6. The architecture of MRS-Transformer (texture splicing method to remove defects in the solid wood board).

Figure 7. A sample of data augmentation.

Figure 8. Some samples with texture defects.

Figure 9. Effect of defective wood reconstruction.

Figure 10. Comparison of results of ablation experiments.

Figure 11. Comparison of restoration results with DeepFill v2 and TG-Net for MRS-Transformer.

Table 1. Quantitative results of ablation experiments.

Models	MSE−	PSNR+	LPIPS−	SSIM+	GFLOPs−
A	0.0007	35.24	0.220	0.8448	38.41
B	0.0007	35.20	0.297	0.8486	27.41
C	0.0007	33.71	0.234	0.8499	20.18
D	0.0003	40.12	0.154	0.9173	20.18

− Lower is better. + Higher is better.

Table 2. Quantitative results of different methods.

Models	MSE−	PSNR+	LPIPS−	SSIM+	GFLOPs−	Time−
Deepfillv2	0.0006	33.65	0.391	0.8504	68.86	0.24 s
TG-Net	0.0010	29.60	0.268	0.8639	43.48	0.27 s
MRS	0.0003	40.12	0.154	0.9173	20.18	0.05 s

− Lower is better. + Higher is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Liu, X.; Liu, H.; Yu, H. MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board. Appl. Sci. 2023, 13, 7006. https://doi.org/10.3390/app13127006

AMA Style

Zhang Y, Liu X, Liu H, Yu H. MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board. Applied Sciences. 2023; 13(12):7006. https://doi.org/10.3390/app13127006

Chicago/Turabian Style

Zhang, Yizhuo, Xingyu Liu, Hantao Liu, and Huiling Yu. 2023. "MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board" Applied Sciences 13, no. 12: 7006. https://doi.org/10.3390/app13127006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MRS-Transformer: Texture Splicing Method to Remove Defects in Solid Wood Board

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Vison Transformer

3.2. MRS-Transformer

3.2.1. Mask

3.2.2. Encoder

3.2.3. Decoder

3.2.4. Weighted L2 Loss

3.3. Qualitative Evaluation of Texture Generation

4. Experiments

4.1. Experimental Preparation

4.1.1. Data Acquisition

4.1.2. Experimental Environment and Key Parameters

4.2. Experiments on Texture Generation of Defective Solid Wood Boards

4.3. Ablation Experiments

4.4. Comparative Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI