This section details our proposed approach. First, the proposed two-stage loss function is discussed. Second, a global and local markovian discriminator, named GL-PatchGAN, is proposed. Last, an improved coarse-to-fine generative adversarial network architecture for image inpainting is presented.
3.1. Proposed Two-Stage Loss Function
It is very important to select the proper intermediate state in the coarse-to-fine networks, which can largely improve the quality of the restored images. Inspired by Reference [
17,
22], a loss function based on Sobel operator is proposed in the two phases of network in this paper.
Given the network prediction result
, and the ground truth image
, our two-stage Sobel loss is defined as:
where
G is a Gaussian filtering operation for noise removal. In order to obtain the image information on different scales of two stages, different Gaussian kernels are used in our network.
and
are the horizontal and vertical Sobel convolution kernels, respectively. Sobel kernels have strong ability to catch the gradient of images, so they are used in our loss function to extract the image structure in coarse network and to obtain the image contour in refinement network.
In order to better recover the image structure and texture components, other edge extraction operators, such as Laplace, Robert, and Prewitt operator, are also performed in the loss function, and the comparison experiments are conducted on CelebA-HQ [
24] dataset. The networks setup is identical except the operators in loss function.
As shown in
Table 1, the loss function based on Sobel operator performs better than others due to the following reasons. Compared to the Sobel operator, the Robert one has smaller kernel size, as well as a smaller receptive field. The Laplace operator only uses a single convolution kernel to process images and does not extract the horizontal and vertical gradient of images, respectively. Besides, the Prewitt operator tends to smooth images when extracting image gradient. Therefore, we apply the Sobel operator on loss function in this paper for better performance.
In this paper, the proposed loss function is utilized in two stages, as well as the Gaussian convolution kernel with different sizes and standard deviations. In coarse network, the Gaussian kernel with strong filtering ability is used, while, in refinement network, the Gaussian kernel with relatively slight filtering ability is utilized. Thus, the network tends to focus on the image structure in coarse network and focus on the details of the restored image in refinement network.
Gaussian convolution kernels with different parameters will lead to different results. Suppose
s and
are the size and the standard deviation of Gaussian convolution kernel, respectively. The larger the
s is, or the smaller the
is, the stronger the filtering ability will be; thus, more detailed information tends to be eliminated. However, stronger filtering ability will also blur the images. To achieve the best parameters, we compared the results of convolution kernels with different sizes at different stages of network. The experiments were performed on 2000 test images taken from the dataset CelebA-HQ [
24] with 10–20% irregular masks. The visual comparison results are shown in
Figure 1, while the quantitatively evaluate results are shown in
Table 2.
The first row of
Figure 1 are the original and input images, while the second and the third row show the inpainting results at stage 1 and stage 2 with different parameters, respectively. We can see from
Figure 1 and
Table 1 that the best inpainting result can be achieved using the parameters
s = 5,
= 1 at stage 1, and
s = 3,
= 1 at stage 2. Therefore, the parameters above are selected in the subsequent experiments for better inpainting effects.
To verify the effectiveness of our proposed loss function, we performed the experiments with or without it at different stages, and the results are shown in
Figure 2. where
Figure 2a–f are the ground truth images, input masked images, the results without and with our loss function at stage 1, and the results without and with our loss function at stage 2, respectively. We can see from
Figure 2 that the network with our proposed loss function achieves better restoration effects on structural information, as shown in
Figure 2c,d. The reason is that our loss function is beneficial to focus on the image contour at stage 1. In addition, the network with our proposed loss function better recovers the image details at stage 2, as shown in
Figure 2e,f, because the selected Gaussian kernels in loss function is beneficial to remove slight noise and preserve the good texture images.
Given an input image with hole
, initial binary mask
M (0 for holes), the network prediction
, and the ground truth image
, our pixel reconstruction loss at two stages of network is defined as:
Since adversarial loss is widely used to improve the visual quality of restored images, PatchGAN [
25] is also used in this paper as the global discriminator
and the local discriminator
. We denote its adversarial loss as:
The adversarial loss of our generator is defined as:
where
is the generated image,
is the image after replacing the undamaged region with those in original image, and
represents the mean of the requested item within the data range. Unlike other two-stage networks, the adversarial loss is only used on the final result in this paper; thus, the network parameters and training time can be reduced. Our final loss
is defined as
3.3. Our Model Architecture
This paper improved the coarse-to-fine network proposed by Yu [
16] with the proposed two-stage loss and GL-PatchGAN loss. Our proposed framework is shown in
Figure 4.
In our proposed framework, the encoder-decoder network with the proposed loss function is used in both coarse and refinement networks. As mentioned above, it is useful for the coarse network to restore the overall image structure, and is helpful for the refinement network to restore the image details. Besides, adversarial loss is applied in the refinement network to guarantee the realistic inpainting result. Using the proposed GL-PatchGAN, the discriminator tends to attach importance to both global appearance and local detail of final result.
In our coarse-to-fine network, coarse network firstly produces preliminary result with complete contour. As the skeleton of images, proper contours tend to lead to plausible final results. In refinement network, dilated convolution [
26] and context attention [
15] were used to search for the relationship between long distance pixels. Besides, gated convolution [
16] was used in both coarse network and refinement network, which could overcome the ill-fitted for image hole-filling of vanilla convolution. However, compared to some one-stage networks, coarse-to-fine network tend to add the network parameters and training time.
Different from the existing two-stage networks, in which the separate training is necessary at different stages, our network can achieve the end-to-end training. Moreover, our network can adjust the middle inpainting state freely by changing the relative parameters; thus, the complexity of training is reduced.