A Reference-Guided Double Pipeline Face Image Completion Network

Liu, Hongrui; Li, Shuoshi; Wang, Hongquan; Zhu, Xinshan

doi:10.3390/electronics9111969

Open AccessArticle

A Reference-Guided Double Pipeline Face Image Completion Network

¹

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

²

State Key Laboratory of Digital Publishing Technology, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Electronics 2020, 9(11), 1969; https://doi.org/10.3390/electronics9111969

Submission received: 25 October 2020 / Revised: 13 November 2020 / Accepted: 14 November 2020 / Published: 22 November 2020

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The existing face image completion approaches cannot be utilized to rationally complete damaged face images where their identity information is completely lost due to being obscured by center masks. Hence, in this paper, a reference-guided double-pipeline face image completion network (RG-DP-FICN) is designed within the framework of the generative adversarial network (GAN) completing the identity information of damaged images utilizing reference images with the same identity as damaged images. To reasonably integrate the identity information of reference images into completed images, the reference image is decoupled into identity features (e.g., the contour of eyes, eyebrows, nose) and pose features (e.g., the orientation of face and the positions of the facial features), and then the resulting identity features are fused with posture features of damaged images. Specifically, a lightweight identity predictor is used to extract the pose features; an identity extraction module is designed to compress and globally extract the identity features of the reference images, and an identity transfer module is proposed to effectively fuse identity and pose features by performing identity rendering on different receptive fields. Furthermore, quantitative and qualitative evaluations are conducted on a public dataset CelebA-HQ. Compared to the state-of-the-art methods, the evaluation metrics peak signal-to-noise ratio (PSNR), structure similarity index (SSIM) and L1 loss are improved by 2.22 dB, 0.033 and 0.79%, respectively. The results indicate that RG-DP-FICN can generate completed images with reasonable identity, with superior completion effect compared to existing completion approaches.

Keywords:

face image completion; generative adversarial network; double pipeline; feature decoupling; feature fusion

1. Introduction

Face image completion has been extensively utilized as a research hotspot in the field of computer vision in the film industry and detection. Face image completion should effectively restore the reasonable identity of a damaged image, i.e., ensure that the completed image has the same identity as the original image. Mainly, traditional image completion approaches are based on patches or diffusion. The diffusion-based approach [1,2] is to iteratively propagate the low-level features from known regions to unknown areas along the mask boundaries to complete the small-hole area with weak structure. The patch-based approach [3,4,5,6,7,8] is to search for similar patches in the same dataset or image to complete the unknown areas. Both approaches assume unknown areas share similar content with known regions; thus they can directly match, copy and realign the known patches to complete unknown areas, but cannot predict unique content not present in the known regions. Thus, these methods are only appropriate for filling natural images with similar textures. Nevertheless, face images have stronger topolopy than natural images, hence, they cannot be completed by such traditional approaches.

Presently, the approach based on deep learning and a generative adversarial network (GAN) [9] has attracted research on face image completion. In such an approach, a trainable completion network is made mostly based on a convolutional neural network, and the completion network is utilized as a discriminator and generator for adversarial training to generate natural and real completed images. A context encoder was proposed by Pathak et al. [10] using the generative adversarial loss for training, based on which Iizuka et al. [11] proposed local and global discriminant loss. Yu et al. [12] put forward a completion plan of rough completion first and then fine completion to obtain a high-definition completion effect through long-range feature transfer using the semantic attention layer. Moreover, some scholars proposed partial convolution [13] and gated convolution [14] to enhance the network robustness to the mask shape. Given the problem of a single completion result, Zhang et al. [15] mapped the damaged images to probability distribution utilizing a modified variational encoder and GAN, so that multiple completion results can be constructed for one damaged image by sampling the probability distribution.

The completion results of the dataset-guided face image completion scheme [10,11,12,13,14,15] are natural and real on face images with completely lost identity information. However, the results are essentially average faces based on the entire dataset, due to their identities being uncertain. Therefore, Li et al. [16] proposed semantic parsing loss, effectively enhancing the similarity between the completion result and the topological structure of the original images. Some scholars [17,18,19,20,21] used landmarks and edges as prior knowledge to guide the completion process. Compared with the approach of [16], the visual quality of the completion results of such approaches is greatly improved with more stable training. However, these approaches can only provide a posture information guide to the completion procedure, and the completion results still have no certain identity.

To solve the identity uncertainty problem, a reference-guided double-pipeline face image completion network (RG-DP-FICN) is proposed in this paper to guide the entire completion process using the identity information of reference images. The aim of the network is to transfer the identity information from the reference images to completed images, thereby improving the identity plausibility of the completed images.

The key points of this paper are as follows:

(1): The reference-guided completion network is used to determine the rationality of the identity of completion results.
(2): A double-pipeline GAN is proposed to realize the decoupling and fusion of identity and posture features.
(3): An attention feature fusion module is designed to restore the background information lost during fusion.

2. Materials and Methods

2.1. Dataset and Evaluation Metrics

All the experiments conducted in this paper exploit a high-quality human face dataset CelebA-HQ [22]. This dataset contains 30,000 high-quality face images coming from 6217 identities. It is trained and tested with the processing as follows. First, the images with the same identity in the dataset are summarized to obtain 6217 image lists. Each list contains a different number of face images. Then, the lists of only one image with the same identity are eliminated. Two images are selected in order from the list as the original image and the reference image, respectively, and 28,299 image pairs are obtained. Ultimately, 27,299 image pairs are chosen as the training set, while the remaining 1000 image pairs as the test set. All images are scaled to 256 × 256 × 3.

The evaluation metrics include peak signal-to-noise ratio (PSNR) [23], structure similarity index (SSIM) [23] and L1 loss. These metrics are used to measure the pixel-level difference and overall similarity between the completed image and the original image. For PSNR and SSIM, higher values indicate better performance, while for L1 loss, the lower the better. The formulas for the calculation of these performance metrics are as follows:

\begin{matrix} M S E = \frac{1}{H \times W} \times \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(I_{g} (i, j) - I_{s} (i, j))}^{2} \end{matrix}

(1)

\begin{matrix} P S N R = 10 \times l o g_{10} (\frac{{(2^{n} - 1)}^{2}}{M S E}) \end{matrix}

(2)

\begin{matrix} S S I M = l \times c \times s, l = \frac{2 \times μ_{g} \times μ_{s} + C_{1}}{μ_{g}^{2} + μ_{s}^{2} + C_{1}}, c = \frac{2 \times σ_{g} \times σ_{s} + C_{2}}{σ_{g}^{2} + σ_{s}^{2} + C_{2}}, s = \frac{σ_{g s} + C_{3}}{σ_{g}^{2} + σ_{s} + C_{3}} \end{matrix}

(3)

\begin{matrix} L 1 l o s s = \frac{1}{H \times W} \times \sum_{i = 1}^{H} \sum_{j = 1}^{W} |I_{g} (i, j) - I_{s} (i, j)| \end{matrix}

(4)

where,

I_{s}

represents the original image,

I_{g}

represents the completed image, i and j represent the coordinate positions of pixels in

I_{s}

and

I_{g}

; H, W represent the width and height of the image respectively; l, c and s represent the image similarity of

I_{s}

and

I_{g}

in terms of luminance, contrast, and structure, respectively;

μ_{s}

,

μ_{g}

σ_{s}

,

σ_{g}

represent the mean and variance of

I_{s}

and

I_{g}

,

σ_{g s}

represent the covariance of

I_{s}

and

I_{g}

;

C 1

,

C 2

,

C 3

represent the constant in order to avoid the denominator is 0, generally take

C 1 = {(K 1 \times L)}^{2}

,

C 2 = {(K 2 \times L)}^{2}

,

C 3 = {(K 3 \times L)}^{2}

,

K 1 = 0.01

,

K 2 = 0.03

,

L = 255

.

2.2. Double-Pipeline GAN

Considering identity uncertainty problem in the completed image, reference image

I_{a}

with the same identity as the damaged image

I_{m}

is incorporated to guide the completion procedure. In this study, a reference guided double-pipeline face image completion network(RG-DP-FICN) is proposed (Figure 1). Two pipelines (reconstruction and completion) are constructed in this network, to reasonably incorporate the identity information of reference images. To realize the identity transfer from reference image to damaged image, the posture features(e.g the orientation of face and the positions of the facial features) and identity features(e.g., the contour of eyes, eyebrows, nose) of reference images are decoupled through such double-pipeline network design, and the identity features of reference images are fused with the posture features of damaged images as follows.

The reconstruction pipeline possesses all the information on reference images and the information is jointly obtained by the encoder E and identity encoder

E_{i d}

. The reconstruction process is as follows: First, the damaged image

I_{ma}

and the topological structure of the reference image

L_{a}

are encoded via the encoder E to attain posture feature

z_{ma}

. Then,

z_{ma}

and identity feature

z_{id}

extracted by

E_{i d}

are fused utilizing the identity transfer module to obtain

z_{ua}

. Ultimately,

z_{ua}

is inserted into the decoder F for decoding to attain the reconstructed image

I_{rec}

(Equations (5) and (6)). Therefore, by supervising the similarity between the reconstructed image

I_{rec}

generated via the pipeline, the reference image

I_{a}

can effectively realize the decoupling of the posture feature

z_{ma}

and the identity feature

z_{id}

. As for the completion pipeline, the damaged image

I_{m}

and the topological structure

L_{g}

are encoded by the encoder E to obtain

z_{m}

. Then, the identity feature

z_{id}

obtained by decoupling the reference image and the posture feature

z_{m}

of the damaged image are fused utilizing the identity transfer module to obtain

z_{u}

. Ultimately,

z_{ua}

is inserted into the decoder F for decoding to generate the completed image

I_{g}

with certain identity (Equations (7) and (8)).

\begin{matrix} I_{rec} = G (I_{ma}, L_{a}, E_{i d} (I_{a})) \end{matrix}

(5)

\begin{matrix} I_{ma} = I_{a} \times M \end{matrix}

(6)

\begin{matrix} I_{g} = G (I_{m}, L_{g}, E_{i d} (I_{a})) \end{matrix}

(7)

\begin{matrix} I_{m} = I_{s} \times M \end{matrix}

(8)

where,

I_{s}

represents the original image;

M

is a mask;

L_{a}

and

L_{g}

denote the face topological structure obtained by connecting the landmarks of

I_{a}

and

I_{m}

predicted by the landmark predictor [20].

The generator G is trained to enhance the visual quality of

I_{g}

and the rationality of its topological structure through adversarial training with two discriminators

D_{1}

and

D_{2}

. For

D_{1}

, the true sample and the false sample are the image pair

(L_{g}, I_{s})

and

(L_{g}, I_{g})

, respectively. For

D_{2}

, the true sample and the false sample are respectively the image pair

(L_{a}, I_{a})

and

(L_{a}, I_{rec})

. Such sample pair design can effectively enhance the image quality of

I_{g}

and

I_{rec}

, and ensure the rationality of their topological structure.

2.3. Identity Encoder and Identity Transfer Module

The fusion of the identity information of reference images and the original information of damaged images is a challenging problem. This is because first, the reference images present different visual appearances due to gender, lighting conditions, and makeup, hence, it is difficult to obtain identity information. Second, there are differences in the face posture along with different proportions and positions of facial features in the image, hence, it is difficult to rationally fuse identity information. Hence, to solve this problem, an identity transfer module and an identity encoder are designed in this paper (Figure 2) to extract the identity information and feature fusion, respectively. Next, the transfer process in the completion pipeline is introduced as an example.

In the identity encoder, the residual block and the self-attention module [24] are used first to extract and compress identity information of reference image (256 × 256 × 3) to obtain a feature (512 × 4 × 4). Then, the feature is mapped into the identity feature

z_{id}

(512 × 1 × 1) through a fully connected layer. As a global operation, each unit of the identity feature

z_{id}

is the weighted sum of 512 × 4 × 4 feature values, which allows each unit in

z_{id}

to reason about reference image identity. Hence,

z_{id}

could well represent the identity of reference images.

In the identity transfer module,

z_{m}

and

z_{am}

, which representing high-dimensional semantics, is globally adjusted step by step using the style rendering block to realize identity transfer. The Identity transfer module consists of four style rendering blocks, each of which contains two AdaIn layers. The specific structure of style rendering block is represented in Figure 3, which is employed twice through adaptation instance normalization (AdaIN) [25], as follows:

\begin{matrix} f_{i, j}^{*} = \frac{f_{i, j} - u (f_{i, j})}{σ (f_{i, j})} \times β_{i, j} + α_{i, j} \end{matrix}

(9)

where,

f_{i, j}

represents the input feature,

u (f_{i, j})

and

σ (f_{i, j})

denote the mean and variance of

f_{i, j}

, respectively,

(α_{i, j}, β_{i, j})

represents the AdaIn affine parameter obtained by the fully connection layer mapping of identity feature

z_{id}

,

f_{i, j}^{*}

is the rendered feature, i indicates the i-th style rendering block, and j shows the j-th AdaIn of each style rendering block.

To improve the rendering effect, the receptive field is expanded utilizing the dilated convolution [26] in the style rendering blocks, thus, identity rendering in various receptive fields is achieved by continuously using the style rendering blocks. The entire style rendering process is completed by 4 style rendering blocks. The AdaIn affine parameters required by style rendering blocks are obtained by mapping the identity feature

z_{id}

extracted by the identity encoder module. Thus, the adjusted code

z_{u}

has rationality of identity. Similarly, the code

z_{au}

of the reconstruction pipeline also possesses the reasonable identity, due to it uses the same identity transfer operation as completion pipeline.

2.4. Attention Feature Fusion Module

After identity transfer utilizing the identity transfer module, the codes

z_{u}

and

z_{au}

are obtained with reasonable identity. However, AdaIn in the identity transfer module will globally adjust the feature map during transfer, thus, it leads to the loss of background information. Therefore, an attention feature fusion module is designed in this paper to combine the undamaged features of background information before transfer and the code with reasonable identity to restore the background information of face images. For instance,

f_{m}

and

f_{u}

obtained by

z_{u}

sampling are fused in the case of completion pipeline.

Figure 4 represents the structure of the attention feature fusion module. First, the attention score

δ

is calculated utilizing

f_{u}

(Equation (10)). Then,

f_{m}

and

f_{u}

are fused under the guidance of

δ

(Equations (11)–(14)). Ultimately,

γ_{m}

and

γ_{u}

are spliced and then the fusion feature

f_{g}

is obtained through a residual block(Equation (15)). Similarly, the fusion feature

f_{rec}

is obtained by fusing

f_{ma}

and

f_{au}

using this module.

\begin{matrix} δ_{i, j} = \frac{e x p (s_{i, j})}{\sum_{i = 1}^{N} e x p (s_{i, j})}, (s_{i, j}) = P {(f_{u i})}^{T} Q (f_{u i}) \end{matrix}

(10)

where, N denotes the number of pixels in

f_{u}

, and P and Q are

1 \times 1

convolutions.

\begin{matrix} c_{m j} = \sum_{i = 1}^{N} δ_{j i} O (f_{m i}) \end{matrix}

(11)

\begin{matrix} c_{u j} = \sum_{i = 1}^{N} δ_{j i} O (f_{u i}) \end{matrix}

(12)

\begin{matrix} y_{m} = γ_{m} \times (1 - M) \times c_{m} + M \times f_{m} \end{matrix}

(13)

\begin{matrix} y_{u} = γ_{u} \times c_{u} + f_{u} \end{matrix}

(14)

\begin{matrix} f_{g} = S (y_{m}, y_{u}) \end{matrix}

(15)

where, O represents a

1 \times 1

convolution, and γ_m & γ_u denote equilibrium parameters, S represents a residual block.

To verify the effectiveness of the attention feature fusion module, the additive operation is imposed for the multiple channels of the feature map

f_{u}

and

f_{g}

before and after fusion in this paper, for which the visualized results are represented in Figure 5. It is observed that the information before fusion is mainly concentrated in the face area, however, the background information is restored after fusion indicating that the module can effectively fuse the features.

2.5. Loss Function

According to the respective tasks of the generation pipeline and the reconstruction pipeline, in designing their loss function, it is considered that the reconstructed images should have a great similarity with the reference images, and the generated images should be real and natural with reasonable semantics.

First, reconstruction loss is introduced for the reconstruction pipeline to decrease the pixel-level difference between

I_{rec}

and

I_{a}

. There is no complete face image information in the completion pipeline, hence, erroneous training results will be obtained when using the reconstruction loss forcibly. Therefore, spatially discounted reconstruction loss [12] is introduced, and the spatial discounted weight

W_{sd}

is adjusted to apply a weaker pixel-level supervision to pixels farther from the nearest known pixel. The reconstruction loss and spatial reconstruction loss can be written as:

\begin{matrix} L_{a p p}^{r} = \sum_{i} {∥I_{rec} - I_{a}∥}_{1} \end{matrix}

(16)

\begin{matrix} L_{a p p}^{g} = \sum_{i} {∥W_{sd} \times (I_{g} - I_{s})∥}_{1} \end{matrix}

(17)

\begin{matrix} W_{sd} = γ^{l} \end{matrix}

(18)

where,

{∥\cdot∥}_{1}

denotes the

L_{1}

norm,

W_{sd}

represents the weight of each pixel, l is the distance of the pixel to the nearest known pixel.

γ

is set to 0.9 in all experiments.

Then, the perceptual loss [27] is introduced for the reconstruction pipeline to decrease the perceptually difference between

I_{rec}

and

I_{a}

. The output distance of the reconstructed image

I_{rec}

and the reference image

I_{a}

at different layers of the pre-training network is penalized to enhance their perceptual similarity. The perceptual loss is determined utilizing the feature map

ϕ_{i} (\cdot)

output by the pre-trained VGG19 network on the ImageNet dataset [28], as follows:

\begin{matrix} L_{p c} = \sum_{i} \frac{1}{N} \times {∥ϕ_{i} (I_{g}) - ϕ_{i} (I_{s})∥}_{1} \end{matrix}

(19)

where,

ϕ_{i} (I_{g})

$ and

ϕ_{i} (I_{s})

denote the feature map corresponding to

I_{g}

and

I_{s}

, respectively.

Ultimately, to decrease the difference in data distribution between the generated image and the real image, and enhance the visual quality of the generated image, the discriminant loss is introduced for the two pipelines. The discriminant loss for the completion pipeline is designed based on LSGAN [29] (Eqtuion (20)), which could effectively improve the authenticity of the completed image

I_{g}

. Inspired by [30], the LSGAN-based discriminant loss is deformed (Eqation (21)) for the reconstruction pipeline. Such a design can encourage the original features

D_{2} (I_{a}, L_{a})

and reconstructed features

D_{2} (I_{rec}, L_{a})

in the discriminator to be close together, so it can effectively enhance the semantic similarity between

I_{rec}

and the reference image

I_{a}

while ensuring the authenticity of the reconstructed image

I_{rec}

.

\begin{matrix} L_{a d v}^{D_{1}} = {∥D_{1} (I_{g}, L_{g}) - 1∥}_{1} \end{matrix}

(20)

\begin{matrix} L_{a d v}^{D_{2}} = {∥D_{2} (I_{rec}, L_{a}) - D_{2} (I_{rec}, L_{a})∥}_{1} \end{matrix}

(21)

The total losses of the generation pipeline and the reconstruction pipeline are defined as follows:

\begin{matrix} L_{g} = λ_{a p p} \times L_{a p p}^{g} + λ_{a d v} \times L_{a d v}^{D_{1}} \end{matrix}

(22)

\begin{matrix} L_{r} = λ_{a p p} \times L_{a p p}^{r} + λ_{a d v} \times L_{a d v}^{D_{2}} + λ_{p c} \times L_{p c} \end{matrix}

(23)

where,

λ_{a p p}

,

λ_{a d v}

and

λ_{p c}

denote the weighting factors.

3. Results and Discussion

3.1. Implementation Details

Our proposed model is implemented in PyTorch v1.2.0. The model is optimized using Adam optimizer [31] with decay index

β_{1}

= 0,

β_{2}

= 0.9, and learning rate

l_{r}

= 0.0001. The weights

λ_{r e c}

,

λ_{a d v}

and

λ_{p c}

of the loss function are set to 1, 0.1 and 0.1, respectively. On a single NVIDIA 2080Ti (11 GB), we trained our model on CelebA-HQ for three days with a batch size of 4. The convergence processes of the loss function and the test set evaluation metrics are shown in Figure 6 and Figure 7. We found that the model achieved the best results at the 38th epoch.

3.2. Completion Consistency

A comparison is made between RG-DP-FICN proposed in this paper and generative image inpainting with contextual attention (CA) [12], generative landmark guided face inpainting (LaFIn) [20], and pluralistic image completion(PIC) [15] on the same database CelebA-HQ. Notice that for CA, LaFIN and PIC, the pre-trained models on the CelebA-HQ are given, these methods are tested utilizing the code and its network weights provided by the original authors.

To visually demonstrate the superiority of the face image completion approach proposed in this paper, a qualitative evaluation is performed for the completion results of this approach, CA, LaFIn, and PIC (Figure 8). The completion result of PIC and CA are essentially average faces based on the entire dataset, due to their identity are uncertain. Therefore, their completion results can possess certain realism, however, the identity is clearly different from the ground-truth. For LaFIn, the landmarks is introduced as prior knowledge of topological structure, nevertheless, its completion result still has no reasonable identity owing to the lack of identity information guidance, particularly manifests as the different contour of eyes, eyebrows, nose from that in the original image. For RG-DP-FICN, the reference image of the same identity is introduced to guide the completion process. Utilizing the identity transfer module for transferring the identity of the reference image to the damaged image, the identity features obtained by decoupling and the posture features of the damaged image are fused. Moreover, the background information lost during identity transfer is restored by the attention feature fusion module. In conclusion, RG-DP-FICN can generate the completed images with reasonable topological structure and identity compared to the above three schemes.

To objectively demonstrate the superiority of the face image completion approach proposed in this paper, the completion results of this approach, CA, LaFIn, and PIC are quantitatively assessed. The quantitative evaluation results are represented in Table 1. As can be seen from the numbers in Table 1, LaFIn is superior over PIC and CA in most cases, as it employs the landmark information to guide completion process. PSNR and SSIM of RG-DP-FICN are far higher than those of other schemes, while L1 loss is lower compared to the other schemes. It is inferred that in comparison with the existing advanced schemes, the RG-DP-FICN proposed in this paper has a stronger capability of face image completion.

3.3. Completion Diversity

To further validate the model’s identity transfer capabilities, the reference face images of different identities are chosen for the damaged images utilizing the center mask (Figure 9 and Figure 10). Covering the face attributes completely, the completion results will represent obvious differences in the position and contour of the eye and the shape of the nose based on the different reference face images. It can be seen that the completion network in this paper effectively guided the identity of the completion result.

4. Conclusions

Considering the problem of identity uncertainty in the results of face image completion, RG-DP-FICN is proposed in this paper. First, an identity transfer module is designed to realize the identity guidance of face image completion results. Then, an attention feature fusion module is designed to effectively restore the background information of the face image. In comparison with various advanced approaches, the completion results of RG-DP-FICN are more real and natural in the subjective visual effect and possessed significantly enhanced objective pixel-level similarity and overall similarity. Hence, it effectively solves the above problem.

Author Contributions

Conceptualization, H.L. and X.Z.; methodology, H.L., X.Z., S.L.; validation, H.W. and S.L.; data analysis, H.W.; investigation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L., H.W. and S.L.; visualization, S.L. and H.W.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (Grant No. 61972282), and by the Opening Project of State Key Laboratory of Digital Publishing Technology (Grant No. Cndplab-2019-Z001).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics And Interactive Techniques, Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Elad, M.; Starck, J.L.; Querre, P.; Donoho, D.L. Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). Appl. Comput. Harmon. Anal. 2005, 19, 340–358. [Google Scholar] [CrossRef] [Green Version]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Bertalmio, M.; Vese, L.; Sapiro, G.; Osher, S. Simultaneous structure and texture image inpainting. IEEE Trans. Image Process. 2003, 12, 882–889. [Google Scholar] [CrossRef] [PubMed]
Criminisi, A.; Pérez, P.; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef] [PubMed]
Huang, J.B.; Kang, S.B.; Ahuja, N.; Kopf, J. Image completion using planar structure guidance. ACM Trans. Graph. (TOG) 2014, 33, 1–10. [Google Scholar] [CrossRef]
Darabi, S.; Shechtman, E.; Barnes, C.; Goldman, D.B.; Sen, P. Image melding: Combining inconsistent images using patch-based synthesis. ACM Trans. Graph. (TOG) 2012, 31, 1–10. [Google Scholar] [CrossRef]
Hays, J.; Efros, A.A. Scene completion using millions of photographs. ACM Trans. Graph. (TOG) 2007, 26, 4. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. 2014. Available online: https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf (accessed on 16 November 2020).
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5505–5514. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
Zheng, C.; Cham, T.J.; Cai, J. Pluralistic image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1438–1447. [Google Scholar]
Li, Y.; Liu, S.; Yang, J.; Yang, M.H. Generative face completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3911–3919. [Google Scholar]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.Z.; Ebrahimi, M. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv 2019, arXiv:1901.00212. [Google Scholar]
Xiong, W.; Yu, J.; Lin, Z.; Yang, J.; Lu, X.; Barnes, C.; Luo, J. Foreground-aware image inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5840–5848. [Google Scholar]
Song, L.; Cao, J.; Song, L.; Hu, Y.; He, R. Geometry-aware face completion and editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Volume 33, pp. 2506–2513. [Google Scholar]
Yang, Y.; Guo, X. Generative Landmark Guided Face Inpainting. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Nanjing, China, 16–18 October 2020; pp. 14–26. [Google Scholar]
Jo, Y.; Park, J. SC-FEGAN: Face Editing Generative Adversarial Network with User’s Sketch and Color. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1745–1753. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Okinawa, Japan, 16–18 April 2019; pp. 7354–7363. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Overview of RG-DP-FICN with two parallel pipelines. The reconstruction pipeline (blue line) consisting of generator G and discriminator

D_{2}

possesses all the information on reference images from

I_{a}

, which is used only for training. The completion pipeline (yellow line) consisting of generator G and discriminator

D_{1}

is used for training and testing. Two piplines share identical generator G and identity encoder

E_{i d}

, the generator consists of encoder E and discriminator F. The structure of the residual encoder and residual decoder is the same as [15]; Identity encoder module and identity transfer module is further described in Section 2.3; attention feature fusion module is further described in Section 2.4.

Figure 1. Overview of RG-DP-FICN with two parallel pipelines. The reconstruction pipeline (blue line) consisting of generator G and discriminator

D_{2}

possesses all the information on reference images from

I_{a}

, which is used only for training. The completion pipeline (yellow line) consisting of generator G and discriminator

D_{1}

is used for training and testing. Two piplines share identical generator G and identity encoder

E_{i d}

, the generator consists of encoder E and discriminator F. The structure of the residual encoder and residual decoder is the same as [15]; Identity encoder module and identity transfer module is further described in Section 2.3; attention feature fusion module is further described in Section 2.4.

Figure 2. The identity encoder and identity transfer module.

Figure 3. The style rendering block.

Figure 4. The attention feature fusion module. The module contains three

1 \times 1

convolutional layers, O, P, Q, to extract features pixel by pixel.

Figure 4. The attention feature fusion module. The module contains three

1 \times 1

convolutional layers, O, P, Q, to extract features pixel by pixel.

Figure 5. The feature maps before and after attention feature fusion. x and y represent the coordinates of the feature map, and the color bar represents the range of the feature map values.

Figure 6. The curve of loss function convergence.

Figure 7. The curve of evaluation metrics convergence.

Figure 8. The qualitative evaluation on CelebA-HQ datasets: (a) masked input; (b) ground-truth; (c) results of CA; (d) results of PIC; (e) results of LaFIn; (f) results of RG-DP-FICN.

Figure 9. The completion results of a masked IMAGE6 using different reference face images: (a) masked input; (b) ground-truth; (c) reference face images; (d) results of RG-DP-FICN.

Figure 10. The completion results of a masked IMAGE8 using different reference face images: (a) masked input; (b) ground-truth; (c) reference face images; (d) results of RG-DP-FICN.

Table 1. The quantitative results on the CelebA-HQ test set consisting of 1000 face images.

Method	PSNR	SSIM	L1 Loss(%)
CA [12]	24.39	0.863	2.88
PIC [15]	24.86	0.871	2.65
LaFIn [20]	26.22	0.905	2.37
Our method	28.44	0.938	1.58

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Li, S.; Wang, H.; Zhu, X. A Reference-Guided Double Pipeline Face Image Completion Network. Electronics 2020, 9, 1969. https://doi.org/10.3390/electronics9111969

AMA Style

Liu H, Li S, Wang H, Zhu X. A Reference-Guided Double Pipeline Face Image Completion Network. Electronics. 2020; 9(11):1969. https://doi.org/10.3390/electronics9111969

Chicago/Turabian Style

Liu, Hongrui, Shuoshi Li, Hongquan Wang, and Xinshan Zhu. 2020. "A Reference-Guided Double Pipeline Face Image Completion Network" Electronics 9, no. 11: 1969. https://doi.org/10.3390/electronics9111969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reference-Guided Double Pipeline Face Image Completion Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Evaluation Metrics

2.2. Double-Pipeline GAN

2.3. Identity Encoder and Identity Transfer Module

2.4. Attention Feature Fusion Module

2.5. Loss Function

3. Results and Discussion

3.1. Implementation Details

3.2. Completion Consistency

3.3. Completion Diversity

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI