Next Article in Journal
A Review on the Prospects of Mobile Manipulators for Smart Maintenance of Railway Track
Next Article in Special Issue
Arabic News Classification Based on the Country of Origin Using Machine Learning and Deep Learning Techniques
Previous Article in Journal
Seismic Risk Analysis of Offshore Bridges Considering Seismic Correlation between Vulnerable Components
Previous Article in Special Issue
HFD: Hierarchical Feature Detector for Stem End of Pomelo with Transformers
 
 
Article
Peer-Review Record

Portrait Reification with Generative Diffusion Models

Appl. Sci. 2023, 13(11), 6487; https://doi.org/10.3390/app13116487
by Andrea Asperti *, Gabriele Colasuonno and Antonio Guerra
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4:
Appl. Sci. 2023, 13(11), 6487; https://doi.org/10.3390/app13116487
Submission received: 18 April 2023 / Revised: 22 May 2023 / Accepted: 23 May 2023 / Published: 25 May 2023
(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods)

Round 1

Reviewer 1 Report

The paper is dedicated to the problem of replacing painters abstract faces in paintings with real human faces. The paper uses extensively an  embedding technique for generative diffusion models, developed in a another paper of the same main author. The originality is the combination of the DIM models with a complex pre- and post processing, which includes face detection, head position estimation and cropping for the preprocessing and super-resolution, face segmentation and colour correction in the post processing. One of the really interesting aspects is the successful training and application of such a complex algorithm. 

Here are some remarks or hints for improvement:

1. in 4.4. the effect of the crop on the face expression is described. Why not to work with a vector arithmetic in the latent space to reduce the smile-effect like for VAE?

2. it might be interesting to comment on the question, for which painting styles this approach is applicable. It seems that expressionist and some impressionist styles might be excluded. 

3. The application is well done, but the paper would win, if the authors somewhere can mention, where these techniques have a practical application.

4. For some images the references are missing, for example fig. 5, also for the used original paintings a source identification would be great.

5. there are minor editing errors, missing blanks etc.

the quality of English is acceptable, proofreading by a native speaker always improves the quality.

Author Response

Thank you for the accurate reviews and the precious suggestions.

Q1. in 4.4. the effect of the crop on the face expression is described. Why not to work with a vector arithmetic in the latent space to reduce the smile-effect like for VAE?

A1. This is indeed an interesting question and a direction worth to be better explored. However, according to some experiments we did, it is is extremely difficult to find significative vector directions that could work on the totality (or a large portion) of data. This is presumably due to the dimension of that latent space (equal to the dimension of the visibe space). In other words, the arithmetic for semantic manipulation of the latent space of diffusion models could possibly be non-linear.

Q2. it might be interesting to comment on the question, for which painting styles this approach is applicable. It seems that expressionist and some impressionist styles might be excluded. 

A2. Yes, indeed. We added a comment in the conclusions.

Q3. The application is well done, but the paper would win, if the authors somewhere can mention, where these techniques have a practical application.

A3. Alas, we don't know. For us, it just has a ludic nature, as many similar application for image processing. On the oher side, it can eventually contibute to improve our knoweledge of the latent space of diffusion models, still largely unexplored.

Q4. For some images the references are missing, for example fig. 5, also for the used original paintings a source identification would be great.

A4. We used images randomly taken from public repositories. For each painting there are a lot of different copies available, and we did not keep track of the source, sorry. Figure 5 has replaced with an original picture.

Q5. there are minor editing errors, missing blanks etc.

A5. We read the article carefully and tried to correct all errors.

Reviewer 2 Report

 

- Authors should clearly indicate the difference between the work they present in this article and the one described in the paper [1]. Since both works are by the same authors, it is essential to clearly highlight the contribution of the new article.

 

- It is essential to describe the practical utility of the proposed method (indicate real applications in which it may be interesting).

 

- Lines 31-33:

The authors indicate that 'the overall process is quite complex'. Therefore, Figure 4 does not seem adequate to summarize this complex process. It would be necessary to include a block diagram where the sequence of operations carried out by the proposed method can be seen, including more information and not only the successive images that are obtained. In other words, a scheme must be included that visually shows the successive operations that are applied to the image and that the authors describe in the different sections of the article.

 

- CelebA, CelebHQ, CelebMaskHQ: they are used several times in the article, so they need to be described in more detail.

 

- The content of algorithms 1 and 2 should be modified to make it easier to understand. Although the authors indicate that these algorithms include pseudocode, the notation used does not conform too closely to what would be the traditional concept of pseudocode, which allows operations to be translated into a programming language. Indeed, both algorithms combine mathematical notation with the enumeration of very simplified operations. Authors should expand the description so that the reader can easily reproduce the operations described.

 

- LINE 91: the authors used a network with 10 diffusion steps. They must indicate the reason why they selected this value.

 

- Figure 8: Various values of the factor used to generate the images are indicated in this figure. Authors should indicate the criteria used to determine the appropriate value for that factor.

 

- Figure 9 is too small. I can't see its content.

 

- The description given in the article does not include any equation. That is, the authors use words to describe what they have done, but it would also be interesting to have the applied equations (at least in the fundamental steps).

 

- References: Authors should review this section, as the same format is not used in all references and there are some errors.

 

- The authors could consider the option of replacing the references published in ArXiv by other similar ones published in journals.

 

 

- The text included in some images is too small.

- Acronyms must be defined only once in the article.

- The authors should carefully read the text, as there are some wrong words.

The authors should carefully read the text, as there are some wrong words.

Author Response

Thank you for the accurate reviews and the precious suggestions.

Q1: Authors should clearly indicate the difference between the work they present in this article and the one described in the paper [1]. 

A1: We added a paragraph in the introduction clarifying the contribution of the new article and the novelties w.r.t. [1]. Roughly, the examples in [1] were based on manual crops, while here we automatize the process, developing a self-contained, user friendly application. 

Q2: It is essential to describe the practical utility of the proposed method (indicate real applications in which it may be interesting).

A2: From our point of view, the application just has a ludic nature, as many applications for image processing frequently have (e.g. what is the practical utility of, say, inceptionism?). From a more scientifc point of view, the application paves the way to interesting investigations on the latent space of diffusion models, still largely unexplored.

Q3: The authors indicate that 'the overall process is quite complex'. Therefore, Figure 4 does not seem adequate to summarize this complex process. It would be necessary to include a block diagram where the sequence of operations carried out by the proposed method can be seen, including more information and not only the successive images that are obtained.

A3: we modified the figure including a scheme that visually shows the successive operations that are applied to the image.

Q4: CelebA, CelebHQ, CelebMaskHQ: they are used several times in the article, so they need to be described in more detail.

A4: We integrated a descritption of the datasets.

Q5: The content of algorithms 1 and 2 should be modified to make it easier to understand. Although the authors indicate that these algorithms include pseudocode, the notation used does not conform too closely to what would be the traditional concept of pseudocode, which allows operations to be translated into a programming language. Indeed, both algorithms combine mathematical notation with the enumeration of very simplified operations. Authors should expand the description so that the reader can easily reproduce the operations described.

A5: The pseudocode conforms to the tradition of this particular domain, with a strong probabilistic foundation. Sampling is a basic operation. We believe that trying to rephrase it in different ways could only make confusion.  We already added comments on the left. We expanded the discussion in the article.

Q6: LINE 91: the authors used a network with 10 diffusion steps. They must indicate the reason why they selected this value.

A6: 10 steps is the standard for DDIM. We added an explanation in the article.

Q7: Figure 8: Various values of the factor used to generate the images are indicated in this figure. Authors should indicate the criteria used to determine the appropriate value for that factor.

A7: During the calibration of the crops we did experiments with small variations in their size, and observed the phenomenon. As we say in the article, the point is just to stress the sensibility of the process to small variations of the conditions.

Q8: Figure 9 is too small. I can't see its content

A8: We changed the figure

Q9: The description given in the article does not include any equation. That is, the authors use words to describe what they have done, but it would also be interesting to have the applied equations (at least in the fundamental steps).

A9: We could add a theoretical introduction to Diffusion Models, if required, but this can be easily found in plenty of articles. We added some equations for relevant preprocessing and postporcessing operations, or pseucode when possible.

Q10: References: Authors should review this section, as the same format is not used in all references and there are some errors.

A10. We revised references.

Reviewer 3 Report

Contributions:

This study presents an application of generative diffusion techniques for reifying human portraits in artistic paintings. My comments are given below:

 

  1. The contents are complicated. Please provide a flowchart to address the proposed system.
  2. (Page 3)The authors should explicitly introduce algorithm 1.
  3. (Page 3) In Algorithm 1, q and epsilon are not defined. What is the loss function?
  4. (Line 77 on page 4)Why do you select U-net? 
  5. (Page 5) Which algorithm do you use for face detection? What is the accuracy rate? The statement is too superficial. 
  6. (Page 5) Which algorithm do you use for head pose estimation? What is the accuracy rate?
  7. (Page 6) The algorithm for cropping is unclear. Please provide the pseudo-code and the parameters you used in the experiments.
  8. (Page 6) The estimated bounding box is inaccurate in the first of Fig. 7.
  9. (Page 7) How does self-attention work on this work?
  10. (Page 8)Why do you use residual blocks in Fig. 9?
  11. (Page 8)Why do you use U-net for face segmentation? I think the YOLOnet outperforms the U-net. 
  12. (Page 9) What is the accuracy rate and recall rate for face segmentation?
  13. Figures 1 to 3 can be combined.
  14. (Page 2) The caption of Fig. 4 is too redundant. Some statements can be moved to the context. Figure 10 also has the same problem.
  15. (Lines 51 and 52 on page 3) The symbols zx and x can be removed.
  16. (Line 64 on page 3) The paragraph can be combined into the previous paragraph. Line 99 on page 4 also has the same problem.

The quality of the English language should be improved.

 

Author Response

Thank you for the accurate reviews and the precious suggestions.

Q1: The contents are complicated. Please provide a flowchart to address the proposed system.

A1: We expanded figure 4, and improved the discussion.

Q2:  (Page 3)The authors should explicitly introduce algorithm 1.

A2: Done.

Q3: (Page 3) In Algorithm 1, q and epsilon are not defined. What is the loss function?

A3: q was a mistake, thanks for spotting. epsilon_theta is the denoising network, defined a few lines above, and epsilon is the noise defined at step 5. The error is explicitly given in step 7, and it is the distance between the actual noise and the noise predicted by the network.

Q4: (Line 77 on page 4) Why do you select U-net? 

A4: it is the network used by all works on diffusion models, and a de facto standard, at the moment. 

Q5: Which algorithm do you use for face detection? What is the accuracy rate? The statement is too superficial. 

A5: We explained in more detail the algorithm we used, the reason for our choice,  and its accuracy.

Q6: (Page 5) Which algorithm do you use for head pose estimation? What is the accuracy rate?

A6: We improved the Head pose estimation subsection adding more details on the algorithm and its accuracy.

Q7: (Page 6) The algorithm for cropping is unclear. Please provide the pseudo-code and the parameters you used in the experiments.

A7: We updated the section adding more information and the pseudo-code.

Q8: (Page 6) The estimated bounding box is inaccurate in the first of Fig. 7.

A8. Thanks for the question. The crop is not based on the bounding box of the face, but it must conform to the training data for the CelebA dataset, where faces are aligned with respect to the position of the eyes, resulting in crops of the kind shown in the figure. We hope that the changes made to the text all along the article helped to clarify the point.

Q9, Q10: How does self-attention work on this work? - Why do you use residual blocks in Fig. 9?

A9,A10: Self-attention mechanism is meant to capture long-range dependencies in the image, resulting in a sensible performance improvement.  Residual blocks are a traditional technique, contrasting the vanishing gradient problem and accelerating the training process.

Q11. Why do you use U-net for face segmentation? I think the YOLOnet outperforms the U-net.

A11. U-net and YOLOnet are both popular and effective deep learning models. We preferred U-net over YOLOnet to avoid introducing an additional software.  Segmentation is not a crucial component of the pipeline and the accuracy we get is more than adequate for our purposes.

Q12. What is the accuracy rate and recall rate for face segmentation?

A12. We achieved an accuracy of 96.78% and a recall of 97.60%. We added this information in the Face segmentation subsection.

Q13-Q16 are editing suggestions, and we tried to take them into account.

 

Reviewer 4 Report

This paper presents reification of human portraits in artistic paintings. The algorithm is based on Denoising Diffusion Implicit Models (DDIM). A handful of results are shown to highlight the effectiveness of the proposed extension.

Based on the current results, the paper may be accepted after a few modifications. Overall, the presentation needs to be improved. Author may add a separate paragraph in Sec 1 to list the key contributions/findings.

 

It would be good to discuss the relevance of iterative image denoising algorithm in this framework.

 

It may be relevant to comment whether color-space filtering based post-processing may help.

 

Does image cropping have the same technical challenge as image downscaling [R1]? Please comment.

===================

[R1] "Image downscaling via co-occurrence learning," Journal of Visual Communication and Image Representation, vol. 91, pp. 103766, 2023.

Needs to be improved.

Author Response

Thank you for the accurate reviews and the precious suggestions.

Q1: Author may add a separate paragraph in Sec 1 to list the key contributions/findings.

A1: We added a paragraph listing the key contributions.

Q2: It would be good to discuss the relevance of iterative image denoising algorithm in this framework.

A2 This was already briefly discussed at the bottom of page 1: "This is particularly effective for reverse diffusion techniques, since they have a larger sample diversity and introduce much less artifacts in the generative process than different generative techniques". We expanded the point.

Q3: It may be relevant to comment whether color-space filtering based post-processing may help.

A3:  We added a comment in the Color correction subsection. Briefly, LAB color space makes it easier to manipulate the color information separately from the brightness information making the color correction more reliable.

Q4: Does image cropping have the same technical challenge as image downscaling [R1]? Please comment.

A4: We are not sure to understand the question. The pre-processing phase also requires downscaling, since all faces must eventually have a same dimension of 64x64 pixels.

Round 2

Reviewer 2 Report

 

There are 3 basic recommendations that the authors have not taken into account:

- point 9

- point 3: the graph has been modified but no additional information is included to allow a proper understanding of the sequence of operations

- point 5: the authors have not made any changes. On the other hand, in their new version of the article they have included two other algorithms whose description has a totally different format than the one included in the initial version of the article. This makes the description not homogeneous. On the other hand, many of the operations represented in these two new algorithms would be easier to understand if mathematical notation were used.

 

On the other hand, it seems logical that the solution proposed by the authors has some practical utility.

Correct

Author Response

It is not true that the recommendations done by the reviewer have not been taken into account: they have been either partially implemented, or we explained our motivations for doing otherwise. We believe that the reviewer cannot content himself with reiterating her/his requests without entering into the merits of our reply.

Specifically:

point 9. The reviewer was generically complaining that the article did not include any equation, without further specification. We added some equations in the part of the article that, according to a different reviewer, seemed to be weaker, namely the description of the pre-processing and post-processing phases. The article does not contain original material on diffusion models, and we prefer to avoid to give this wrong impression. It is just an application meant to valorize a recent embedding technique for diffusion models introduce by the first author, in conjunction with other researchers. This seems to be in line with the journal scope, and with this specific special issue.

point 3: Q: "the graph has been modified but no additional information is included to allow a proper understanding of the sequence of operations".

A: The caption has not been modified since, according to another reviewer, it was aleady too long. However, in our opinion, the text contains enough information to follow the sequence of operations. All steps are mentioned and briefly described in the text. This is just an introduction: the rest of the article describes the full application pipeline, with details on all the different phases.

point 5: Q: the authors have not made any changes. On the other hand, in their new version of the article they have included two other algorithms whose description has a totally different format than the one included in the initial version of the article. This makes the description not homogeneous. On the other hand, many of the operations represented in these two new algorithms would be easier to understand if mathematical notation were used.

A: As we already said in our reply during the first round, the pseudocode conforms to the tradition of the domain of diffusion models. It is simple and clear. A pseudocode would be extremely similar to code, and since the code is open source we see no reason to add it inside the article. We really believe it is the right way to present it, and in lack of precise motivations for doing otherwise, we stick to our decision. Pseudocode for the algorithms in the other sections was explicitly required by the third reviewer. We removed one of them, with little significance. 

Reviewer 3 Report

The authors have improved the quality of this paper. I think it can be accepted for publication.

The English writing should be further improved.

Author Response

Thank you for your appreciation.

The tried to improve the quality of the English language, also with the help of editing software.

Back to TopTop