Vision Transformer in Industrial Visual Inspection
Round 1
Reviewer 1 Report
1. What is the main question addressed by the research? Comparison of the state-of-the-art transformer models with various assumptions. 2. Do you consider the topic original or relevant in the field? Does it address a specific gap in the field? Accepted. The gaps have been addressed accordingly. However, it can be improved. 3. What does it add to the subject area compared with other published material? Yes. 4. What specific improvements should the authors consider regarding the methodology? What further controls should be considered? Please check my comments. 5. Are the conclusions consistent with the evidence and arguments presented and do they address the main question posed? Yes. 6. Are the references appropriate? Yes. COMMENTS: 1. Abstract - good 2. Keywords - too many keywords. It is recommended to write 5 or 6 keywords only 3. Problem statements - well-described 4. Related works - good connection of past research 5. Methodology - The compared transformers models must be described accordingly. The characteristic, specification and feature of transformer models must be explained. 6. Results - Have been well described and explained with high quality figures and tables. 7. Conclusion - has been well explained 8. Reference - suitable references have been selectedAuthor Response
Dear Reviewer,
we would like to thank you for the valuable feedback on our work. We have addressed the two main remarks you had:
Comment 2
Reviewer’s Comment
“Keywords - too many keywords. It is recommended to write 5 or 6 keywords only”
Our Rebuttal
We reduced the number of keywords to 6 by removing the ones that had topic wise overlap with each other and some that seemed lower priority for us.
Comment 5
Reviewer’s Comment
“Methodology - The compared transformers models must be described accordingly. The characteristic, specification and feature of transformer models must be explained.”
Our Rebuttal
We enhanced the description of the utilized transformer models in section 2.2 with more detailed information as well as architecture visualizations for all three of them.
Yours sincerely
Tobias Meisen, Richard Meyes and Nils Hütten
Author Response File: Author Response.pdf
Reviewer 2 Report
The problem studied in this paper is to introduce a visual transformer in industrial visual inspection, and the opinions given in this paper are as follows:
1. The "introduction" is not enough to highlight the visual inspection problems studied in the paper, but rather to highlight the contributions of this paper and highlight their own innovation, not only to introduce computer vision and visual inspection problems.
2. The content of "relevant work" is not comprehensive enough. You can state your own argument through the model architecture, including natural language processing, computer vision and other related content, to make its content more comprehensive and specific.
3. In the "Learning Task Description" section, more detailed and specific examples should be listed to highlight the importance of learning tasks.
4. In terms of "experiments and results", the experimental setup is not comprehensive enough. It is necessary to conduct multi-level comparison of experimental data and conduct a comprehensive analysis based on the effects of some experiments.
5. In terms of "conclusion and prospect", reorganize the language, highlight the main content and contribution of this paper, and prospect and develop the effect of visual transformer.
Author Response
Dear Reviewer,
we would like to thank you for the valuable feedback on our work. We did our best to address the remarks you had.
Comment 1
Reviewer’s Comment
“The ‘introduction’ is not enough to highlight the visual inspection problems studied in the paper, but rather to highlight the contributions of this paper and highlight their own innovation, not only to introduce computer vision and visual inspection problems.”
Our Rebuttal
In the introduction, we highlighted the similarities and differences between our work and the referenced VI papers by Wang and Liu to clarify our contribution. For example, we highlighted that two of our use cases have strong similarities with theirs, but the character recognition case goes beyond what SOTA addresses in terms of characteristics and complexity. In addition, we state that two of the three vision transformer models we used, were not applied to VI before, which is an additional contribution.
Comment 2
Reviewer’s Comment
“The content of "relevant work" is not comprehensive enough. You can state your own argument through the model architecture, including natural language processing, computer vision and other related content, to make its content more comprehensive and specific.”
Our Rebuttal
In chapter two, we mainly focused on enhancing the description of the utilized transformer models in section 2.2 with more detailed information as well as architecture visualizations for all three of them.
Comment 3
Reviewer’s Comment
“In the "Learning Task Description" section, more detailed and specific examples should be listed to highlight the importance of learning tasks.”
Our Rebuttal
In chapter three we added more background information to highlight why the covered tasks are important in the maintenance of rail freight cars.
Comment 4
Reviewer’s Comment
“In terms of "experiments and results", the experimental setup is not comprehensive enough. It is necessary to conduct multi-level comparison of experimental data and conduct a comprehensive analysis based on the effects of some experiments.”
Our Rebuttal
Unfortunately, we are not able to conduct additional experiments to enhance the comprehensiveness of the experimental setup, due to the time constraint of 10 days for the revision given by the editor. Nevertheless, we are grateful for your feedback and will be considered in our future work.
Comment 5
Reviewer’s Comment
“In terms of "conclusion and prospect", reorganize the language, highlight the main content and contribution of this paper, and prospect and develop the effect of visual transformer.”
Our Rebuttal
Similar to the introduction we highlighted our contribution of applying vision transformers to VI tasks of higher complexity compared to SOTA as well as employing models that were not in this area before.
Yours sincerely
Tobias Meisen, Richard Meyes and Nils Hütten
Author Response File: Author Response.pdf