Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique

Electronics 2023, 12(1), 192; https://doi.org/10.3390/electronics12010192

by Zhengjie Li¹, Lingli Hou², Xinxuan Tao¹, Jian Wang¹ and Jinmei Lai^1,*

Reviewer 1:

Dragos Florin Lisman

Reviewer 2:

Yiyu Tan

Reviewer 3: Anonymous

Electronics 2023, 12(1), 192; https://doi.org/10.3390/electronics12010192

Submission received: 31 October 2022 / Revised: 13 December 2022 / Accepted: 26 December 2022 / Published: 30 December 2022

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, Volume II)

Round 1

Reviewer 1 Report

I would recommend a more detailed conclusion chapter. The results obtained during the research are very good, and they would support more details and refined conclusions.

Besides this it's a very good work that deserves to be published.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposed a framework to improve accuracy in FPGA-based CNN accelerator. And the experimental results demonstrated the accuracy loss is within 0.1% compared with quantized model, and the power efficiency is better than other related FPGA-based accelerators. The paper is interesting, but English is very poor so that the paper is too hard to be understood. Some of the concerns are shown as follows to further refine the paper.

（1） The abstract should be concise and simply. The current abstract is too wordy, and many grammar mistakes exist. Please revise it.

（2） The paper is about a framework (toolchain) which is used to generate high accuracy CNN accelerator. But in Section 2, many critical information is missed. For example, how to generate HDL codes from the CNN algorithm according to kernel size and stride? How to verify the generated HDL code automatically? How to explore the parallelism? From Fig. 1, how to generate the new bitstream from the first bitstream? and so on.

（3） The Introduction is not well organized, and reviewer cannot understand what the author’s purpose is.

（4） Section 2.2 is confused and cannot be understood. Please explain more details and describe clearly. What do the S1, S2, S3 mean? How to derive the Equation (4)?

（5） In line 233, please explain clearly how S₁S₂/(S₃x9) can be transformed into multiply and shift operations.

（6） In Equations (10) to (15), what is Bias?

（7） In Section 3.1, the kernel transformation matrix is multiplied by 24 and the calculation results should be divided by 24, how do you handle it?

（8） What is the PE architecture? How much is the latency of each line in the EWMM module?

（9） In Section 4, please analyze more why the performance of the proposed accelerator is better than those of other accelerators.

（10） In Section 4, many factors affect the clock frequency of the FPGA system, not simply the fabrication technology, such as design techniques. Why the accelerator based on the Arria 10 can run at much higher clock frequency, I think a critical issue is it provides hardened native DSP blocks, which can run at much higher frequency over Virtex 7 on VC709.

（11） Major revisions of English are required to refine the paper. I am sorry I am not a native English speaker, please ask for a native English speaker to refine it.

（a） In line 10, “FPGA become popular” ->. FPGAs become popular

（b） In line 11, “RepVGG propose re-parameterization….” à RepVGG proposed the re-parameterization….

（c） In line 15, “it is seldom support by FPGA accelerator -> it is seldom supported by FPGA accelerators.

（d） In line 15, “Winograd use less …” -> Winograd uses less….

（e） In line 18, “which support more” -> which supported more ….

（f） In line 22, “….compare….” -> ….compared….

（g） …..

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The article is proposing a tool chain that can support multiple CNN models, supporting the exploration of various tradeoffs.

Request to authors:

1. Please improve the language structure used to express the ideas.

2. The abstract comes up short in establishing the scope of the paper, research problems etc. It is not clear what problems the authors are addressing.

3. a summary of results in the introduction will be helpful to understand the effectiveness of the proposed method.

4. It is not clear how using the proposed toolchain reduces design exploration time, power, resources etc. How is the toolchain reducing development time ?

A rewrite of the article by improving the flow of the content will help readers better appreciate the work.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Although the paper writing and English are improved, English is still very poor so that the paper is too hard to be understood. Therefore, the reviewer strongly suggests asking for an English native speaker to refine the English. Some of the concerns are shown as follows to further refine the paper.

（1） The abstract should be concise and only show your main idea and results. Many contents of the revised abstract are not required. Please revise it.

（2） In the Introduction, the author continuously mentioned CNN in edge devices. But the paper is about tool chain for FPGA-based CNN accelerator, and there is no relation with edge devices. Reviewer thinks these contents are not required, otherwise, the author should show evidence and evaluation results of your proposal applied in edge devices.

（3） In Fig. 1, how to describe the different CNN architecture and input it in the HDL generator? Which parts are updated in the 1^st bitstream to avoid re-synthesis and re-implementation? From Figure 5, it seems that only the parameters and instruction memory are updated. Even if it is, the BRAM should be re-generated again using IP core generator, and the whole system should be synthesized again. In addition, if the kernel size and stride are changed, is it the system not required to be synthesized and implemented again? The reviewer is wondering the generation of the new bit file.

（4） In Figure 5, if only the contents of BRAM are updated, the computation engine like PE array is not changed. Then how to compute in the case of different kernel size using the PE array?

（5） In Table 4, how to calculate the throughput? 2816 DSPs are used in the case of VGG, and they are run at 100MHz, the computational throughput is about 2816 x 100 M = 281 GOPs/s, but your result is about 869 GOPs/s. Furthermore, both VGG and RepInception use same number of DSPs and clock frequency, but their throughputs are different. Please explain the reason.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Thank you for addressing the concerns in the first version of the article. I appreciate the additional details and explanation. However, please work more on updating the sentence structure and flow of ideas expressed in the sections. More editing of the article is required.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Thank you for taking the effort to revise the article.

Article Menu

HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique

Further Information

Guidelines

MDPI Initiatives

Follow MDPI