Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP

Electronics 2023, 12(1), 78; https://doi.org/10.3390/electronics12010078

by Ilia Safonov^*

, Anton Kornilov

and Daria Makienko

Reviewer 1:

Reviewer 2:

Reviewer 3:

Reviewer 4:

Reviewer 5: Anonymous

Electronics 2023, 12(1), 78; https://doi.org/10.3390/electronics12010078

Submission received: 11 November 2022 / Revised: 21 December 2022 / Accepted: 22 December 2022 / Published: 25 December 2022

(This article belongs to the Special Issue Feature Papers in Computer Science & Engineering)

Round 1

Reviewer 1 Report

Title: An approach for matrix multiplication of 32-bit fixed point 2 numbers by means of 16-bit SIMD instructions on DSP.

The authors have presented an approach for matrix multiplication of 32-bit fixed point 2 numbers by means of 16-bit SIMD instructions on DSP. In a way, the practical analysis used supports the presented mathematical method due to my own observation, and the paper is also relevant to the special issue. However, the author has to look into the following concerns:

1. The writing of the paper needs a lot of improvement in terms of grammar, spelling, and presentation. There is the presence of grammatical and punctuation errors, the authors should try and improve on the grammar and punctuation of the write-up.

2. The use of personal pronouns should be totally avoided; this is academic writing, not a storybook.

3. Authors are advised to be precise in the abstract, and structure your abstract as follows- 1) Background 2) Aim/Objective 3) Methodology 4) Results 5) Conclusion. Write 2-4 lines for each and merge everything in one paragraph (200-300 Words) without any subheading.

4. The introduction section did not present the problem the paper wants to solve clearly, the contributions are not well stated, therefore, the paper is very difficult to follow. Also, the motivation and contribution of the paper should be stated more clearly at the end of the introduction immediately before the structure of the paper.

5. The related work section is very small, an updated and complete literature review should be conducted. Some latest papers which studied similar effects problems can be discussed to help the readers.

6. The concepts of matrix multiplication of 32-bit fixed point 2 numbers should briefly discuss and explain how this innovation is used in the proposed model for better understanding by the intending readers.

7. It will be worth mentioning if the author can state the advantages of the chosen models against others.

8. The author seems to disregard or neglect some important findings in the results that have been achieved in the paper. So elaborate and explain the results in more detail.

9. The paper should contain a conclusion part; the author should be able to elaborate on their work in the concluding part so as to help the readers understand the work. The results from the study should also be explained in the conclusion part. The future directions for this study will help readers who want to work in this area.

I really appreciate the style of presentation of this paper, but the author needs to incorporate the above-mentioned points for a better and possible publication with the journal. I, therefore, recommend a major revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper describes an approach for dense matrix multiplication of 32-bit signed integer and fixed point numbers by means of vector instructions for 16-bit unsigned integer. That is important for the transportation of designed program running in Personal computer to DSP system.

Author Response

Thanks for your comments!

Reviewer 3 Report

The article is basically well written in terms of content and language. Unfortunately, it describes a rather specific solution in a specific (and even unusual) VLIW processor, about which the authors do not write much.
The work has almost no scientific contribution. It can be useful for optimizing code for matrix multiplication, but the results are very difficult to relate to state-of-the-art.

Many of the references are relatively old (older than 5 years).

Detailed comments:
Line 37: "This DSP is a prototype, and it is not available on a market at present time, however a qualified person is able to generalize our approaches for many modern signal and graphical processors." - this is somehow a drawback of the paper.
Line 62: "Section 3 is described approaches..." - to be corrected
Line 157: "each concrete calculation" - concrete - "definition of adjective: existing in a material or physical form;" -> use another word
Line 216: "Figure 1 demonstrates the number of cycles (that can be considered as processing time)" - what do you mean by "cycle"? Is the multiplication or addition performed in a given number of cycles? It is not clear.
Line 228 "One can find a foundation of the arithmetic of numbers with a fixed point in 228 the [4]." -> "One can find a foundation of the arithmetic of numbers with a fixed point in [4]."
Line 230: "For product of fixed-point numbers ... by the number of bits of the fractional part." In fixed-point numbers we assume a part of integers and a fraction part with a fixed point of the dot. There are many of notations with various position of the dot, e.g. Q16, Q32, when there is a fraction part only. Do you assume a given notation or present a general idea? This is important because in the case of integers, the result will be much longer than the operands, while with fractions, at most accuracy will drop and the result will remain a fraction (for example in line 292 you write about 80-bit outcome).
Some information of the notations you give in lines 308/311, but beforehand, the reader does not know this.
Line 239: "We analyzed our program using a simulator and a profiler in order to explore the causes of the delays. This showed that this approach does not employ the VLIW pipeline effectively; only a few instructions executed simultaneously." - Again, you present just your solution to a given (not-typical) VLIW processor. Do you used any type of scheduler? Maybe it can improve your results.
Line 253: "four matrix multiplication" - does it means four multiplications or four matrices? To be corrected.
Figures 3 and 4 have various symbol characters. To be unified.
Line 294: "It is worth to note, we can skip calculation of the terms outside black rectangle in Figure 3." - why?
Line 297: "You should not ignore S for exact integer computations with an 80-bit result." What do you mean by "S"? If you have 32-bit integer factors the result is no longer than 64-bit. Longer accumulators are often used to avoid saturation when adding up (accumulate) the results, but you don't mention it at all.
Figure 5: Subtruction - Subtraction
Figure 5: The results are only true for your specific VLIW processor. It is not known how fast it performs operations of addition, multiplication or register shifts (or copying from/to memory). If you add this details the results can be easier generalized. (I found some info in line 377 - it should be given in the place that you present first results).
Line 323: having several DSP -> having several DSPs
Line 328: There are two optimization tricks, maybe there are more tricks, but you propose just two.
Line 343: "it is possible to find a more optimal way" - "Optimal" is a very strong word. Something can be optimal (under a given optimization parameter) or not. It can't be less or more optimal.
6. Approaches for parallel implementation - Ideas presented in this section are not supported by any results or calculations. In line 368 you write "Parallel implementation on several IP cores allows to achieve much faster processing speed." - again, no proof. The idea of parallelization has many limits and additional costs.
Line 379: "because total number of operations is a higher in comparison" -> "because total number of operations is higher in comparison"
Line 393: "An instruction requires a cycle of a processor." - Any instruction? Typically VLIW processor can perform more than one instruction per cycle.
Line 404: "This is pretty good." - please avoid such general conclusions.
Line 409: dense and sparce -> dense and sparse

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

the paper is well written and discussed the approach for matrix multiplication of 32-bit fixed point 2 numbers by means of 16-bit SIMD instructions on DSP.

1. need to have instructions flow and elaborate more details for the SIMD and DSP

2. need to have a benchmark with other related works

3. only one performance parameter has been examined "no. of cycles",, examining only one parameter will not give all the potential performances, better to examine another performance parameter or reflect the no. of. cycles in the title,

4. paper is good and worth to be published,

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

The authors should consider the following References to enrich the Introduction and compare their contributions:

· P. Kamranfar, S. A. Shahabi, G. Vazhbakht and Z. Navabi, "Configurable Systolic Matrix Multiplication," 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems, 2014, pp. 336-341, doi: 10.1109/VLSID.2014.64.

· H. De Silva, J. L. Gustafson and W. -F. Wong, "Making Strassen Matrix Multiplication Safe," 2018 IEEE 25th International Conference on High Performance Computing (HiPC), 2018, pp. 173-182, doi: 10.1109/HiPC.2018.00028.

· M. Shanmugakumar, V. S. M. Srinivasavarma and S. Noor Mahammad, "Energy Efficient Hardware Architecture for Matrix Multiplication," 2020 IEEE 4th Conference on Information & Communication Technology (CICT), 2020, pp. 1-6, doi: 10.1109/CICT51604.2020.9312050.

· C. Misra, S. Bhattacharya and S. K. Ghosh, "Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache Spark," in IEEE Transactions on Big Data, vol. 8, no. 3, pp. 699-710, 1 June 2022, doi: 10.1109/TBDATA.2020.2977326.

1. The authors selected a DSP IP core, but they provided brief information about its features. Please provide more details of this computer platform and a reference of the IP employed.

2. Figure 1 shows the number of cycles required for matrix multiplication. However, it is not specified what kind of assumptions are made. For example, a multiplication of two numbers does not necessarily take one clock cycle because of the fact that the first time, the values must be loaded from memory to the CPU internal register. Once the multiplication is computed, the result needs to be stored back in memory. All of these require spending more than one clock cycle and therefore increase the total processing time. Consequently, Figure 1 doesn’t represent the total amount of processing time, instead of that, it only represents the number of operations needed.

3. Which is your instrument or tool for cycle measurements? It is very convenient to explain this issue. You could use a Table with the set up of specifications, parameters, and assumptions.

4. In lines 239-240, the authors mentioned that they use a simulator and a profiler, but they don’t provide information about these tools.

5. Is the simulator used cycle accurate?

6. Figure 5 is cropped.

7. In Figure 5, change the word “Subtruction”

8. In order to have a clearer comparison, you can use a semiology graph for the cycle counting.

9. In lines 329-331, the authors claim that “thereat loading data from local memory to registers is performed several times faster”, Could you explain what type of local memory are you referring (RAM, Cache, or TCM)?

10. In lines 361-362, the authors mention “copying from the global to the local memory”, Is “global memory” RAM type? Is “local memory” cache or TCM type?

11. If you are using a DSP with cache memory, how does the size of the cache memory affect the processing time due to cache misses? Is Figure 1 still valid?

12. In order to show and consolidate the results, as well as show a clear comparison with other works in terms of computational complexity, etc., Could you provide a Table with this information? Check for example the recent paper of M. Shanmugakumar et al. (the complete reference is above). You can find another example in Table 5 of C. Misra et al. (the complete reference is above).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Title: An approach for matrix multiplication of 32-bit fixed point 2 numbers by means of 16-bit SIMD instructions on DSP.

The authors have greatly improved the manuscript by answering most of my concerns, but the authors still need to work on these minor comments before acceptance for publication.

1. The comments at the end of the conclusion, "Among other findings of this paper we would like to emphasize the following:” should be removed from the conclusion section to the end of the introduction before the paper organization.

2. Author needs to include the limitations of the proposed system in the conclusion section.

3. Author should mention the future scope of your present work in the conclusion section.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The article has been significantly improved, the authors positively took into account all general and detailed comments.

Author Response

Thanks for your review! It allows to improve quality of our paper.

Article Menu

An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP

Further Information

Guidelines

MDPI Initiatives

Follow MDPI