Next Article in Journal
Hybrid TOA/AOA Virtual Station Localization Based on Scattering Signal Identification for GNSS-Denied Urban or Indoor NLOS Environments
Next Article in Special Issue
HierTTS: Expressive End-to-End Text-to-Waveform Using a Multi-Scale Hierarchical Variational Auto-Encoder
Previous Article in Journal
Research on the Effect of Karst on Foundation Pit Blasting and the Stiffness of Optimal Rock-Breaking Cement Mortar
 
 
Article
Peer-Review Record

Arabic Emotional Voice Conversion Using English Pre-Trained StarGANv2-VC-Based Model

Appl. Sci. 2022, 12(23), 12159; https://doi.org/10.3390/app122312159
by Ali H. Meftah 1,*, Yousef A. Alotaibi 1 and Sid-Ahmed Selouani 2
Reviewer 1:
Reviewer 2:
Appl. Sci. 2022, 12(23), 12159; https://doi.org/10.3390/app122312159
Submission received: 10 October 2022 / Revised: 31 October 2022 / Accepted: 22 November 2022 / Published: 28 November 2022
(This article belongs to the Special Issue Audio, Speech and Language Processing)

Round 1

Reviewer 1 Report

The paper is interesting, an experimental design and evaluation were performed, the results are promising and give insights for future works

The following are main suggestions for further improvements, others are left in the attached file.

- revise the paper and references, format and style

- to add main results in the abstract

- remove period before citation ; .[  ].

- cite this claim "...and autoencoder are the most common techniques used in EVC frameworks."

- define "ASR and F0" at first use in introduction section

- check https://ieeexplore.ieee.org/document/7073218 for preliminary work 

- figure1: check caption, starGANv2-VC framework

- revise the tables/figures captions to be more expressive

- revise tables' presentation for better readability

- the dataset has 5 emotions class, while your exp using 4, why?

- "Due to the significant effort and resources required for subjective evaluation" ... to mention/describe this approach details at least , even if not considered, so reader can understand the challenge

-  Figure 9 not cited/ref-ed

Comments for author File: Comments.pdf

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper discusses Arabic emotions and their conversion to another form using a model trained for another language. A convolutional recurrent neural network (CRNN) is proposed for this purpose. The following points need to be considered:

·       The summary of the results needs to be added to the abstract. 

·       It is not clear to me how you train the model. Also, did you use the Arabic language for training because the title is a little bit confusing?

·       Figures 2 and 3 are not that clear—for example, what 20 represents and the colors as well in Figure 2. Also,  the histograms in Figure 3 need an explanation.   

·       How does the following statement get classified, and how?

?assaadaat bat?alul ħarbi wassalaam

·       Table 5 and Figures 5, 6, 7, and 8 need more elaboration. 

·       The results seem not seem that good. Is there a reason for that? 

·       Summarize the results in the conclusion section.

·       Overall, the work is okay, but the description of the technical work may need more effort.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I do not have any further comment. 

Back to TopTop