Next Article in Journal
Application of Rough Set Theory and Bow-Tie Analysis to Maritime Safety Analysis Management: A Case Study of Taiwan Ship Collision Incidents
Next Article in Special Issue
Replay Speech Detection Based on Dual-Input Hierarchical Fusion Network
Previous Article in Journal
Prediction of Deflection Due to Multistage Loading of a Corrugated Package
Previous Article in Special Issue
Multi-Scale Channel Adaptive Time-Delay Neural Network and Balanced Fine-Tuning for Arabic Dialect Identification
 
 
Article
Peer-Review Record

Multi-Scale Feature Learning for Language Identification of Overlapped Speech

Appl. Sci. 2023, 13(7), 4235; https://doi.org/10.3390/app13074235
by Zuhragvl Aysa, Mijit Ablimit * and Askar Hamdulla
Reviewer 3: Anonymous
Reviewer 4:
Appl. Sci. 2023, 13(7), 4235; https://doi.org/10.3390/app13074235
Submission received: 15 February 2023 / Revised: 21 March 2023 / Accepted: 23 March 2023 / Published: 27 March 2023
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

The paper presents a combination of squeeze and excitation network with Res2Net for language identification task. The proposed method has been shown to improve the performance compared to baseline systems. However, the writing is very poor. It requires proofreading, as the authors were not able to write proper English sentences that can be easily understood in most of the time.

- Abbreviations in title. What does SE stands for in the title? If possible, please avoid abbreviation in title.

- Abbreviations in abstract. If you want to use abbreviations in the abstract, you should provide what they stand for the first time they are written.

- Line 43: "In the training process, the feature reflects the distinguish ability of the language; then, a classification model is trained 2.""
  This sentence is grammatically incorrect. There are a lot of grammatical errors throughout the manuscript; makes it difficult to read. Please proof-read the paper before re-submission.

- Line 51: "In this paper, a language identification method based on the SE-Res2Net network is proposed to improve ..."
  What is SE-Res2Net network? What does SE stands for? What is the reference?

- Line 53: "Experiments are conducted on the AP17-OLR dataset and a multilingual cocktail party dataset."
  What is API17-OLR dataset? What do API and OLR stand for? What is the reference?

- Line 63: "..., and the feature can reflect the distinguish ability of the language; then ..."
  Why are you always using the phrase "distinguish ability"? What does it mean? It is grammatically incorrect.

- Line 94: "Later, Gonzalez-Dominguez et al. [10] proposed a long ..."
  Reference [10] does not contain an author named Gonzales-Dominguez. There is a Gonzales-Domingues on reference 2 and 9.

- Line 121: "In this paper, we take CNN-CBAM-BiLSTM [20] model ..."
  What is CBAM? It is the first time mentioned; you have not provided what this abbreviation stands for.

- Line 232: "Solves the problem of degradation of deep networks, which are difficult to train."
  What does this sentence mean? It is so grammatically incorrect.

- Line 385: "The performance evaluation metrics used in the experiments are also Accuracy, Precision, Recall and F1 score values."
  Why are you using "also" in this sentence? This is the first sentence in the first paragraph in a subsection. Why would you write it this way?

- Line 478: "Task 2: In this section, we experiment with CNN-CBAM-BiLSTM on a multilingual cocktail party dataset with a target language weight of 1.2 and a non-target language weight of 1, when the overlap is set to 100%, at which point the model accuracy is 64%, and the recognition results of the language identification network are shown in Table 3."
  It is so difficult to comprehend the structure of this sentence. Further, what is target language weight of 1.2 and non-target language weight of 1? How are you using weights for languages? How is the overlap defined?

- Line 506: "We use Focal Loss to solve this problem effectively and improve the performance of the model."
  What is focal loss? Where is the formulation defined? This is the first time it is mentioned. Is there a reference for it?

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

In this paper, an algorithm for langage detection is presented. This algorithm is a modification of CNN-CBAM-BiLSTM network. The main idfference between the proposed algorithm (SE-Res2Net-CBAM-BiLSTM network) is the module to extract features. 

This paper is interesting, some parts can be improved:  

1. In page 2, lines 78-79 a reference is requied.

2. There is not any reference and any mention to Transformer networks. It would be desirable for the authors to mention this type of network, since it is one of the popular ones in natural language processing. The authors are not asked to make an experimental comparison with the Transformer network, only to mention the articles where they do language detection.

3. It is desirable to explain the CNN-CBAM-BiLSTM network, in particular the CBAM and BiLSTM modules in section 3. 

4. Improve the resolution of Figure 1.

5. In section "4.2 Net work parameters", (lines 375, 377) a description of platform used is presented. It is desirable to include the description of the platform's CPU. 

6. Why, in figures 8 and 9, the number of epochs is up to 30 and does not cover the entire X axis?  Is it a scaling problem on the X axis?

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

The manuscript entitled “Multi-scale SE-Res2Net-based language identification on over-2 lapped speech” by Aysa et al., is aimed to develop a multi-scale language identification feature extraction technique for Oriental languages. The orientation of this work is very interesting, due to the popularity of the mobile internet people publish diverse media content such as multilingual and multi-dialectal audio and video.

However, there are several points to improve before it is considered for publication:

0. Abstract: The accuracy achieved with the new technique should be shown, not the % improvement.

1. Introduction: It does not specify the relevance and popularity of this research area. For easier reading it would be better if it is divided in two subsections (one for Language Identification Techniques and a new one for the importance of this research area). In addition, it must be referenced:

·       Se-Res2Net network in line 51 and AP17-OLR dataset in line 53.

2. Related Work: For clarification of the readers, it would be better if it is developed a table with the characteristics of the different language identification techniques. Also, it must be referenced:

·       Lines from 58 to 69 and lines from 75 to 78.

3. New section: It's very hard to the reader to follow the methodology. Perhaps a new section (or subsection) with an explanatory diagram or figure would be helpful (including the tasks). The methodology used needs to be better explained/justified in order to fully understand the results obtained.

4.1. Data collection:

·       Further justification should be given as to why only Oriental languages have been selected and no other Western languages. Would it be valid for other languages as well?

·       It does not specify/reference why speech data were divided according 7:2:1 (line 365).

·       Justification should be given on why the original data is cut by 4s (line 370). Would it not seriously affect the meaning of the words/sentences? Shouldn't the ultimate goal be not only to detect the language but also the meaning? Is it a limitation?

5. Conclusions and Future Work: The possible use or non-use of this technique in other Western languages as well as for the meaning of the words/sentences should be discussed.

In sum, with the above suggestions and corrections it may be an interesting paper to be published.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

The submitted manuscript is devoted to the problem of language identification. The research aims to increase the accuracy of voice language identification in various complex acoustic scenarios. The proposed approach uses the feature extraction technique of multi-scale language identification based on a multilingual data set. Experiments showed that the accuracy of the proposed model increased by about 3% on the oriental language dataset and 11% on the multilingual test dataset. The relevance and novelty of the work are at a high level. However, the information in the article can be considered poor quality and therefore raises several questions.

1. Very unsatisfactory design of the article! There are many misspellings and inaccuracies that make it difficult to understand the text. For example, in the Introduction section, references to literary sources are given simply by numbers (without parentheses).

2. In the Introduction section, the article’s contributions should be highlighted more strictly, and the article’s structure should be improved.

3. It needs to be clarified what the last paragraph of the Related Work Section (Lines 121-125) is dedicated to.

4. Line 109 - what does the expression [1314] mean? A comma may be needed.

5. In Section 3, the figures are incorrectly shown (the figure is on one page, and the caption of the figure is on the other), and the formula is incorrectly inserted into the text - it is impossible to read and understand them. There are references to related works without parentheses.

6. In Section 4, Table 2 contains rows of different heights; the reference to the table is numbered (Line 395), unlike in the table header.

7. Moreover, Section 4 should be renamed as Experiments and Discussion.

8. Tables 3, 4, and 5 contain some artifacts at the beginning.

9. In the Conclusion section, the proposed method has no limitations. It is desirable to pay attention to limitations in the Discussion as well.

To sum up, the submitted manuscript might be accepted after minor revisions based on the reviewer’s comments.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The paper proposes a usage of SE-Res2Net module to improve baseline CNN-CBAM-BiLSTM-based language identification network.

 

I thank you for the response and the effort to revise the papers. However, I highly suggest for the authors to do proof-reading of the manuscript. There are still a lot of grammatical errors and strangely constructed sentences making it difficult to be read.

 

Several points that I can find:

- On abstract: "Experiments on Oriental language data sets and the multilingual cocktail party dataset showed that the model showed a significant improvement in recognition ..."

--> Please revise the two usages of "showed", it confuses readers. Further, in abstract, it is better to use present tense, as you do for the next sentences. Why are you using past tense in this sentence?

 

- Line 36: "... in the field of brilliant speech processing, ..."

--> What is "brilliant speech processing"? This is the first time I have ever heard such a term.

 

- Line 38: "The emergence of deep learning has provided new opportunities for the development of language identification [4]. And in recent years, language identification in realistic noisy environments has received more attention."

--> Don't use "And" to start a sentence. Make these as a one sentence, or revise the second sentence.

 

- Line 226: "Using the residual structure allows the network to be deeper, converge faster and optimize more easily, while having fewer parameters and less complexity compared to previous models. Solves the degradation problem of deep networks that is difficult to train."

--> This was also my comment in the first review. I thought the authors have revised this point. The second sentence, starting from "Solves the degradation ..." is grammatically incorrect. What/Who does solve the degradation problem? There is no subject. This is a very basic sentence composition.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop