Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Speech GAU: A Single Head Attention for Mandarin Speech Recognition for Air Traffic Control

Aerospace 2022, 9(8), 395; https://doi.org/10.3390/aerospace9080395

by Shiyu Zhang^†, Jianguo Kong^†, Chao Chen, Yabin Li and Haijun Liang^*

Reviewer 1:

Pavol Kurdel

Reviewer 2:

Douglas O'Shaughnessy

Aerospace 2022, 9(8), 395; https://doi.org/10.3390/aerospace9080395

Submission received: 29 May 2022 / Revised: 19 July 2022 / Accepted: 21 July 2022 / Published: 22 July 2022

(This article belongs to the Special Issue Application of Multidisciplinary Optimization and Artificial Intelligence Techniques to Aerospace Engineering)

Round 1

Reviewer 1 Report

Dear authors,

I want to thank you for the experimental solution to improve the intelligibility of radio communication between aircraft crews and the ATC controller. Someone unusually shaped the article in the field of end processing of radio communication. It is not exactly given where your equipment in the onboard radio station system will be installed and because of which circuit in the station’s radio communication will be improved. However, I realize the given problem of recognizing the speed and correctness of speech pronunciation is important for ATC. Therefore, it would be appropriate to show the safety qualities of the improvements in your experiment in this article.

At the same time, I ask you to improve the article, and that is in the introduction, where it is very difficult to understand what you want to improve and what will benefit if you add a possible detection and filtering element to the speech communication chain for the recognition and intelligibility of the speech of the ATC controller or aircraft crews.

In the introduction, we need abbreviations that tell the authors exactly what they mean to explain or put English aviation terminology into the abbreviations used. Although the list of abbreviations is at the end of the article, the reader gets confused. Try to improve it in another way. In the introduction, the authors point out the history of recording and speech recognition, or it would be appropriate for them to complete a certain illustration or picture in their introduction, which clearly explains the changes in history and the current state.

Therefore, it is visible that the current state is in the background before the historical description of the issue.

In Chapter 2, the authors describe the speech models, but again they do not show how and by what the given model is presented.

Line 126 - Figure 1 shows speech recognition systems and the difference between them. I would like to specify the LM system, and what type it is. Acoustic model AM, please specify more in the article. As it is an important parameter for voice intelligibility. Since I do not use FM modulation in communication conversations. What are the errors and how are they distinguished in your system?

I will ask the authors to make this picture more comprehensible, what is decoding, and what is signal detection.

Line 130–137, this paragraph I will ask to be redone is incomprehensible to the reader.

I ask the authors to add an image to subsection 3.1:1: as a given fact to the test as described in this subsection.

Line 164–you are describing the form of Aishell, please add a model of its shape or a laboratory result.

Line 189–the image is illegible. I request its correction. Not even a description is visible.

Line 202 - I don’t know if it’s a script error, but picture 5 is missing. It is not visible at all.

In subchapter 3.4 formula (5), in the denominator, there is a sum sign, where it is not clear to me what is being added up. Is it possible to add it or explain the relationship?

Formula (7) is at the bottom of the indices (alpha) and (Beta) whether they are hyperparameters or something else. The text does not explain what the index is and what it means in the given formula.

Author Response

Thank you for your suggestions.

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper deals with end-to-end automatic speech recognition (ASR); this study presents a “E2E framework ResNet–GAU–CTC” for Mandarin ASR for air traffic control. It uses transfer learning and data augmentation.

The CER was 11.1% on an expanded Aishell corpus, and was 8.0% on the ATC corpus.

The work is reasonable, and the writing is in excellent English. However, the only real novelty is applying a neural network model to Mandarin ATC speech. All the techniques (CNN, LSTM, data augmentation, etc) are standard in the ASR field. The authors simply combined them in different ways, without any motivation as to why they were using their various architecture choices. There is a serious lack of justifications and explanations for all the choices made. There is no comparison with the authors’ models vs. any others in the literature.

Specific points:

The abstract has numerous undefined acronyms (ResNet, GAU, CTC, ATC, CER). Indeed, this continues in the text as well: PCVC, GMM-HMM, Am, PM, LM, LVXSR, DNN, etc. This is poor technique. Only at the very end of the paper does one notice the abbreviations.

..CTC introduces blank and generates repeated tokens .. - this is what CTC always does; it seems inappropriate to put such explanation in the abstract.

..In the 1950s, Bell Labs developed an ASR system that was able to identify ten spoken digits. I am not sure one needs to go so far back in ASR history, as a part of the summary of the relevant literature. The history is extremely limited, jumping immediately to 1990.

..used GMM to model the probabilities observed in speech and HMM to model the time sequence of speech. - this is not very helpful to the naive reader who knows little about these terms.

..introduced blank labels .. - in the context of this brief history of ASR, this comment makes little sense; what is the value of these “blank labels”?

..Hinton et al. demonstrated the implementation of LVCSR [2] .. - this is a summary review article, not a research discovery article

..path aggregation .. - this entire ASR history summary is replete with unexplained technical terms, making this section (2nd paragraph of section 1) of little use

..A. modei et al. [5], ..->

..Amodei et al. [5], ..

..theoretical study of speech recognition .. - this is all experimental work, not theory

..domain. the framework .. ->

..domain. The framework ..

..based on the ATC dataset .. - which dataset? No citation is given

..This significantly improved the robustness of ASR for noisy audio signals. - this statement appears to apply to ref. 13, which is a minor paper from 10 years ago dealing with “boosting”, which is rarely discussed any more.

The review in section 2 (first paragraph) is a listing of individual, relevant papers, but there is no attempt to explain how each paper may relate to the others.

Fig. 4a is noted as "Mel spectrogram of ATC corpus.” This is misleading, as it only represents ONE utterance, and the comparison with Fig. 4b is invalid, as the two sentences are surely different.

..function [26] and include .. ->

..function [26] and included ..

..methods of speech signal, .. ->

..methods of speech signals, ..

..time- frequency domain .. ->

..time and frequency domain ..

..invariance of the convolution is used to overcome the diversity of the speech signal. - explain more or better; this is the sort of vague explanation one finds far too often in DNN ASR. It is more a hope than an explanation.

..obtain the highly purified features, .. - in what sense are these “highly purified”? What is this purification process? How to judge success?

..flatten layer plays a bridging role .. bridging in what sense? How? Again, how to measure success here? Is all this complex architecture simply ad hoc? e.g., try something and see if it improves CER?

..make the backward and forward propagation of information smoother. - “smoother” in what sense? Why is smoother better?

..linear variation in the feature space .. -what is this?

Do not capitalize when an equation does not end a sentence, e.g., after eq. 5.

..to train network.

..to train the network.

..traditional greedy algorithm. - give a citation here

..of Aishell corpus and improved Aishell ..

..of the Aishell corpus and the improved Aishell ..

..(hereinafter referred ..

..(hereafter referred ..

..corpus). the training, ..

..corpus). The training, ..

..recordings of North China ..

..recordings of the North China ..

In line 262, why is there batch size negative?

..four stacked layers each of BiLSTM and BiGRU models, and 512 hidden units per layer. In addition, there were 24 stacked layers of MHSA+GLU, with each layer using 8 heads of self-attention .. - any motivation for these specific choices of hyperparameters?

Author Response

Thank you for your suggestions.

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Dear Authors,

Thank you for incorporating the comments, where you can clearly see the changes that improved your submitted article. Please focus on author numbering.

(It is about the numbering in the submitted article, I see some characters that are not even in the Aerospace template).

Author Response

Thanks again for your suggestions and guidance.

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

a few more minor changes needed:

..In this works, we .. ->

..In this work, we ..

..efficiency. we propose .. ->

..efficiency. We propose ..

..many research has been conducted ..

..much research has been conducted ..

..of several ATC speechs ..

..of several ATC utterances ..

..spectrogram of several ATC speeches.

..spectrogram of several ATC utterances.

..during PCVC. Wang et al. proposed ..

..during PCVC, Wang et al proposed ..

..transformation [23] . A effect combination of ..

..transformation [23]. A effective combination of ..

..The length of the input ..

..The lengths of the input ..

Author Response

Thanks again for your suggestions and guidance.

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Speech GAU: A Single Head Attention for Mandarin Speech Recognition for Air Traffic Control

Further Information

Guidelines

MDPI Initiatives

Follow MDPI