Next Article in Journal
Study on the Combustion Characteristics of Ethanol Nanofuel
Next Article in Special Issue
Lessons Learned in Transcribing 5000 h of Air Traffic Control Communications for Robust Automatic Speech Understanding
Previous Article in Journal
LCF Lifetime Reliability Prediction of Turbine Blisks Using Marine Predators Algorithm-Based Kriging Method
Previous Article in Special Issue
In-Vehicle Speech Recognition for Voice-Driven UAV Control in a Collaborative Environment of MAV and UAV
 
 
Article
Peer-Review Record

An Automatic Speaker Clustering Pipeline for the Air Traffic Communication Domain

Aerospace 2023, 10(10), 876; https://doi.org/10.3390/aerospace10100876
by Driss Khalil 1,*, Amrutha Prasad 1,2, Petr Motlicek 1,2, Juan Zuluaga-Gomez 1,3, Iuliia Nigmatulina 1,4, Srikanth Madikeri 1 and Christof Schuepbach 5
Reviewer 1:
Reviewer 2: Anonymous
Aerospace 2023, 10(10), 876; https://doi.org/10.3390/aerospace10100876
Submission received: 30 April 2023 / Revised: 25 September 2023 / Accepted: 26 September 2023 / Published: 10 October 2023

Round 1

Reviewer 1 Report

In this work, an automatic speaker clustering pipeline was proposed. But this framework is very simple. This pipeline wants to solve the problem that the conversation between controllers and pilots is often collected in a single channel, making it difficult to train automatic systems in either group or alone. But one air traffic controller on duty will last about two hoursat the position. These mentioned communication were recorded. The record from polit was also gottn from aircraft voice record. Since that the motivation of this work is lack of importance for air traffic control agency. The proposed pipeline comprises four stages: speech segment separation, automatic speech recognition (ASR), speaker role classification, and speaker clustering. The authors should focus on the difficult of one stage. For instance, the most challenge should be ASR for air traffic controller. The accuracy rate of recognition is expected higher than 99% to make sure the safety of automatic system. Using these results of speech recognition, the potential collision risk can be find at more early time under current aircraft situation. Furthermore, how to detect the error based on strict standard  airtalk is also an interesting issue.

Minor editing of English language required

Author Response

Thank you for your comments and feedback. We appreciate the time and effort you have put into reviewing our work. In response to your comments, we would like to address each point raised and provide clarifications where necessary.

The main objective of the proposed  pipeline is to address the challenges of developing automatic speech recognition clustering systems when conversations between controllers and pilots are recorded in a single channel. We acknowledge that the framework may appear simple on the surface, but it involves several complex tasks that require careful consideration.

 

Regarding the stages of our pipeline, we agree that each stage poses unique challenges. Among them, automatic speech recognition (ASR) for air traffic controllers is indeed a complex task. Achieving an accuracy rate higher than 99% for a pipeline that includes ASR means to have a perfect performance for all the steps in the pipeline, which is difficult to achieve (i.e., typical error-rates of human transcription is around 5%, however this error-rate would increase in case of noisy and complex data such as ATC). This specific domain presents several challenges. The data collected during air traffic control communications is often noisy (i.e., radio communication through low-quality radio receivers in case of ATCO2 data), and factors such as background noise, accents, variations in speech patterns, and domain-specific terminology pose significant difficulties in achieving near-perfect recognition accuracy.

 

Furthermore, we would like to highlight that our focus is on automatic (speaker) clustering of the pilots rather than the air traffic controllers themselves. The distinction is important as it allows us to analyze pilots’ behavior and communication patterns for improved situational awareness and incidence detection. The clustering process involves dealing with complex and noisy data, which introduces additional challenges. We have employed robust techniques and algorithms to handle these challenges, but it is important to note that achieving perfect clustering accuracy is not always feasible in such data-intensive scenarios. In total 929 pilots for ATCO2 data and 189 pilots for LDC data were used in testing the clustering algorithm. We are not aware of any other state-of-the-art technology providing higher accuracies than the one presented in the paper

Reviewer 2 Report

This paper examines voice processing for Air Traffic Management (ATM), specifically speaker clustering.  It uses a pipeline of (i) a speech activity detection (SAD), (ii) an automatic speech recognition, (iii) a text-based speaker role classification, 

 (iv) an unsupervised speaker clustering.  It uses agglomerative hierarchical clustering (AHC) on  the ATCO2 corpus and the Linguistic Data Consortium Air Traffic Control (LDC-ATC).  For speaker clustering, it gets an accuracy of 70% on the LDC dataset and and 50% on the more noisy ATCO2 21 dataset. 

 

The paper is well presented.  The work seems well done.

 

Specific points:

 

..generate the text for an audio segments, .. ->

..generate the text for audio segments, ..

 

..the pilot’s real identities .. ->

..the pilots’ real identities ..

 

..Researchers are actively working on developing a SAD system that can accurately operate in noisy environments. The approach used is based on [4] .. - too narrow a declaration; this implies a single approach for SAD.

 

Also, why cite an arXiv paper (ref.4) for this area, when there are so many refereed journal papers?

 

..robust to language variability. - of what relevance is language to SAD?  All languages use very similar phoneme sets.

 

..Conventional logistic regression and majority voting were employed to combine decisions from different languages. - no citations or details given.

 

For state-of-the-art in ASR, the text cites 3 papers from about 15 years ago.

 

..Current state-of-the-art systems .. - why cite 2 references for the same system? i.e., wav2Vec(2)?

 

..In [19], these groups of words/sentences have similar grammatical properties. Their work .. - ref. 19 is a survey; so to whom does “their” refer?

 

..than the ATCo recordings ..->

 ..than the ATCO recordings ..

 

..formed cluster k: .. - use italics for all math symbols

 

..gold annotations .. - you mean reliable and true?

 

..results includes, ASR, speaker .. ->

..results include ASR, speaker ..

 

..The authors in [39], the authors propose ..

..The authors in [39] propose ..

 

..Word Error Rate (WER) metric.  - why redefine WER well after its earlier use?

 

..(i.e. ATCO .. ->

..(i.e., ATCO ..

(Placing the comma also helps with better spacing)

 

..then it decay linearly.  ->

..then it decays linearly.

 

..and, gradient accumulation .. ->

..and gradient accumulation ..

 

..truth of pilot identities were .. ->

..truth of pilot identities was ..

 

..got the same id .. - I would avoid use of “id”

 

..accuracies of 70% and 50% .. - are the authors truly satisfied with these levels of accuracy?

 

..incorporate language identification (LID) as prior .. - is not English universally used for ATC?

 

 

good quality

Author Response

Thank you for your comments and feedback. We appreciate the time and effort you have put into reviewing our work. In response to your comments, we would like to address each point raised and provide clarifications where necessary.

..robust to language variability. - of what relevance is language to SAD?  All languages use very similar phoneme sets.

Language can play a big role in SAD systems, especially when dealing with diverse linguistic environments. Even though the languages share the same phoneme sets, there can still be variations in pronunciation, accent, dialect, and language style across different languages. These variations can have an effect on the acoustic characteristics of speech.

SAD systems need to take into account all these variations to ensure an accurate system that can detect speech segments across the different languages. This means that they may need to be trained on diverse linguistic data to capture all the patterns and characteristics specific to each language.

 



Also, why cite an arXiv paper (ref.4) for this area, when there are so many refereed journal papers?

 

We would like to address the concerns raised regarding our choice of reference for the SAD (Speech Activity Detection) component in our work. We appreciate your valuable feedback and would like to clarify the rationale behind citing the arXiv paper [4] in our manuscript.

Firstly, we acknowledge the importance of citing peer-reviewed journal papers for establishing credibility and reliability. However, we believe that in this particular case, the arXiv paper [4] holds significant relevance to our research and justifies its inclusion.

The arXiv paper [4] is authored by researchers from the esteemed institution, IDIAP, and presents an approach that aligns closely with our SAD methodology. Furthermore, one of the authors of [4] has contributed to our work, adding further expertise and credibility to our approach.




..accuracies of 70% and 50% .. - are the authors truly satisfied with these levels of accuracy?



While accuracies of 70% and 50% may appear modest at first glance, it is important to consider the unique challenges and constraints of our research context. The ground truth generation (i.e., speaker labeling) process was based on certain hypotheses related to callsigns and dates information. As a result, the evaluation system classified pilot speech as incorrect if it deviated from the expected callsign or date, even if the underlying speaker remained the same. This strict criterion inherently affected the overall accuracy rates.

 

Furthermore, it is crucial to note that our dataset spans a specific time range, approximately seven months from October 2020 to May 2021. This limited time frame introduces additional complexities, as variations in callsigns or dates within this period were evaluated as inaccuracies. Overall, from high-level point of view, being able to correctly cluster 50% of the speakers (across 929 and 189 for ATCO2 and LDC data), we find this performance of possible interest for air-traffic management authorities.

 

For state-of-the-art in ASR, the text cites 3 papers from about 15 years ago...Current state-of-the-art systems .. - why cite 2 references for the same system? i.e., wav2Vec(2)

 

Hybrid HMM-DNN systems were considered state-of-the-art until the development of transformer based systems. We cite two state-of-the-art systems based on transformer architecture as the first wav2vec is based on text and wav2vec 2.0 is for speech. 

 

incorporate language identification (LID) as prior .. - is not English universally used for ATC? 

 

English is commonly used as the international language for Air Traffic Control (ATC) communications. However, there are cases where non-English languages may be used in certain regions or specific situations (in countries where English is not the primary language).

 

Reviewer 3 Report

The manuscript proposes a pipeline for clustering speakers in air traffic control (ATC) communications. This pipeline is shown by performance data and discussions to be able to accurately identify the speakers as well as to assign speech segments to the actual speakers by combining speech activity detection (SAD), also known as voice activity detection (VAD), automatic speech recognition (ASR), speaker role classification based on automatic transcripts and unsupervised speaker clustering.

 

Further, the authors have identified that the speech uttered by the pilots is harder to identify than what is uttered by the controllers due to the noise incorporated in the radio transmission to the recording site. Also, they have tested the system with two corpora, namely, ATCO2 and LDC-ATC, and have obtained better performance with the latter, which is less noisy.

 

Overall, the manuscript is well-written providing plenty of description, data and comments. However, there seems to be some confusion in the description of the algorithm for grouping similar data points as described below.

 

Steps 3 and 4 in the description of the algorithm use k for a distance in 3 and for an index in 4. In 3, since it involves an averaging operation, k will hardly be an integer while in 4, where k is an index, it is supposed to be an integer. Further, what do i' and j' stand for in step 4?

 

The manuscript proposes an important and useful system, which is presented with plenty of performance data and discussions about them as well as instructions about further research. However, some algorithms have to be described more clearly as I point out in my comments to the authors.

 

Author Response

Thank you for your comments and feedback. We appreciate the time and effort you have put into reviewing our work.  In response to your comment, we would like to address the point raised and provide a clarification where necessary.

Steps 3 and 4 in the description of the algorithm use k for a distance in 3 and for an index in 4. In 3, since it involves an averaging operation, k will hardly be an integer while in 4, where k is an index, it is supposed to be an integer. Further, what do i' and j' stand for in step 4? 

 

You're correct, the use of the variable "k" in steps 3 and 4 can be confusing due to its different roles. 

In step 3, "k" represents the distance value associated with the newly formed cluster.
In step 4, "k" is an index representing the newly formed cluster, while "i^" and "j^" represent the indices of the objects selected for merging. We have updated this in section 2.4.

Round 2

Reviewer 1 Report

(1) In the revised version, the authors stated "we specifically targe the clustering of pilot data collected through very high-frequency (VHF) receivers". Why not use the data collected from air traffic control agency? In that database, each airtalk communication issue included a clear callsign. A number voice detection system can make a decision on  which pilot was speaking. (2) An accuracy of 70% on the LDC dataset was recorded. This maybe make sense for personal communication equipment. For air traffic controller speech recognition system, this detection is too low. The author should revised the proposed method to improve the accuracy.

Author Response

Thank you for your comments and feedback. We appreciate the time and effort you have put into reviewing our work.

In response to your comments:

In our revised version, we specifically targeted the clustering of pilot data collected through very high-frequency (VHF) receivers because it allowed us to work with readily available data sources. While we acknowledge the potential benefits of using data from the air traffic control agency, obtaining such data can be legally complex and require access to the operational control rooms of the air navigation service providers (ANSPs), which may not always be feasible.

 

On the other hand, there are alternative methods for collecting voice communication, such as through activities like LiveATC and ATCO2, which rely on volunteers using VHF radio receivers to capture voice communication freely available from VHF radio channels. This approach provides us with thousands of unlabeled transmissions in an easy way, despite the data being noisier compared to controlled datasets.

 

Regarding the usage of callsigns from the air traffic control agency's database, we understand that it could potentially improve the accuracy of speaker identification. However, in our case, we worked with a dataset that provided transcriptions without explicit callsign information. Therefore, we utilized the available data and extracted callsigns from the transcriptions to use as speaker IDs for our clustering algorithm.

Regarding the use of the LDC dataset, we want to emphasize that the only available information in this dataset is the transcription. Therefore, our approach involved extracting callsigns from the transcriptions and using them as speaker IDs. We acknowledge that this may not provide a perfect solution for speaker identification, but it was a reasonable approach given the limitations of the dataset. By grouping the speech segments based on callsigns, we aimed to create clusters corresponding to individual pilots.

In the same case, the dataset comprises 189 different speakers (callsigns), accurately determining speaker identities becomes more challenging. To validate the reliability of our approach, we conducted tests on a sample of the data to verify if pilots using the same callsigns were indeed the same individuals. The results of these tests confirmed our assumption that pilots using the same callsigns in the available data were, in fact, the same individuals. However, we acknowledge that with a larger dataset containing 1165 audio files, there may be cases where multiple pilots use the same callsign, leading to potential ambiguity in speaker identification. Despite this added complexity, we leveraged the best available data and applied our clustering algorithm based on callsigns to address the speaker clustering problem in the ATC domain.

 

While achieving high accuracy in the ATC domain is a challenging task due to the constraints and limitations posed by the data, our proposed method provides a valuable contribution to the field of addressing speaker clustering in air traffic management communications. We recognize the unique challenges of the ATC domain, such as the limited ground truth information and the presence of multiple speakers sharing the same VHF radio channel. In light of these challenges, our approach takes into account the available data sources, such as transcriptions, and aims to achieve meaningful clustering results for pilots' speech segments. Additionally, we are not aware of any other state-of-the-art technology providing higher accuracies in the ATC field than the one presented in the paper.




In addition to evaluating the speaker clustering model on ATC domain data, we also tested it on a separate dataset from the telephony domain. This dataset comprised telephony data and included real ground truth ROXSD data, consisting of 481 recordings. It is important to note that not all of the recordings in this dataset contained intelligible speech, as some of them were either failed or interrupted calls. Moreover, a small portion of the files was recorded in the mono format. For the experiment on the telephony dataset, we extracted 925 embeddings for the mono channels from the stereo recordings, which formed the test set. The model achieved an impressive accuracy of 92.8% on this telephony dataset. This result highlights the robustness and the efficiency of our proposed speaker clustering algorithm.

 

Round 3

Reviewer 1 Report

I would like to thank the authors for their effort to revise the paper and the overall quality of the manuscript has been improved compared to the original version. However, I still have a few concerns to be tackled as follows:

1、 As indicated in the conclusion section, the accuracy reported for LDC-ATCC and ATCO2 datasets feature a relatively large deviation (20%). Although ATCO2 is reported with significantly more speakers and a higher noise level, such a phenomenon may still lead to confusion regarding the generalization ability of the presented pipeline in the real world. The authors should elaborate on such a difference and the relevant potential side effects towards the actual implementation of the complete system.

2、 I would suggest the authors underscore the unique contributions of this work at the end of Section 1 after summarizing the advantages and disadvantages of some similar works.

3、 Some typos, influencies and inconsistencies need to be corrected further. For instance, on Page 1, both "ATCo" and "ATCO" are adopted. On page 8, "selfsupervised" should be "self-supervised". On page 2 (Lines 42-45), the challenges regarding rapid exchange and short duration of utterance can be merged into one sentence.

Author Response

Thank you for your comments and feedback. We appreciate the time and effort you have put into reviewing our work. In response to your comments, we would like to address each point raised and provide clarifications where necessary. 



1- 

We appreciate the reviewer's observation regarding the deviation in accuracy between the LDC-ATCC and ATCO2 datasets. This deviation is indeed important to analyze. We have elaborated on this point in the revised manuscript. Specifically, we now explain that the variation in accuracy can be attributed to the increased noise level in the ATCO2 dataset (due to processing VHF radio data), which adversely affects various components of the pipeline, including the automatic speech recognition (ASR) system. This, in turn, impacts the overall performance. We have also presented that while the overall accuracy figures include all pipeline stages, the speaker clustering accuracy alone, when verified against the speaker role classification ground truth, exhibits a higher consistency. 

 

2-
We've added a special part at the end of Section 1, as recommended by the reviewer. In Section 1, we explain contributions of our work. We describe how our method covers all the steps in analyzing ATC communication (including voice activity detection, speech recognition, speaker role detection and speaker clustering). We highlight here that our approach described in this paper addresses the ATC communication process, and thus is different from the past works. Other researchers usually analyzed only one part of the process. This new section helps readers to better understand our contributions.

 

3- We appreciate the reviewer's keen eye for detail. We have carefully proofread the manuscript to address the reported typographical errors and inconsistencies.

Back to TopTop