Safety Aspects of Supporting Apron Controllers with Automatic Speech Recognition and Understanding Integrated into an Advanced Surface Movement Guidance and Control System

Kleinert, Matthias; Ohneiser, Oliver; Helmke, Hartmut; Shetty, Shruthi; Ehr, Heiko; Maier, Mathias; Schacht, Susanne; Wiese, Hanno

doi:10.3390/aerospace10070596

Open AccessArticle

Safety Aspects of Supporting Apron Controllers with Automatic Speech Recognition and Understanding Integrated into an Advanced Surface Movement Guidance and Control System

by

Matthias Kleinert

^1,*

,

Oliver Ohneiser

¹

,

Hartmut Helmke

¹

,

Shruthi Shetty

¹,

Heiko Ehr

¹,

Mathias Maier

²,

Susanne Schacht

² and

Hanno Wiese

³

¹

German Aerospace Center (DLR), Institute of Flight Guidance, Lilienthalplatz 7, 38108 Braunschweig, Germany

²

ATRiCS Advanced Traffic Solutions GmbH, Am Flughafen 7, 79108 Freiburg im Breisgau, Germany

³

Fraport AG, Frankfurt Airport Services Worldwide, 60547 Frankfurt am Main, Germany

^*

Author to whom correspondence should be addressed.

Aerospace 2023, 10(7), 596; https://doi.org/10.3390/aerospace10070596

Submission received: 16 May 2023 / Accepted: 19 June 2023 / Published: 29 June 2023

(This article belongs to the Special Issue Automatic Speech Recognition and Understanding in Air Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

:

The information air traffic controllers (ATCos) communicate via radio telephony is valuable for digital assistants to provide additional safety. Yet, ATCos have to enter this information manually. Assistant-based speech recognition (ABSR) has proven to be a lightweight technology that automatically extracts and successfully feeds the content of ATC communication into digital systems without additional human effort. This article explains how ABSR can be integrated into an advanced surface movement guidance and control system (A-SMGCS). The described validations were performed in the complex apron simulation training environment of Frankfurt Airport with 14 apron controllers in a human-in-the-loop simulation in summer 2022. The integration significantly reduces the workload of controllers and increases safety as well as overall performance. Based on a word error rate of 3.1%, the command recognition rate was 91.8% with a callsign recognition rate of 97.4%. This performance was enabled by the integration of A-SMGCS and ABSR: the command recognition rate improves by more than 15% absolute by considering A-SMGCS data in ABSR.

Keywords:

air traffic controller; simulation pilot; workload; assistant-based speech recognition; automatic speech recognition and understanding; apron control; STARFiSH

1. Introduction

This article is an extended version of [1].

In air traffic control (ATC), there is a permanent need to increase the efficiency of handling air and ground traffic. This need exists especially at highly frequented airports such as Frankfurt. However, increasing efficiency must never come at the expense of safety. An important approach to increase efficiency and safety on the ground is by supporting ground traffic controllers’ decision making through digitization and automation with new digital assistant systems that are integrated in or interoperate with advanced surface movement guidance and control systems (A-SMGCS) [2]. Currently, A-SMGCS already monitor data from different sensors and are designed to enable controllers to guide traffic more safely without reducing the capacity of traffic guidance.

1.1. Motivation

The most advanced digital assistants in apron control today already have access to a large number of sensors. Together with manual inputs from the controller, the digital assistants are able to detect potentially dangerous situations and warn the controller about them. In contrast, voice communication between controllers and pilots, one of the most important sources of information in ATC, has not yet been used by these assistants. Whenever information from voice communication needs to be digitized, controllers are burdened with the additional task of entering this information into the ATC system. Research results show that up to one third of controllers’ working time is spent on these manual entries [3]. This leads to a reduction in overall efficiency as controllers spend less time optimizing traffic flow [4]. The amount of time spent on manual inputs will even increase in the coming years as future regulations require more alerting functions to be implemented, hence more inputs are needed, especially in apron control, e.g., Commission Implementing Regulation (EU) 2021/116 [5].

Assistant-based speech recognition (ABSR) has already shown in the past that it is possible to significantly reduce manual input from controllers by recognizing and understanding the controller–pilot communication and automatically providing the required inputs into digital assistants [6]. ABSR technology has been continuously developed in several projects, and possible fields of application have been identified in the context of research prototypes. So far, there has been no initial integration into a commercial system to demonstrate that ABSR can also meet the corresponding requirements for safety and usability in a system network commonly used in aviation. In the German Federal Ministry of Education and Research funded project STARFiSH (Safety and Artificial Intelligence Speech Recognition), a powerful artificial intelligence (AI)-based speech recognition system was integrated into a modern A-SMGCS for apron control [1]. The solution was supposed to reduce the additional workload of controllers as much as possible by using speech recognition and understanding capabilities. At the same time, the solution was supposed to be objectively safe and be rated as reliable, easy to use, and safe by the controllers.

1.2. Related Work

This section outlines related work for automatic speech recognition (Section 1.2.1) and understanding applications (Section 1.2.2) in air traffic control and closes with related work on how both are used to automatically fill digital flight strips or radar labels with voice information from the controller–pilot communication in Section 1.2.3.

1.2.1. Early Work on Speech Recognition in Air Traffic Control

Speech Recognition in general has a long history of development. It started in 1952 when Davis, Biddulph, and Balashek of Bell Laboratories built a digit recognition system for a single speaker called “Audrey” [7]. Over the last 70 years, technological advances have led to dramatic improvements in the field of speech recognition. An overview of the first four decades is provided by, e.g., Juang and Rabiner [8]. Connolly from FAA was one of the first to describe the steps of using automatic speech recognition (ASR) in the air traffic management (ATM) domain [9]. In the late 1980s, a first approach to incorporate speech technologies in ATC training was reported [10]. Such developments led to enhanced ASR systems, which are used in ATC training simulators to replace expensive simulation pilots, e.g., FAA [11,12], DLR [13], MITRE [14], and DFS [15].

The challenges with ASR in ATC today go beyond basic training scenarios, where often standard procedures and ICAO phraseology [16] are followed very closely. Modern ASR applications have to recognize experienced controllers, who more often make deviations from the mentioned standards. Furthermore, applications with ASR also go beyond just the scope of simulation and training. ASR is for example used to obtain more objective feedback concerning controllers’ workload [17,18]. A good overview of the integration of ASR in ATC is provided in the two papers of Nguyen and Holone [19,20].

1.2.2. Speech Recognition and Understanding Applications in Air Traffic Control

In the recent past, research projects developed prototypical applications with speech recognition and understanding for all ATC domains from en route [21], via approach [4], to tower and ground [22]. These prototypes support controllers in maintaining aircraft radar labels [23] and flight strips [22] to reduce workload, recognize and highlight aircraft callsigns [24], build safety nets for tower control [25], or even offer automatic readback error detection with reports to controllers [26,27]. The systems have matured to recognize words and meanings of real-life controller and pilot utterances even beyond the simulated environments [28]. The rules of how a speech understanding system can annotate the meaning conveyed with ATC radio transmissions are defined in an ontology that was agreed between major European air traffic management stakeholders in 2018 [29]. With this ontology, different word sequences can be mapped to unique word sequence meanings, e.g., the word sequences “lufthansa zero seven tango taxi via november eight and november to stand victor one five eight” and “zero seven tango via november eight november victor one five eight” both correspond to the same annotation in the ontology “DLH07T TAXI VIA N8 N, DLH07T TAXI TO V158”.

1.2.3. Related Work for Pre-Filling Flight Strips and Radar Labels

The information which air traffic controllers communicate via radio telephony is valuable for digital assistants to provide additional safety. Yet, controllers are usually burdened with entering this information manually. Assistant-based speech recognition (ABSR) has been shown to be a lightweight technology that automatically extracts ATC communication content without additional human workload and that successfully feeds digital systems [6]. DLR, together with Austro Control, DFS, and other European air navigation service providers, has demonstrated that pre-filling radar labels supported by automatic speech recognition and understanding reduces air traffic controllers’ workload [3] and increases flight efficiency with respect to flight time and kerosene consumption. Fuel burn can be reduced by 60 L per aircraft in the approach phase [4]. DLR, Austro Control, Thales, and the air navigation service provider of Czech Republic have redesigned this exercise with a commercial off-the-shelf speech recognizer and an industrial radar screen. The exercise results clearly showed that speech recognition, i.e., obtaining the sequence of words from a voice signal, is not enough [30]. Speech understanding is needed for providing information for flight strips and radar labels.

Recently DLR and Austro Control analyzed the safety aspects of using speech recognition and understanding for pre-filling radar label contents. They investigated how many of the verbally spoken approach controller commands, with and without speech recognition, were finally entered into the ATC system and how many errors were made, not recognized, or not corrected by the air traffic controllers. Despite manual corrections of commands even with speech recognition and understanding support, about 4% of the spoken commands were still not correctly entered into the system. However, this result, which is initially alarming from a safety point of view, is quickly put into perspective, when considering that roughly 10% of the verbally spoken commands are incorrectly or not entered at all into the system, if no speech recognition and understanding support is available. More details are provided in [23]. The results show that speech recognition and understanding [31] is far from being perfect, but a system without speech recognition and understanding seems to be even further away.

One of the main input sources for this paper, which describes the results or transforming the support tool for approach controllers to apron controllers, were two studies of DLR: one from 2015 for the Dusseldorf approach [4] and a recent one for the Vienna approach control [23]. It was expected that the good results of command recognition in the approach area will translate one-to-one to the correctness and completeness of inputs in the apron area. Previous projects have already taken first steps towards using speech recognition and understanding in a tower or apron environment, which included for example the prediction of potential controller commands [32]. The actual use of speech recognition and understanding was then further investigated in a multiple remote tower setup [33]. In the process, relevant information for digital flight strips was automatically derived and entered from the given verbal commands. The tower environment already covered many of the command types relevant for ground/apron traffic such as taxi, hold short, and pushback instructions.

1.3. Paper Structure

Section 2 summarizes the use case of supporting apron controllers, the iterative software development approach and introduces the Software Failure Modes, Effects, and Criticality Analysis. Section 3 describes the final version of the evaluation system. Section 4 explains the validation of the developed application. Section 5 presents the validation results before Section 6 and Section 7 finalize the paper with discussions and conclusions.

2. Materials and Methods

2.1. Application Use Case of Supporting Apron Controllers

Initially, the apron controller, shown as “ATCo” in Figure 1, issues a command to the pilot by radio. Without ABSR (Figure 1, left), the controller enters this command into the A-SMGCS manually either before, afterwards, or in parallel to the radio call so that the system can provide automation functions. With ABSR (Figure 1, right), an ABSR-system automatically generates, based on the radio call, a data packet including metadata from the command, which is sent to the A-SMGCS. The A-SMGCS executes valid commands and highlights the changes together with the associated aircraft symbol. No system interaction by the controller is required unless an error has occurred. If the automatic speech recognition fails, the controller needs to manually correct or enter the command in the same way as without the ABSR system.

The automatic recognition of voice commands issued by controllers to pilots should provide a solution to the problem that controllers are less able to keep an eye on traffic when they enter the information of the radio call that is necessary for modern A-SMGCS support functions. This includes, for example, entering taxi routes so that compliance can be monitored automatically.

From a technical point of view, the sequence of actions is:

Commands given via voice by the controller to the pilot are recorded as an audio data stream (A/D conversion of utterances).
The audio stream is divided into sections by detecting individual transmissions in the audio data.
Speech-to-text (S2T) transformation is applied on the resulting audio sections. S2T is based on neural networks trained with freely available data as well as with domain-specific recorded audio data for the target environment.
Relevant ATC concepts are automatically extracted from the S2T transcription using rule-based algorithms on a previously defined ontology and traffic data fed from the A-SMGCS.
High-level system commands are generated from the extracted ATC instructions using rules algorithmically interpreted from operational necessities according to the current traffic situation and fed into the system.
The changes to the system state resulting from the high-level system commands are presented to the human operators.
Human operators can correct or undo the automatic inputs.

We explain these steps by an example:

The apron controller is continuously speaking to the pilots with some gaps in between, e.g., “… to seven five seven from the left… lufthansa four two two good morning behind opposite air france three twenty one continue november eight lima hold short lima six … austrian one foxtrot behind the passing”. The gaps occur either because no further action is required or due to the verbal response of the (simulation) pilot, which is not available to the ABSR.
The audio stream sections are detected, and one continuous transmission could then be “lufthansa four two two good morning behind opposite air france three twenty one continue november eight lima hold short lima six”.
Let us assume that the result of S2T contains some errors and results in the word sequence: “lufthansa four to two good morning behind opposite air frans three twenty one continue november eight lima holding short lima six” (errors marked in bold).
The relevant ATC instruction, being extracted by ABSR even with the errors from S2T, would be:
- DLH422 GREETING;
- DLH422 GIVE_WAY AFR A321 OPPOSITE;
- DLH422 CONTINUE TAXI;
- DLH422 TAXI VIA N8 L;
- DLH422 HOLD_SHORT L6.
The GREETING is ignored by the A-SMGCS. For the GIVE_WAY instruction the A-SMGCS may find out that the A321 from the opposite is the callsign AFR2AD. A symbol is generated in the human machine interface (HMI) of the apron controllers (and the simulation pilots), showing that DLH422 is waiting until the AFR2AD has passed. The continue statement is executed after the give way situation is resolved. The route along the taxiways N8 and L is shown. A hold short (stop) is displayed before taxiway L6.
In summary, the following visual output is shown to the apron controller:
- The aircraft symbol of DLH422 is highlighted;
- A GIVE_WAY symbol between the two aircrafts;
- The taxi route via N8 and L;
- A HOLD_SHORT symbol (stop) at L6.
The apron controller can accept or reject all three above options or can change some or all of them.

For the controller, almost all processing steps are invisible. Technology remains in the background. From the human operator’s point of view, the sequence of actions is like this:

The callsign addressed in the controller’s radio call is highlighted at the corresponding aircraft symbol in the A-SMGCS (DLH422 in the above example).
Once the commands to the pilot are fully uttered, they are converted into corresponding system commands that would otherwise have to be manually entered, e.g., a taxi route.
The result of the command input is displayed to the controller (and/or simulation pilot) in the A-SMGCS. Wherever possible, the visualization corresponds to the same visualization that would have resulted from a manual entry.
Special case: If an error in the data processing causes the wrong command to be sent and therefore the wrong effects (or none) to be displayed, the human operator must manually correct the command or enter it into the system. Depending on the type of command, dedicated buttons are offered for this purpose.

2.2. Application Development

The solution was created in four main iterations following the spiral model of software development [34]. For this purpose, an ABSR system was integrated with the A-SMGCS system TowerPad™ (see Figure 1) by iterating the following steps:

Technical and operational requirements were determined;
Software and interfaces were developed, implemented, and tested;
Progress was validated by users in realistic operational scenarios in Fraport’s training simulator;
Results were analyzed to derive new requirements.

In the end, the system was intensively validated in realistic simulations with apron controllers and evaluated based on recorded data and defined metrics. The safety aspects detailed in the next subsection were addressed and focused on in the third iteration.

2.3. Safety Considerations

In aviation, a system can only be approved for operation if its impact on safety has been thoroughly assessed. This is even more important if it uses technology that is new and for which the currently available safety assessments are not necessarily suitable. For the use of artificial intelligence-based methods, discussions are taking place in the community regarding how the safety of AI methods can be verified or demonstrated. These discussions happen independent from air traffic management application areas.

However, there is a way out of the dilemma of the lack of approved testing and verification methods, which we saw in the STARFiSH project as a possibility to safely operate a system with AI-based speech recognition and understanding. If the AI system can be encapsulated in such a way that safety-critical outputs cannot have an immediate impact on real-world operation and must always be approved by the user of the system prior to implementation, safety will be verified during operation. However, manual checking of commands means additional effort that one does not want to impose on users for system inputs that cannot have any safety-critical consequences. Thus, it was important to identify which commands have effects that are safety-critical from an operational view.

In order to determine, which system inputs are safety-critical in this sense and which are not, a safety analysis based on the classification in Figure 2, must be performed that first determines, independently of the solution, which system inputs are potentially safety-relevant because they can endanger operational safety. For this analysis, we followed the EUROCONTROL “safety assessment methodology” (SAM) and applied the SFMECA methodology that is at the core of the “functional hazard analysis” (FHA).

SFMECA (Software Failure Modes, Effects, and Criticality Analysis) [35] is a formalized method of risk assessment and subsequent identification of mitigation measures. It is a bottom-up method that analyzes so-called failure modes and their effects to identify (hidden) hazards at the system level. Using the standardized structure and presentation of the process and results specified by the SFMECA, a team of experts used predefined and individual “failure modes” to analyze which safety-relevant effects could be caused by the software and what their causes were for the functional requirements in the project. The requirements were grouped according to features and pre-filtered according to the evaluation dimensions:

Safety-criticality;
Criticality for the work of the controller;
Risk due to potential software development errors.

Then the error cases (in categories “functionality”, “timing”, “sequencing”, and “data, error handling”) were quantitatively evaluated with their respective “root cause”, checking for 26 common causes plus specific functional errors, e.g., “misdetection of callsign under own jurisdiction”.

The evaluation criteria were the severity of the effects, their probability of occurrence, and the probability of timely detection of the error case. Each of these criteria was evaluated in ten gradations for each failure mode and its cause, and a risk priority number (RPN) was calculated. For sufficiently high RPNs, the SFMECA provides steps to be defined on how to reduce the risk with mitigation actions. In the project, the mitigation action envisaged was to let the controller decide on such commands instead of executing the commands directly.

Section 5.7.1 presents the results of the SFMECA and even shows that in our use case, the distinction into good and bad recognitions is not needed.

3. Description of Evaluation System

The final validation trials were conducted on five consecutive days in the apron simulator in Frankfurt in summer 2022. All necessary data were recorded, subsequently processed, evaluated, and documented along the agreed validation concept.

For the trials, an evaluation system was created that allowed us to test the hypotheses set from the project description and to adapt them to the experience gained. The following section describes the final evaluation system as it was integrated into the Fraport apron simulator.

3.1. Technical Integration into the Simulator

While the validation system was necessary to perform the final validation trials, its design was developed in iterations. From the start of the project and as a very basic technical integration, it was used to test the planned system’s architecture and functionality. It was also used to record speech and validation data necessary for the iterative improvement of artificial intelligence (AI)-based speech recognition and understanding during each training session. The actual speech data were recorded, and the A-SMGCS position data, flight plan data, and commands entered by the simulation pilot were logged. Additionally, the recorded speech data were transcribed (a word-for-word transcript of the uttered speech) and annotated (information on the contained commands in the defined ontology). The recorded data were used to train the speech recognition models and adapt the algorithms for speech understanding and callsign prediction. The data used for training and adaptation contained 19 h of audio data without silence from 14,567 single utterances, aligned with corresponding transcriptions. Furthermore, around 8.5 h with respect to 7132 utterances of the transcribed data were annotated in the defined ontology. Both the transcription and annotation processes are based on automatic pre-transcription/pre-annotations generated by the speech recognition and understanding components in the quality available within the different iterations. The manual verification and correction of the pre-transcripts were executed by a human expert from Fraport who is familiar with the airport layout, the procedures, and so on. The pre-annotations were verified and corrected by experts from the DLR, which are familiar with the defined ontology and its components.

The part of the Fraport simulator that was used in the project consists of a simulation room for the apron controllers and a control room for the simulation pilots (see Figure 3).

In operational mode, there are two different workstations. The Movement Controller (MC) workstation guides the aircraft. English is spoken on the flight frequency. In Frankfurt, three Movement workstations are usually manned, named East, Center, and West (see Figure 3 and Figure 4). In addition to the Movement workstations, there are Operational Safety Controllers (OSC). These workstations guide the tugs and assign the follow-me vehicles. German is spoken on these frequencies. For training, usually the MC and two OSC are assigned and split in two different rooms. It was decided to not use OSC during simulation, so that five instead of three simulation days were possible with the same effort of the involved apron controllers. Everything was located in one simulation room.

In the simulator environment, simulation pilots act as counterparts for the controllers. They sit in the simulation pilot room (left part in Figure 3). The task of the simulation pilots is to move the aircraft as instructed by the controller and to provide readback of uttered commands. The simulation pilot is in control of the same aircraft as the controller, and, therefore, controls several aircraft. The simulation pilots, like the controllers, are assigned to designated work areas (East, Center, and West). Thus, the controller always talks to the same simulation pilot during a simulation session and vice versa. Three MC workstations and three active simulation pilot positions were evaluated through ABSR support.

To use ABSR in the simulator like in real operations, various data had to be exchanged between the simulator software (ATRiCS AVATOR™), the A-SMGCS (ATRiCS TowerPad™), and DLR’s ABSR system (see Figure 5).

The surveillance data in Asterix CAT 20 format as well as flight plans and Collaborative Decision Making (CMD) Times were sent to the ABSR system and evaluated by the callsign prediction module. The audio recordings were first processed by the voice activity detection, then by the speech-to-text component, and finally by the speech understanding component. These results (ATC concepts) were forwarded as commands to the A-SMGCS and simulation pilot workstations. The latter control the traffic and radar simulator and visualization. This was performed by means of a transmission control protocol/internet protocol (TCP-IP) connection. Another interface was used to transmit flight plan data from the simulator to the ABSR system. The main interface was from the ABSR system to the A-SMGCS. Here, the recognized commands were passed to the A-SMGCS for visual display on the simulation pilot workstation and on the apron controller workstation, respectively. The interfaces and software programs as well as traffic scenarios were successfully tested in the first iteration of the project.

After the first iteration, new functions were integrated and tested in the simulator and the ABSR system in short intervals. In this way, it was possible to quickly check whether the interaction of the software worked and whether the adjustments represented an improvement or had no significant or even negative effects and should, therefore, be removed again.

Similar to the development process, an iterative approach was also taken to evaluate the results. An evaluation basis was defined and tested during the iterations and continuously improved. During these tests, it was determined that an objective measure of the cognitive load placed on controllers by system inputs was needed. Using eye-tracking sensors would have been one way to measure how often the controller’s gaze is on the implemented ABSR output, e.g., to provide manual input and to determine how often the simulated traffic can be observed from an outside view. However, after further experiments, the decision was made to use a much less complex measurement method by means of secondary tasks for the participating controllers, see Section 4.4.

In the simulations, different traffic situations were used as scenarios. Scenarios from 30 min to 60 min were tested, as well as high, medium, and low traffic. After various tests, the length of 30 min and very high traffic density seemed to be most suitable to validate or falsify the validation hypotheses in the final validation trials. Two scenarios were created for the final validation trial. One scenario included runway operating direction 25, another one, operating direction 07. The two different operating directions indicate the direction in which the parallel runway system in Frankfurt is used, i.e., the direction in which aircraft take off and land. The direction depends on the weather, in particular on the wind, since landings should be made against the wind direction if possible. During operating direction 25, the runways 25 left (25 L) and 25 right (25 R) were used for inbounds/arrivals. The runways 18 and 25 center (25 C) were used for outbounds/departures. During operating direction 07, inbounds used 07 L and 07 R, whereas outbounds used 07 C and 18, i.e., in both scenarios, two inbound and two outbound runways were in use. On the ground, the operating directions affect the taxi guidance since the aircraft are then guided on other taxiways to the stand or runway. Accordingly, the ABSR and the integration of the systems could be tested in different situations. Consequently, the results should be more general and transferable to other traffic scenarios and other airports.

3.2. Assistant-Based Speech Recognition

The core of the ABSR system implemented in the STARFiSH project mainly consists of three modules (see Figure 6), which perform the conversion of the audio signal into recognized word sequences (speech recognition), the prediction of the relevant callsigns (callsign prediction), and the extraction of the semantic meaning of apron controller commands (speech understanding).

The only mandatory input signal to the system is the voice radio of the apron controller. To improve the recognition quality of the ABSR system, radar and flight plan information is also provided by the A-SMGCS. These data allow the generation of relevant contextual information, such as the list of aircraft callsigns that are currently relevant for operations per area of responsibility, that can be directly integrated into the recognition process of the ABSR system. In addition to the three central modules, for technical reasons, see Section 3.2.1, the project also had to implement and integrate a voice activity detection for the ABSR system, which determines when a controller’s radio transmission starts and when it ends. Figure 6 provides an overview of the interfaces between the core modules. The following sections describe each module in more detail.

3.2.1. Voice Activity Detection

The goal of the ABSR system in STARFiSH is to recognize and understand the uttered commands of apron controllers. Since the audio signal is transmitted as a continuous stream of data from the voice communication system to the ABSR system, even when there is no speech at all, the system needs a way to detect the points in time a dedicated radio transmission has started and ended. Therefore, a signal is required that indicates the beginning and end of a radio message to the ABSR system. The most precise signal for this purpose would be the so-called push-to-talk (PTT) signal. This signal is triggered by controllers each time they push or release the button on the microphone they are using to start or end a radio transmission to a pilot. However, for technical reasons, the PTT signal could not be accessed for use in this project. To compensate for this problem, STARFiSH uses voice activity detection, i.e., the acoustic signal is analyzed to determine when a transmission begins and ends. The start and end of radio transmissions are detected based on the duration of previously detected silence states and a probability of reaching the end of the voice signal. Five predefined rules for detecting the end of a segment online from Kaldi have been considered without further adaptations [36].

3.2.2. Speech Recognition, i.e., Speech-to-Text (Transcriptions)

As soon as the voice activity detection detects the beginning of a radio transmission, the audio signal is forwarded to the S2T component, and the recognition process immediately starts converting the audio signal into word sequences. This means that the speech recognition system starts the recognition process as soon as the apron controller begins the radio transmission. The system then continuously provides intermediate recognitions until the controller reaches the end of the radio transmission. For example, a controller might say the following:

“lufthansa three charlie foxtrot taxi alfa six two alfa via november one one november november eight at november eight give way to the company A three twenty from the right”.

Let us assume that this sentence could be recognized and output by the S2T component in the following increments:

“lufthansa three charlie”;
“lufthansa three charlie foxtrot taxi alfa six two alfa via”;
“lufthansa three charlie foxtrot taxi alfa six two alfa via november one one november november eight”;
“lufthansa three charlie foxtrot taxi alfa six two alfa via november one one november november eight at november eight give way to”;
“lufthansa three charlie foxtrot taxi alfa six two alfa via november one one november november eight at november eight give way to the company A three twenty from the right”.

The speech recognition engine is implemented as a hybrid deep neural network combined with a hidden Markov model (HMM). It is combined with a convolutional neural network factorized time delayed neural network (CNN-TDNNF) with six convolution layers and fifteen factorized time-delay neural networks. Overall, the model has around 31 M trainable parameters. The whole model is trained with a so-called “lattice-free maximum mutual information” as an objective function. The system follows the standard chain LF-MMI training recipe [37] of Kaldi [38], which uses high-resolution “Mel frequency cepstral coefficients” and i-vectors as input features. A typical 3-gram language model was trained and adapted using domain-specific data.

Starting from a base model, the speech recognition engine was continuously improved with new training data during the course of this project. An integration of context knowledge from callsign predictions was also implemented and contributes to the improvement of the recognition performance.

3.2.3. Speech Understanding

When a word sequence is transmitted from the S2T component, it is analyzed by the speech understanding module and converted into relevant ATC concepts, as originally defined in an ontology [29]. According to this ontology, the above word sequence “lufthansa three charlie foxtrot taxi alfa six two alfa via november one one november november eight at november eight give way to company A three twenty from the right” is transformed into the following commands:

DLH3CF TAXI TO A62A;
DLH3CF TAXI VIA N11 N N8;
DLH3CF GIVE_WAY DLH A320 RIGHT WHEN AT N8.

In total, this radio transmission contains three commands. Here, the pilot of the aircraft with the callsign DLH3CF was instructed to taxi to parking position A62A via the taxiways N11, N, and N8. When arriving at taxiway N8, the pilot of the aircraft must give way to a Lufthansa (DLH), which is from the same company as the pilot addressed, has the aircraft type A320, and is coming from the right (RIGHT), before being allowed to continue taxiing.

In the case of intermediate detections from the speech recognition engine, the speech understanding module is able to provide early recognition of the callsign or, if required, early recognition of subsequent commands. The speech understanding implementation is based on a rule-based algorithm that identifies the relevant parts step by step and converts them into ATC commands. For more information, see [39].

The speech understanding module does not only convert the word sequences into ATC concepts but also makes an initial decision as to whether the extracted commands might be erroneous. Potentially erroneous commands are caused either by an erroneous interpretation of the rule-based algorithm, by an already erroneous word sequence due to a misrecognition of the speech recognition engine, or by misleading formulations of the controller. The decision, whether a command could be erroneous, is based on simple heuristic rules that determine which commands can occur together in a radio transmission. Here are some of the rules, explained by examples:

It is logically not possible that an aircraft is instructed in a single radio transmission to taxi to two different target positions, e.g., a “TAXI TO” to two different parking positions, runways, or both in one transmission is impossible. Therefore, the module would automatically discard all “TAXI TO” commands within the transmission. Of course, with more information, it might be possible in some cases to determine which of the target positions is the correct one and only neglect one of the “TAXI TO” clearances, but that would require quite complex knowledge about the airport infrastructure to be implemented within the speech understanding component. The target application, on the other hand, which receives information from speech understanding, usually already has the required knowledge about the airport and therefore is more suitable to handle this task.
A similar example would be a “TURN LEFT” and a “TURN RIGHT” command within one transmission and no other command in between, which is also impossible and would therefore be neglected for the same reasons.
A less obvious example is the recognition of a “PUSHBACK” and a “TAXI TO” command in one transmission. Theoretically this might seem possible, but also these commands do not appear together and if they do, the error is usually a wrongly extracted “TAXI TO”. Therefore, the heuristic says to always neglect the “TAXI TO” in this case.

However, the examples above show also that erroneous commands can only be detected if the error case is predefined. Therefore, confidence measures have furthermore been implemented for speech understanding output and are used to reduce possible false recognitions. These confidence measures can also be applied to the error cases listed above instead of neglecting the erroneous commands, but this requires that the application receiving the information is able to implement it. This means that the application then has to determine which command to neglect or not. In the end, all errors that are not detected, either by speech understanding or by the application, have to be handled manually by the apron controller in charge.

Analogous to speech recognition, speech understanding was also continuously developed and adapted based on new information. Just as in speech recognition, an integration of context knowledge from callsign prediction takes place and contributes to the improvement of recognition performance.

3.2.4. Callsign Prediction

Callsign prediction receives both radar and flight plan information from the A-SMGCS. The module uses these data to determine which callsigns may be part of a radio transmission in the near future. The radar information is used in the first step to obtain an overview of the available callsigns in the airport area. However, since many aircraft are in the airport area, but not all will be actively participating in taxiing traffic in the near future, the module also uses flight plan information dynamically provided by the A-SMGCS to determine more precisely which of the available callsigns will be addressed in the near future. For this purpose, the responsible controller position, the target startup approval time (TSAT), the actual take off time (ATOT), the actual landing time (ALDT), and the actual in block time (AIBT) are extracted from the flight plan. All relevant callsigns are forwarded to the speech recognition module (callsign boosting) and the speech understanding module to include the callsigns in the process of recognition and understanding. More information on the technique of callsign boosting, used within the speech recognition module to enhance recognition, can be found here [40,41]. The integration of callsign predictions in the speech understanding module transforms the callsigns into possible word sequences and calculates the closest match to the recognized word sequence based on the Levenshtein distance [42] to determine the correct callsign.

3.2.5. Concept Interpretation

The final stage of integrating an ABSR system into an A-SMGCS represents the testing for operational plausibility, interpretation, and implementation of the extracted concepts or commands. Figure 7 shows the running integration of ABSR into the A-SMGCS at one of the simulation pilot stations.

Testing and interpretation are necessary prior to implementation for two reasons:

Controller instructions via voice convey exactly the information that is necessary and sufficient for the addressed pilot in the current traffic situation. Globally, however, these instructions can be ambiguous. It is, therefore, necessary for an information technology system to unambiguously identify the addressed pilot and to make assumptions about his/her contextual knowledge in order to be able to exclude ambiguities from this perspective. A GIVE_WAY command from the right could identify several aircraft that approach from the right at the same time or consecutive taxiway crossings. The system has to determine the correct one that is implied from the traffic context.
The extracted concepts may be erroneous. Either the controller has made a mistake, so that the verbal instruction does not correspond to what would be advised in the current traffic situation, or errors have occurred in the recording of the speech, the pause recognition, the conversion to text, or the speech understanding, so that the extracted concept is erroneous and should not be implemented.

These two sources of error cannot be distinguished. The task of this module is to admit only those commands into the assistance system that are plausible and fit into the current traffic context. If inappropriate commands are delivered by the ABSR system, the user must be given the opportunity to manually correct the error. This is technically implemented by the following steps, which are detailed in Appendix B:

Preprocessing;
Highlight the aircraft symbol on the basis of the recognized callsign;
Trigger multiple actions based on a single command;
Discard commands incompatible with the traffic situation;
Correctly interpret context-dependent commands;
Complete incomplete commands from the current traffic situation;
Convert commands;
Deal with detected errors;
Deal with undetected errors and identify error sources.

3.3. Usability Considerations

A key element to the successful implementation of automation features is the user interface. Since automation reduces the necessary interactions with the system, users may miss automatically executed actions. It is thus essential that users are still able to see all system states that are relevant for safe operations. In addition, it must be possible to quickly analyze and correct errors in the event of automation failure. This is a necessary requirement of operational safety, especially in aviation, where errors can lead to accidents.

In the STARFiSH project, the automation functions as well as the user interface were first implemented in a purely functional way and then analyzed in operation with the end users to iteratively overcome the challenges. This involved asking questions such as: Is the right information available, and is the right information being perceived? Is the user interface not overloaded with information, i.e., is important information also perceived more easily than less important information? Are delays sufficiently low?

Using various elicitation techniques (observation, brainstorming, and interviews), user requirements were thus determined in an iterative manner, and new target formulations were achieved. In total, 30 users participated in 29 evaluation days, distributed over the four iterations in the course of this project.

3.3.1. Visualization of the Automation Actions (Feedback)

In iteration 1, the main focus was to be able to check the technical integration of the systems, i.e., to investigate the question of whether recognized commands reach the human operator and are available in time from an operational point of view. Therefore, the following were implemented first:

Display the recognized transcriptions and the resulting annotations (ATC commands) on the side of the ABSR output log, in order to be able to compare the output of the ABSR system with the received data on the TowerPad™.
Log data at the interfaces of the ABSR system and on the working position computers of the controllers and simulation pilots, in order to be able to analyze, after the simulation runs, whether the correct commands arrive in time.
Log the commands provided by the ABSR system in chronological order on the working position computers to give users and researchers a way to observe and verify the results of the speech inputs independently of the implementation of the commands.

During preliminary testing, it became apparent that displaying each recognized command in a table to support easy troubleshooting did not add value to the operational users of the system but was distracting. Therefore, the user interface was designed so that commands generate specific visual feedback which is integrated into the workflow. In terms of position and design, this resulted in symbolic displays specifically adapted to the command or very compact dialogs. Although the display as a table was still extended for troubleshooting, it was no longer visible at the controllers’ working positions during the trials and was positioned on a second screen outside the focus of the users at the simulation pilot workstations. It was only used for evaluation and development.

Starting with iteration 3, commands were directly translated into visible actions of the system. For some actions, it was possible to use the same visual feedback to the human operator that is used for manual input, for example:

Change a route;
HOLD_SHORT command;
GIVE_WAY command.

There are fundamental advantages to displaying the same feedback in the HMI regardless of the input method (by speech recognition and understanding, or by mouse or touch gesture), as there is less need for training. On the other hand, users should be able to identify if the source of a change in the user interface is the speech recognition and understanding component. This was implemented by the following user interface features:

Highlighting of the addressed aircraft symbols without disturbing user touch or mouse input, executed in parallel, additionally multi-highlighting when several commands are executed in quick succession.
Feedback for changes, which are scarcely visible when executed manually, such as the transfer of an aircraft to another working position.

3.3.2. Manual Error Correction

The simulation experiments showed that for some actions, an undo is ambiguous and not without side effects, e.g., when changing a taxi route. For these actions, it was easier for human operators to select the desired function directly without prior “undo”, thereby implicitly overriding the wrong action.

4. Validation Trials

This section presents the preparation and results of the validation trials. All simulations took place in Fraport’s training simulator, which had been retrofitted for the experiments and tests.

4.1. Pre-Simulations

During the pre-simulations, the individual parts of the system and their integration were tested, and exemplary evaluations of the simulation runs were carried out in order to determine methods for the final validation trial. The basic structure, i.e., the architecture, remained constant after the initial integration tests.

Controllers and pilots, in their corresponding positions, speak to each other on the same radio frequency. The ABSR system operates on the voice recordings of the controllers, and the command implementation takes place independently in the two instances of the A-SMGCS for the controllers and simulation pilots, respectively (see Figure 5). For both groups, ABSR support is enabled either for all working positions or for none.

The simulations in the first iterations served the dual purpose of obtaining feedback from the human operators and testing the technical integration. In later iterations, the validation methods themselves were tested as well, i.e., exemplary evaluations of the simulation runs were performed. For example, the hypotheses regarding the reduction of taxi times were discarded, since they did not differ significantly.

It was also explored what the scope of traffic should/should not be, and which additional tasks are suitable to challenge the attention of the users without tying up the support team too much.

4.2. Validation Plan

Four different combinations of the ABSR support were investigated, as shown in Table 1:

4.2.1. Validation Hypotheses

The following hypotheses were tested during the final validation trials:

H1.

(H-C-less_input): Automatic documentation (conditions JC and CP) reduces the total number of manual inputs to guide taxiing traffic at the controller’s working position compared to full manual input (conditions NO and JP).

H2.

(H-P-less_input): Automatic command recognition for simulation pilots (conditions JP and CP) reduces the total number of manual inputs to guide the taxiing traffic of simulation pilots compared to full manual input (conditions NO and JC).

H3.

(H-C-more_cog_res): Automatic documentation (conditions JC and CP) increases the controller’s free cognitive resources compared to full manual input (conditions JP and NO).

H4.

(H-C-less_workload): Automatic documentation (conditions JC and CP) reduces the workload of the controller compared to full manual input (conditions JP and NO).

H5.

(H-C-sit_aw_ok): Automatic documentation (conditions JC and CP) does not limit the controller’s situational awareness compared to full manual input (conditions JP and NO).

H6.

(H-C-conf): The controller’s confidence in command entry automation (conditions JC and CP) is above average.

H7.

(H-P-conf): The simulation pilot’s confidence in command entry automation (conditions JP and CP) is above average.

H8.

(H-E-CmdRR): The command extraction rate (JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (command extraction rate for simulation-relevant commands >90%).

H9.

(H-E-CmdER): The command extraction error rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (command extraction error rate for simulation-relevant commands <5%).

H10.

(H-E-CsgRR): The callsign extraction rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (>97%).

H11.

(H-E-CsgER): The callsign extraction error rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR H11. (callsign extraction error rate <2%.)

4.2.2. Independent Variables

The independent variables (IV) of the final validation trials were as follows:

(IV-Input): Documentation on the controller’s HMI by ABSR vs. manual input (JC and CP vs. JP and NO).
(IV-Control): Control of the simulation by ABSR for the controller’s utterances vs. full manual input (JP and CP vs. JC and NO).

4.2.3. Dependent Variables

The dependent variables of the final validation trials are listed in Appendix A. The respective results are each compared between the different operational conditions within a scenario.

4.3. Execution of the Final Validation Trials

The final validation trials took place from 27th of June to 1st of July 2022 in the apron simulator in Frankfurt. The number of simultaneously active users was three controllers and three simulation pilots. For the final trials, 14 controllers were recruited who had enough experience with the A-SMGCS system (see Figure 8). On each day, a new team of controllers was on site (one controller participated twice). Half of the participants already had their first experience with the system at one of the many pre-simulations. The other half had their first contact with the ABSR system during the final trials.

Two different traffic scenarios were prepared for the final validation trial: one for runway operating direction (OD) 25 and one for OD 07. The simulation scenarios generated from these were 30 min long each. Table 2 shows the number of aircraft movements in total and the projected number of aircraft at each of the three working areas: East, Center, and West.

In order to generate a heavy workload, the amount of traffic in the scenarios were increased compared to usual traffic at Frankfurt Airport so that the controllers were as busy as possible all the time. The heaviest workload with respect to radio frequency usage and number of commands was expected at the Center position, followed by the East working position. At the West working position, the load was expected to be lower, even if the numbers in the Table 2 suggest otherwise. This is partially due to the type of movement (pushback-aircraft must be pushed from the parking position with the tug, etc.) and the size of the area. In the West, there were significantly fewer pushbacks, because there were less nose-in positions (so most aircraft could leave the parking stand forward under their own power). The number of commands and utterances per position for all runs are shown in Table 3.

The realistic maximum traffic volume in real operations was 106 per 60 min for 2019. This amount was used in the simulation trials for half an hour. This increase compensated for the fact that there were no tows on the apron and other secondary activities that would otherwise occur in reality and that could not all be represented in the simulator.

Each simulation day began with a briefing of the controllers and simulation pilots involved. In this briefing, the controllers and simulation pilots were educated on the concept of ABSR and its interaction with the controller input interface and the simulation pilot’s working station. They were explained how to make manual entries in the systems and when these are required (when no ABSR support is active for the respective station, or a correction is necessary).

In addition, the controllers and simulation pilots were informed about the schedule, and the questionnaires were introduced. Afterwards, training for controllers and simulation pilots took place, in which all operational conditions were explained and tried out. During the training run, the controllers also exercised the secondary task for measuring mental load after a short introduction on how to perform it. This secondary task is discussed in more detail in Section 4.4.

The teams of three controllers and three simulation pilots remained at their working positions throughout the different simulation runs and operational conditions (OC). On each day, there were different runs for each of the two operating directions OD07 and OD25. The teams were always the same for the same OD.

When evaluating the influence of ABSR support on the work of the controller, the ABSR system was always active for the simulation pilots, i.e., only altered from on to off and vice versa for the controllers. In addition, two simulation runs were carried out in OD25, in which the ABSR system was always active on the controller’s side and a change from on to off and vice versa for the simulation pilots. The influence of the ABSR system on the simulation pilot’s activity was analyzed, too. Thus, six runs were performed per day. After each run, the controllers filled out a questionnaire. At the end of the day, an additional questionnaire was filled out, and the impressions, comments, and hints of the controllers and simulation pilots were recorded in a non-formal debriefing session with all participants.

In the runs, in which the ABSR system was alternately on or off for the controller, the secondary task was performed by the controllers. The task started 10 min after the start of the run and stopped 10 min later. This was performed simultaneously for all three working positions. The simulation runs with the different operational conditions and simulation scenarios were determined as follows in Table 4.

Table 5 below shows the order of the training and experimental runs for each controller-team, consisting of three persons. The order of the runs was changed each day to reduce (in the mean of the evaluation) learning effects that may occur during the day.

4.4. Objective Workload Measurement by a Secondary Task

The questionnaires reflect the subjective experiences of the controllers, which one might argue to be the most important measure for most operationally deployed systems. Nevertheless, we wanted to obtain more objective data that would confirm or reject our hypothesis that the proposed system reduces workload.

To measure mental load, we used a secondary task that required similar skills to the main task (controlling traffic), namely mental focus, English language proficiency, color recognition, and quick orientation on the user interface, and yet that was simple enough to be performed in parallel with the main task.

In the pre-simulations, subjects were asked to sort decks of playing cards as a secondary task to measure free mental capacity and were then made to answer questions about missing cards (“which 1–4 cards were missing?”), as was described in [3]. However, the use of this task required too much manual effort and, therefore, ran the risk of introducing errors in data recording and execution, so we chose a largely automated approach for the final validation trials using the application described below. This greatly reduced the physical and mental workload of the simulation support team and the susceptibility to errors.

For the secondary task, 10 min after the start of each simulation run, each controller (and additionally once in parallel with the simulation pilots) was asked to complete as many Stroop tasks [43] as possible in the following 10 min in addition to their main task. For this purpose, a tablet PC (6x Samsung A8) was provided to the controllers that ran an application for executing consecutive Stroop tasks [44]. The application recorded the execution time and duration of each task as well as its correctness. A high number of correctly executed Stroop tasks in the application suggests an available mental capacity that is not needed for the main task.

The atomic Stroop task is the following: when the start button is pressed, a word for a color is displayed, but in a different color to the color that the word stands for. The task for the user is to select the right button with the color word that matches the display color from a set of seven buttons, all labelled with a color word in black. The order of these buttons changes in a pseudo-random way at each repetition of the task. In Figure 9, the color word “ORANGE” is displayed in blue, so the button to press is the one labelled “BLUE”.

5. Validation Results

5.1. Speech Recognition and Understanding Performance

Section 5.1.1 focuses on speech recognition, and Section 5.1.2 focuses on speech understanding performance results.

5.1.1. Speech-to-Text Accuracy (Speech Recognition)

A first indication for the quality of the ABSR system is provided by the so-called word error rate (WER) of the S2T component. The WER is calculated based on the Levenshtein distance [42] between the word sequence recognized by the S2T component and the actual spoken word sequence (gold transcription). This involves counting, in the recognized word sequence, how many words of the actual word sequence have been substituted (S), deleted (D), or additionally inserted (I). All three components are then added and divided by the number of words of the actual word sequence (N). Table 6 shows the WER of the developed S2T component in the final validation trials based on the verbal utterances of 14 apron controllers.

WER was evaluated for two different modes. Online recognition measures what recognition performance the S2T component achieved during the final validation trials in summer 2022. This means that these results contain a certain number of errors that are not induced by the S2T component but by voice activity detection (VAD), due to the missing PTT signal. In order to determine how large the influence of VAD is and what improvement can be expected by accessing the PTT signal, the offline recognition after summer 2022 was used to subsequently evaluate what the system would have recognized if the audio stream had been perfectly split by PTT. It can be seen that offline recognition again brings a significant improvement over online recognition, with a WER of 3.1% compared to 5.0%.

It is also interesting to observe that the average WER of female apron controllers (2.6% and 3.7%, respectively) was better than those of male apron controllers (3.3% and 5.5%, respectively). On the other hand, out of the total 14 apron controllers, only four were female. Performing unpaired t-tests with the 24 runs with female apron controllers versus the 62 runs of male controllers provides very statistically significant results, with a p-value of 0.02%.

The question of what WER is good enough for the intended purpose often arises. This question cannot be answered in a general way, because in the end, it is irrelevant how many words are recognized correctly. What is important is the ability of the system to extract the meaning behind the recognized words and finally to implement it appropriately in the application. Some errors on the word level can change the meaning of an utterance, while others have no influence at all. Therefore, it is not possible to define a general threshold for the WER, but a low WER allows conclusions on the quality of the implemented ABSR system.

5.1.2. Text-to-Concept Accuracy (Speech Understanding)

The performance of speech understanding is evaluated by comparing the commands automatically extracted by the system with the correct commands manually created and verified by human experts (gold annotations). The evaluation is based on three metrics: command recognition rate, command error rate, and command rejection rate. The command recognition rate is defined as the number of correctly recognized commands divided by the total number of commands actually given. A command is considered correctly recognized if and only if all elements of a command such as command type, callsign, value, unit, qualifier, condition, etc., as defined in the ontology, are correctly recognized. Command error rate is the percentage of incorrectly extracted commands divided by the total number of commands actually given. Command rejection rate is the percentage of actual commands given that were not extracted at all or were rejected by the system for some reason. Table 7 below shows the metrics defined above with an example. The example also illustrates that the sum of the recognition, error, and rejection rates can exceed 100%.

In a similar manner to the extraction rates for commands, separately, the extraction rates for callsigns are determined. Again, there is a callsign recognition rate, error rate, and rejection rate. For each utterance, each callsign is considered only once, unless several different callsigns are extracted from the same utterance (“break break” utterances). Therefore, in the above example from Table 7, three callsigns are considered. Detailed information on the defined metrics can be found in [45]. Table 8 illustrates the performance of speech understanding based on the above-explained metrics, i.e., it contains the number of radio telephony utterances and commands as well as the recognition, error, and rejection rates for full commands and callsigns.

Recognition rates of 91.8% and 88.7% are obtained when speech understanding is applied offline (simulated PTT) and online (VAD), respectively. The improvement in the speech understanding result for offline recognition comes from the better word-level recognition and the fact that the offline data does not include radio transmissions that were incorrectly split by VAD, allowing for a better interpretation of the content. Similarly, the recognition of aircraft callsigns in offline recognition is also better than in online recognition, with recognition rates of 97.4% and 95.2%, respectively. The last two rows of Table 8 show the influence of the predicted callsigns on the recognition performance of ABSR. With context information available, the recognition rate increases by 15.5% overall and 16.3% on the callsign level.

Table 9 shows the rates of offline recognition for different command types. The table lists only the most common or important command types that are relevant to the application.

Thus, speech understanding has error rates below 4% and recognition rates in the range of roughly 87% to 98%, depending on the command type with the exception of the GIVE_WAY command. The reason for the worse results of the GIVE_WAY command extraction, marked in red in Table 9, is its very complex nature, i.e., it can be given/uttered in many different ways, not all of which were modeled so far.

5.2. Interaction Count

To determine the amount of manual HMI interactions needed at the A-SMGCS with and without ABSR support, we recorded the HMI interactions at each position and per simulation run, counted them, and categorized them into 48 different task types, such as “edit route”, “clear pushback”, and “select label”. Expectations were, of course, that the number of interactions would be significantly lower when ABSR was available. However, it was apparent from the pre-simulations that the controllers would not make all the required inputs without ABSR support, nor would they correct every error made by the ABSR system, because not performing an input has no direct consequences for the controller as long as the simulation pilot still follows the voice instructions, because the pilots control the simulation.

Therefore, the numbers of the simulation pilots are more meaningful, since here, any omitted input or correction leads to a delay or incorrect behavior in the simulator. On the other hand, not all interactions of the simulation pilots can be replaced by the ABSR input, because the pilot initiates the communication with the controller and needs information for this, which is only available when selecting the aircraft symbol via a mouse click.

Therefore, even for a perfect ABSR, the total number of interactions can never be zero and without ABSR, the number of interactions is higher for the simulation pilots, and the reduction in interactions for the simulation pilots is also not as extreme as for the controllers. Figure 10 shows the remaining portion of manual actions needed for the controllers and simulation pilots when being supported by ABSR for the most frequent interactions. A strong reduction of workload is apparent for both.

5.3. Workload, NASA TLX

The NASA Task Load Index (TLX) has been used for decades in different variants to assess perceived workload in six different aspects [46]. We used a simple unweighted questionnaire procedure in which a mark between 0 (very low) and 20 (very high) is to be entered for each aspect. “The Task”, here, means the task of the apron control in the simulator operating the A-SMGCS.

This questionnaire was completed by the users at the controllers’ working positions after each simulation run. The scores were aggregated by position, OD, and by use of the ABSR (or not). The six questions are as follows:

Mental Demand: How mentally demanding was the task?
Physical Demand: How physically demanding was the task?
Temporal Demand: How hurried or rushed was the pace of the task?
Performance: How successful were you in accomplishing what you were asked to do?
Effort: How hard did you have to work to accomplish your level of performance?
Frustration: How insecure, discouraged, irritated, stressed, and annoyed were you?

The question about the subjects’ own performance (see list above) requires a reversal of the scale values: “how successful?” suggests a high value if the subjects were satisfied with their performance. This is also how the controllers expressed themselves in the feedback rounds, but this was not consistently evident in the questionnaires; in some cases, conspicuously low values were given here, although the values for the other aspects were also very low. Unfortunately, we have to assume that not all controllers understood this question correctly or answered it as expected, and we therefore excluded the performance aspect from the evaluation.

Table 10 below shows the average values by working position and overall, separately for the baseline runs (without ABSR) and the solution runs (with ABSR). Columns “α” show the statistical significance of a t-test.

In general, the workload on the working position “West” was significantly lower than on the other two, while it was estimated to be slightly higher on East than on Center. At OD25, the workload is slightly higher at West and at Center, while the workload for East is estimated to be highest at OD07 (not shown in the table). On average, the workload is slightly higher for OD07, and only with regard to the time aspect is OD25 experienced as somewhat more stressful. Thus, although the working position makes a big difference, the OD does not have a noticeable effect on the average values for all positions.

Especially for physical demand, the value even decreases by almost four points. Significance tests (paired t-tests) on the data prove that the differences between with and without ABSR are not random. On the West position, the results for mental demand, and frustration does not decrease in a statistically significant way. The overall results with respect to workload reduction were very, very statistically significant. We obtained an alpha (p-value) of 0.06%, i.e., if we would have repeated the experiments with all the 15 participants 1000 times again, only in six cases could we expect that the workload without ABSR support is less than that with ABSR support.

5.4. Evaluation of Stroop Tests as Secondary Task

The results of the secondary task point in the expected direction: at the Center and West positions, subjects were able to perform significantly more tasks correctly in parallel with their work when the ABSR support was active, see Figure 11.

From a statistical point of view, the variance at the East working position was too high to make a reliable statement for this position. At Center and West, the figures support the hypothesis, but strictly speaking, this statement is still outside usually required statistical significance due to the relatively small total number of experiments. The qualitative observations made during the simulation runs support our decision to use the application and indicate that the secondary task used here is an objective measure of the cognitive capacity available:

“If the Stroop tasks are done while R/T [radio telephony] must be done, selecting the correct button takes longer.”
“More complicated routes increase the error rate [in Stroop tasks].”

5.5. Situational Awareness, Shape-SASHA

The questionnaire Situational Awareness for Shape (SASHA) [47] was used to assess the controllers’ situational awareness during the simulation runs. The test persons marked their assessment of the aspects on a Likert scale between 0 “never” and 6 “always”. The negatively formulated statements (marked with an “*” below) have been inverted for later evaluation, i.e., the averages are calculated as 6 minus the raw average value. The statements in detail were as follows:

In the previous working period(s), …
○
I was ahead of the traffic.
○
I started to focus on a single problem or a specific area of the sector.*
○
There was the risk of forgetting something important […].*
○
I was able to plan and to organize my work as I wanted.
○
I was surprised by an event I did not expect […].*
○
I had to search for an item of information.*

The subjects filled out this questionnaire after each simulation run. After evaluation, the situational awareness of the controllers is generally found to be good with and without ABSR (see Table 11). The average values over all simulation runs, positions, and aspects are above 4. The most important message is that with ABSR support, situational awareness increases on average over all aspects and at each position.

Further analysis of the values not shown in the above table reveals some differences in situational awareness (SA) related to the working environment. The working position area has an influence on SA, i.e., at the West position, the value was 4.8, while East (4.1) and Center (4.2) have lower values. The OD makes a very small difference, with 4.3 for OD07 and 4.4 for OD25, respectively. However, SA is significantly lower (one whole scale point) at the Center position for OD25 and at the East position for OD07, regardless of the use of ABSR. These two working positions gain half a point with ABSR support, but this is not as clearly reflected at the West position. These results fit the NASA TLX results: where workload is lower, situational awareness is higher.

5.6. Confidence in Automation, Shape-SATI

To elicit trust in the automatic entry of commands into the controllers’ or simulation pilots’ HMI, we used the SHAPE Automation Trust Index questionnaire, SATI. We had each participant complete this questionnaire once at the end of the simulation day, with the request that they focus on the effects of the ABSR system. This allowed us to evaluate 15 questionnaires from the controllers and 6 from the pilots. The items that could be answered on a 7-point Likert scale from 0 “never” to 6 “always” in detail were as follows:

In the previous working period(s), I felt that …
○
The system was useful.
○
The system was reliable.
○
The system worked accurately.
○
The system was understandable.
○
The system worked robustly (in difficult situations, with invalid inputs, etc.).
○
I was confident when working with the system.

A distinction whether the ABSR system or other automation features of the not completely familiar system triggered trust or distrust could probably not be made completely by the test persons. Hence, the results must be considered under this restriction. The overall impression turned out to be very positive, as shown in Table 12.

The controllers consider the system almost always useful. Regarding accuracy and robustness, the confidence is lowest but still high (>4). The simulation pilots are slightly more skeptical, but overall trust in the system is well above average.

5.7. Results with Respect to Safety

5.7.1. Software Failure Modes, Effects, and Criticality Analysis (SFMECA)

The risk analysis based on the SFMECA had the result that no error case would lead to an increased or unacceptable risk, so that no classification into good and bad recognitions is needed, as mentioned in Section 2.3, although our implementation of speech understanding provides this information. This result was not expected. However, it can be explained as follows: The apron control is not responsible for the runways, i.e., areas are excluded where wrong decisions have particularly severe effects and where the possibility of detecting errors quickly is reduced (due to higher speeds). There was also no indirect risk of causing distractions through false alarms and thus endangering situational awareness, since there were no automatic alarm functions available in the project (with which the controllers were familiar). Based on the safety analysis, it was therefore possible to make the decision to implement all commands directly in the A-SMGCS without exceptions (those that are not identified as nonsensical based on plausibility checks).

For the most critical command “Handover”, it was decided to always offer an undo function. A mistakenly executed handover of a flight to another working position would cause all subsequent commands to be discarded: the aircraft would be assigned to another working position and therefore be unavailable for incoming commands at the actual working position. However, the error is very easy to detect with the implemented visualization, and we offered a one-click solution to undo it.

In addition to considering AI ABSR errors, this project also discussed and considered the issues of safety when introducing automation. In addition to the direct effects of automation errors, increased automation can affect safety in the following ways:

Indirect impact due to automation errors (too many disruptive errors, either due to a lack of recognition or incorrect recognition);
Lack of visibility of the automation result (a loss of “situational awareness”);
Lack of flexibility (no possibility of correction or override by the user and therefore a loss of control);
Overconfidence/complacency.

The approaches in this project for addressing these risks were the following:

Achieve sufficient recognition rates and sufficiently low recognition error rates to prevent potential overload from occurring in the first place.
Make the results visible enough for users to retain situational awareness at all times.
Allow human operators to make corrections to automation errors in order to remain in control.
Assessments of risk by overconfidence through safety considerations: what can happen if automation errors are not corrected?

This was validated in multiple ways: (1) indirectly, by selecting particularly challenging simulation scenarios that go beyond the usual in terms of traffic density, by evaluating the required number of interactions with the user interface, and by measuring cognitive load using secondary tasks; and (2) directly through test subjects filling out standardized questionnaires on situational awareness and trust in automation.

5.7.2. Feedback from the Test Subjects on Safety

From simulations at the beginning of the project, in which significantly more errors happened, and significantly fewer commands were available for automatic execution, the following feedback was obtained:

Since the speech recognizer still makes mistakes and you have to check if everything is correct whenever you are spoken to, you are less free in your timing. One also expects that, e.g., the callsign is highlighted. If that doesn’t happen, you’re wondering why it didn’t work.

The feedback on safety became gradually less negative as the project progressed, and the number of errors decreased. At the beginning, there were definitely impairments of a smooth workflow, because the controllers had to wait for the implementation or were inclined to always check the correctness. When most commands worked and the error rate dropped significantly, there were no more comments suggesting a negative impact or reduced safety. This confirms the work on the safety of the overall validation system and the analysis from the safety assessment.

After the simulation runs, subjects were always questioned about safety. The following responses (translated analogously by the authors) are representative of the sentiments:

You could always see if there were errors or not.
The delay is fine. You can already talk to the next pilot or you get the indication during the readback. That’s sufficient.
The errors were very few. They couldn’t put us in critical situations.
Here, the aircraft are controlled very directly because the simulator directly implements commands [with voice recognition enabled] [including errors]. A pilot would not do that. That’s why it [emerging situations] would be less critical in real life.
If something takes too long, you leave it out.—If the pilot executes it correctly, it’s okay.—If incorrectly detected, the worst thing that can happen is false alarms.

5.7.3. Summary of All Feedback Collected

Feedback was consistently positive toward the solution with ABSR support. The controllers were mostly surprised that the system worked so well, even though it is still in a research stage. It was emphasized that it made no qualitative difference to the ABSR system (1) whether the controller spoke quickly or slowly, or (2) whether the controller strictly adhered to International Civil Aviation Organization (ICAO) phraseology [16] in his/her speech utterances or deviated from it to a greater or lesser extent, caused by increased traffic density and high radio frequency use. The system was very well received because it did not require controllers to change. The controllers could simply speak as they were accustomed and still the correct action occurred in most cases. The controllers said that it could leave more time to keep an eye on traffic instead of staring at the display.

It was also noted that with ABSR support, the controllers sometimes instruct different taxi routes than when they have to input the route manually: If a route is pre-selected by the system, then it is easier to follow it than to change it manually. But if the controllers can simply use speech to change the route instead of having to enter it manually, then they are more likely to change the route, e.g., to shorten the aircraft’s taxiing time.

The controllers as well as the simulation pilots indicated that the workload decreases significantly with ABSR support. The best feedback was for the working position West. Here, almost everything was correctly recognized for everybody. Recognition was also good for the East and Center positions, but there were also minor misrecognitions.

There were hardly any critical voices. Rather, there were suggestions on how to make it even better, e.g., that the recognition should be faster and could be better, so that there would be even fewer false recognitions. The command types “Hold Abeam” and “Pushback Abeam” were, e.g., not implemented within the resources of the STARFiSH project. Over the days, the feedback from the controllers involved was qualitatively repetitive, and so it became apparent that the different controllers had the same good experiences with the system.

5.8. Results with Respect to Validation Hypotheses

In Section 4.2.1, we formulated the Hypotheses H1 to H11. The results with respect to validation of the hypotheses and falsification were presented in the above sections. This subsection summarizes the results with respect to each hypothesis.

5.8.1. Hypotheses with Respect to “Number of Manual Inputs”

The results with respect to these hypotheses are presented in Section 5.2 in Figure 10. The number of inputs from the simulation pilots (dependent variable, DV-Input-H-P-less_input) is reduced by a factor of 2.5, and the number of manual inputs of the apron controllers (dependent variable, DV-Input-H-C-less_input) is even reduced by a factor of more than 6. Therefore, we mark the following two hypotheses as validated.

H1.

(H-C-less_input): Automatic documentation (conditions JC and CP) reduces the total number of manual inputs to guide taxiing traffic at the controllers’ working position compared to full manual input (conditions NO and JP).Validated

H2.

(H-P-less_input): Automatic command recognition for the simulation pilots (conditions JP and CP) reduces the total number of manual inputs to guide the taxiing traffic of the simulation pilots compared to full manual input (conditions NO and JC).Validated

5.8.2. Hypothesis with Respect to “Free Cognitive Resources of Apron Controller”

The results with respect to this hypothesis are presented in Section 5.4 in Figure 11. The number of correct Stroop tasks increased for the West and Center positions and did not decrease for the East position. Therefore, we mark the following hypothesis as partially validated.

H3.

(H-C-more_cog_res): Automatic documentation (conditions JC and CP) increases the controller’s free cognitive resources compared to full manual input (conditions JP and NO).Partially Validated

5.8.3. Hypothesis with Respect to “Apron Controller Workload Reduction”

The results with respect to this hypothesis are presented in Section 5.3 in Table 10. The workload reduced by 2.2 scale units on the 20-unit NASA TLX scale on average. Therefore, we mark the following hypothesis as validated.

H4.

(H-C-less_workload): Automatic documentation (conditions JC and CP) reduces the workload of the controller compared to full manual input (conditions JP and NO).Validated

5.8.4. Hypothesis with Respect to “Apron Controller’s Situational Awareness”

The results with respect to this hypothesis are presented in Section 5.5 in Table 11. The situational awareness over all three positions and over both operating directions increased from 4.2 to 4.6 (maximum value of 6.0). The lowest effect was measured for the West position with an increase of 0.1 unit points. However, situational awareness was already high without ABSR support at this position (4.8). Therefore, we mark the following hypothesis as validated.

H5.

(H-C-sit_aw_ok): Automatic documentation (conditions JC and CP) does not limit the con-troller’s situational awareness compared to full manual input (conditions JP and NO).Validated

5.8.5. Hypotheses with Respect to “Apron Controller’s Confidence”

The results with respect to this hypothesis are presented in Section 5.6 in Table 12. The average value for the apron controllers was 4.6 and that for simulation pilots was 4.5. These values are above the average of 3.0. In addition, the lowest individual value for both (4.2) is far beyond the average of 3.0. Therefore, we mark the following hypotheses both as validated.

H6.

(H-C-conf): Controller confidence in command entry automation (conditions JC and CP) is above average.Validated

H7.

(H-P-conf): Simulation pilot’s confidence in command entry automation (conditions JP and CP) is above average.Validated

5.8.6. Hypotheses with Respect to “Automatic Speech Understanding for Complete Commands”

The results with respect to this hypothesis are presented in Section 5.1.2 in Table 8. Assuming the availability of push-to-talk, we measured an average command recognition rate of 91.2%, which is fully above the threshold of 90%. We obtained 3.2% as the command recognition error rate, which is also better than the threshold of 5%. Therefore, we mark the following hypotheses both as validated.

H8.

(H-E-CmdRR): The command extraction rate (JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (command extraction rate for simulation-relevant commands >90%).Validated

H9.

(H-E-CmdER): The command extraction error rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (command extraction error rate for simulation-relevant commands <5%).Validated

The results with respect to callsign recognition are also presented in Table 8. The callsign recognition rate of 97.4% is better than the threshold of 97%, and the callsign recognition error rate of 1.3% is also better than the threshold of 2%. Therefore, we mark the following hypotheses both as validated.

H10.

(H-E-CsgRR): The callsign extraction rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (>97%). Validated

H11.

(H-E-CsgER): The callsign extraction error rate (conditions JC, JP, and CP) in the apron environment is comparable to the quality already achieved in the approach domain by ABSR (callsign extraction error rate <2%). Validated

6. Discussion

The STARFiSH project was of course subject to some restrictions that determined what was possible to research in the given time and budget. This section, therefore, discusses possibilities for improvements that could be addressed in the future and highlights some aspects that proved to be useful within this project.

The SFMECA (see Section 2.3 and Section 5.7) produced no RPNs that mandated mitigation actions. This was due to the environment in which the project was executed: Areas of responsibility for the apron control did not include runways and no automatic alerting functions were implemented at the baseline A-SMGCS. For an environment without these limitations, the SFMECA is expected to produce different results and additional challenges for usability and safety. In addition, while the SFMECA itself was chosen as a proven instrument, there are suggestions for amendments to the methodology which could be used in order to address its specific shortcomings [48].

The results in Section 5.1.1 and Section 5.1.2 show that the use of voice activity detection significantly degrades the overall performance. The push-to-talk signal should therefore be used whenever possible. Especially in an operational scenario, voice activity detection should not be considered as an alternative, since the push-to-talk signal is in use anyway, and technical access should not be an issue. Nevertheless, in non-operational scenarios where, for technical reasons, the push-to-talk signal might not be available, more modern approaches to voice activity detection based on neural network architectures could be exploited [49].

In Section 3.2.3, the rule-based algorithm for speech understanding was mentioned. This approach, of course, offers a very precise control about what is extracted and how the extraction itself takes place. The disadvantage of this method is, on the other hand, that every adaptation has to be programmed manually, which can create a lot of effort. Future projects could ease the adaptation process by fine-tuning pre-trained language models such as BERT [50], which could then recognize the different elements of the ontology [51].

The iterative approach taken for the development of the whole system and also for the training and improvement of the speech recognition and understanding modules proved to be very useful throughout this project. The different prototypes made it possible to involve the apron controller (end-users) already at an early stage of development and to incorporate their feedback in future prototypes. That not only improved the system in itself but also made the controllers involved and interested in the system and in what can be achieved with such a technology. The iterative improvement of the speech recognition parts was also useful with respect to the transcription and annotation process of the recorded data. As the recognition performance of these components became better over the iterations, the manual work to correct and verify transcriptions and annotations could be reduced.

One of the next steps should be to move the developed system from the simulation into an operational environment to see how big the difference to real world operations is and what obstacles have to be overcome. A first step could be to run the system in shadow mode so that it does not interfere with the operating systems, but operational experts could monitor how the system would react.

7. Conclusions

The STARFiSH project was the first to implement a speech recognition and understanding system for a complex apron environment at Frankfurt Airport. DLR’s ABSR system was successfully coupled with the commercial A-SMGCS system from ATRiCS, i.e., a previously prototypical technology from a scientific environment was integrated into a commercial system that is available on the market. The solution was iteratively improved and finally tested in validation trials with 14 different apron controllers in 29 simulation runs in the tower simulator of Fraport. A total of 43 h of validation data (radar, audio, HMI inputs, etc.) were recorded and subsequently analyzed.

A main objective of the STARFiSH project was to prepare the usage of an artificial intelligence-powered speech recognition and understanding system in the safety-critical environment of the ops room at a European major hub airport. The formal method SFMECA (=Software Failure Modes, Effects, and Criticality Analysis) for risk assessment and subsequent identification of mitigation measures was applied with the very encouraging result that no error case would lead to an increased or unacceptable risk. At the same time, it could be shown that such an AI-equipped application can be operated safely in aviation and, moreover, does not have a negative impact on the controllers’ situational awareness.

When supported by ABSR, the controllers made more than six times fewer manual entries into the A-SMGCS. This already includes the correction of wrong or missing recognitions from the speech recognition and understanding support. A recognition rate of 91.8% on the command level was observed, i.e., the callsign, the command type, the command values, e.g., taxi routes, and the command conditions were correctly extracted in 91.8% of the cases.

Author Contributions

Conceptualization, M.K. and H.H.; methodology, H.H.; software, M.K., H.H., O.O., S.S. (Shruthi Shetty) and H.E.; validation, M.K., H.W., M.M. and S.S. (Susanne Schacht); formal analysis, H.H.; investigation, H.H. and M.K.; resources, H.W. and S.S. (Susanne Schacht); data curation, H.W., M.K., H.H. and O.O.; writing—original draft preparation, M.K., O.O. and H.H.; writing—review and editing, O.O., H.W., M.M., H.E., S.S. (Shruthi Shetty) and S.S. (Susanne Schacht); visualization, H.H., M.K., H.E., O.O., M.M. and S.S. (Susanne Schacht); supervision, M.K.; project administration, M.K.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

The project STARFiSH was funded by the German Federal Ministry of Education and Research, under support code 01IS20017C.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank the Fraport controllers for their participation in this study.

Conflicts of Interest

The authors declare no conflict of interest. The funding sponsors had no role in the design of this study, in the collection, analyses, or interpretation of the data, in the writing of the manuscript, and in the decision to publish the results.

Appendix A

The following dependent variables of the final validation trials are considered. The respective results of a dependent variable are each compared between the different operational conditions within a scenario.

Appendix A.1. DV-Input: Number of Manual Inputs for Control by Controllers/Simulation Pilots

The manual inputs are counted. Since the inputs are identifiable by type, certain types are highlighted, if necessary, should it be found that some types occur particularly frequently or infrequently. The total count is compared between simulation runs with or without ABSR support.

These dependent variables are used to validate/falsify the following hypotheses:

DV-Input-H-C-less_input (ABSR for the controllers reduces the number of manual inputs).
DV-Input-H-P-less_input (ABSR for the simulation pilots reduces the number of manual inputs).

Appendix A.2. DV-Cog-Res: Measurement of Cognitive Resources by Secondary Task

The cognitive resources are measured by means of a secondary task, i.e., a task the test subject (controller or simulation pilot) performs during a scenario in parallel to the main task. This is a secondary task that the subject is only allowed to perform when no mental resources are needed for the main task. The secondary task consists of performing a repeated Stroop test in a web application, see Section 4.4. The number of correctly mastered tests in a given time period is a measure of free cognitive resources. For this purpose, the responses per item are categorized as correct/wrong, and the number per time is plotted as a histogram and compared for simulation runs with and without ABSR support, respectively. These dependent variables are used to validate/falsify the following hypothesis:

DV-Cog-Res-H-C-more_cog_res (more free cognitive resources of the controller due to ABSR).