sensors-logo

Journal Browser

Journal Browser

Future Speech Interfaces with Sensors and Machine Intelligence

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (10 September 2022) | Viewed by 39025

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Electronics Department, Sorbonne Université, 75005 Paris, France
Interests: signal processing; telecommunications; machine learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Interests: speech synthesis, speech analysis, vocoding, ultrasound based tongue movement analysis, and deep learning methods applied for speech technologies

E-Mail Website
Guest Editor
Dalle Molle Institute for Artificial Intelligence USI-SUPSI, Viganello, Switzerland
Interests: machine learning; biosignal processing; silent speech processing; neural networks

Special Issue Information

Dear Colleagues,

Speech is the most spontaneous and natural means of communication; all the more so today, with the reality of instantaneous spoken communication to anywhere in the world. Speech is also becoming the preferred modality for interacting with mobile or fixed electronic devices. However, speech has certain drawbacks:

  • Lack of privacy: speech can be easily captured by a third party;
  • Lack of inclusivity: speech communication is problematic for the speech-impaired;
  • Lack of robustness: speech understanding degrades rapidly in noisy conditions, both for humans and for machines;
  • Difficulty of creating high-performance speech-based man–machine interfaces.

The past decade has seen an increased interest in expanding the capabilities of speech interfaces, using solutions such as multi-modal input, novel acoustic and non-acoustic speech sensors, and sophisticated machine-learning techniques for speech recognition, speech synthesis, and natural language processing. The Special Issue “Future Speech Interfaces with Sensors and Machine Intelligence” strives to assemble, in a single volume, contributions from a wide range of speech-related fields, in order to broaden the scope of research in future speech interfaces, and profit from the cross-fertilization of bringing together ideas from disparate domains. Contributions are solicited in the fields of:

  • Multimodal and silent speech interfaces;
  • Articulatory-to-acoustic mapping via “direct synthesis” or “recognition and synthesis”;
  • Lip reading applications;
  • Novel acoustic and non-acoustic sensors for speech processing;
  • Interfaces for enhanced speech inclusivity;
  • Advanced machine learning techniques for future speech interfaces;
  • Generative Adversarial Networks in speech research;
  • Neural Vocoders for future speech interfaces;
  • Current state of the art in speech-based man-machine interfaces;

Prof. Bruce Denby
Dr. Tamás Gábor Csapó
Dr. Michael Wand
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multimodal speech
  • silent speech interfaces
  • lip reading
  • speech sensors
  • generative adversarial networks
  • neural vocoders

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

6 pages, 247 KiB  
Editorial
Future Speech Interfaces with Sensors and Machine Intelligence
by Bruce Denby, Tamás Gábor Csapó and Michael Wand
Sensors 2023, 23(4), 1971; https://doi.org/10.3390/s23041971 - 10 Feb 2023
Cited by 3 | Viewed by 1729
Abstract
Speech is the most spontaneous and natural means of communication. Speech is also becoming the preferred modality for interacting with mobile or fixed electronic devices. However, speech interfaces have drawbacks, including a lack of user privacy; non-inclusivity for certain users; poor robustness in [...] Read more.
Speech is the most spontaneous and natural means of communication. Speech is also becoming the preferred modality for interacting with mobile or fixed electronic devices. However, speech interfaces have drawbacks, including a lack of user privacy; non-inclusivity for certain users; poor robustness in noisy conditions; and the difficulty of creating complex man–machine interfaces. To help address these problems, the Special Issue “Future Speech Interfaces with Sensors and Machine Intelligence” assembles eleven contributions covering multimodal and silent speech interfaces; lip reading applications; novel sensors for speech interfaces; and enhanced speech inclusivity tools for future speech interfaces. Short summaries of the articles are presented, followed by an overall evaluation. The success of this Special Issue has led to its being re-issued as “Future Speech Interfaces with Sensors and Machine Intelligence-II” with a deadline in March of 2023. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)

Research

Jump to: Editorial

13 pages, 1425 KiB  
Article
Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping
by Tamás Gábor Csapó, Gábor Gosztolya, László Tóth, Amin Honarmandi Shandiz and Alexandra Markó
Sensors 2022, 22(22), 8601; https://doi.org/10.3390/s22228601 - 8 Nov 2022
Cited by 3 | Viewed by 1618
Abstract
Within speech processing, articulatory-to-acoustic mapping (AAM) methods can apply ultrasound tongue imaging (UTI) as an input. (Micro)convex transducers are mostly used, which provide a wedge-shape visual image. However, this process is optimized for the visual inspection of the human eye, and the signal [...] Read more.
Within speech processing, articulatory-to-acoustic mapping (AAM) methods can apply ultrasound tongue imaging (UTI) as an input. (Micro)convex transducers are mostly used, which provide a wedge-shape visual image. However, this process is optimized for the visual inspection of the human eye, and the signal is often post-processed by the equipment. With newer ultrasound equipment, now it is possible to gain access to the raw scanline data (i.e., ultrasound echo return) without any internal post-processing. In this study, we compared the raw scanline representation with the wedge-shaped processed UTI as the input for the residual network applied for AAM, and we also investigated the optimal size of the input image. We found no significant differences between the performance attained using the raw data and the wedge-shaped image extrapolated from it. We found the optimal pixel size to be 64 × 43 in the case of the raw scanline input, and 64 × 64 when transformed to a wedge. Therefore, it is not necessary to use the full original 64 × 842 pixels raw scanline, but a smaller image is enough. This allows for the building of smaller networks, and will be beneficial for the development of session and speaker-independent methods for practical applications. AAM systems have the target application of a “silent speech interface”, which could be helpful for the communication of the speaking-impaired, in military applications, or in extremely noisy conditions. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

27 pages, 6147 KiB  
Article
Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications
by Sanghun Jeon and Mun Sang Kim
Sensors 2022, 22(20), 7738; https://doi.org/10.3390/s22207738 - 12 Oct 2022
Cited by 3 | Viewed by 2144
Abstract
Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an [...] Read more.
Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

16 pages, 3112 KiB  
Article
Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis
by Beiming Cao, Alan Wisler and Jun Wang
Sensors 2022, 22(16), 6056; https://doi.org/10.3390/s22166056 - 13 Aug 2022
Cited by 7 | Viewed by 1864
Abstract
Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design [...] Read more.
Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design of SSI that has the advantages of easy-implementation and low-latency, and therefore is becoming more popular. Current ATS studies focus on speaker-dependent (SD) models to avoid large variations of articulatory patterns and acoustic features across speakers. However, these designs are limited by the small data size from individual speakers. Speaker adaptation designs that include multiple speakers’ data have the potential to address the issue of limited data size from single speakers; however, few prior studies have investigated their performance in ATS. In this paper, we investigated speaker adaptation on both the input articulation and the output acoustic signals (with or without direct inclusion of data from test speakers) using the publicly available electromagnetic articulatory (EMA) dataset. We used Procrustes matching and voice conversion for articulation and voice adaptation, respectively. The performance of the ATS models was measured objectively by the mel-cepstral distortions (MCDs). The synthetic speech samples were generated and are provided in the supplementary material. The results demonstrated the improvement brought by both Procrustes matching and voice conversion on speaker-independent ATS. With the direct inclusion of target speaker data in the training process, the speaker-adaptive ATS achieved a comparable performance to speaker-dependent ATS. To our knowledge, this is the first study that has demonstrated that speaker-adaptive ATS can achieve a non-statistically different performance to speaker-dependent ATS. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

20 pages, 770 KiB  
Article
Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
by Wentao Yu, Steffen Zeiler and Dorothea Kolossa
Sensors 2022, 22(15), 5501; https://doi.org/10.3390/s22155501 - 23 Jul 2022
Cited by 1 | Viewed by 1333
Abstract
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly [...] Read more.
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

16 pages, 485 KiB  
Article
FlexLip: A Controllable Text-to-Lip System
by Dan Oneață, Beáta Lőrincz, Adriana Stan and Horia Cucu
Sensors 2022, 22(11), 4104; https://doi.org/10.3390/s22114104 - 28 May 2022
Cited by 1 | Viewed by 2499
Abstract
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video [...] Read more.
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

27 pages, 6424 KiB  
Article
End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
by Sanghun Jeon and Mun Sang Kim
Sensors 2022, 22(9), 3597; https://doi.org/10.3390/s22093597 - 9 May 2022
Cited by 7 | Viewed by 2417
Abstract
Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on [...] Read more.
Concomitant with the recent advances in deep learning, automatic speech recognition and visual speech recognition (VSR) have received considerable attention. However, although VSR systems must identify speech from both frontal and profile faces in real-world scenarios, most VSR studies have focused solely on frontal face pictures. To address this issue, we propose an end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives (frontal, 30°, 45°, and 60°). The encoder uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words, and the decoder uses cascaded local self-attention connectionist temporal classification to collect the details of local contextual information in the immediate vicinity, which results in a substantial performance boost and speedy convergence. To compare the performance of the proposed model for experiments on the OuluVS2 dataset, the dataset was divided into four different perspectives, and the obtained performance improvement was 3.31% (0°), 4.79% (30°), 5.51% (45°), 6.18% (60°), and 4.95% (mean), respectively, compared with the existing state-of-the-art performance, and the average performance improved by 9.1% compared with the baseline. Thus, the suggested design enhances the performance of multi-view VSR and boosts its usefulness in real-world applications. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

21 pages, 3791 KiB  
Article
End-to-End Lip-Reading Open Cloud-Based Speech Architecture
by Sanghun Jeon and Mun Sang Kim
Sensors 2022, 22(8), 2938; https://doi.org/10.3390/s22082938 - 12 Apr 2022
Cited by 4 | Viewed by 3222
Abstract
Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments [...] Read more.
Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google’s trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

27 pages, 6450 KiB  
Article
Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut
by Alan Wrench and Jonathan Balch-Tomes
Sensors 2022, 22(3), 1133; https://doi.org/10.3390/s22031133 - 2 Feb 2022
Cited by 13 | Viewed by 3973
Abstract
Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images [...] Read more.
Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and systems. Tongue surface contours interpolated from estimated and hand-labelled keypoints produced an average mean sum of distances (MSD) of 0.93, s.d. 0.46 mm, compared with 0.96, s.d. 0.39 mm, for two human labellers, and 2.3, s.d. 1.5 mm, for the best performing edge detection algorithm. A pilot set of simultaneous electromagnetic articulography (EMA) and ultrasound recordings demonstrated partial correlation among three physical sensor positions and the corresponding estimated keypoints and requires further investigation. The accuracy of the estimating lip aperture from a camera video was high, with a mean MSD of 0.70, s.d. 0.56 mm compared with 0.57, s.d. 0.48 mm for two human labellers. DeepLabCut was found to be a fast, accurate and fully automatic method of providing unique kinematic data for tongue, hyoid, jaw, and lips. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

17 pages, 821 KiB  
Article
Exploring Silent Speech Interfaces Based on Frequency-Modulated Continuous-Wave Radar
by David Ferreira, Samuel Silva, Francisco Curado and António Teixeira
Sensors 2022, 22(2), 649; https://doi.org/10.3390/s22020649 - 14 Jan 2022
Cited by 11 | Viewed by 3249
Abstract
Speech is our most natural and efficient form of communication and offers a strong potential to improve how we interact with machines. However, speech communication can sometimes be limited by environmental (e.g., ambient noise), contextual (e.g., need for privacy), or health conditions (e.g., [...] Read more.
Speech is our most natural and efficient form of communication and offers a strong potential to improve how we interact with machines. However, speech communication can sometimes be limited by environmental (e.g., ambient noise), contextual (e.g., need for privacy), or health conditions (e.g., laryngectomy), preventing the consideration of audible speech. In this regard, silent speech interfaces (SSI) have been proposed as an alternative, considering technologies that do not require the production of acoustic signals (e.g., electromyography and video). Unfortunately, despite their plentitude, many still face limitations regarding their everyday use, e.g., being intrusive, non-portable, or raising technical (e.g., lighting conditions for video) or privacy concerns. In line with this necessity, this article explores the consideration of contactless continuous-wave radar to assess its potential for SSI development. A corpus of 13 European Portuguese words was acquired for four speakers and three of them enrolled in a second acquisition session, three months later. Regarding the speaker-dependent models, trained and tested with data from each speaker while using 5-fold cross-validation, average accuracies of 84.50% and 88.00% were respectively obtained from Bagging (BAG) and Linear Regression (LR) classifiers, respectively. Additionally, recognition accuracies of 81.79% and 81.80% were also, respectively, achieved for the session and speaker-independent experiments, establishing promising grounds for further exploring this technology towards silent speech recognition. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

20 pages, 4596 KiB  
Article
Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
by Sanghun Jeon, Ahmed Elsharkawy and Mun Sang Kim
Sensors 2022, 22(1), 72; https://doi.org/10.3390/s22010072 - 23 Dec 2021
Cited by 16 | Viewed by 4834
Abstract
In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using [...] Read more.
In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications. Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

28 pages, 24101 KiB  
Article
A Transformer-Based Neural Machine Translation Model for Arabic Dialects That Utilizes Subword Units
by Laith H. Baniata, Isaac. K. E. Ampomah and Seyoung Park
Sensors 2021, 21(19), 6509; https://doi.org/10.3390/s21196509 - 29 Sep 2021
Cited by 14 | Viewed by 7611
Abstract
Languages that allow free word order, such as Arabic dialects, are of significant difficulty for neural machine translation (NMT) because of many scarce words and the inefficiency of NMT systems to translate these words. Unknown Word (UNK) tokens represent the out-of-vocabulary words for [...] Read more.
Languages that allow free word order, such as Arabic dialects, are of significant difficulty for neural machine translation (NMT) because of many scarce words and the inefficiency of NMT systems to translate these words. Unknown Word (UNK) tokens represent the out-of-vocabulary words for the reason that NMT systems run with vocabulary that has fixed size. Scarce words are encoded completely as sequences of subword pieces employing the Word-Piece Model. This research paper introduces the first Transformer-based neural machine translation model for Arabic vernaculars that employs subword units. The proposed solution is based on the Transformer model that has been presented lately. The use of subword units and shared vocabulary within the Arabic dialect (the source language) and modern standard Arabic (the target language) enhances the behavior of the multi-head attention sublayers for the encoder by obtaining the overall dependencies between words of input sentence for Arabic vernacular. Experiments are carried out from Levantine Arabic vernacular (LEV) to modern standard Arabic (MSA) and Maghrebi Arabic vernacular (MAG) to MSA, Gulf–MSA, Nile–MSA, Iraqi Arabic (IRQ) to MSA translation tasks. Extensive experiments confirm that the suggested model adequately addresses the unknown word issue and boosts the quality of translation from Arabic vernaculars to Modern standard Arabic (MSA). Full article
(This article belongs to the Special Issue Future Speech Interfaces with Sensors and Machine Intelligence)
Show Figures

Figure 1

Back to TopTop