Deep Learning for Speech Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 August 2023) | Viewed by 6077

Special Issue Editors


E-Mail Website1 Website2
Guest Editor
Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
Interests: deep learning; machine learning; artificial intelligence; speech processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Speech & Multimodal Interfaces Laboratory, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia
Interests: speech and multimodal interfaces

E-Mail Website
Guest Editor
School of Computer Science, College of Engineering, University of Seoul, Seoul, Korea
Interests: artificial intelligence; speech recognition; speaker recognition
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the development of deep learning technologies, their performance is continually improving in many application areas; in some fields, it is similar to or even surpasses human performance. We have also started to observe this progress in the field of speech processing. Numerous research papers are published every year not only for traditional classification problems but also for the generation problems in the speech processing area. This Special Issue will be dedicated to state-of-the-art research articles as well as tutorials and reviews in the rapidly evolving field of deep learning for speech and language processing.

Prof. Dr. Dongsuk Yook
Prof. Dr. Alexey Karpov
Prof. Dr. Ha-Jin Yu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • emotion recognition
  • language models
  • machine translation
  • neural vocoders
  • source separation
  • speaker and language recognition
  • speech analysis
  • speech assessment and measurement
  • speech coding
  • speech detection
  • speech enhancement
  • speech recognition
  • speech representation
  • speech synthesis
  • spoken language processing
  • voice conversion

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

14 pages, 3012 KiB  
Article
Bidirectional Representations for Low-Resource Spoken Language Understanding
by Quentin Meeus, Marie-Francine Moens and Hugo Van hamme
Appl. Sci. 2023, 13(20), 11291; https://doi.org/10.3390/app132011291 - 14 Oct 2023
Viewed by 638
Abstract
Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) [...] Read more.
Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) objectives. This approach enables the model to learn contextual bidirectional representations. We evaluate the representations in a challenging low-resource scenario, where training data is limited, necessitating expressive speech embeddings to compensate for the scarcity of examples. Notably, we demonstrate that our model’s initial embeddings outperform comparable models on multiple datasets before fine tuning. Fine tuning the top layers of the representation model further enhances performance, particularly on the Fluent Speech Command dataset, even under low-resource conditions. Additionally, we introduce the concept of class attention as an efficient module for spoken language understanding, characterized by its speed and minimal parameter requirements. Class attention not only aids in explaining model predictions but also enhances our understanding of the underlying decision-making processes. Our experiments cover both English and Dutch languages, offering a comprehensive evaluation of our proposed approach. Full article
(This article belongs to the Special Issue Deep Learning for Speech Processing)
Show Figures

Figure 1

20 pages, 753 KiB  
Article
The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets
by Khalil Al-Hussaeni, Mohamed Sameer and Ioannis Karamitsos
Appl. Sci. 2023, 13(19), 11104; https://doi.org/10.3390/app131911104 - 09 Oct 2023
Viewed by 1027
Abstract
Due to the increasing reliance on social network platforms in recent years, hate speech has risen significantly among online users. Government and social media platforms face the challenging responsibility of controlling, detecting, and removing massively growing hateful content as early as possible to [...] Read more.
Due to the increasing reliance on social network platforms in recent years, hate speech has risen significantly among online users. Government and social media platforms face the challenging responsibility of controlling, detecting, and removing massively growing hateful content as early as possible to prevent future criminal acts, such as cyberviolence and real-life hate crimes. Twitter is used globally by people from various backgrounds and nationalities; it contains tweets posted in different languages, including code-mixed language, such as Hindi–English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is especially challenging in code-mixed text. In this paper, we tackle the critical issue of hate speech detection on social media, with a focus on a mix of English and Hindi–English (code-mixed) text messages on Twitter. More specifically, we aim to evaluate the impact of data pre-processing on hate speech detection. Our method first performs 10-step data cleansing; then, it builds a detection method based on two architectures, namely a convolutional neural network (CNN) and a combination of CNN and long short-term Memory (LSTM) algorithms. We tune the hyperparameters of the proposed model architectures and conduct extensive experimental analysis on real-life tweets to evaluate the performance of the models in terms of accuracy, efficiency, and scalability. Moreover, we compare our method with a closely related hate speech detection method from the literature. The experimental results suggest that our method results in an improved accuracy and a significantly improved runtime. Among our best-performing models, CNN-LSTM improved accuracy by nearly 2% and decreased the runtime by almost half. Full article
(This article belongs to the Special Issue Deep Learning for Speech Processing)
Show Figures

Figure 1

15 pages, 938 KiB  
Article
MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality
by Elhard James Kumalija and Yukikazu Nakamoto
Appl. Sci. 2023, 13(4), 2455; https://doi.org/10.3390/app13042455 - 14 Feb 2023
Cited by 2 | Viewed by 1271
Abstract
In IP audio systems, audio quality is degraded by environmental noise, poor network quality, and encoding–decoding algorithms. Therefore, there is a need for a continuous automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation. Furthermore, [...] Read more.
In IP audio systems, audio quality is degraded by environmental noise, poor network quality, and encoding–decoding algorithms. Therefore, there is a need for a continuous automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation. Furthermore, there are diverse IP audio transmitters and receivers, from high-performance computers and mobile phones to low-memory and low-computing-capacity embedded systems. This paper proposes MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications based on a lightweight deep neural network (DNN) model. The proposed model can predict the audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. Two variations of the proposed MiniatureVQNet model were evaluated: a MiniatureVQNet model trained on a dataset that contains environmental noise only, referred to as MiniatureVQNet–Noise, and a second model trained on both noise and network distortions, referred to as MiniatureVQNet–Noise–Network. The proposed MiniatureVQNet model outperforms the traditional P.563 method in terms of accuracy on all tested network conditions and environmental noise parameters. The mean squared error (MSE) of the models compared to the PESQ score for ITU-T P.563, MiniatureVQNet-Noise, and MiniatureVQNet–Noise–Network was 2.19, 0.34, and 0.21, respectively. The performance of both the MiniatureVQNet–Noise–Network and MiniatureVQNet-Noise model depends on the noise type for an SNR greater than 0 dB and less than 10 dB. In addition, training on a noise–network-distorted speech dataset improves the model prediction accuracy in all VoIP environment distortions compared to training the model on a noise-only dataset. Full article
(This article belongs to the Special Issue Deep Learning for Speech Processing)
Show Figures

Figure 1

18 pages, 972 KiB  
Article
Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection
by Md. Anwar Hussen Wadud, Mohammed Alatiyyah and M. F. Mridha
Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109 - 22 Dec 2022
Cited by 8 | Viewed by 2270
Abstract
A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, [...] Read more.
A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models. Full article
(This article belongs to the Special Issue Deep Learning for Speech Processing)
Show Figures

Figure 1

Back to TopTop