Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Menu

Journal Browser

Deep Learning for Speech Processing

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 August 2023) | Viewed by 6077

Share This Special Issue

Special Issue Editors

Prof. Dr. Dongsuk Yook

E-Mail Website1 Website2
Guest Editor

Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
Interests: deep learning; machine learning; artificial intelligence; speech processing
Special Issues, Collections and Topics in MDPI journals

Prof. Dr. Alexey Karpov

E-Mail Website
Guest Editor

Speech & Multimodal Interfaces Laboratory, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia
Interests: speech and multimodal interfaces

Prof. Dr. Ha-Jin Yu

E-Mail Website
Guest Editor

School of Computer Science, College of Engineering, University of Seoul, Seoul, Korea
Interests: artificial intelligence; speech recognition; speaker recognition
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

With the development of deep learning technologies, their performance is continually improving in many application areas; in some fields, it is similar to or even surpasses human performance. We have also started to observe this progress in the field of speech processing. Numerous research papers are published every year not only for traditional classification problems but also for the generation problems in the speech processing area. This Special Issue will be dedicated to state-of-the-art research articles as well as tutorials and reviews in the rapidly evolving field of deep learning for speech and language processing.

Prof. Dr. Dongsuk Yook
Prof. Dr. Alexey Karpov
Prof. Dr. Ha-Jin Yu
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

emotion recognition
language models
machine translation
neural vocoders
source separation
speaker and language recognition
speech analysis
speech assessment and measurement
speech coding
speech detection
speech enhancement
speech recognition
speech representation
speech synthesis
spoken language processing
voice conversion

Published Papers (4 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

14 pages, 3012 KiB

Open AccessArticle

Bidirectional Representations for Low-Resource Spoken Language Understanding

by Quentin Meeus, Marie-Francine Moens and Hugo Van hamme

Appl. Sci. 2023, 13(20), 11291; https://doi.org/10.3390/app132011291 - 14 Oct 2023

Viewed by 638

Abstract

Speech representation models lack the ability to efficiently store semantic information and require fine tuning to deliver decent performance. In this research, we introduce a transformer encoder–decoder framework with a multiobjective training strategy, incorporating connectionist temporal classification (CTC) and masked language modeling (MLM) objectives. This approach enables the model to learn contextual bidirectional representations. We evaluate the representations in a challenging low-resource scenario, where training data is limited, necessitating expressive speech embeddings to compensate for the scarcity of examples. Notably, we demonstrate that our model’s initial embeddings outperform comparable models on multiple datasets before fine tuning. Fine tuning the top layers of the representation model further enhances performance, particularly on the Fluent Speech Command dataset, even under low-resource conditions. Additionally, we introduce the concept of class attention as an efficient module for spoken language understanding, characterized by its speed and minimal parameter requirements. Class attention not only aids in explaining model predictions but also enhances our understanding of the underlying decision-making processes. Our experiments cover both English and Dutch languages, offering a comprehensive evaluation of our proposed approach. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

20 pages, 753 KiB

Open AccessArticle

The Impact of Data Pre-Processing on Hate Speech Detection in a Mix of English and Hindi–English (Code-Mixed) Tweets

by Khalil Al-Hussaeni, Mohamed Sameer and Ioannis Karamitsos

Appl. Sci. 2023, 13(19), 11104; https://doi.org/10.3390/app131911104 - 09 Oct 2023

Viewed by 1027

Abstract

Due to the increasing reliance on social network platforms in recent years, hate speech has risen significantly among online users. Government and social media platforms face the challenging responsibility of controlling, detecting, and removing massively growing hateful content as early as possible to prevent future criminal acts, such as cyberviolence and real-life hate crimes. Twitter is used globally by people from various backgrounds and nationalities; it contains tweets posted in different languages, including code-mixed language, such as Hindi–English. Due to the informal format of tweets with variations in spelling and grammar, hate speech detection is especially challenging in code-mixed text. In this paper, we tackle the critical issue of hate speech detection on social media, with a focus on a mix of English and Hindi–English (code-mixed) text messages on Twitter. More specifically, we aim to evaluate the impact of data pre-processing on hate speech detection. Our method first performs 10-step data cleansing; then, it builds a detection method based on two architectures, namely a convolutional neural network (CNN) and a combination of CNN and long short-term Memory (LSTM) algorithms. We tune the hyperparameters of the proposed model architectures and conduct extensive experimental analysis on real-life tweets to evaluate the performance of the models in terms of accuracy, efficiency, and scalability. Moreover, we compare our method with a closely related hate speech detection method from the literature. The experimental results suggest that our method results in an improved accuracy and a significantly improved runtime. Among our best-performing models, CNN-LSTM improved accuracy by nearly 2% and decreased the runtime by almost half. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

15 pages, 938 KiB

Open AccessArticle

MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality

by Elhard James Kumalija and Yukikazu Nakamoto

Appl. Sci. 2023, 13(4), 2455; https://doi.org/10.3390/app13042455 - 14 Feb 2023

Cited by 2 | Viewed by 1271

Abstract

In IP audio systems, audio quality is degraded by environmental noise, poor network quality, and encoding–decoding algorithms. Therefore, there is a need for a continuous automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation. Furthermore, there are diverse IP audio transmitters and receivers, from high-performance computers and mobile phones to low-memory and low-computing-capacity embedded systems. This paper proposes MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications based on a lightweight deep neural network (DNN) model. The proposed model can predict the audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. Two variations of the proposed MiniatureVQNet model were evaluated: a MiniatureVQNet model trained on a dataset that contains environmental noise only, referred to as MiniatureVQNet–Noise, and a second model trained on both noise and network distortions, referred to as MiniatureVQNet–Noise–Network. The proposed MiniatureVQNet model outperforms the traditional P.563 method in terms of accuracy on all tested network conditions and environmental noise parameters. The mean squared error (MSE) of the models compared to the PESQ score for ITU-T P.563, MiniatureVQNet-Noise, and MiniatureVQNet–Noise–Network was 2.19, 0.34, and 0.21, respectively. The performance of both the MiniatureVQNet–Noise–Network and MiniatureVQNet-Noise model depends on the noise type for an SNR greater than 0 dB and less than 10 dB. In addition, training on a noise–network-distorted speech dataset improves the model prediction accuracy in all VoIP environment distortions compared to training the model on a noise-only dataset. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Figure 1

18 pages, 972 KiB

Open AccessArticle

Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection

by Md. Anwar Hussen Wadud, Mohammed Alatiyyah and M. F. Mridha

Appl. Sci. 2023, 13(1), 109; https://doi.org/10.3390/app13010109 - 22 Dec 2022

Cited by 8 | Viewed by 2270

Abstract

A crucial element of computer-assisted pronunciation training systems (CAPT) is the mispronunciation detection and diagnostic (MDD) technique. The provided transcriptions can act as a teacher when evaluating the pronunciation quality of finite speech. The preceding texts have been entirely employed by conventional approaches, such as forced alignment and extended recognition networks, for model development or for enhancing system performance. The incorporation of earlier texts into model training has recently been attempted using end-to-end (E2E)-based approaches, and preliminary results indicate efficacy. Attention-based end-to-end models have shown lower speech recognition performance because multi-pass left-to-right forward computation constrains their practical applicability in beam search. In addition, end-to-end neural approaches are typically data-hungry, and a lack of non-native training data will frequently impair their effectiveness in MDD. To solve this problem, we provide a unique MDD technique that uses non-autoregressive (NAR) end-to-end neural models to greatly reduce estimation time while maintaining accuracy levels similar to traditional E2E neural models. In contrast, NAR models can generate parallel token sequences by accepting parallel inputs instead of left-to-right forward computation. To further enhance the effectiveness of MDD, we develop and construct a pronunciation model superimposed on our approach’s NAR end-to-end models. To test the effectiveness of our strategy against some of the best end-to-end models, we use publicly accessible L2-ARCTIC and SpeechOcean English datasets for training and testing purposes where the proposed model shows the best results than other existing models. Full article

(This article belongs to the Special Issue Deep Learning for Speech Processing)

► Show Figures

Journal Menu

Journal Browser

Deep Learning for Speech Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (4 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI