Advances in Speech and Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 November 2023) | Viewed by 30723

Special Issue Editors

School of Software Engineering, Tongji University, Shanghai 201804, China
Interests: speech signal processing; speech synthesis

E-Mail Website
Guest Editor
Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601, China
Interests: speech enhancement; speech recognition; speech signal processing
Special Issues, Collections and Topics in MDPI journals
School of Artificial Inteligence, Beijing University of Posts and Telecomunications, Beijing 100876, China
Interests: speech synthesis; affective computing; dialog system
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues, 

Speech and language processing techniques play an important role in current intelligent interaction systems. As one of the most attractive research fields in artificial intelligence, this field has witnessed a rapid growth in recent years. On the one hand, new applications emerge to meet the requirements of users, such as smart agents, voice driving direction, etc. On the other hand, inspired by the deep learning techniques, novel algorithms are proposed to solve the complicated problems occurring in the real world. These applications and algorithms deeply reshape our way of human–computer interactions. The purpose of this Special Issue is to present the advanced research in the field of speech and language processing. Papers that address innovative applications and algorithms related to speech and language processing are welcome for this issue.

Dr. Ying Shen
Dr. Cunhang Fan
Dr. Ya Li 
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech recognition
  • speech synthesis
  • speech enhancement
  • speech diarization
  • speaker recognition and verification
  • multimodal learning in speech applications
  • machine learning for CL/NLP
  • natural language generation
  • machine translation

Published Papers (17 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

28 pages, 660 KiB  
Article
Improving End-to-End Models for Children’s Speech Recognition
by Tanvina Patel and Odette Scharenborg
Appl. Sci. 2024, 14(6), 2353; https://doi.org/10.3390/app14062353 - 11 Mar 2024
Viewed by 688
Abstract
Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for [...] Read more.
Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for training the Automatic Speech Recognition (ASR) systems. Traditionally, Vocal Tract Length Normalization (VTLN) has been widely used in hybrid ASR systems to address acoustic mismatch and variability in children’s speech when training models on adults’ speech. Meanwhile, End-to-End (E2E) systems often use data augmentation methods to create child-like speech from adults’ speech. For adult speech-trained ASRs, we investigate the effectiveness of augmentation methods; speed perturbations and spectral augmentation, along with VTLN, in an E2E framework for the CSR task, comparing these across Dutch, German, and Mandarin. We applied VTLN at different stages (training/test) of the ASR and conducted age and gender analyses. Our experiments showed highly similar patterns across the languages: Speed Perturbations and Spectral Augmentation yield significant performance improvements, while VTLN provided further improvements while maintaining recognition performance on adults’ speech (depending on when it is applied). Additionally, VTLN showed performance improvement for both male and female speakers and was particularly effective for younger children. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

15 pages, 3274 KiB  
Article
Unmasking Nasality to Assess Hypernasality
by Ignacio Moreno-Torres, Andrés Lozano, Rosa Bermúdez, Josué Pino, María Dolores García Méndez and Enrique Nava
Appl. Sci. 2023, 13(23), 12606; https://doi.org/10.3390/app132312606 - 23 Nov 2023
Viewed by 679
Abstract
Automatic evaluation of hypernasality has been traditionally computed using monophonic signals (i.e., combining nose and mouth signals). Here, this study aimed to examine if nose signals serve to increase the accuracy of hypernasality evaluation. Using a conventional microphone and a Nasometer, we recorded [...] Read more.
Automatic evaluation of hypernasality has been traditionally computed using monophonic signals (i.e., combining nose and mouth signals). Here, this study aimed to examine if nose signals serve to increase the accuracy of hypernasality evaluation. Using a conventional microphone and a Nasometer, we recorded monophonic, mouth, and nose signals. Three main analyses were performed: (1) comparing the spectral distance between oral/nasalized vowels in monophonic, nose, and mouth signals; (2) assessing the accuracy of Deep Neural Network (DNN) models in classifying oral/nasal sounds and vowel/consonant sounds trained with nose, mouth, and monophonic signals; (3) analyzing the correlation between DNN-derived nasality scores and expert-rated hypernasality scores. The distance between oral and nasalized vowels was the highest in the nose signals. Moreover, DNN models trained on nose signals outperformed in nasal/oral classification (accuracy: 0.90), but were slightly less precise in vowel/consonant differentiation (accuracy: 0.86) compared to models trained on other signals. A strong Pearson’s correlation (0.83) was observed between nasality scores from DNNs trained with nose signals and human expert ratings, whereas those trained on mouth signals showed a weaker correlation (0.36). We conclude that mouth signals partially mask the nasality information carried by nose signals. Significance: the accuracy of hypernasality assessment tools may improve by analyzing nose signals. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

15 pages, 1019 KiB  
Article
Environment-Aware Knowledge Distillation for Improved Resource-Constrained Edge Speech Recognition
by Arthur Pimentel, Heitor R. Guimarães, Anderson Avila and Tiago H. Falk
Appl. Sci. 2023, 13(23), 12571; https://doi.org/10.3390/app132312571 - 22 Nov 2023
Viewed by 759
Abstract
Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test [...] Read more.
Recent advances in self-supervised learning have allowed automatic speech recognition (ASR) systems to achieve state-of-the-art (SOTA) word error rates (WER) while requiring only a fraction of the labeled data needed by its predecessors. Notwithstanding, while such models achieve SOTA results in matched train/test scenarios, their performance degrades substantially when tested in unseen conditions. To overcome this problem, strategies such as data augmentation and/or domain adaptation have been explored. Available models, however, are still too large to be considered for edge speech applications on resource-constrained devices; thus, model compression tools, such as knowledge distillation, are needed. In this paper, we propose three innovations on top of the existing DistilHuBERT distillation recipe: optimize the prediction heads, employ a targeted data augmentation method for different environmental scenarios, and employ a real-time environment estimator to choose between compressed models for inference. Experiments with the LibriSpeech dataset, corrupted with varying noise types and reverberation levels, show the proposed method outperforming several benchmark methods, both original and compressed, by as much as 48.4% and 89.2% in the word error reduction rate in extremely noisy and reverberant conditions, respectively, while reducing by 50% the number of parameters. Thus, the proposed method is well suited for resource-constrained edge speech recognition applications. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

23 pages, 4689 KiB  
Article
Orthogonalization of the Sensing Matrix Through Dominant Columns in Compressive Sensing for Speech Enhancement
by Vasundhara Shukla and Preety D. Swami
Appl. Sci. 2023, 13(15), 8954; https://doi.org/10.3390/app13158954 - 04 Aug 2023
Viewed by 725
Abstract
This paper introduces a novel speech enhancement approach called dominant columns group orthogonalization of the sensing matrix (DCGOSM) in compressive sensing (CS). DCGOSM optimizes the sensing matrix using particle swarm optimization (PSO), ensuring separate basis vectors for speech and noise signals. By utilizing [...] Read more.
This paper introduces a novel speech enhancement approach called dominant columns group orthogonalization of the sensing matrix (DCGOSM) in compressive sensing (CS). DCGOSM optimizes the sensing matrix using particle swarm optimization (PSO), ensuring separate basis vectors for speech and noise signals. By utilizing an orthogonal matching pursuit (OMP) based CS signal reconstruction with this optimized matrix, noise components are effectively avoided, resulting in lower noise in the reconstructed signal. The reconstruction process is accelerated by iterating only through the known speech-contributing columns. DCGOSM is evaluated against various noise types using speech quality measures such as SNR, SSNR, STOI, and PESQ. Compared to other OMP-based CS algorithms and deep neural network (DNN)-based speech enhancement techniques, DCGOSM demonstrates significant improvements, with maximum enhancements of 42.54%, 62.97%, 27.48%, and 8.72% for SNR, SSNR, PESQ, and STOI, respectively. Additionally, DCGOSM outperforms DNN-based techniques by 20.32% for PESQ and 8.29% for STOI. Furthermore, it reduces recovery time by at least 13.2% compared to other OMP-based CS algorithms. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

17 pages, 1757 KiB  
Article
End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning
by Linkai Peng, Yingming Gao, Rian Bao, Ya Li and Jinsong Zhang
Appl. Sci. 2023, 13(11), 6793; https://doi.org/10.3390/app13116793 - 02 Jun 2023
Cited by 1 | Viewed by 1990
Abstract
As an indispensable module of computer-aided pronunciation training (CAPT) systems, mispronunciation detection and diagnosis (MDD) techniques have attracted a lot of attention from academia and industry over the past decade. To train robust MDD models, this technique requires massive human-annotated speech recordings which [...] Read more.
As an indispensable module of computer-aided pronunciation training (CAPT) systems, mispronunciation detection and diagnosis (MDD) techniques have attracted a lot of attention from academia and industry over the past decade. To train robust MDD models, this technique requires massive human-annotated speech recordings which are usually expensive and even hard to acquire. In this study, we propose to use transfer learning to tackle the problem of data scarcity from two aspects. First, from audio modality, we explore the use of the pretrained model wav2vec2.0 for MDD tasks by learning robust general acoustic representation. Second, from text modality, we explore transferring prior texts into MDD by learning associations between acoustic and textual modalities. We propose textual modulation gates that assign more importance to the relevant text information while suppressing irrelevant text information. Moreover, given the transcriptions, we propose an extra contrastive loss to reduce the difference of learning objectives between the phoneme recognition and MDD tasks. Conducting experiments on the L2-Arctic dataset showed that our wav2vec2.0 based models outperformed the conventional methods. The proposed textual modulation gate and contrastive loss further improved the F1-score by more than 2.88% and our best model achieved an F1-score of 61.75%. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

14 pages, 2596 KiB  
Article
Audio–Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem
by Zhanbo Shi, Lin Zhang and Dongqing Wang
Appl. Sci. 2023, 13(10), 6056; https://doi.org/10.3390/app13106056 - 15 May 2023
Cited by 2 | Viewed by 1732
Abstract
Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still [...] Read more.
Locating the sound source is one of the most important capabilities of robot audition. In recent years, single-source localization techniques have increasingly matured. However, localizing and tracking specific sound sources in multi-source scenarios, which is known as the cocktail party problem, is still unresolved. In order to address this challenge, in this paper, we propose a system for dynamically localizing and tracking sound sources based on audio–visual information that can be deployed on a mobile robot. Our system first locates specific targets using pre-registered voiceprint and face features. Subsequently, the robot moves to track the target while keeping away from other sound sources in the surroundings instructed by the motion module, which helps the robot gather clearer audio data of the target to perform downstream tasks better. Its effectiveness has been verified via extensive real-world experiments with a 20% improvement in the success rate of specific speaker localization and a 14% reduction in word error rate in speech recognition compared to its counterparts. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

15 pages, 1345 KiB  
Article
GERP: A Personality-Based Emotional Response Generation Model
by Ziyi Zhou, Ying Shen, Xuri Chen and Dongqing Wang
Appl. Sci. 2023, 13(8), 5109; https://doi.org/10.3390/app13085109 - 19 Apr 2023
Viewed by 1481
Abstract
It is important for chatbots to emotionally communicate with users. However, most emotional response generation models generate responses simply based on a specified emotion, neglecting the impacts of speaker’s personality on emotional expression. In this work, we propose a novel model named GERP [...] Read more.
It is important for chatbots to emotionally communicate with users. However, most emotional response generation models generate responses simply based on a specified emotion, neglecting the impacts of speaker’s personality on emotional expression. In this work, we propose a novel model named GERP to generate emotional responses based on the pre-defined personality. GERP simulates the emotion conversion process of humans during the conversation to make the chatbot more anthropomorphic. GERP adopts the OCEAN model to precisely define the chatbot’s personality. It can generate the response containing the emotion predicted based on the personality. Specifically, to select the most-appropriate response, a proposed beam evaluator was integrated into GERP. A Chinese sentiment vocabulary and a Chinese emotional response dataset were constructed to facilitate the emotional response generation task. The effectiveness and superiority of the proposed model over five baseline models was verified by the experiments. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

13 pages, 838 KiB  
Article
Syllable-Based Multi-POSMORPH Annotation for Korean Morphological Analysis and Part-of-Speech Tagging
by Hyeong Jin Shin, Jeongyeon Park and Jae Sung Lee
Appl. Sci. 2023, 13(5), 2892; https://doi.org/10.3390/app13052892 - 23 Feb 2023
Cited by 2 | Viewed by 1014
Abstract
Various research approaches have attempted to solve the length difference problem between the surface form and the base form of words in the Korean morphological analysis and part-of-speech (POS) tagging task. The compound POS tagging method is a popular approach, which tackles the [...] Read more.
Various research approaches have attempted to solve the length difference problem between the surface form and the base form of words in the Korean morphological analysis and part-of-speech (POS) tagging task. The compound POS tagging method is a popular approach, which tackles the problem using annotation tags. However, a dictionary is required for the post-processing to recover the base form and to dissolve the ambiguity of compound POS tags, which degrades the system performance. In this study, we propose a novel syllable-based multi-POSMORPH annotation method to solve the length difference problem within one framework, without using a dictionary for the post-processing. A multi-POSMORPH tag is created by combining POS tags and morpheme syllables for the simultaneous POS tagging and morpheme recovery. The model is implemented with a two-layer transformer encoder, which is lighter than the existing models based on large language models. Nonetheless, the experiments demonstrate that the performance of the proposed model is comparable to, or better than, that of previous models. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

14 pages, 3232 KiB  
Article
Deep Learning-Based Acoustic Echo Cancellation for Surround Sound Systems
by Guoteng Li, Chengshi Zheng, Yuxuan Ke and Xiaodong Li
Appl. Sci. 2023, 13(3), 1266; https://doi.org/10.3390/app13031266 - 17 Jan 2023
Cited by 1 | Viewed by 2901
Abstract
Surround sound systems that play back multi-channel audio signals through multiple loudspeakers can improve augmented reality, which has been widely used in many multimedia communication systems. It is common that a hand-free speech communication system suffers from the acoustic echo problem, and the [...] Read more.
Surround sound systems that play back multi-channel audio signals through multiple loudspeakers can improve augmented reality, which has been widely used in many multimedia communication systems. It is common that a hand-free speech communication system suffers from the acoustic echo problem, and the echo needs to be canceled or suppressed completely. This paper proposes a deep learning-based acoustic echo cancellation (AEC) method to recover the desired near-end speech from the microphone signals in surround sound systems. The ambisonics technique was adopted to record the surround sound for reproduction. To achieve a better generalization capability against different loudspeaker layouts, the compressed complex spectra of the first-order ambisonic signals (B-format) were sent to the neural network as the input features directly instead of using the ambisonic decoded signals (D-format). Experimental results on both simulated and real acoustic environments showed the effectiveness of the proposed algorithm in surround AEC, and outperformed other competing methods in terms of the speech quality and the amount of echo reduction. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

25 pages, 850 KiB  
Article
Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: An Overview
by Wenqiang Du, Yikeremu Maimaitiyiming, Mewlude Nijat, Lantian Li, Askar Hamdulla and Dong Wang
Appl. Sci. 2023, 13(1), 326; https://doi.org/10.3390/app13010326 - 27 Dec 2022
Cited by 4 | Viewed by 2878
Abstract
With the emergence of deep learning, the performance of automatic speech recognition (ASR) systems has remarkably improved. Especially for resource-rich languages such as English and Chinese, commercial usage has been made feasible in a wide range of applications. However, most languages are low-resource [...] Read more.
With the emergence of deep learning, the performance of automatic speech recognition (ASR) systems has remarkably improved. Especially for resource-rich languages such as English and Chinese, commercial usage has been made feasible in a wide range of applications. However, most languages are low-resource languages, presenting three main difficulties for the development of ASR systems: (1) the scarcity of the data; (2) the uncertainty in the writing and pronunciation; (3) the individuality of each language. Uyghur, Kazakh, and Kyrgyz as examples are all low-resource languages, involving clear geographical variation in their pronunciation, and each language possesses its own unique acoustic properties and phonological rules. On the other hand, they all belong to the Altaic language family of the Altaic branch, so they share many commonalities. This paper presents an overview of speech recognition techniques developed for Uyghur, Kazakh, and Kyrgyz, with the purposes of (1) highlighting the techniques that are specifically effective for each language and generally effective for all of them and (2) discovering the important factors in promoting the speech recognition research of low-resource languages, by a comparative study of the development path of these three neighboring languages. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure A1

14 pages, 3997 KiB  
Article
A Full Loading-Based MVDR Beamforming Method by Backward Correction of the Steering Vector and Reconstruction of the Covariance Matrix
by Jing Zhou and Changchun Bao
Appl. Sci. 2023, 13(1), 285; https://doi.org/10.3390/app13010285 - 26 Dec 2022
Cited by 2 | Viewed by 1473
Abstract
In order to improve the performance of the diagonal loading-based minimum variance distortionless response (MVDR) beamformer, a full loading-based MVDR beamforming method is proposed in this paper. Different from the conventional diagonal loading methods, the proposed method combines the backward correction of the [...] Read more.
In order to improve the performance of the diagonal loading-based minimum variance distortionless response (MVDR) beamformer, a full loading-based MVDR beamforming method is proposed in this paper. Different from the conventional diagonal loading methods, the proposed method combines the backward correction of the steering vector of the target source and the reconstruction of the covariance matrix. Firstly, based on the linear combination, an appropriate full loading matrix was constructed to correct the steering vector of the target source backward. Secondly, based on the spatial sparsity of the sound sources, an appropriate loading matrix was constructed to further suppress interferences. Thirdly, the spatial response power was utilized to derive a more accurate direction of arrival (DOA) of the target source, which is helpful for obtaining a more accurate steering vector of the target source and a more effective covariance matrix iteratively. The simulation results show that the proposed method can effectively suppress interferences and noise. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

22 pages, 2743 KiB  
Article
Effective Dereverberation with a Lower Complexity at Presence of the Noise
by Fengqi Tan, Changchun Bao and Jing Zhou
Appl. Sci. 2022, 12(22), 11819; https://doi.org/10.3390/app122211819 - 21 Nov 2022
Cited by 2 | Viewed by 1231
Abstract
Adaptive beamforming and deconvolution techniques have shown effectiveness for reducing noise and reverberation. The minimum variance distortionless response (MVDR) beamformer is the most widely used for adaptive beamforming, whereas multichannel linear prediction (MCLP) is an excellent approach for the deconvolution. How to solve [...] Read more.
Adaptive beamforming and deconvolution techniques have shown effectiveness for reducing noise and reverberation. The minimum variance distortionless response (MVDR) beamformer is the most widely used for adaptive beamforming, whereas multichannel linear prediction (MCLP) is an excellent approach for the deconvolution. How to solve the problem where the noise and reverberation occur together is a challenging task. In this paper, the MVDR beamformer and MCLP are effectively combined for noise reduction and dereverberation. Especially, the MCLP coefficients are estimated by the Kalman filter and the MVDR filter based on the complex Gaussian mixture model (CGMM) is used to enhance the speech corrupted by the reverberation with the noise and to estimate the power spectral density (PSD) of the target speech required by the Kalman filter, respectively. The final enhanced speech is obtained by the Kalman filter. Furthermore, a complexity reduction method with respect to the Kalman filter is also proposed based on the Kronecker product. Compared to two advanced algorithms, the integrated sidelobe cancellation and linear prediction (ISCLP) method and the weighted prediction error (WPE) method, which are very effective for removing reverberation, the proposed algorithm shows better performance and lower complexity. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

17 pages, 2844 KiB  
Article
Packet Loss Concealment Based on Phase Correction and Deep Neural Network
by Qiang Ji, Changchun Bao and Zihao Cui
Appl. Sci. 2022, 12(19), 9721; https://doi.org/10.3390/app12199721 - 27 Sep 2022
Cited by 3 | Viewed by 1661
Abstract
In a packet switching network, the performance of packet loss concealment (PLC) is often affected by inaccurate estimation of phase spectrum of speech signal in the lost packet. In order to solve this problem, two kinds of PLC methods in the scene of [...] Read more.
In a packet switching network, the performance of packet loss concealment (PLC) is often affected by inaccurate estimation of phase spectrum of speech signal in the lost packet. In order to solve this problem, two kinds of PLC methods in the scene of continuous packet loss are proposed in this paper based on phase correction. One of them is based on waveform similarity overlap-add (WSOLA) and deep neural network (DNN), and the other one is based on the Griffin–Lim algorithm (GLA) and DNN. In the first method, considering the correlation of adjacent frames of speech signal, the periodicity of speech signal is well retained by the WSOLA method so that the phase spectrum of the lost signal is recovered. Combined with the prediction of amplitude spectrum of the lost signal by the DNN, the performance of the PLC in the case of continuous packet loss is effectively improved. In the second method, the phase spectrum of the lost signal is modified iteratively by the amplitude spectra estimated via the DNN based on the consistency of Fourier transform so that the phase spectra match the amplitude spectra. The PLC is greatly achieved as well. The experimental results show that the proposed PLC methods provide better speech quality than the reference methods. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

12 pages, 811 KiB  
Article
Entropy-Based Dynamic Rescoring with Language Model in E2E ASR Systems
by Zhuo Gong, Daisuke Saito and Nobuaki Minematsu
Appl. Sci. 2022, 12(19), 9690; https://doi.org/10.3390/app12199690 - 27 Sep 2022
Viewed by 1147
Abstract
Language models (LM) have played crucial roles in automatic speech recognition (ASR), whether as an essential part of a conventional ASR system composed of an acoustic model and LM, or as an integrated model to enhance the performance of novel end-to-end ASR systems. [...] Read more.
Language models (LM) have played crucial roles in automatic speech recognition (ASR), whether as an essential part of a conventional ASR system composed of an acoustic model and LM, or as an integrated model to enhance the performance of novel end-to-end ASR systems. With the development of machine learning and deep learning, language modeling has made great progress in natural language processing applications. In recent years, efforts have been made to leverage the advantages of novel LM to ASR. The most common way to apply an integration is still shallow fusion because it can be easily implemented by zero-overhead while obtaining significant improvement. Our method can further enhance the applicability of shallow fusion without hyperparameter tuning while maintaining similar performance. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

20 pages, 4886 KiB  
Article
Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network
by Juan Li, Xueying Zhang, Lixia Huang, Fenglian Li, Shufei Duan and Ying Sun
Appl. Sci. 2022, 12(19), 9518; https://doi.org/10.3390/app12199518 - 22 Sep 2022
Cited by 12 | Viewed by 2299
Abstract
In the background of artificial intelligence, the realization of smooth communication between people and machines has become the goal pursued by people. Mel spectrograms is a common method used in speech emotion recognition, focusing on the low-frequency part of speech. In contrast, the [...] Read more.
In the background of artificial intelligence, the realization of smooth communication between people and machines has become the goal pursued by people. Mel spectrograms is a common method used in speech emotion recognition, focusing on the low-frequency part of speech. In contrast, the inverse Mel (IMel) spectrogram, which focuses on the high-frequency part, is proposed to comprehensively analyze emotions. Because the convolutional neural network-stacked sparse autoencoder (CNN-SSAE) can extract deep optimized features, the Mel-IMel dual-channel complementary structure is proposed. In the first channel, a CNN is used to extract the low-frequency information of the Mel spectrogram. The other channel extracts the high-frequency information of the IMel spectrogram. This information is transmitted into an SSAE to reduce the number of dimensions, and obtain the optimized information. Experimental results show that the highest recognition rates achieved on the EMO-DB, SAVEE, and RAVDESS datasets were 94.79%, 88.96%, and 83.18%, respectively. The conclusions are that the recognition rate of the two spectrograms was higher than that of each of the single spectrograms, which proves that the two spectrograms are complementary. The SSAE followed the CNN to get the optimized information, and the recognition rate was further improved, which proves the effectiveness of the CNN-SSAE network. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Graphical abstract

15 pages, 12136 KiB  
Article
Target Speaker Extraction by Fusing Voiceprint Features
by Shidan Cheng, Ying Shen and Dongqing Wang
Appl. Sci. 2022, 12(16), 8152; https://doi.org/10.3390/app12168152 - 15 Aug 2022
Cited by 3 | Viewed by 1937
Abstract
It is a critical problem to accurately separate clean speech in the multispeaker scenario for different speakers. However, in most cases, smart devices such as smart phones interact with only one specific user. As a consequence, the speech separation models adopted by these [...] Read more.
It is a critical problem to accurately separate clean speech in the multispeaker scenario for different speakers. However, in most cases, smart devices such as smart phones interact with only one specific user. As a consequence, the speech separation models adopted by these devices only have to extract the target speaker’s speech. A voiceprint, which reflects the speaker’s voice characteristics, provides prior knowledge for the target speech separation. Therefore, how to efficiently integrate voiceprint features into the existing speech separation models to improve their performance for the target speech separation is an interesting problem not fully explored. This paper attempts to solve this issue to some extent and our contributions are as follows. First, two different voiceprint features (i.e., MFCCs and d-vector) are explored in the performance enhancement for three speech separation models. Second, three different feature fusion methods are proposed to efficiently fuse the voiceprint features with the magnitude spectrograms originally used in the speech separation models. Third, a target speech extraction method which utilizes the fused features is proposed for two speaker-independent models. Experiments demonstrate that the speech separation models integrated with voiceprint features using three feature fusion methods can effectively extract the target speaker’s speech. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

Review

Jump to: Research

18 pages, 833 KiB  
Review
Code-Switching in Automatic Speech Recognition: The Issues and Future Directions
by Mumtaz Begum Mustafa, Mansoor Ali Yusoof, Hasan Kahtan Khalaf, Ahmad Abdel Rahman Mahmoud Abushariah, Miss Laiha Mat Kiah, Hua Nong Ting and Saravanan Muthaiyah
Appl. Sci. 2022, 12(19), 9541; https://doi.org/10.3390/app12199541 - 23 Sep 2022
Cited by 3 | Viewed by 3878
Abstract
Code-switching (CS) in spoken language is where the speech has two or more languages within an utterance. It is an unsolved issue in automatic speech recognition (ASR) research as ASR needs to recognise speech in bilingual and multilingual settings, where the accuracy of [...] Read more.
Code-switching (CS) in spoken language is where the speech has two or more languages within an utterance. It is an unsolved issue in automatic speech recognition (ASR) research as ASR needs to recognise speech in bilingual and multilingual settings, where the accuracy of ASR systems declines with CS due to pronunciation variation. There are very few reviews carried out on CS, with none conducted on bilingual and multilingual CS ASR systems. This study investigates the importance of CS in bilingual and multilingual speech recognition systems. To meet the objective of this study, two research questions were formulated, which cover both the current issues and the direction of the research. Our review focuses on databases, acoustic and language modelling, and evaluation metrics. Using selected keywords, this research has identified 274 papers and selected 42 experimental papers for review, of which 24 (representing 57%) have discussed CS, while the rest look at multilingual ASR research. The selected papers cover many well-resourced and under-resourced languages, and novel techniques to manage CS in ASR systems, which are mapping, combining and merging the phone sets of the languages experimented with in the research. Our review also examines the performance of those methods. This review found a significant variation in the performance of CS in terms of word error rates, indicating an inconsistency in the ability of ASRs to handle CS. In the conclusion, we suggest several future directions that address the issues identified in this review. Full article
(This article belongs to the Special Issue Advances in Speech and Language Processing)
Show Figures

Figure 1

Back to TopTop