Advanced Technology in Speech and Acoustic Signal Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Electrical, Electronics and Communications Engineering".

Deadline for manuscript submissions: 30 June 2024 | Viewed by 9337

Special Issue Editors

Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
Interests: speech or acoustic signal processing
Special Issues, Collections and Topics in MDPI journals
Computer And Information Sciences, University of Strathclyde, Glasgow SC015263, UK
Interests: image processing; signal processing; speech processing; multimodal speech filtering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

We live in a world of sound, including the sound of nature, animals and humans. Uncovering information from sound waves remains an active research field, with challenges such as event detection, speech recognition, speaker recognition, and emotion recognition, to name a few. Conventional approaches are mostly based on statistical models, while the new generation of technologies extensively uses neural models as the backbone, partly due to the accumulation of large-scale data. In recent years, numerous novel methods have been proposed such as end-to-end modeling, effective data augmentation, large-scale pre-training, and smart design of architectures and loss functions. These new techniques have brought significant performance improvement, which in turn substantially extends the application scope of techniques. 

In spite of these tremendous successes, many challenges still exist. For instance, it is still hard to detect events with a limited appearance in the training data; performance of speech recognition for minor languages is still weak; and recognizing speakers from speech with interfering noise or speakers is still hard. Some of these challenges can be solved by prudent employment of existing methods, while others need new theories, algorithms and computational models. 

This special issue will focus on novel theories and methods as well as new experimental findings in various domains of speech and acoustic signal processing, including but not limited to speech enhancement, experimental acoustics, speech/speaker/emotion recognition, multimodal signal processing, and speech synthesis. 

Dr. Dong Wang
Dr. Andrew Abel
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 2067 KiB  
Article
A Dual-Branch Speech Enhancement Model with Harmonic Repair
Appl. Sci. 2024, 14(4), 1645; https://doi.org/10.3390/app14041645 - 18 Feb 2024
Viewed by 360
Abstract
Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the [...] Read more.
Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

30 pages, 4582 KiB  
Article
Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features
Appl. Sci. 2024, 14(2), 798; https://doi.org/10.3390/app14020798 - 17 Jan 2024
Viewed by 476
Abstract
In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in [...] Read more.
In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

16 pages, 1637 KiB  
Article
Exploring Multi-Stage GAN with Self-Attention for Speech Enhancement
Appl. Sci. 2023, 13(16), 9217; https://doi.org/10.3390/app13169217 - 14 Aug 2023
Cited by 1 | Viewed by 928
Abstract
Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the [...] Read more.
Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the temporal dependencies within the signal sequence. This study explores self-attention to address the temporal dependency issue in multi-generator speech enhancement GANs to improve their enhancement performance. We empirically study the effect of integrating a self-attention mechanism into the convolutional layers of the multiple generators in multi-stage or multi-generator speech enhancement GANs, specifically, the ISEGAN and the DSEGAN networks. The experimental results show that introducing a self-attention mechanism into ISEGAN and DSEGAN leads to improvements in their speech enhancement quality and intelligibility across the objective evaluation metrics. Furthermore, we observe that adding self-attention to the ISEGAN’s generators does not only improves its enhancement performance but also bridges the performance gap between the ISEGAN and the DSEGAN with a smaller model footprint. Overall, our findings highlight the potential of self-attention in improving the enhancement performance of multi-generator speech enhancement GANs. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

12 pages, 1810 KiB  
Article
Amplitude and Phase Information Interaction for Speech Enhancement Method
Appl. Sci. 2023, 13(14), 8025; https://doi.org/10.3390/app13148025 - 09 Jul 2023
Viewed by 712
Abstract
In order to improve the speech enhancement ability of FullSubNet model, an improved method FullSubNet-pMix is proposed. Specifically, pMix module is added to the structure of full-band frequency domain information processing, which realizes the information interaction between amplitude spectrum and phase spectrum. At [...] Read more.
In order to improve the speech enhancement ability of FullSubNet model, an improved method FullSubNet-pMix is proposed. Specifically, pMix module is added to the structure of full-band frequency domain information processing, which realizes the information interaction between amplitude spectrum and phase spectrum. At the same time, the hyperparameters used in training are optimized so that the full-band and sub-band structure of the system can play a better role. Experiments are carried out on selected test sets. The experimental results show that the proposed method can independently improve the speech enhancement effect of the model, and the effect on the four evaluation indicators of WB-PESQ, NB-PESQ, STOI, and SI-SDR is better than the original model. Therefore, the FullSubNet-pMix method proposed in this paper can effectively enhance the ability of the model to extract and use voice information. The impact of different loss functions on the training performance was also verified. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

25 pages, 1378 KiB  
Article
Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths
Appl. Sci. 2023, 13(13), 7579; https://doi.org/10.3390/app13137579 - 27 Jun 2023
Cited by 1 | Viewed by 1623
Abstract
Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated [...] Read more.
Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model’s scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model’s scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

16 pages, 3386 KiB  
Article
Multi-Level Attention-Based Categorical Emotion Recognition Using Modulation-Filtered Cochleagram
Appl. Sci. 2023, 13(11), 6749; https://doi.org/10.3390/app13116749 - 01 Jun 2023
Cited by 1 | Viewed by 808
Abstract
Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention [...] Read more.
Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention network to extract high-level emotional feature representations from the modulation-filtered cochleagram. Our approach utilizes channel-level attention and spatial-level attention modules to generate emotional saliency maps of channel and spatial feature representations, capturing significant emotional channel and feature space from the 3D convolution feature maps, respectively. Furthermore, we employ a temporal-level attention module to capture significant emotional regions from the concatenated feature sequence of the emotional saliency maps. Our experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrate that the modulation-filtered cochleagram significantly improves the prediction performance of categorical emotion compared to other evaluated features. Moreover, our emotion recognition framework achieves comparable unweighted accuracy of 71% in categorical emotion recognition by comparing with several existing approaches. In summary, our study demonstrates the effectiveness of the modulation-filtered cochleagram in speech emotion recognition, and our proposed multi-level attention framework provides a promising direction for future research in this field. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

10 pages, 471 KiB  
Article
Replay Speech Detection Based on Dual-Input Hierarchical Fusion Network
Appl. Sci. 2023, 13(9), 5350; https://doi.org/10.3390/app13095350 - 25 Apr 2023
Viewed by 1156
Abstract
Speech anti-spoofing is a crucial aspect of speaker recognition systems and has received a great deal of attention in recent years. Deep neural networks have achieved satisfactory results in datasets with similar training and testing data distributions, but their generalization ability is limited [...] Read more.
Speech anti-spoofing is a crucial aspect of speaker recognition systems and has received a great deal of attention in recent years. Deep neural networks have achieved satisfactory results in datasets with similar training and testing data distributions, but their generalization ability is limited in datasets with different distributions. In this paper, we proposed a novel dual-input hierarchical fusion network (HFN) to improve the generalization ability of our model. The network had two inputs (the original speech signal and the time-reversed signal), which increased the volume and diversity of the training data. The hierarchical fusion model (HFM) enabled more thorough fusion of information from different input levels and improved model performance by fusing the two inputs after speech feature extraction. We finally evaluated the results using the ASVspoof 2021 PA (Physical Access) dataset, and the proposed system achieved an Equal Error Rate (EER) of 24.46% and a minimum tandem Detection Cost Function (min t-DCF) of 0.6708 in the test set. Compared with the four baseline systems in the ASVspoof 2021 competition, the proposed system min t-DCF values were decreased by 28.9%, 31.0%, 32.6%, and 32.9%, and the EERs were decreased by 35.7%, 38.1%, 45.4%, and 49.7%, respectively. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

16 pages, 3905 KiB  
Article
Multi-Scale Feature Learning for Language Identification of Overlapped Speech
Appl. Sci. 2023, 13(7), 4235; https://doi.org/10.3390/app13074235 - 27 Mar 2023
Cited by 4 | Viewed by 918
Abstract
Language identification is the front end of multilingual speech-processing tasks. The study aims to enhance the accuracy of language identification in complex acoustic environments by proposing a multi-scale feature extraction method. This method replaces the baseline feature extraction network with a multi-scale feature [...] Read more.
Language identification is the front end of multilingual speech-processing tasks. The study aims to enhance the accuracy of language identification in complex acoustic environments by proposing a multi-scale feature extraction method. This method replaces the baseline feature extraction network with a multi-scale feature extraction network (SE-Res2Net-CBAM-BILSTM) to extract multi-scale features. A multilingual cocktail party dataset was simulated, and comparative experiments were conducted with various models. The experimental results show that the proposed model achieved language identification accuracies of 97.6% for an Oriental language dataset and 75% for a multilingual cocktail party dataset Furthermore, comparative experiments show that our model outperformed three other models in the accuracy, recall, and F1 values. Finally, a comparison of different loss functions shows that the model performance was better when using focal loss. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

14 pages, 2654 KiB  
Article
Multi-Scale Channel Adaptive Time-Delay Neural Network and Balanced Fine-Tuning for Arabic Dialect Identification
Appl. Sci. 2023, 13(7), 4233; https://doi.org/10.3390/app13074233 - 27 Mar 2023
Viewed by 1188
Abstract
The time-delay neural network (TDNN) can consider multiple frames of information simultaneously, making it particularly suitable for dialect identification. However, previous TDNN architectures have focused on only one aspect of either the temporal or channel information, lacking a unified optimization for both domains. [...] Read more.
The time-delay neural network (TDNN) can consider multiple frames of information simultaneously, making it particularly suitable for dialect identification. However, previous TDNN architectures have focused on only one aspect of either the temporal or channel information, lacking a unified optimization for both domains. We believe that extracting appropriate contextual information and enhancing channels are critical for dialect identification. Therefore, in this paper, we propose a novel approach that uses the ECAPA-TDNN from the speaker recognition domain as the backbone network and introduce a new multi-scale channel adaptive module (MSCA-Res2Block) to construct a multi-scale channel adaptive time-delay neural network (MSCA-TDNN). The MSCA-Res2Block is capable of extracting multi-scale features, thus further enlarging the receptive field of convolutional operations. We evaluated our proposed method on the ADI17 Arabic dialect dataset and employed a balanced fine-tuning strategy to address the issue of imbalanced dialect datasets, as well as Z-Score normalization to eliminate score distribution differences among different dialects. After experimental validation, our system achieved an average cost performance (Cavg) of 4.19% and a 94.28% accuracy rate. Compared to ECAPA-TDNN, our model showed a 22% relative improvement in Cavg. Furthermore, our model outperformed the state-of-the-art single-network model reported in the ADI17 competition. In comparison to the best-performing multi-network model hybrid system in the competition, our Cavg also exhibited an advantage. Full article
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)
Show Figures

Figure 1

Back to TopTop