Research

13 pages, 2067 KiB

Open AccessArticle

A Dual-Branch Speech Enhancement Model with Harmonic Repair

by Lizhen Jia, Yanyan Xu and Dengfeng Ke

Appl. Sci. 2024, 14(4), 1645; https://doi.org/10.3390/app14041645 - 18 Feb 2024

Viewed by 637

Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the [...] Read more.

Recent speech enhancement studies have mostly focused on completely separating noise from human voices. Due to the lack of specific structures for harmonic fitting in previous studies and the limitations of the traditional convolutional receptive field, there is an inevitable decline in the auditory quality of the enhanced speech, leading to a decrease in the performance of subsequent tasks such as speech recognition and speaker identification. To address these problems, this paper proposes a Harmonic Repair Large Frame enhancement model, called HRLF-Net, that uses a harmonic repair network for denoising, followed by a real-imaginary dual branch structure for restoration. This approach fully utilizes the harmonic overtones to match the original harmonic distribution of speech. In the subsequent branch process, it restores the speech to specifically optimize its auditory quality to the human ear. Experiments show that under HRLF-Net, the intelligibility and quality of speech are significantly improved, and harmonic information is effectively restored. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

30 pages, 4582 KiB

Open AccessArticle

Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

by Zhongping Dong, Yan Xu, Andrew Abel and Dong Wang

Appl. Sci. 2024, 14(2), 798; https://doi.org/10.3390/app14020798 - 17 Jan 2024

Viewed by 811

Abstract

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in [...] Read more.

In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

16 pages, 1637 KiB

Open AccessArticle

Exploring Multi-Stage GAN with Self-Attention for Speech Enhancement

by Bismark Kweku Asiedu Asante, Clifford Broni-Bediako and Hiroki Imamura

Appl. Sci. 2023, 13(16), 9217; https://doi.org/10.3390/app13169217 - 14 Aug 2023

Cited by 1 | Viewed by 1082

Abstract

Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the [...] Read more.

Multi-stage or multi-generator generative adversarial networks (GANs) have recently been demonstrated to be effective for speech enhancement. The existing multi-generator GANs for speech enhancement only use convolutional layers for synthesising clean speech signals. This reliance on convolution operation may result in masking the temporal dependencies within the signal sequence. This study explores self-attention to address the temporal dependency issue in multi-generator speech enhancement GANs to improve their enhancement performance. We empirically study the effect of integrating a self-attention mechanism into the convolutional layers of the multiple generators in multi-stage or multi-generator speech enhancement GANs, specifically, the ISEGAN and the DSEGAN networks. The experimental results show that introducing a self-attention mechanism into ISEGAN and DSEGAN leads to improvements in their speech enhancement quality and intelligibility across the objective evaluation metrics. Furthermore, we observe that adding self-attention to the ISEGAN’s generators does not only improves its enhancement performance but also bridges the performance gap between the ISEGAN and the DSEGAN with a smaller model footprint. Overall, our findings highlight the potential of self-attention in improving the enhancement performance of multi-generator speech enhancement GANs. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

12 pages, 1810 KiB

Open AccessArticle

Amplitude and Phase Information Interaction for Speech Enhancement Method

by Qiuyu Yu and Ruohua Zhou

Appl. Sci. 2023, 13(14), 8025; https://doi.org/10.3390/app13148025 - 09 Jul 2023

Viewed by 816

Abstract

In order to improve the speech enhancement ability of FullSubNet model, an improved method FullSubNet-pMix is proposed. Specifically, pMix module is added to the structure of full-band frequency domain information processing, which realizes the information interaction between amplitude spectrum and phase spectrum. At [...] Read more.

In order to improve the speech enhancement ability of FullSubNet model, an improved method FullSubNet-pMix is proposed. Specifically, pMix module is added to the structure of full-band frequency domain information processing, which realizes the information interaction between amplitude spectrum and phase spectrum. At the same time, the hyperparameters used in training are optimized so that the full-band and sub-band structure of the system can play a better role. Experiments are carried out on selected test sets. The experimental results show that the proposed method can independently improve the speech enhancement effect of the model, and the effect on the four evaluation indicators of WB-PESQ, NB-PESQ, STOI, and SI-SDR is better than the original model. Therefore, the FullSubNet-pMix method proposed in this paper can effectively enhance the ability of the model to extract and use voice information. The impact of different loss functions on the training performance was also verified. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

25 pages, 1378 KiB

Open AccessArticle

Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths

by Jiajun Liu, Aishan Wumaier, Dongping Wei and Shen Guo

Appl. Sci. 2023, 13(13), 7579; https://doi.org/10.3390/app13137579 - 27 Jun 2023

Cited by 1 | Viewed by 2008

Abstract

Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated [...] Read more.

Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model’s scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model’s scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

16 pages, 3386 KiB

Open AccessArticle

Multi-Level Attention-Based Categorical Emotion Recognition Using Modulation-Filtered Cochleagram

by Zhichao Peng, Wenhua He, Yongwei Li, Yegang Du and Jianwu Dang

Appl. Sci. 2023, 13(11), 6749; https://doi.org/10.3390/app13116749 - 01 Jun 2023

Cited by 1 | Viewed by 921

Abstract

Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention [...] Read more.

Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-level attention network to extract high-level emotional feature representations from the modulation-filtered cochleagram. Our approach utilizes channel-level attention and spatial-level attention modules to generate emotional saliency maps of channel and spatial feature representations, capturing significant emotional channel and feature space from the 3D convolution feature maps, respectively. Furthermore, we employ a temporal-level attention module to capture significant emotional regions from the concatenated feature sequence of the emotional saliency maps. Our experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset demonstrate that the modulation-filtered cochleagram significantly improves the prediction performance of categorical emotion compared to other evaluated features. Moreover, our emotion recognition framework achieves comparable unweighted accuracy of 71% in categorical emotion recognition by comparing with several existing approaches. In summary, our study demonstrates the effectiveness of the modulation-filtered cochleagram in speech emotion recognition, and our proposed multi-level attention framework provides a promising direction for future research in this field. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

10 pages, 471 KiB

Open AccessArticle

Replay Speech Detection Based on Dual-Input Hierarchical Fusion Network

by Chenlei Hu, Ruohua Zhou and Qingsheng Yuan

Appl. Sci. 2023, 13(9), 5350; https://doi.org/10.3390/app13095350 - 25 Apr 2023

Viewed by 1297

Abstract

Speech anti-spoofing is a crucial aspect of speaker recognition systems and has received a great deal of attention in recent years. Deep neural networks have achieved satisfactory results in datasets with similar training and testing data distributions, but their generalization ability is limited [...] Read more.

Speech anti-spoofing is a crucial aspect of speaker recognition systems and has received a great deal of attention in recent years. Deep neural networks have achieved satisfactory results in datasets with similar training and testing data distributions, but their generalization ability is limited in datasets with different distributions. In this paper, we proposed a novel dual-input hierarchical fusion network (HFN) to improve the generalization ability of our model. The network had two inputs (the original speech signal and the time-reversed signal), which increased the volume and diversity of the training data. The hierarchical fusion model (HFM) enabled more thorough fusion of information from different input levels and improved model performance by fusing the two inputs after speech feature extraction. We finally evaluated the results using the ASVspoof 2021 PA (Physical Access) dataset, and the proposed system achieved an Equal Error Rate (EER) of 24.46% and a minimum tandem Detection Cost Function (min t-DCF) of 0.6708 in the test set. Compared with the four baseline systems in the ASVspoof 2021 competition, the proposed system min t-DCF values were decreased by 28.9%, 31.0%, 32.6%, and 32.9%, and the EERs were decreased by 35.7%, 38.1%, 45.4%, and 49.7%, respectively. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

16 pages, 3905 KiB

Open AccessArticle

Multi-Scale Feature Learning for Language Identification of Overlapped Speech

by Zuhragvl Aysa, Mijit Ablimit and Askar Hamdulla

Appl. Sci. 2023, 13(7), 4235; https://doi.org/10.3390/app13074235 - 27 Mar 2023

Cited by 4 | Viewed by 1073

Abstract

Language identification is the front end of multilingual speech-processing tasks. The study aims to enhance the accuracy of language identification in complex acoustic environments by proposing a multi-scale feature extraction method. This method replaces the baseline feature extraction network with a multi-scale feature [...] Read more.

Language identification is the front end of multilingual speech-processing tasks. The study aims to enhance the accuracy of language identification in complex acoustic environments by proposing a multi-scale feature extraction method. This method replaces the baseline feature extraction network with a multi-scale feature extraction network (SE-Res2Net-CBAM-BILSTM) to extract multi-scale features. A multilingual cocktail party dataset was simulated, and comparative experiments were conducted with various models. The experimental results show that the proposed model achieved language identification accuracies of 97.6% for an Oriental language dataset and 75% for a multilingual cocktail party dataset Furthermore, comparative experiments show that our model outperformed three other models in the accuracy, recall, and F1 values. Finally, a comparison of different loss functions shows that the model performance was better when using focal loss. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

14 pages, 2654 KiB

Open AccessArticle

Multi-Scale Channel Adaptive Time-Delay Neural Network and Balanced Fine-Tuning for Arabic Dialect Identification

by Qibao Luo and Ruohua Zhou

Appl. Sci. 2023, 13(7), 4233; https://doi.org/10.3390/app13074233 - 27 Mar 2023

Viewed by 1354

Abstract

The time-delay neural network (TDNN) can consider multiple frames of information simultaneously, making it particularly suitable for dialect identification. However, previous TDNN architectures have focused on only one aspect of either the temporal or channel information, lacking a unified optimization for both domains. [...] Read more.

The time-delay neural network (TDNN) can consider multiple frames of information simultaneously, making it particularly suitable for dialect identification. However, previous TDNN architectures have focused on only one aspect of either the temporal or channel information, lacking a unified optimization for both domains. We believe that extracting appropriate contextual information and enhancing channels are critical for dialect identification. Therefore, in this paper, we propose a novel approach that uses the ECAPA-TDNN from the speaker recognition domain as the backbone network and introduce a new multi-scale channel adaptive module (MSCA-Res2Block) to construct a multi-scale channel adaptive time-delay neural network (MSCA-TDNN). The MSCA-Res2Block is capable of extracting multi-scale features, thus further enlarging the receptive field of convolutional operations. We evaluated our proposed method on the ADI17 Arabic dialect dataset and employed a balanced fine-tuning strategy to address the issue of imbalanced dialect datasets, as well as Z-Score normalization to eliminate score distribution differences among different dialects. After experimental validation, our system achieved an average cost performance (C_avg) of 4.19% and a 94.28% accuracy rate. Compared to ECAPA-TDNN, our model showed a 22% relative improvement in C_avg. Furthermore, our model outperformed the state-of-the-art single-network model reported in the ADI17 competition. In comparison to the best-performing multi-network model hybrid system in the competition, our C_avg also exhibited an advantage. Full article

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Advanced Technology in Speech and Acoustic Signal Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Published Papers (9 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI