Research

11 pages, 894 KiB

Open AccessArticle

Improved Convolutional Neural Network–Time-Delay Neural Network Structure with Repeated Feature Fusions for Speaker Verification

by Miaomiao Gao and Xiaojuan Zhang

Appl. Sci. 2024, 14(8), 3471; https://doi.org/10.3390/app14083471 - 19 Apr 2024

Viewed by 265

Abstract

The development of deep learning greatly promotes the progress of speaker verification (SV). Studies show that both convolutional neural networks (CNNs) and dilated time-delay neural networks (TDNNs) achieve advanced performance in text-independent SV, due to their ability to sufficiently extract the local feature [...] Read more.

The development of deep learning greatly promotes the progress of speaker verification (SV). Studies show that both convolutional neural networks (CNNs) and dilated time-delay neural networks (TDNNs) achieve advanced performance in text-independent SV, due to their ability to sufficiently extract the local feature and the temporal contextual information, respectively. Also, the combination of the above two has achieved better results. However, we found a serious gridding effect when we apply the 1D-Res2Net-based dilated TDNN proposed in ECAPA-TDNN for SV, which indicates discontinuity and local information losses of frame-level features. To achieve high-resolution process for speaker embedding, we improve the CNN–TDNN structure with proposed repeated multi-scale feature fusions. Through the proposed structure, we can effectively improve the channel utilization of TDNN and achieve higher performance under the same TDNN channel. And, unlike previous studies that have all converted CNN features to TDNN features directly, we also studied the latent space transformation between CNN and TDNN to achieve efficient conversion. Our best method obtains 0.72 EER and 0.0672 MinDCF on VoxCeleb-O test set, and the proposed method performs better in cross-domain SV without additional parameters and computational complexity. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

18 pages, 3325 KiB

Open AccessArticle

Enhancing Insect Sound Classification Using Dual-Tower Network: A Fusion of Temporal and Spectral Feature Perception

by Hangfei He, Junyang Chen, Hongkun Chen, Borui Zeng, Yutong Huang, Yudan Zhaopeng and Xiaoyan Chen

Appl. Sci. 2024, 14(7), 3116; https://doi.org/10.3390/app14073116 - 08 Apr 2024

Viewed by 503

Abstract

In the modern field of biological pest control, especially in the realm of insect population monitoring, deep learning methods have made further advancements. However, due to the small size and elusive nature of insects, visual detection is often impractical. In this context, the [...] Read more.

In the modern field of biological pest control, especially in the realm of insect population monitoring, deep learning methods have made further advancements. However, due to the small size and elusive nature of insects, visual detection is often impractical. In this context, the recognition of insect sound features becomes crucial. In our study, we introduce a classification module called the “dual-frequency and spectral fusion module (DFSM)”, which enhances the performance of transfer learning models in audio classification tasks. Our approach combines the efficiency of EfficientNet with the hierarchical design of the Dual Towers, drawing inspiration from the way the insect neural system processes sound signals. This enables our model to effectively capture spectral features in insect sounds and form multiscale perceptions through inter-tower skip connections. Through detailed qualitative and quantitative evaluations, as well as comparisons with leading traditional insect sound recognition methods, we demonstrate the advantages of our approach in the field of insect sound classification. Our method achieves an accuracy of 80.26% on InsectSet32, surpassing existing state-of-the-art models by 3 percentage points. Additionally, we conducted generalization experiments using three classic audio datasets. The results indicate that DFSM exhibits strong robustness and wide applicability, with minimal performance variations even when handling different input features. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

9 pages, 366 KiB

Open AccessArticle

Speech Audiometry: The Development of Lithuanian Bisyllabic Phonemically Balanced Word Lists for Evaluation of Speech Recognition

by Vija Vainutienė, Justinas Ivaška, Vytautas Kardelis, Tatjana Ivaškienė and Eugenijus Lesinskas

Appl. Sci. 2024, 14(7), 2897; https://doi.org/10.3390/app14072897 - 29 Mar 2024

Viewed by 419

Abstract

Background and Objectives: Speech audiometry employs standardized materials, typically in the language spoken by the target population. Language-specific nuances, including phonological features, influence speech perception and recognition. The material of speech audiometry tests for the assessment of word recognition comprises lists of words [...] Read more.

Background and Objectives: Speech audiometry employs standardized materials, typically in the language spoken by the target population. Language-specific nuances, including phonological features, influence speech perception and recognition. The material of speech audiometry tests for the assessment of word recognition comprises lists of words that are phonemically or phonetically balanced. As auditory perception is influenced by a variety of linguistic features, it is necessary to develop test materials for the listener’s mother tongue. The objective of our study was to compose and evaluate new lists of Lithuanian words to assess speech recognition abilities. Materials and Methods: The main criteria for composing new lists of Lithuanian words included the syllable structure and frequency, the correlation between consonant and vowel phonemes, the frequency of specific vowel and consonant phonemes, word familiarity and rate. The words for the new lists were chosen from the Frequency Dictionary of Written Lithuanian according to the above criteria. Word recognition was assessed at different levels of presentations. The word list data were analyzed using a linear mixed-effect model for repeated measures. Results: Two hundred bisyllabic words were selected and organized into four lists. The results showed no statistically significant difference between the four sets of words. The interaction of the word list and presentation level was not statistically significant. Conclusions: Monaural performance functions indicated good inter-list reliability with no significant differences between the word recognition scores on the different bisyllabic word lists at each of the tested intensities. The word lists developed are equivalent, reliable and can be valuable for assessing speech recognition in a variety of conditions, including diagnosis, hearing rehabilitation and research. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

17 pages, 1141 KiB

Open AccessArticle

Improving the Robustness of DTW to Global Time Warping Conditions in Audio Synchronization

by Jittisa Kraprayoon, Austin Pham and Timothy J. Tsai

Appl. Sci. 2024, 14(4), 1459; https://doi.org/10.3390/app14041459 - 10 Feb 2024

Viewed by 466

Abstract

Dynamic time warping estimates the alignment between two sequences and is designed to handle a variable amount of time warping. In many contexts, it performs poorly when confronted with two sequences of different scale, in which the average slope of the true alignment [...] Read more.

Dynamic time warping estimates the alignment between two sequences and is designed to handle a variable amount of time warping. In many contexts, it performs poorly when confronted with two sequences of different scale, in which the average slope of the true alignment path in the pairwise cost matrix deviates significantly from one. This paper investigates ways to improve the robustness of DTW to such global time warping conditions, using an audio–audio alignment task as a motivating scenario of interest. We modify a dataset commonly used for studying audio–audio synchronization in order to construct a benchmark in which the global time warping conditions are carefully controlled, and we evaluate the effectiveness of several strategies designed to handle global time warping. Among the strategies tested, there is a clear winner: performing sequence length normalization via downsampling before invoking DTW. This method achieves the best alignment accuracy across a wide range of global time warping conditions, and it maintains or reduces the runtime compared to standard usages of DTW. We present experiments and analyses to demonstrate its effectiveness in both controlled and realistic scenarios. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

18 pages, 4741 KiB

Open AccessArticle

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

by Qing-Dao-Er-Ji Ren, Lele Wang, Wenjing Zhang and Leixiao Li

Appl. Sci. 2024, 14(2), 625; https://doi.org/10.3390/app14020625 - 11 Jan 2024

Viewed by 468

Abstract

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, [...] Read more.

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

22 pages, 814 KiB

Open AccessArticle

Neural Network-Based Approach to Detect and Filter Misleading Audio Segments in Classroom Automatic Transcription

by Jorge Hewstone and Roberto Araya

Appl. Sci. 2023, 13(24), 13243; https://doi.org/10.3390/app132413243 - 14 Dec 2023

Viewed by 1106

Abstract

Audio recording in classrooms is a common practice in educational research, with applications ranging from detecting classroom activities to analyzing student behavior. Previous research has employed neural networks for classroom activity detection and speaker role identification. However, these recordings are often affected by [...] Read more.

Audio recording in classrooms is a common practice in educational research, with applications ranging from detecting classroom activities to analyzing student behavior. Previous research has employed neural networks for classroom activity detection and speaker role identification. However, these recordings are often affected by background noise that can hinder further analysis, and the literature has only sought to identify noise with general filters and not specifically designed for classrooms. Although the use of high-end microphones and environmental monitoring can mitigate this problem, these solutions can be costly and potentially disruptive to the natural classroom environment. In this context, we propose the development of a novel neural network model that specifically detects and filters out problematic audio sections in classroom recordings. This model is particularly effective in reducing transcription errors, achieving up to a 96% success rate in filtering out segments that could lead to incorrect automated transcriptions. The novelty of our work lies in its targeted approach for low-budget, aurally complex environments like classrooms, where multiple speakers are present. By allowing the use of lower-quality recordings without compromising analysis capability, our model facilitates data collection in natural educational settings and reduces the dependency on expensive recording equipment. This advancement not only demonstrates the practical application of specialized neural network filters in challenging acoustic environments but also opens new avenues for enhancing audio analysis in educational research and beyond. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

11 pages, 1736 KiB

Open AccessArticle

Benefits of Auditory Training with an Open-Set Sentences-in-Babble-Noise

by Ayelet Barda, Yair Shapira and Leah Fostick

Appl. Sci. 2023, 13(16), 9126; https://doi.org/10.3390/app13169126 - 10 Aug 2023

Cited by 1 | Viewed by 891

Abstract

Auditory training (AT) has limited generalization to non-trained stimuli. Therefore, in the current study, we tested the effect of stimuli similar to that used in daily life: sentences in background noise. The sample consisted of 15 Hebrew-speaking adults aged 61–88 years with bilateral [...] Read more.

Auditory training (AT) has limited generalization to non-trained stimuli. Therefore, in the current study, we tested the effect of stimuli similar to that used in daily life: sentences in background noise. The sample consisted of 15 Hebrew-speaking adults aged 61–88 years with bilateral hearing impairment who engaged in computerized auditory training at home four times per week over a two-month period. Significant improvements were observed in sentences comprehension (Hebrew AzBio (HeBio) sentences test) with both four-talker-babble-noise (4TBN) and speech-shaped-noise (SSN) and in words comprehension (consonant-vowel-consonant (CVC) words test), following one month of AT. These improvements were sustained for two months after completing the AT. No evidence of spontaneous learning was observed in the month preceding training, nor was there an additional training effect in the additional month. Participants’ baseline speech perception abilities predicted their post-training speech perception improvements in the generalization tasks. The findings suggest that top-down generalization occurs from sentences to words and from babble noise to SSN and quiet conditions. Consequently, synthetic training tasks focusing on sentence-level comprehension accompanied by multi-talker babble noise should be prioritized. Moreover, an individualized approach to AT has demonstrated effectiveness and should be considered in both clinical and research settings. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

17 pages, 450 KiB

Open AccessArticle

Cascade Speech Translation for the Kazakh Language

by Zhanibek Kozhirbayev and Talgat Islamgozhayev

Appl. Sci. 2023, 13(15), 8900; https://doi.org/10.3390/app13158900 - 02 Aug 2023

Viewed by 1226

Abstract

Speech translation systems have become indispensable in facilitating seamless communication across language barriers. This paper presents a cascade speech translation system tailored specifically for translating speech from the Kazakh language to Russian. The system aims to enable effective cross-lingual communication between Kazakh and [...] Read more.

Speech translation systems have become indispensable in facilitating seamless communication across language barriers. This paper presents a cascade speech translation system tailored specifically for translating speech from the Kazakh language to Russian. The system aims to enable effective cross-lingual communication between Kazakh and Russian speakers, addressing the unique challenges posed by these languages. To develop the cascade speech translation system, we first created a dedicated speech translation dataset ST-kk-ru based on the ISSAI Corpus. The ST-kk-ru dataset comprises a large collection of Kazakh speech recordings along with their corresponding Russian translations. The automatic speech recognition (ASR) module of the system utilizes deep learning techniques to convert spoken Kazakh input into text. The machine translation (MT) module employs state-of-the-art neural machine translation methods, leveraging the parallel Kazakh-Russian translations available in the dataset to generate accurate translations. By conducting extensive experiments and evaluations, we have thoroughly assessed the performance of the cascade speech translation system on the ST-kk-ru dataset. The outcomes of our evaluation highlight the effectiveness of incorporating additional datasets for both the ASR and MT modules. This augmentation leads to a significant improvement in the performance of the cascade speech translation system, increasing the BLEU score by approximately 2 points when translating from Kazakh to Russian. These findings underscore the importance of leveraging supplementary data to enhance the capabilities of speech translation systems. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

22 pages, 666 KiB

Open AccessArticle

Listeners’ Spectral Reallocation Preferences for Speech in Noise

by Olympia Simantiraki and Martin Cooke

Appl. Sci. 2023, 13(15), 8734; https://doi.org/10.3390/app13158734 - 28 Jul 2023

Viewed by 608

Abstract

Modifying the spectrum of recorded or synthetic speech is an effective strategy for boosting intelligibility in noise without increasing the speech level. However, the wider impact of changes to the spectral energy distribution of speech is poorly understood. The present study explored the [...] Read more.

Modifying the spectrum of recorded or synthetic speech is an effective strategy for boosting intelligibility in noise without increasing the speech level. However, the wider impact of changes to the spectral energy distribution of speech is poorly understood. The present study explored the influence of spectral modifications using an experimental paradigm in which listeners were able to adjust speech parameters directly with real-time audio feedback, allowing the joint elicitation of preferences and word recognition scores. In two experiments involving full-bandwidth and bandwidth-limited speech, respectively, listeners adjusted one of eight features that altered the speech spectrum, and then immediately carried out a sentence-in-noise recognition task at the chosen setting. Listeners’ preferred adjustments in most conditions involved the transfer of speech energy from the sub-1 kHz region to the 1–4 kHz range. Preferences were not random, even when intelligibility was at the ceiling or constant across a range of adjustment values, suggesting that listener choices encompass more than a desire to maintain comprehensibility. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

12 pages, 2813 KiB

Open AccessArticle

Contributions of Temporal Modulation Cues in Temporal Amplitude Envelope of Speech to Urgency Perception

by Masashi Unoki, Miho Kawamura, Maori Kobayashi, Shunsuke Kidani, Junfeng Li and Masato Akagi

Appl. Sci. 2023, 13(10), 6239; https://doi.org/10.3390/app13106239 - 19 May 2023

Viewed by 862

Abstract

We previously investigated the perception of noise-vocoded speech to determine whether the temporal amplitude envelope (TAE) of speech plays an important role in the perception of linguistic information as well as non-linguistic information. However, it remains unclear if these TAEs also play a [...] Read more.

We previously investigated the perception of noise-vocoded speech to determine whether the temporal amplitude envelope (TAE) of speech plays an important role in the perception of linguistic information as well as non-linguistic information. However, it remains unclear if these TAEs also play a role in the urgency perception of non-linguistic information. In this paper, we comprehensively investigated whether the TAE of speech contributes to urgency perception. To this end, we compared noise-vocoded stimuli containing TAEs identical to those of original speech with those containing TAEs controlled by low-pass or high-pass filtering. We derived degrees of urgency from a paired comparison of the results and then used them as a basis to clarify the relationship between the temporal modulation components in TAEs of speech and urgency perception. Our findings revealed that (1) the perceived degrees of urgency of noise-vocoded stimuli are similar to those of the original, (2) significant cues for urgency perception are temporal modulation components of the noise-vocoded stimuli higher than the modulation frequency of

6

Hz, (3) additional significant cues for urgency perception are temporal modulation components of the noise-vocoded stimuli lower than the modulation frequency of

8

Hz, and (4) the TAE of the time-reversal speech is not likely to contain important cues for the perception of urgency. We therefore conclude that temporal modulation cues in the TAE of speech are a significant component in the perception of urgency. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

13 pages, 1367 KiB

Open AccessArticle

JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning

by Nurmemet Yolwas and Weijing Meng

Appl. Sci. 2023, 13(9), 5239; https://doi.org/10.3390/app13095239 - 22 Apr 2023

Cited by 1 | Viewed by 1124

Abstract

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. [...] Read more.

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

14 pages, 363 KiB

Open AccessArticle

Two-Stage Single-Channel Speech Enhancement with Multi-Frame Filtering

by Shaoxiong Lin, Wangyou Zhang and Yanmin Qian

Appl. Sci. 2023, 13(8), 4926; https://doi.org/10.3390/app13084926 - 14 Apr 2023

Viewed by 1718

Abstract

Speech enhancement has been extensively studied and applied in the fields of automatic speech recognition (ASR), speaker recognition, etc. With the advances of deep learning, attempts to apply Deep Neural Networks (DNN) to speech enhancement have achieved remarkable results and the quality of [...] Read more.

Speech enhancement has been extensively studied and applied in the fields of automatic speech recognition (ASR), speaker recognition, etc. With the advances of deep learning, attempts to apply Deep Neural Networks (DNN) to speech enhancement have achieved remarkable results and the quality of enhanced speech has been greatly improved. In this study, we propose a two-stage model for single-channel speech enhancement. The model has two DNNs with the same architecture. In the first stage, only the first DNN is trained. In the second stage, the second DNN is trained to refine the enhanced output from the first DNN, while the first DNN is frozen. A multi-frame filter is introduced to help the second DNN reduce the distortion of the enhanced speech. Experimental results on both synthetic and real datasets show that the proposed model outperforms other enhancement models not only in terms of speech enhancement evaluation metrics and word error rate (WER), but also in its superior generalization ability. The results of the ablation experiments also demonstrate that combining the two-stage model with the multi-frame filter yields better enhancement performance and less distortion. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

15 pages, 8433 KiB

Open AccessArticle

Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network

by Ala Saleh Alluhaidan, Oumaima Saidani, Rashid Jahangir, Muhammad Asif Nauman and Omnia Saidani Neffati

Appl. Sci. 2023, 13(8), 4750; https://doi.org/10.3390/app13084750 - 10 Apr 2023

Cited by 10 | Viewed by 10745

Abstract

Speech emotion recognition (SER) is the process of predicting human emotions from audio signals using artificial intelligence (AI) techniques. SER technologies have a wide range of applications in areas such as psychology, medicine, education, and entertainment. Extracting relevant features from audio signals is [...] Read more.

Speech emotion recognition (SER) is the process of predicting human emotions from audio signals using artificial intelligence (AI) techniques. SER technologies have a wide range of applications in areas such as psychology, medicine, education, and entertainment. Extracting relevant features from audio signals is a crucial task in the SER process to correctly identify emotions. Several studies on SER have employed short-time features such as Mel frequency cepstral coefficients (MFCCs), due to their efficiency in capturing the periodic nature of audio signals. However, these features are limited in their ability to correctly identify emotion representations. To solve this issue, this research combined MFCCs and time-domain features (MFCCT) to enhance the performance of SER systems. The proposed hybrid features were given to a convolutional neural network (CNN) to build the SER model. The hybrid MFCCT features together with CNN outperformed both MFCCs and time-domain (t-domain) features on the Emo-DB, SAVEE, and RAVDESS datasets by achieving an accuracy of 97%, 93%, and 92% respectively. Additionally, CNN achieved better performance compared to the machine learning (ML) classifiers that were recently used in SER. The proposed features have the potential to be widely utilized to several types of SER datasets for identifying emotions. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

14 pages, 1653 KiB

Open AccessArticle

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing

by Genshun Wan, Tingzhi Mao, Jingxuan Zhang, Hang Chen, Jianqing Gao and Zhongfu Ye

Appl. Sci. 2023, 13(7), 4243; https://doi.org/10.3390/app13074243 - 27 Mar 2023

Cited by 2 | Viewed by 1462

Abstract

For most automatic speech recognition systems, many unacceptable hypothesis errors still make the recognition results absurd and difficult to understand. In this paper, we introduce the grammar information to improve the performance of the grammatical deviation distance and increase the readability of the [...] Read more.

For most automatic speech recognition systems, many unacceptable hypothesis errors still make the recognition results absurd and difficult to understand. In this paper, we introduce the grammar information to improve the performance of the grammatical deviation distance and increase the readability of the hypothesis. The reinforcement of word embedding with grammar embedding is presented to intensify the grammar expression. An auxiliary text-to-grammar task is provided to improve the performance of the recognition results with the downstream task evaluation. Furthermore, the multiple evaluation methodology of grammar is used to explore an expandable usage paradigm with grammar knowledge. Experiments on the small open-source Mandarin speech corpus AISHELL-1 and large private-source Mandarin speech corpus TRANS-M tasks show that our method can perform very well with no additional data. Our method achieves relative character error rate reductions of 3.2% and 5.0%, a relative grammatical deviation distance reduction of 4.7% and 5.9% on AISHELL-1 and TRANS-M tasks, respectively. Moreover, the grammar-based mean opinion score of our method is about 4.29 and 3.20, significantly superior to the baseline of 4.11 and 3.02. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

12 pages, 1273 KiB

Open AccessArticle

Comparative Study for Multi-Speaker Mongolian TTS with a New Corpus

by Kailin Liang, Bin Liu, Yifan Hu, Rui Liu, Feilong Bao and Guanglai Gao

Appl. Sci. 2023, 13(7), 4237; https://doi.org/10.3390/app13074237 - 27 Mar 2023

Viewed by 1315

Abstract

Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets [...] Read more.

Low-resource text-to-speech synthesis is a very promising research direction. Mongolian is the official language of the Inner Mongolia Autonomous Region and is spoken by more than 10 million people worldwide. Mongolian, as a representative low-resource language, has a relative lack of open-source datasets for its TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for related researchers. In this work, we invited three Mongolian announcers to record topic-rich speeches. Each announcer recorded 10 h of Mongolian speech, and the whole dataset was 30 h in total. In addition, we built two baseline systems based on state-of-the-art neural architectures, including a multi-speaker Fastspeech 2 model with HiFi-GAN vocoder and a full end-to-end VITS model for multi-speakers. On the system of FastSpeech2+HiFi-GAN, the three speakers scored 4.0 or higher on both naturalness evaluation and speaker similarity. In addition, the three speakers achieved scores of 4.5 or higher on the VITS model for naturalness evaluation and speaker similarity scores. The experimental results show that the published MnTTS2 dataset can be used to build robust Mongolian multi-speaker TTS models. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

12 pages, 2548 KiB

Open AccessArticle

An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario

by Bing Yin, Shutong Niu, Haitao Tang, Lei Sun, Jun Du, Zhenhua Ling and Cong Liu

Appl. Sci. 2023, 13(7), 4100; https://doi.org/10.3390/app13074100 - 23 Mar 2023

Viewed by 1193

Abstract

Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR), is a promising direction for improving [...] Read more.

Robust speech recognition in real world situations is still an important problem, especially when it is affected by environmental interference factors and conversational multi-speaker interactions. Supplementing audio information with other modalities, such as audio–visual speech recognition (AVSR), is a promising direction for improving speech recognition. The end-to-end (E2E) framework can learn information between multiple modalities well; however, the model is not easy to train, especially when the amount of data is relatively small. In this paper, we focus on building an encoder–decoder-based end-to-end audio–visual speech recognition system for use under realistic scenarios. First, we discuss different pre-training methods which provide various kinds of initialization for the AVSR framework. Second, we explore different model architectures and audio–visual fusion methods. Finally, we evaluate the performance on the corpus from the first Multi-modal Information based Speech Processing (MISP) challenge, which is recorded in a real home television (TV) room. By system fusion, our final system achieves a 23.98% character error rate (CER), which is better than the champion system of the first MISP challenge (CER = 25.07%). Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

13 pages, 647 KiB

Open AccessArticle

DIR: A Large-Scale Dialogue Rewrite Dataset for Cross-Domain Conversational Text-to-SQL

by Jieyu Li, Zhi Chen, Lu Chen, Zichen Zhu, Hanqi Li, Ruisheng Cao and Kai Yu

Appl. Sci. 2023, 13(4), 2262; https://doi.org/10.3390/app13042262 - 09 Feb 2023

Cited by 2 | Viewed by 1975

Abstract

Semantic co-reference and ellipsis always lead to information deficiency when parsing natural language utterances with SQL in a multi-turn dialogue (i.e., conversational text-to-SQL task). The methodology of dividing a dialogue understanding task into dialogue utterance rewriting and language understanding is feasible to tackle [...] Read more.

Semantic co-reference and ellipsis always lead to information deficiency when parsing natural language utterances with SQL in a multi-turn dialogue (i.e., conversational text-to-SQL task). The methodology of dividing a dialogue understanding task into dialogue utterance rewriting and language understanding is feasible to tackle this problem. To this end, we present a two-stage framework to complete conversational text-to-SQL tasks. To construct an efficient rewriting model in the first stage, we provide a large-scale dialogue rewrite dataset (DIR), which is extended from two cross-domain conversational text-to-SQL datasets, SParC and CoSQL. The dataset contains 5908 dialogues involving 160 domains. Therefore, it not only focuses on conversational text-to-SQL tasks, but is also a valuable corpus for dialogue rewrite study. In experiments, we validate the efficiency of our annotations with a popular text-to-SQL parser, RAT-SQL. The experiment results illustrate 11.81 and 27.17 QEM accuracy improvement on SParC and CoSQL, respectively, when we eliminate the semantic incomplete representations problem by directly parsing the golden rewrite utterances. The experiment results of evaluating the performance of the two-stage frameworks using different rewrite models show that the efficiency of rewrite models is important and still needs improvement. Additionally, as a new benchmark of the dialogue rewrite task, we also report the performance results of different baselines for related studies. Our dataset will be publicly available once this paper is accepted. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

16 pages, 1071 KiB

Open AccessArticle

Multi-Hypergraph Neural Networks for Emotion Recognition in Multi-Party Conversations

by Haojie Xu, Cheng Zheng, Zhuoer Zhao and Xiao Sun

Appl. Sci. 2023, 13(3), 1660; https://doi.org/10.3390/app13031660 - 28 Jan 2023

Cited by 1 | Viewed by 1526

Abstract

Emotion recognition in multi-party conversations (ERMC) is becoming increasingly popular as an emerging research topic in natural language processing. Recently, many approaches have been devoted to exploiting inter-dependency and self-dependency among participants. However, these approaches remain inadequate in terms of inter-dependency due to [...] Read more.

Emotion recognition in multi-party conversations (ERMC) is becoming increasingly popular as an emerging research topic in natural language processing. Recently, many approaches have been devoted to exploiting inter-dependency and self-dependency among participants. However, these approaches remain inadequate in terms of inter-dependency due to the fact that the effects among speakers are not individually captured. In this paper, we design two hypergraphs to deal with inter-dependency and self-dependency, respectively. To this end, we design a multi-hypergraph neural network for ERMC. In particular, we combine average aggregation and attention aggregation to generate hyperedge features, which can allow utterance information to be better utilized. The experimental results show that our method outperforms multiple baselines, indicating that further exploitation of inter-dependency is of great value for ERMC. In addition, we also achieved good results on the emotional shift issue. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

12 pages, 789 KiB

Open AccessArticle

LWMD: A Comprehensive Compression Platform for End-to-End Automatic Speech Recognition Models

by Yukun Liu, Ta Li, Pengyuan Zhang and Yonghong Yan

Appl. Sci. 2023, 13(3), 1587; https://doi.org/10.3390/app13031587 - 26 Jan 2023

Cited by 1 | Viewed by 1022

Abstract

Recently end-to-end (E2E) automatic speech recognition (ASR) models have achieved promising performance. However, existing models tend to adopt increasing model sizes and suffer from expensive resource consumption for real-world applications. To compress E2E ASR models and obtain smaller model sizes, we propose a [...] Read more.

Recently end-to-end (E2E) automatic speech recognition (ASR) models have achieved promising performance. However, existing models tend to adopt increasing model sizes and suffer from expensive resource consumption for real-world applications. To compress E2E ASR models and obtain smaller model sizes, we propose a comprehensive compression platform named LWMD (light-weight model designing), which consists of two essential parts: a light-weight architecture search (LWAS) framework and a differentiable structured pruning (DSP) algorithm. On the one hand, the LWAS framework adopts the neural architecture search (NAS) technique to automatically search light-weight architectures for E2E ASR models. By integrating different architecture topologies of existing models together, LWAS designs a topology-fused search space. Furthermore, combined with the E2E ASR training criterion, LWAS develops a resource-aware search algorithm to select light-weight architectures from the search space. On the other hand, given the searched architectures, the DSP algorithm performs structured pruning to reduce parameter numbers further. With a Gumbel re-parameter trick, DSP builds a stronger correlation between the pruning criterion and the model performance than conventional pruning methods. And an attention-similarity loss function is further developed for better performance. On two mandarin datasets, Aishell-1 and HKUST, the compression results are well evaluated and analyzed to demonstrate the effectiveness of the LWMD platform. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

13 pages, 817 KiB

Open AccessArticle

Explore Long-Range Context Features for Speaker Verification

by Zhuo Li, Zhenduo Zhao, Wenchao Wang, Pengyuan Zhang and Qingwei Zhao

Appl. Sci. 2023, 13(3), 1340; https://doi.org/10.3390/app13031340 - 19 Jan 2023

Cited by 1 | Viewed by 1558

Abstract

Multi-scale context information, especially long-range dependency, has shown to be beneficial for speaker verification (SV) tasks. In this paper, we propose three methods to systematically explore long-range context SV feature extraction based on ResNet and analyze their complementarity. Firstly, the Hierarchical-split block (HS-block) [...] Read more.

Multi-scale context information, especially long-range dependency, has shown to be beneficial for speaker verification (SV) tasks. In this paper, we propose three methods to systematically explore long-range context SV feature extraction based on ResNet and analyze their complementarity. Firstly, the Hierarchical-split block (HS-block) is introduced to enlarge the receptive fields (RFs) and extract long-range context information over the feature maps of a single layer, where the multi-channel feature maps are split into multiple groups and then stacked together. Then, by analyzing the contribution of each location of the convolution kernel to SV, we find the traditional convolution with a square kernel is not effective for long-range feature extraction. Therefore, we propose cross convolution kernel (cross-conv), which replaces the original 3 × 3 convolution kernel with a 1 × 5 and 5 × 1 convolution kernel. Cross-conv further enlarges the RFs with the same FLOPs and parameters. Finally, the Depthwise Separable Self-Attention (DSSA) module uses an explicit sparse attention strategy to capture effective long-range dependencies globally in each channel. Experiments are conducted on the VoxCeleb and CnCeleb to verify the effectiveness and robustness of the proposed system. Experimental results show that the combination of HS-block, cross-conv, and DSSA module achieves better performance than any single method, which demonstrates the complementarity of these three methods. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

12 pages, 3795 KiB

Open AccessArticle

HierTTS: Expressive End-to-End Text-to-Waveform Using a Multi-Scale Hierarchical Variational Auto-Encoder

by Zengqiang Shang, Peiyang Shi, Pengyuan Zhang, Li Wang and Guangying Zhao

Appl. Sci. 2023, 13(2), 868; https://doi.org/10.3390/app13020868 - 08 Jan 2023

Cited by 3 | Viewed by 1984

Abstract

End-to-end text-to-speech (TTS) models that directly generate waveforms from text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. Additionally, previous studies on improving the expressiveness of TTS have mainly focused on acoustic models. There is [...] Read more.

End-to-end text-to-speech (TTS) models that directly generate waveforms from text are gaining popularity. However, existing end-to-end models are still not natural enough in their prosodic expressiveness. Additionally, previous studies on improving the expressiveness of TTS have mainly focused on acoustic models. There is a lack of research on enhancing expressiveness in an end-to-end framework. Therefore, we propose HierTTS, a highly expressive end-to-end text-to-waveform generation model. It deeply couples the hierarchical properties of speech with hierarchical variational auto-encoders and models multi-scale latent variables, at the frame, phone, subword, word, and sentence levels. The hierarchical encoder encodes the speech signal from fine-grained features into coarse-grained latent variables. In contrast, the hierarchical decoder generates fine-grained features conditioned on the coarse-grained latent variables. We propose a staged KL-weighted annealing strategy to prevent hierarchical posterior collapse. Furthermore, we employ a hierarchical text encoder to extract linguistic information at different levels and act on both the encoder and the decoder. Experiments show that our model performs closer to natural speech in prosody expressiveness and has better generative diversity. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

21 pages, 3645 KiB

Open AccessArticle

Arabic Emotional Voice Conversion Using English Pre-Trained StarGANv2-VC-Based Model

by Ali H. Meftah, Yousef A. Alotaibi and Sid-Ahmed Selouani

Appl. Sci. 2022, 12(23), 12159; https://doi.org/10.3390/app122312159 - 28 Nov 2022

Cited by 1 | Viewed by 1671

Abstract

The goal of emotional voice conversion (EVC) is to convert the emotion of a speaker’s voice from one state to another while maintaining the original speaker’s identity and the linguistic substance of the message. Research on EVC in the Arabic language is well [...] Read more.

The goal of emotional voice conversion (EVC) is to convert the emotion of a speaker’s voice from one state to another while maintaining the original speaker’s identity and the linguistic substance of the message. Research on EVC in the Arabic language is well behind that conducted on languages with a wider distribution, such as English. The primary objective of this study is to determine whether Arabic emotions may be converted using a model trained for another language. In this work, we used an unsupervised many-to-many non-parallel generative adversarial network (GAN) voice conversion (VC) model called StarGANv2-VC to perform an Arabic EVC (A-EVC). The latter is realized by using pre-trained phoneme-level automatic speech recognition (ASR) and fundamental frequency (F0) models in the English language. The generated voice is evaluated by prosody and spectrum conversion in addition to automatic emotion recognition and speaker identification using a convolutional recurrent neural network (CRNN). The results of the evaluation indicated that male voices were scored higher than female voices and that the evaluation score for the conversion from neutral to other emotions was higher than the evaluation scores for the conversion of other emotions. Full article

(This article belongs to the Special Issue Audio, Speech and Language Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Audio, Speech and Language Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (22 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI