Editorial

2 pages, 157 KiB

Open AccessEditorial

Special Issue on Automatic Speech Recognition

by Lijiang Chen

Appl. Sci. 2023, 13(9), 5389; https://doi.org/10.3390/app13095389 - 26 Apr 2023

Cited by 3 | Viewed by 1027

With the rapid development of artificial intelligence and deep learning technology, automatic speech recognition technology is experiencing new vitality [...] Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

Research

Jump to: Editorial, Review

19 pages, 3063 KiB

Open AccessArticle

An Electroglottograph Auxiliary Neural Network for Target Speaker Extraction

by Lijiang Chen, Zhendong Mo, Jie Ren, Chunfeng Cui and Qi Zhao

Appl. Sci. 2023, 13(1), 469; https://doi.org/10.3390/app13010469 - 29 Dec 2022

Cited by 3 | Viewed by 1314

Abstract

The extraction of a target speaker from mixtures of different speakers has attracted extensive amounts of attention and research. Previous studies have proposed several methods, such as SpeakerBeam, to tackle this speech extraction problem using clean speech from the target speaker to provide [...] Read more.

The extraction of a target speaker from mixtures of different speakers has attracted extensive amounts of attention and research. Previous studies have proposed several methods, such as SpeakerBeam, to tackle this speech extraction problem using clean speech from the target speaker to provide information. However, clean speech cannot be obtained immediately in most cases. In this study, we addressed this problem by extracting features from the electroglottographs (EGGs) of target speakers. An EGG is a laryngeal function detection technology that can detect the impedance and condition of vocal cords. Since EGGs have excellent anti-noise performance due to the collection method, they can be obtained in rather noisy environments. In order to obtain clean speech from target speakers out of the mixtures of different speakers, we utilized deep learning methods and used EGG signals as additional information to extract target speaker. In this way, we could extract target speaker from mixtures of different speakers without needing clean speech from the target speakers. According to the characteristics of the EGG signals, we developed an EGG_auxiliary network to train a speaker extraction model under the assumption that EGG signals carry information about speech signals. Additionally, we took the correlations between EGGs and speech signals in silent and unvoiced segments into consideration to develop a new network involving EGG preprocessing. We achieved improvements in the scale invariant signal-to-distortion ratio improvement (SISDRi) of 0.89 dB on the Chinese Dual-Mode Emotional Speech Database (CDESD) and 1.41 dB on the EMO-DB dataset. In addition, our methods solved the problem of poor performance with target speakers of the same gender and the different between the same gender situation and the problem of greatly reduced precision under the low SNR circumstances. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

13 pages, 1864 KiB

Open AccessArticle

Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network

by Sakshi Dua, Sethuraman Sambath Kumar, Yasser Albagory, Rajakumar Ramalingam, Ankur Dumka, Rajesh Singh, Mamoon Rashid, Anita Gehlot, Sultan S. Alshamrani and Ahmed Saeed AlGhamdi

Appl. Sci. 2022, 12(12), 6223; https://doi.org/10.3390/app12126223 - 19 Jun 2022

Cited by 38 | Viewed by 3383

Abstract

Deep learning-based machine learning models have shown significant results in speech recognition and numerous vision-related tasks. The performance of the present speech-to-text model relies upon the hyperparameters used in this research work. In this research work, it is shown that convolutional neural networks [...] Read more.

Deep learning-based machine learning models have shown significant results in speech recognition and numerous vision-related tasks. The performance of the present speech-to-text model relies upon the hyperparameters used in this research work. In this research work, it is shown that convolutional neural networks (CNNs) can model raw and tonal speech signals. Their performance is on par with existing recognition systems. This study extends the role of the CNN-based approach to robust and uncommon speech signals (tonal) using its own designed database for target research. The main objective of this research work was to develop a speech-to-text recognition system to recognize the tonal speech signals of Gurbani hymns using a CNN. Further, the CNN model, with six layers of 2DConv, 2DMax Pooling, and 256 dense layer units (Google’s TensorFlow service) was also used in this work, as well as Praat for speech segmentation. Feature extraction was enforced using the MFCC feature extraction technique, which extracts standard speech features and features of background music as well. Our study reveals that the CNN-based method for identifying tonal speech sentences and adding instrumental knowledge performs better than the existing and conventional approaches. The experimental results demonstrate the significant performance of the present CNN architecture by providing an 89.15% accuracy rate and a 10.56% WER for continuous and extensive vocabulary sentences of speech signals with different tones. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

20 pages, 1254 KiB

Open AccessArticle

Multi-Label Extreme Learning Machine (MLELMs) for Bangla Regional Speech Recognition

by Prommy Sultana Hossain, Amitabha Chakrabarty, Kyuheon Kim and Md. Jalil Piran

Appl. Sci. 2022, 12(11), 5463; https://doi.org/10.3390/app12115463 - 27 May 2022

Cited by 6 | Viewed by 2754

Abstract

Extensive research has been conducted in the past to determine age, gender, and words spoken in Bangla speech, but no work has been conducted to identify the regional language spoken by the speaker in Bangla speech. Hence, in this study, we create a [...] Read more.

Extensive research has been conducted in the past to determine age, gender, and words spoken in Bangla speech, but no work has been conducted to identify the regional language spoken by the speaker in Bangla speech. Hence, in this study, we create a dataset containing 30 h of Bangla speech of seven regional Bangla dialects with the goal of detecting synthesized Bangla speech and categorizing it. To categorize the regional language spoken by the speaker in the Bangla speech and determine its authenticity, the proposed model was created; a Stacked Convolutional Autoencoder (SCAE) and a Sequence of Multi-Label Extreme Learning machines (MLELM). SCAE creates a detailed feature map by identifying the spatial and temporal salient qualities from MFEC input data. The feature map is then sent to MLELM networks to generate soft labels and then hard labels. As aging generates physiological changes in the brain that alter the processing of aural information, the model took age class into account while generating dialect class labels, increasing classification accuracy from 85% to 95% without and with age class consideration, respectively. The classification accuracy for synthesized Bangla speech labels is 95%. The proposed methodology works well with English speaking audio sets as well. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

18 pages, 961 KiB

Open AccessArticle

Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation

by Lijiang Chen, Jie Ren, Xia Mao and Qi Zhao

Appl. Sci. 2022, 12(9), 4338; https://doi.org/10.3390/app12094338 - 25 Apr 2022

Cited by 6 | Viewed by 1690

Abstract

Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this [...] Read more.

Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this problem, we propose a model utilizing only the fundamental frequency from electroglottograph (EGG) signals. EGG signals are a sort of physiological signal that can directly reflect the movement of the vocal cord. Under the assumption that different acoustic features share similar representations in the internal emotional state, we propose cross-modal emotion distillation (CMED) to train the EGG-based SER model by transferring robust speech emotion representations from the log-Mel-spectrogram-based model. Utilizing the cross-modal emotion distillation, we achieve an increase of recognition accuracy from 58.98% to 66.80% on the S70 subset of the Chinese Dual-mode Emotional Speech Database (CDESD 7-classes) and 32.29% to 42.71% on the EMO-DB (7-classes) dataset, which shows that our proposed method achieves a comparable result with the human subjective experiment and realizes a trade-off between model complexity and performance. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

15 pages, 2315 KiB

Open AccessArticle

Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement

by Zhendong Song, Yupeng Ma, Fang Tan and Xiaoyi Feng

Appl. Sci. 2022, 12(7), 3461; https://doi.org/10.3390/app12073461 - 29 Mar 2022

Cited by 6 | Viewed by 1519

Abstract

In this paper, we propose a fully convolutional neural network based on recursive recurrent convolution for monaural speech enhancement in the time domain. The proposed network is an encoder-decoder structure using a series of hybrid dilated modules (HDM). The encoder creates low-dimensional features [...] Read more.

In this paper, we propose a fully convolutional neural network based on recursive recurrent convolution for monaural speech enhancement in the time domain. The proposed network is an encoder-decoder structure using a series of hybrid dilated modules (HDM). The encoder creates low-dimensional features of a noisy input frame. In the HDM, the dilated convolution is used to expand the receptive field of the network model. In contrast, the standard convolution is used to make up for the under-utilized local information of the dilated convolution. The decoder is used to reconstruct enhanced frames. The recursive recurrent convolutional network uses GRU to solve the problem of multiple training parameters and complex structures. State-of-the-art results are achieved on two commonly used speech datasets. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

27 pages, 496 KiB

Open AccessArticle

A Rule-Based Grapheme-to-Phoneme Conversion System

by Piotr Kłosowski

Appl. Sci. 2022, 12(5), 2758; https://doi.org/10.3390/app12052758 - 07 Mar 2022

Cited by 5 | Viewed by 2761

Abstract

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts [...] Read more.

This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

18 pages, 2634 KiB

Open AccessArticle

Harris Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System

by Mohammed Hasan Ali, Mustafa Musa Jaber, Sura Khalil Abd, Amjad Rehman, Mazhar Javed Awan, Daiva Vitkutė-Adžgauskienė, Robertas Damaševičius and Saeed Ali Bahaj

Appl. Sci. 2022, 12(3), 1091; https://doi.org/10.3390/app12031091 - 21 Jan 2022

Cited by 25 | Viewed by 3083

Abstract

Automatic speech recognition (ASR) is an effective technique that can convert human speech into text format or computer actions. ASR systems are widely used in smart appliances, smart homes, and biometric systems. Signal processing and machine learning techniques are incorporated to recognize speech. [...] Read more.

Automatic speech recognition (ASR) is an effective technique that can convert human speech into text format or computer actions. ASR systems are widely used in smart appliances, smart homes, and biometric systems. Signal processing and machine learning techniques are incorporated to recognize speech. However, traditional systems have low performance due to a noisy environment. In addition to this, accents and local differences negatively affect the ASR system’s performance while analyzing speech signals. A precise speech recognition system was developed to improve the system performance to overcome these issues. This paper uses speech information from jim-schwoebel voice datasets processed by Mel-frequency cepstral coefficients (MFCCs). The MFCC algorithm extracts the valuable features that are used to recognize speech. Here, a sparse auto-encoder (SAE) neural network is used to classify the model, and the hidden Markov model (HMM) is used to decide on the speech recognition. The network performance is optimized by applying the Harris Hawks optimization (HHO) algorithm to fine-tune the network parameter. The fine-tuned network can effectively recognize speech in a noisy environment. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

Review

Jump to: Editorial, Research

22 pages, 1693 KiB

Open AccessReview

Arabic Automatic Speech Recognition: A Systematic Literature Review

by Amira Dhouib, Achraf Othman, Oussama El Ghoul, Mohamed Koutheair Khribi and Aisha Al Sinani

Appl. Sci. 2022, 12(17), 8898; https://doi.org/10.3390/app12178898 - 05 Sep 2022

Cited by 12 | Viewed by 5950

Abstract

Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, [...] Read more.

Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, especially for the Arabic language. The purpose is to highlight the trends of research about Arabic ASR and guide researchers with the most significant studies published over ten years from 2011 to 2021. This SLR attempts to tackle seven specific research questions related to the toolkits used for developing and evaluating Arabic ASR, the supported type of the Arabic language, the used feature extraction/classification techniques, the type of speech recognition, the performance of Arabic ASR, the existing gaps facing researchers, along with some future research. Across five databases, 38 studies met our defined inclusion criteria. Our results showed different open-source toolkits to support Arabic speech recognition. The most prominent ones were KALDI, HTK, then CMU Sphinx toolkits. A total of 89.47% of the retained studies cover modern standard Arabic, whereas 26.32% of them were dedicated to different dialects of Arabic. MFCC and HMM were presented as the most used feature extraction and classification techniques, respectively: 63% of the papers were based on MFCC and 21% were based on HMM. The review also shows that the performance of Arabic ASR systems depends mainly on different criteria related to the availability of resources, the techniques used for acoustic modeling, and the used datasets. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

26 pages, 2025 KiB

Open AccessReview

Automatic Speech Recognition (ASR) Systems for Children: A Systematic Literature Review

by Vivek Bhardwaj, Mohamed Tahar Ben Othman, Vinay Kukreja, Youcef Belkhier, Mohit Bajaj, B. Srikanth Goud, Ateeq Ur Rehman, Muhammad Shafiq and Habib Hamam

Appl. Sci. 2022, 12(9), 4419; https://doi.org/10.3390/app12094419 - 27 Apr 2022

Cited by 25 | Viewed by 6104

Abstract

Automatic speech recognition (ASR) is one of the ways used to transform acoustic speech signals into text. Over the last few decades, an enormous amount of research work has been done in the research area of speech recognition (SR). However, most studies have [...] Read more.

Automatic speech recognition (ASR) is one of the ways used to transform acoustic speech signals into text. Over the last few decades, an enormous amount of research work has been done in the research area of speech recognition (SR). However, most studies have focused on building ASR systems based on adult speech. The recognition of children’s speech was neglected for some time, which means that the field of children’s SR research is wide open. Children’s SR is a challenging task due to the large variations in children’s articulatory, acoustic, physical, and linguistic characteristics compared to adult speech. Thus, the field became a very attractive area of research and it is important to understand where the main center of attention is, and what are the most widely used methods for extracting acoustic features, various acoustic models, speech datasets, the SR toolkits used during the recognition process, and so on. ASR systems or interfaces are extensively used and integrated into various real-life applications, such as search engines, the healthcare industry, biometric analysis, car systems, the military, aids for people with disabilities, and mobile devices. A systematic literature review (SLR) is presented in this work by extracting the relevant information from 76 research papers published from 2009 to 2020 in the field of ASR for children. The objective of this review is to throw light on the trends of research in children’s speech recognition and analyze the potential of trending techniques to recognize children’s speech. Full article

(This article belongs to the Special Issue Automatic Speech Recognition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Automatic Speech Recognition

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Published Papers (10 papers)

Editorial

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI