Automatic Speech Recognition

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Acoustics and Vibrations".

Deadline for manuscript submissions: closed (20 November 2022) | Viewed by 33671

Special Issue Editor


E-Mail Website
Guest Editor
Department of Electronics and Information Engineering, Beihang University, Beijing 100191, China
Interests: affective computing; pattern recognition; human–computer interaction
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Major progress is being published regularly on both the technology and exploitation of automatic speech recognition (ASR). However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as sensitivity to the environment (background noise), or weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes use by specific populations. There are actually many factors affecting speech realization: regional, sociolinguistic, or related to the environment or the speaker themselves. These create a wide range of variations that may not be modeled correctly (speaker, gender, emotion, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. We are interested in articles that explore robust ASR systems. Potential topics include but are not limited to the following:

  • Automatic speech segmentation and phoneme detection;
  • Automatic speech recognition with noised speech;
  • Automatic speech translation;
  • Automatic classification of emotions in speech;
  • Multimodal speech recognition with video or physiological signals;

Dr. Lijiang Chen
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech recognition
  • noise reduction
  • emotion recognition
  • environment independence
  • acoustic model

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review

2 pages, 157 KiB  
Editorial
Special Issue on Automatic Speech Recognition
by Lijiang Chen
Appl. Sci. 2023, 13(9), 5389; https://doi.org/10.3390/app13095389 - 26 Apr 2023
Cited by 3 | Viewed by 1027
Abstract
With the rapid development of artificial intelligence and deep learning technology, automatic speech recognition technology is experiencing new vitality [...] Full article
(This article belongs to the Special Issue Automatic Speech Recognition)

Research

Jump to: Editorial, Review

19 pages, 3063 KiB  
Article
An Electroglottograph Auxiliary Neural Network for Target Speaker Extraction
by Lijiang Chen, Zhendong Mo, Jie Ren, Chunfeng Cui and Qi Zhao
Appl. Sci. 2023, 13(1), 469; https://doi.org/10.3390/app13010469 - 29 Dec 2022
Cited by 3 | Viewed by 1314
Abstract
The extraction of a target speaker from mixtures of different speakers has attracted extensive amounts of attention and research. Previous studies have proposed several methods, such as SpeakerBeam, to tackle this speech extraction problem using clean speech from the target speaker to provide [...] Read more.
The extraction of a target speaker from mixtures of different speakers has attracted extensive amounts of attention and research. Previous studies have proposed several methods, such as SpeakerBeam, to tackle this speech extraction problem using clean speech from the target speaker to provide information. However, clean speech cannot be obtained immediately in most cases. In this study, we addressed this problem by extracting features from the electroglottographs (EGGs) of target speakers. An EGG is a laryngeal function detection technology that can detect the impedance and condition of vocal cords. Since EGGs have excellent anti-noise performance due to the collection method, they can be obtained in rather noisy environments. In order to obtain clean speech from target speakers out of the mixtures of different speakers, we utilized deep learning methods and used EGG signals as additional information to extract target speaker. In this way, we could extract target speaker from mixtures of different speakers without needing clean speech from the target speakers. According to the characteristics of the EGG signals, we developed an EGG_auxiliary network to train a speaker extraction model under the assumption that EGG signals carry information about speech signals. Additionally, we took the correlations between EGGs and speech signals in silent and unvoiced segments into consideration to develop a new network involving EGG preprocessing. We achieved improvements in the scale invariant signal-to-distortion ratio improvement (SISDRi) of 0.89 dB on the Chinese Dual-Mode Emotional Speech Database (CDESD) and 1.41 dB on the EMO-DB dataset. In addition, our methods solved the problem of poor performance with target speakers of the same gender and the different between the same gender situation and the problem of greatly reduced precision under the low SNR circumstances. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

13 pages, 1864 KiB  
Article
Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network
by Sakshi Dua, Sethuraman Sambath Kumar, Yasser Albagory, Rajakumar Ramalingam, Ankur Dumka, Rajesh Singh, Mamoon Rashid, Anita Gehlot, Sultan S. Alshamrani and Ahmed Saeed AlGhamdi
Appl. Sci. 2022, 12(12), 6223; https://doi.org/10.3390/app12126223 - 19 Jun 2022
Cited by 38 | Viewed by 3383
Abstract
Deep learning-based machine learning models have shown significant results in speech recognition and numerous vision-related tasks. The performance of the present speech-to-text model relies upon the hyperparameters used in this research work. In this research work, it is shown that convolutional neural networks [...] Read more.
Deep learning-based machine learning models have shown significant results in speech recognition and numerous vision-related tasks. The performance of the present speech-to-text model relies upon the hyperparameters used in this research work. In this research work, it is shown that convolutional neural networks (CNNs) can model raw and tonal speech signals. Their performance is on par with existing recognition systems. This study extends the role of the CNN-based approach to robust and uncommon speech signals (tonal) using its own designed database for target research. The main objective of this research work was to develop a speech-to-text recognition system to recognize the tonal speech signals of Gurbani hymns using a CNN. Further, the CNN model, with six layers of 2DConv, 2DMax Pooling, and 256 dense layer units (Google’s TensorFlow service) was also used in this work, as well as Praat for speech segmentation. Feature extraction was enforced using the MFCC feature extraction technique, which extracts standard speech features and features of background music as well. Our study reveals that the CNN-based method for identifying tonal speech sentences and adding instrumental knowledge performs better than the existing and conventional approaches. The experimental results demonstrate the significant performance of the present CNN architecture by providing an 89.15% accuracy rate and a 10.56% WER for continuous and extensive vocabulary sentences of speech signals with different tones. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

20 pages, 1254 KiB  
Article
Multi-Label Extreme Learning Machine (MLELMs) for Bangla Regional Speech Recognition
by Prommy Sultana Hossain, Amitabha Chakrabarty, Kyuheon Kim and Md. Jalil Piran
Appl. Sci. 2022, 12(11), 5463; https://doi.org/10.3390/app12115463 - 27 May 2022
Cited by 6 | Viewed by 2754
Abstract
Extensive research has been conducted in the past to determine age, gender, and words spoken in Bangla speech, but no work has been conducted to identify the regional language spoken by the speaker in Bangla speech. Hence, in this study, we create a [...] Read more.
Extensive research has been conducted in the past to determine age, gender, and words spoken in Bangla speech, but no work has been conducted to identify the regional language spoken by the speaker in Bangla speech. Hence, in this study, we create a dataset containing 30 h of Bangla speech of seven regional Bangla dialects with the goal of detecting synthesized Bangla speech and categorizing it. To categorize the regional language spoken by the speaker in the Bangla speech and determine its authenticity, the proposed model was created; a Stacked Convolutional Autoencoder (SCAE) and a Sequence of Multi-Label Extreme Learning machines (MLELM). SCAE creates a detailed feature map by identifying the spatial and temporal salient qualities from MFEC input data. The feature map is then sent to MLELM networks to generate soft labels and then hard labels. As aging generates physiological changes in the brain that alter the processing of aural information, the model took age class into account while generating dialect class labels, increasing classification accuracy from 85% to 95% without and with age class consideration, respectively. The classification accuracy for synthesized Bangla speech labels is 95%. The proposed methodology works well with English speaking audio sets as well. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

18 pages, 961 KiB  
Article
Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
by Lijiang Chen, Jie Ren, Xia Mao and Qi Zhao
Appl. Sci. 2022, 12(9), 4338; https://doi.org/10.3390/app12094338 - 25 Apr 2022
Cited by 6 | Viewed by 1690
Abstract
Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this [...] Read more.
Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this problem, we propose a model utilizing only the fundamental frequency from electroglottograph (EGG) signals. EGG signals are a sort of physiological signal that can directly reflect the movement of the vocal cord. Under the assumption that different acoustic features share similar representations in the internal emotional state, we propose cross-modal emotion distillation (CMED) to train the EGG-based SER model by transferring robust speech emotion representations from the log-Mel-spectrogram-based model. Utilizing the cross-modal emotion distillation, we achieve an increase of recognition accuracy from 58.98% to 66.80% on the S70 subset of the Chinese Dual-mode Emotional Speech Database (CDESD 7-classes) and 32.29% to 42.71% on the EMO-DB (7-classes) dataset, which shows that our proposed method achieves a comparable result with the human subjective experiment and realizes a trade-off between model complexity and performance. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

15 pages, 2315 KiB  
Article
Hybrid Dilated and Recursive Recurrent Convolution Network for Time-Domain Speech Enhancement
by Zhendong Song, Yupeng Ma, Fang Tan and Xiaoyi Feng
Appl. Sci. 2022, 12(7), 3461; https://doi.org/10.3390/app12073461 - 29 Mar 2022
Cited by 6 | Viewed by 1519
Abstract
In this paper, we propose a fully convolutional neural network based on recursive recurrent convolution for monaural speech enhancement in the time domain. The proposed network is an encoder-decoder structure using a series of hybrid dilated modules (HDM). The encoder creates low-dimensional features [...] Read more.
In this paper, we propose a fully convolutional neural network based on recursive recurrent convolution for monaural speech enhancement in the time domain. The proposed network is an encoder-decoder structure using a series of hybrid dilated modules (HDM). The encoder creates low-dimensional features of a noisy input frame. In the HDM, the dilated convolution is used to expand the receptive field of the network model. In contrast, the standard convolution is used to make up for the under-utilized local information of the dilated convolution. The decoder is used to reconstruct enhanced frames. The recursive recurrent convolutional network uses GRU to solve the problem of multiple training parameters and complex structures. State-of-the-art results are achieved on two commonly used speech datasets. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

27 pages, 496 KiB  
Article
A Rule-Based Grapheme-to-Phoneme Conversion System
by Piotr Kłosowski
Appl. Sci. 2022, 12(5), 2758; https://doi.org/10.3390/app12052758 - 07 Mar 2022
Cited by 5 | Viewed by 2761
Abstract
This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts [...] Read more.
This article presents a rule-based grapheme-to-phoneme conversion method and algorithm for Polish. It should be noted that the fundamental grapheme-to-phoneme conversion rules have been developed by Maria Steffen-Batóg and presented in her set of monographs dedicated to the automatic grapheme-to-phoneme conversion of texts in Polish. The author used previously developed rules and independently developed the grapheme-to-phoneme conversion algorithm.The algorithm has been implemented as a software application called TransFon, which allows the user to convert any text in Polish orthography to corresponding strings of phonemes, in phonemic transcription. Using TransFon, a phonemic Polish language corpus was created out of an orthographic corpus. The phonemic language corpusallows statistical analysis of the Polish language, as well as the development of phoneme- and word-based language models for automatic speech recognition using statistical methods. The developed phonemic language corpus opens up further opportunities for research to improve automatic speech recognition in Polish. The development of statistical methods for speech recognition and language modelling requires access to large language corpora, including phonemic corpora. The method presented here enables the creation of such corpora. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

18 pages, 2634 KiB  
Article
Harris Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System
by Mohammed Hasan Ali, Mustafa Musa Jaber, Sura Khalil Abd, Amjad Rehman, Mazhar Javed Awan, Daiva Vitkutė-Adžgauskienė, Robertas Damaševičius and Saeed Ali Bahaj
Appl. Sci. 2022, 12(3), 1091; https://doi.org/10.3390/app12031091 - 21 Jan 2022
Cited by 25 | Viewed by 3083
Abstract
Automatic speech recognition (ASR) is an effective technique that can convert human speech into text format or computer actions. ASR systems are widely used in smart appliances, smart homes, and biometric systems. Signal processing and machine learning techniques are incorporated to recognize speech. [...] Read more.
Automatic speech recognition (ASR) is an effective technique that can convert human speech into text format or computer actions. ASR systems are widely used in smart appliances, smart homes, and biometric systems. Signal processing and machine learning techniques are incorporated to recognize speech. However, traditional systems have low performance due to a noisy environment. In addition to this, accents and local differences negatively affect the ASR system’s performance while analyzing speech signals. A precise speech recognition system was developed to improve the system performance to overcome these issues. This paper uses speech information from jim-schwoebel voice datasets processed by Mel-frequency cepstral coefficients (MFCCs). The MFCC algorithm extracts the valuable features that are used to recognize speech. Here, a sparse auto-encoder (SAE) neural network is used to classify the model, and the hidden Markov model (HMM) is used to decide on the speech recognition. The network performance is optimized by applying the Harris Hawks optimization (HHO) algorithm to fine-tune the network parameter. The fine-tuned network can effectively recognize speech in a noisy environment. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

Review

Jump to: Editorial, Research

22 pages, 1693 KiB  
Review
Arabic Automatic Speech Recognition: A Systematic Literature Review
by Amira Dhouib, Achraf Othman, Oussama El Ghoul, Mohamed Koutheair Khribi and Aisha Al Sinani
Appl. Sci. 2022, 12(17), 8898; https://doi.org/10.3390/app12178898 - 05 Sep 2022
Cited by 12 | Viewed by 5950
Abstract
Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, [...] Read more.
Automatic Speech Recognition (ASR), also known as Speech-To-Text (STT) or computer speech recognition, has been an active field of research recently. This study aims to chart this field by performing a Systematic Literature Review (SLR) to give insight into the ASR studies proposed, especially for the Arabic language. The purpose is to highlight the trends of research about Arabic ASR and guide researchers with the most significant studies published over ten years from 2011 to 2021. This SLR attempts to tackle seven specific research questions related to the toolkits used for developing and evaluating Arabic ASR, the supported type of the Arabic language, the used feature extraction/classification techniques, the type of speech recognition, the performance of Arabic ASR, the existing gaps facing researchers, along with some future research. Across five databases, 38 studies met our defined inclusion criteria. Our results showed different open-source toolkits to support Arabic speech recognition. The most prominent ones were KALDI, HTK, then CMU Sphinx toolkits. A total of 89.47% of the retained studies cover modern standard Arabic, whereas 26.32% of them were dedicated to different dialects of Arabic. MFCC and HMM were presented as the most used feature extraction and classification techniques, respectively: 63% of the papers were based on MFCC and 21% were based on HMM. The review also shows that the performance of Arabic ASR systems depends mainly on different criteria related to the availability of resources, the techniques used for acoustic modeling, and the used datasets. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

26 pages, 2025 KiB  
Review
Automatic Speech Recognition (ASR) Systems for Children: A Systematic Literature Review
by Vivek Bhardwaj, Mohamed Tahar Ben Othman, Vinay Kukreja, Youcef Belkhier, Mohit Bajaj, B. Srikanth Goud, Ateeq Ur Rehman, Muhammad Shafiq and Habib Hamam
Appl. Sci. 2022, 12(9), 4419; https://doi.org/10.3390/app12094419 - 27 Apr 2022
Cited by 25 | Viewed by 6104
Abstract
Automatic speech recognition (ASR) is one of the ways used to transform acoustic speech signals into text. Over the last few decades, an enormous amount of research work has been done in the research area of speech recognition (SR). However, most studies have [...] Read more.
Automatic speech recognition (ASR) is one of the ways used to transform acoustic speech signals into text. Over the last few decades, an enormous amount of research work has been done in the research area of speech recognition (SR). However, most studies have focused on building ASR systems based on adult speech. The recognition of children’s speech was neglected for some time, which means that the field of children’s SR research is wide open. Children’s SR is a challenging task due to the large variations in children’s articulatory, acoustic, physical, and linguistic characteristics compared to adult speech. Thus, the field became a very attractive area of research and it is important to understand where the main center of attention is, and what are the most widely used methods for extracting acoustic features, various acoustic models, speech datasets, the SR toolkits used during the recognition process, and so on. ASR systems or interfaces are extensively used and integrated into various real-life applications, such as search engines, the healthcare industry, biometric analysis, car systems, the military, aids for people with disabilities, and mobile devices. A systematic literature review (SLR) is presented in this work by extracting the relevant information from 76 research papers published from 2009 to 2020 in the field of ASR for children. The objective of this review is to throw light on the trends of research in children’s speech recognition and analyze the potential of trending techniques to recognize children’s speech. Full article
(This article belongs to the Special Issue Automatic Speech Recognition)
Show Figures

Figure 1

Back to TopTop