Next Article in Journal
Four-Objective Optimization of an Irreversible Stirling Heat Engine with Linear Phenomenological Heat-Transfer Law
Next Article in Special Issue
Improved Transformer-Based Dual-Path Network with Amplitude and Complex Domain Feature Fusion for Speech Enhancement
Previous Article in Journal
Template Attack of LWE/LWR-Based Schemes with Cyclic Message Rotation
Previous Article in Special Issue
Multi-Task Transformer with Adaptive Cross-Entropy Loss for Multi-Dialect Speech Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning

School of Electronics Engineering, VIT-AP University, Amaravati 522237, India
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(10), 1490; https://doi.org/10.3390/e24101490
Submission received: 31 August 2022 / Revised: 8 October 2022 / Accepted: 14 October 2022 / Published: 19 October 2022
(This article belongs to the Special Issue Information-Theoretic Approaches in Speech Processing and Recognition)

Abstract

:
Automatic speech recognition (ASR) in children is a rapidly evolving field, as children become more accustomed to interacting with virtual assistants, such as Amazon Echo, Cortana, and other smart speakers, and it has advanced the human–computer interaction in recent generations. Furthermore, non-native children are observed to exhibit a diverse range of reading errors during second language (L2) acquisition, such as lexical disfluency, hesitations, intra-word switching, and word repetitions, which are not yet addressed, resulting in ASR’s struggle to recognize non-native children’s speech. The main objective of this study is to develop a non-native children’s speech recognition system on top of feature-space discriminative models, such as feature-space maximum mutual information (fMMI) and boosted feature-space maximum mutual information (fbMMI). Harnessing the collaborative power of speed perturbation-based data augmentation on the original children’s speech corpora yields an effective performance. The corpus focuses on different speaking styles of children, together with read speech and spontaneous speech, in order to investigate the impact of non-native children’s L2 speaking proficiency on speech recognition systems. The experiments revealed that feature-space MMI models with steadily increasing speed perturbation factors outperform traditional ASR baseline models.

1. Introduction

Automatic speech recognition (ASR) technology for adults has become popular in many applications owing to the accessibility of large volumes of data and substantial computational capabilities, and new studies have already shown that ASR systems can accomplish quality standards similar to human transcriptions for some tasks [1]. However, ASRs proved their ability to recognize English spoken by adults or children with a native accent. They, on the other hand, continue to struggle due to the coexisting variability of non-native and children’s speaking characteristics, which include the following:
  • A lack of accessibility to a large amount of training data for non-native children’s speech [2];
  • Reading miscues, such as lexical disfluency, hesitations, intra-word switching (combining two languages within a single word), incomplete words, and filled interruptions, word repetitions, word boundary errors, and the adoption of some native language sounds and phonology are among them.
Because of its prominence as the language of formal instruction, English has become a significant mode of communication among Indian children [3]. However, the aforementioned reading miscues can significantly degrade ASR efficacy [4,5,6]. Therefore, ASR systems should continue to expand in order to more effectively process children’s speech from a non-native population. The key application field is the artificial assessment of L2 speaking proficiency, where ASR difficulties are exacerbated by the speakers with low levels of speaking ability, particularly for children. One of the similar applications available now for detecting and correcting pronunciation errors is “SPIRE-fluent”, a self-learning application for teaching oral fluency to L2 English learners [7]. However, the targeted users are class IX and above, job-seeking graduates, etc. ASR technology developed by “SOAPBOX” labs is being used to assist kids by transcribing their native accent as they read a story. This transcription is then compared to the text of the reading passage. The reliability of the student’s reading attempt is then shown by the fluency algorithm [8].
This encourages us to focus the research on non-native children, where most of the children’s speech corpora are costly and restricted in terms of accessibility. On the other hand, a considerable quantity of Chinese speaking English data (e.g., SpeechOcean [9]) are publicly available. Moreover, a good proportion of the datasets for Indian L2 speakers are limited. As a result, the proposed work involved the collection of speech corpora of different voice tasks from children and made publicly available. Speed perturbation-based data augmentation is being investigated to deal with the lack of training data from Indian children speaking English. To better address the aforementioned gaps, the current state-of-the-art ASR can act as a forerunner for non-native children’s speech.
This article’s details are as follows: Section 2 provides a brief overview of the existing non-native children speech recognition using several state-of-the-art models. Section 3 describes the details of the collected children’s speech database, which is followed by the proposed state-of-the-art models, including the mathematical intuition of the discriminative models in Section 4. The experimental procedure that addresses the corpus setup and artificial data creation used in the study is presented in Section 5, and the experimental evaluations for different styles of children’s speech are reported in Section 6. Finally, Section 7 outlines the conclusion and the future expansion of the proposed work.

2. Related Work

Children’s automatic speech recognition is a subject of great interest these days, as they are more comfortable interacting with digital personal assistants, such as Amazon Alexa, Microsoft Cortana, and Google Home. Children will advantage the most from technologies, such as automated reading assessment [10] and an interactive reading tutor [11], assisting them to learn both their first and second languages with minimal support from instructors. Another application is accent classification from native and non-native UK children’s speech, using reliable cues extracted from five–year-old pre-school children’s speech [12]. Another use in children’s speech is the classification of para-linguistic speech cues for adults and children using a phoneme-based model [13,14]. Since 2020, Interspeech has launched a challenge [15] to improve research on non-native children’s speech recognition technology, which is still failing due to immature vocal tracts, disfluencies [16], ungrammatical syntactic structure due to their L2 proficiency [17,18], use of out-of-vocabulary words (OOV) [19], and often due to a lack of publicly accessible data.
Many efforts are underway towards enhancing the efficiency of non-native children’s speech recognition, including data augmentation, the robust extraction of feature techniques, transfer learning, multi-task learning, and so on. Children aged 11 to 15 years old across Arab countries, China, France, and Germany obtained a word error rate (WER) of 13.4% employing bidirectional long short-term memory (BiLSTM)–recurrent neural network (RNN) models with read speech, picture narration, and spontaneous speech [20]. An investigation on transfer learning with Italian, German, English, and Swedish children aged between 9 and 10 years using a deep neural network (DNN) model achieved a WER of 14.2% for Italians speaking English and 15% for German children speaking English [18]. Additionally, prosody and spectrogram-based data augmentation are employed on the Trentino Language Testing (TLT) school non-native Italian children’s speech corpus, aged 9–16 years, on read speech incorporating factored time-delay neural networks (TDNN-F) and a convolution neural network (CNN) along with BiLSTM and vocal-tract-length normalization (VTLN) and attained a WER of 18.71% [21]. Consequently, the Mel-frequency cepstral coefficient (MFCC) and i-vectors are extracted and then speed and spectrogram perturbed to boost the effectiveness of the factored time-delay neural networks and convolution neural network (CNN-TDNN-F), resulting in a relative improvement (RI) of 17.76% over the same TLT corpus [22]. Furthermore, using spectrogram augmentation with a TDNN-F and LSTM on the TLT Italian children speech corpus [23], an RI of 17.87% is achieved compared to [21] and a 10.7% RI compared to [22]. Moreover, all available children’s speech corpora, such as OGI kids, MyST, CU, and CMU kids, ages 5 to 16, augmented the data with speed perturbation, room impulse response (RIR), babble noise, and non-speech noise, were incorporated on the CNN-TDNN-F model to achieve a WER of 16.59% using the minimum Bayes-risk (MBR) decoding technique [24]. Finally, the CNN-TDNN-F and Esp-Net models were used to investigate with pitch, speed, tempo, volume, and reverberation perturbations on a corpus of Mandarin children’s speech-language therapy (SLT) with read and conversational speech for 4–16 year old children, yielding a character error rate (CER) of 16.48% [25].
Neural networks struggle with data scarcity due to the large number of parameters more frequently than conventional methods, especially in low-data scenarios. As a result, the proposed work is significant for children aged 7–12 years and employs feature-space discriminatively trained models on various methods of speeding up data augmentation.

3. Children Speech Corpora

A collection of non-native children’s speech was developed because there was no publicly available corpus of non-native children’s speech suitable for the planned analysis at the time of authoring. As a result, data collection is required and the procedure for it will be discussed further below.

3.1. Participants

A total of 20 children with a gender composition of 11 females and 9 males in concrete-operational stage [26] with an age group between 7 and 12 years is considered for data collection. All the children are non-native English speakers and are bilingual, i.e., they speak Telugu, an Indian regional language, alongside English. At school, all the students speak English with age-appropriate proficiency. All the children and the parents gave their approval for participation in the research.

3.2. Data Collection

In order to assess children’s pronunciation and communication skills, the sentence repeating task and the picture narration task were devised in data collection. As it is targeted for primary and lower middle class students around the ages of 7–12 years, simple sentences and pictures were used. In “read speech” task, children repeat the sentences which were framed as a part of daily routine (e.g.,: I will wake up early in the morning and pray to God) to measure the phonetic ability. To evaluate working memory and attention, spoken digits from 1 to 15 were included but without repetitions or mistakes. To make the recording setting child friendly, few most well-known rhymes were also added, such as “rain rain go away, come again on other day”. Each sentence lasts for 2–3 s duration.
In the picture narration task, children are expected to describe the black and white picture of objects, such as pencil, umbrella, cap, etc., which are available in SurveyLex cloud [27]. The ability of non-native children to speak English fluently and effectively can be evaluated using this method of speech and is known as “spontaneous speech”. This activity has the potential to represent children’s higher-order thought associations, vocabulary, attentiveness, alongside disfluencies, false starts, pause lengths, and other issues. The length of the utterance varies depending on how well the children can communicate in English.

3.3. Data Processing

The open source SurveyLex [28] platform was deployed in the device at a distance of 2 feet from the children to record speech. Students are allowed to sit in front of a computer in a relaxed pace. All the audio files were recorded in .wav format that can use a stereo channel at a sampling rate of 44.1 KHz and a bit rate of 16 bits per sample. To analyze the variability in words and sentences, each survey is taken as many times as the child can, up to a maximum of 10 times per child. A total of 199.2 min (3.32 h) of data has been recorded. The audio files of children in English were recorded during class days, because robotic and computer applications are required to cope relatively well in classroom (i.e., real-world) environments which can often be noisy [29]. As a result, the planned corpus is recorded in this natural setting that involve doors closing, bell ringing, and other children talking sounds in neighboring rooms. A realistic representation of a minimal practical noise level in natural scenario is acceptable, which allows us to evaluate recognition accuracy with greater veracity.
Both read and spontaneous speech data were carefully transcribed, and all raw audio files, as well as transcription, may be found at Non-Native Children Speech Mini Corpus [30].

4. Proposed System Overview

The collected children’s speech is a stereo channel at a sampling rate of 44.1 KHz, which is down-sampled to 16KHz rate with single channel using Sox [31]. Initially, the proposed model is trained with re-sampled children speech and further experimented using speed perturbation-based data augmentation in order to handle the data scarcity as demonstrated in Figure 1. The most popular front-end feature extraction, MFCC, has been employed with frame length of 25 ms and an overlap of 10 ms after successful adaption of cepstral mean and variance normalization (CMVN).
The work is usually composed of two parts: acoustic modeling (AM) and language modeling (LM), with one focusing on mapping a set of acoustic feature vectors to suitable phonetics and the other subsequently transducing a set of phonetic units into proper sentences. The baseline data without augmentation and the augmented feature vectors are initially trained using acoustic models, such as context-independent mono-phone model (mono) and context-dependent triphone models (tri1, tri2) with delta and double-delta features on top of MFCCs. The linear discriminant analysis with maximum likelihood linear transformation (LDA-MLLT) (tri3) is being used to further transform the feature dimension, which are aligned using feature-space maximum likelihood linear regression (fMLLR). To reduce the significant deviation among word representations in train and test models, the output of tri3 is fed to discriminative models, such as maximum mutual information (MMI), boosted MMI (bMMI), feature-space MMI (fMMI), and boosted feature-space MMI (fbMMI), as shown in Figure 1.
For language modeling, conventional 3-gram language models (LM) [32] in the form of weighted finite-state transducers (WFST) [33] are commonly used, whereas Speech Technology and Research Laboratory (SRILM) is an open-source language modeling toolkit with remarkable features that is used to rescore the acoustic model outputs. Furthermore, test data are used to evaluate the performance of the proposed model, and the decoder results are depicted in terms of word error rate (WER).

4.1. Acoustic Models

For decades in ASR engines, Mel-frequency cepstral coefficients (MFCC) have been the most extensively employed acoustic feature vectors. A dense representation of the data is created by extracting MFCC feature vectors from raw data with a frame size of 25 ms. There has been a lot of research on extracting features using MFCC due to its ability to simplify the speech amplitude spectrum in a cosine form on a non-linear Mel scale. The primary goal of speech recognition is to find the best possible sequence of words from raw audio input using a language model. Acoustic observations are indicated as X, and a sequence of words is represented by W r , in order to form an acoustic model as shown in Equation (1)
W r ˜ = argmax W P ( X | W r ) P ( W r ) ,
where W r ˜ is the word hypothesis, P ( X | W r ) is the acoustic model, and P ( W r ) is the language model.
A Gaussian mixture model (GMM) may be used to estimate the distribution of phone features. The hidden Markov model (HMM) may be used to model the transition between phones and associated observations. The HMM model is comprised of hidden states and observables, and the Baum–Welch method [34] can build the HMM model provided all the training set. The Viterbi algorithm is used to determine the most probable sequence of hidden states.
Finally, two important metrics were used to determine optimal system performance [35]: the word error rate ( W E R ) , which is equal to the number of substitutions ( S ) , insertions ( I ) , and deletions ( D ) divided by total number of words in the actual utterance ( N ) and is given in Equation (2)
% W E R = S + D + I N × 100
and the percentage relative improvement ( % R I ) is predicated on the division of absolute increment corresponding to new values ( N ) by their original values ( O ) as shown in Equation (3)
% R I = ( N O ) O × 100

4.2. Discriminative Models

The ultimate goal of discriminative modeling is to enhance the quality of ASR by properly training the statistical parameters of HMM. Although there are several others, MMI is one of the most extensively used discriminative techniques. Despite the fact that several researchers have reported significant improvements in native adult and child speech recognition using discriminative models [36,37,38], persistent improvements in non-native child speech recognition using discriminative models are still elusive. The key contribution of this article is the use of feature-space MMI and boosted feature-space MMI, which reliably outperform MMI and boosted MMI and are likely to be effective methods for discriminative training.

4.2.1. Maximum Mutual Information (MMI)

MMI aims at maximizing the posterior distribution, or likelihood of data, for certain right-word patterns while lowering the total likelihood of data for all possible word sequences in the lexicon. In a nutshell, the objective function of MMI is developed by finding the correct word hypothesis more likely whilst making the incorrect word hypothesis less likely [39,40]. The MMI objective function is achieved by optimizing mutual information on a sequence of observations X, which requires setting the parameter to λ , while the correct word transcription W r is corresponding to HMM of the r th utterance is derived as shown in Equation (4),
ϝ MMI ( λ ) = r = 1 R log P ( X r , W r ) P ( X r ) P ( W r ) = r = 1 R log P ( X r , W r ) P ( X r ) log P ( W r ) = r = 1 R log P λ ( X r | W r ) k P ( W r ) W ^ P λ ( X r | W ^ ) k P ( W ^ ) log P ( W r ) ,
where P ( W r ) is the probability of all possible word sequences in the language model and k is a scalable fudge factor in correcting the word estimations.
MMI can be further simplified as shown in Equation (5) by setting P ( W ^ ) independent of model λ which significantly improves the posterior probability that is possible simply by increasing the likelihood of making a prediction that is identical to hypothesis (numerator lattice) and minimizing the likelihood of acquiring wrong words W ^ (denominator lattice). Figure 2 illustrates an example of generated lattice for a possible word hypothesis.
ϝ MMI ( λ ) = argmax λ r = 1 R log P λ ( X r | W r ) P ( W r ) W ^ P λ ( X r | W ^ ) P ( W ^ )

4.2.2. Boosted Maximum Mutual Information (bMMI)

In reality, MMI has a number of issues, including difficulty in maximizing the objective function, greater computational cost for maximization, and poor adaptation to unknown data. So, the language model is altered by the inclusion of a boosting parameter. This boosting factor, b = 0.05, is believed to enhance the possibility of wrong words ( W ^ ), which leads to more confusion [41]. There seems to be a small exponential change in bMMI when compared to MMI, as shown in Equation (6).
ϝ bMMI ( λ ) = argmax λ r = 1 R log P λ ( X r | W r ) P ( W r ) W ^ P λ ( X r | W ^ ) P ( W ^ ) e x p ( b A ( W r , W ^ ) )
where A ( W r , W ^ ) is the accuracy of word sequence W ^ with respect to W r ; also, each arc in the lattice is generated by subtracting ‘b’ times the accuracy in log-likelihood.

4.2.3. Feature-Space Maximum Mutual Information (fMMI)

The objective function of feature-space discriminative training is similar to MMI, with the exception that feature transformation is optimized using a very high-dimensional feature vector ( f τ ), along with global matrix G, also known as Gaussian posterior. These transformed feature vectors ( f τ ) are reflected back into the original feature space by adding it to original features a τ by a linear transformation technique with T : a b given as b τ = a τ + G f τ .
The feature vector transformations are calculated using gradient descent [42] instead of the Baum–Welch technique [34] to maximize the feature-space MMI objective function, as indicated in Equation (7)
G i j : = G i j + α i j G i j ϝ MMI ( λ )
where α i j is the learning rate, and objective function of fMMI Equation (8) is obtained from Equation (7) as,
ϝ fMMI ( λ ) = ϝ MMI ( λ ) G i j = τ = 1 T ϝ MMI ( λ ) b τ i f τ j

4.2.4. Boosted Feature-Space Maximum Mutual Information (fbMMI)

The training procedure of boosted feature-space MMI (fbMMI) is analogous to fMMI training as already addressed in the above section. Moreover, in fbMMI, the feature transformation is maximized in reference to the bMMI objective function as shown in Equation (9). The learning procedure of the model and statistics acquisition in fbMMI are equivalent to those in fMMI. The significant difference is that while implementing the forward–backward algorithm on the lattices, it is necessary to estimate the raw phone accuracy for each arc in addition to the acoustic and linguistic scores. The objective function of boosted feature-space MMI is given as
ϝ fbMMI ( λ ) = ϝ bMMI ( λ ) G i j = τ = 1 T ϝ bMMI ( λ ) b τ i f τ j

5. Experimental Setup

Children’s read speech, spontaneous speech, and a combination of both read and spontaneous speech have been used in the experimental setup of a non-native speech recognition system. For this purpose, discriminatively trained models are developed using the Kaldi toolkit [43]. The following is a more detailed description of children’s speech corpora and speed perturbation-based audio augmentation.

5.1. Data Interpretation

As shown in Table 1, the collected speech corpus is categorized into three parts to evaluate the performance of ASR, given different styles of children’s speech. Children’s reading abilities are critical for enhancing comprehension and understanding. Furthermore, non-native children who are acquiring a second language such as English are known to exhibit a variety of reading miscues. Lexical disfluency, hesitations, intra-word switching (combining two languages within a single word), word repetitions, and the adoption of some native language sounds and phonology are among them. Because certain miscues are not addressed in the standard children’s database so far, hence ASR’s struggle to recognize non-native children’s speech. These sub-word transcriptions, as shown in Table 2, are especially beneficial for detecting pronunciation errors and hesitations in non-native children’s speech [44].
The datasets in this study are intended to solve the aforementioned problems of children’s ASR. Therefore, it includes both read and spontaneous speech, with an orthogonal split (i.e., no identical speaker) of 70% for training data and 30% for test data, respectively. The train set of read speech corpus consists of 1.82 h of data from 15 children. The number of utterances is 1585, with total and unique word count of 18,154 and 155, respectively. The 0.65 h of test data consists of 5 speakers with 588 utterances in which total words are 6769 and unique words are 123. Four distinct images are used to collect the spontaneous speech in children, resulting in 0.67 h of training data which includes 569 utterances with 6906 total words and 131 unique words. There are 156 utterances in the 0.18 h of test data, including 1896 total words and 82 unique words in spontaneous speech. As one can see, spontaneous speech has a high word count despite the fact that the number of utterances is limited. Finally, as presented in Table 1, combined speech is composed of both read and spontaneous speech from children, with an orthogonal split of 2.65 h of training data and 0.67 h of test data. The train set contains 2274 utterances with a total of 26,443 words and 256 unique words, whereas the test set included 624 utterances with a total of 7286 words and 183 unique words.

5.2. Speed Perturbation

Due to limited available training data, children’s ASR systems do not perform well with non-native accents. To tackle the problem of data scarcity, the suggested approach collects high-quality children’s speech and augments it with speed perturbation. The mathematical intuition behind speed perturbation, which resamples the original data in time domain via Sox, is as follows.
If children’s speech is represented in time domain as S(t), then warping the duration of an audio signal by a factor of β alters the children’s speaking rate to y ( t ) = S ( β t ) . The spectral domain representation of S ( β t ) is β 1 S ^ ( β 1 w ) . If β < 1, the length of the speech signal is scaled back while signal energy raises to higher frequencies, and vice versa.
The experiment is conducted with various folds (1, 3, 5, 7) that stretch or squeeze the duration of speech utterance without influencing the linguistic content of the children’s speech, as shown in Figure 3. The standard Kaldi routine [45] is used to adjust the speed of children’s speech in the proposed work. The baseline data before speed perturbation are interpreted as the Fold-1. By adjusting the speed of the original training data to 90% and 110%, two extra copies of the identical training data were created. This results in a 3-fold training set.
In addition to the standard recipes, 5-fold and 7-fold training sets were developed. With a 5-fold speed perturbation, the speed is changed to 80%, 90%, 110%, and 120% of the original rate, augmenting the training data five times. Similarly, 7-fold speed perturbation data were created by modifying the speed to 80%, 85%, 90%, 95%, 110%, and 115% of the original rate, which is labeled as such in Table 3.

6. Experimental Results and Discussion

Considering different styles of children’s speech, several investigations have been carried out by adjusting the speed perturbation factors and examined ASR performance in all the aspects. Initially, acoustic models are being used to train baseline and speed-perturbed data, followed by discriminative models, such as the MMI, boosted MMI, feature-space MMI, and boosted feature-space MMI. This section is a detailed description of the performance evaluation of non-native children’s speech recognition in different speech styles.

6.1. Performance Analysis of Proposed Models on Read Speech Task

The GMM-HMM acoustic model is deployed, and it is evaluated using 0.65 h with 588 utterances of baseline children’s read speech data. As shown in Table 4, the initially proposed models are experimented without introducing any speed perturbation factors which are referred to as the baseline results. To conduct monophone (mono) training, 500 total Gaussians were considered, with 1.25 boosting silence. Following, triphone (Tri1) training with 500 tied states (senones) and 2000 cluster leaves and triphone (Tri2) modeling with 600 senones and 2500 leaves is performed. A minimum WER of 2.22% is achieved by the MFCC features when combined with the delta and double-delta features. Further, using 700 tied states and 3000 cluster leaves, a linear discriminant analysis with maximum likelihood linear transformation (LDA-MLLT) (tri3) is applied to the features, which is subsequently modeled using discriminative approaches, yielding 2.13% of the WER with the boosted feature-space MMI, and a relative improvement of 4% is observed in the discriminative training. These baseline results are compared with the 3-fold, 5-fold, and 7-fold speed-perturbated data, as shown in Figure 4a. The percentage error in the MMI, bMMI, fMMI, and fbMMI decreases significantly as the number of folds accumulates, as depicted in Figure 4b. When compared to the MMI and bMMI, the recognition performance of the fMMI and fbMMI is enhanced at a 7-fold augment, with an RI of 11.06% and 3.7%, respectively.

6.2. Performance Analysis of Proposed Models on Spontaneous Speech Task

The GMM-HMM acoustic model uses the same composition of Gaussians, tied states, and cluster leaves where Tri1 and Tri2 have the least WER compared to the fMMI and fbMMI, because the models are trained with relatively minimal data in the spontaneous speech task without speed perturbation. Because of their L2 [46] proficiency, children tend to include long pauses, repetition of words, and speech crossovers in spontaneous speech tasks, which indeed results in a performance drop-off in ASR when compared to reading speech proficiency. Long pauses can be diminished using speed-based data augmentation, and the recognition performance of feature discriminative techniques can be optimized.
After the augmentation, 0.67 h of baseline training data is 7-fold perturbated to 4.95 h, resulting in a decrease in the WER from 17.14% to 15.77% in the fMMI and 17.09% to 15.51% in the fbMMI, respectively, as shown in Figure 5a. A relative improvement of 7.99% and 9.24% in the fMMI and fbMMI, respectively, is achieved in spontaneous speech, as depicted in Figure 5b.

6.3. Performance Analysis of Proposed Models on Combined Speech Task

With regard to speed perturbation-based augmentation, there is indeed a significant variation in the performance of the ASR on the read speech and spontaneous speech of non-native children, as shown in the preceding sections. However, both the read and spontaneous speech datasets are merged and explored with the proposed models in order to make the ASR a generic model for any type of speech.
With a 7-fold data augmentation, 2.65 h of training data is increased to approximately 20 h in combined speech. Alongside, the suggested models are evaluated on 0.67 h of orthogonal data without augmentation, and the best WER in the fbMMI is 2.81%. All the discriminative models performed similarly in this type of speech, as seen in Figure 6a. With a 5-fold speed perturbation, the MMI and bMMI models perform better than the 7-fold one; however, with the 5-fold and 7-fold augmentation, the fMMI and fbMMI models perform similarly. As shown in Figure 6b, the fMMI and fbMMI obtained a performance improvement of 11.37% and 11.74%, respectively, with the 7-fold speed perturbation, as presented in Table 5.

6.4. Comparative Analysis of Earlier State of the Art on Non-Native Children Speech Recognition

As reported in Table 6, the majority of non-native children’s speech recognition research has centered on languages such as Italian, German, Chinese, Swedish, and others, but ASR for Indian children who speak Telugu as their first language and English as a second has yet to be examined. In comparison to previous state-of-the-art models, this paper is a forerunner in addressing the issues faced by L2 learners while using ASR systems and has therefore developed a discriminatively trained non-native children’s ASR.

7. Conclusions

In this paper, a corpus of non-native children who are acquiring a second language such as English is being collected for a children’s speech recognition system and analyzed in a limited-data scenario. The speed perturbation-based data augmentation is induced with various folds (1, 3, 5, and 7) on the original data to handle the data scarcity. Through a comprehensive set of feature discriminative training experiments, the combined use of read speech and spontaneous speech has demonstrated the promising performance of non-native children’s English speech ASR. The results revealed that the system training-pooled 7-fold speed perturbation-based data augmentation outperformed the baseline models (1-fold) with relative improvements of 11.37% and 11.74% with the fMMI and fbMMI, respectively. The performance of the non-native children’s ASR may be significantly improved in the near future by developing more advanced DNN-HMM or end-to-end ASR systems and also other resource-poor ASR applications.

Author Contributions

Conceptualization, K.R. and M.B.; methodology, K.R.; software, K.R.; validation, K.R. and M.B.; formal analysis, K.R.; investigation, K.R.; resources, K.R. and M.B.; data curation, K.R.; writing—original draft preparation, K.R.; writing—review and editing, M.B.; visualization, M.B.; supervision, M.B.; project administration, M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The open-access data that support the findings of this study are made publicly available in the Kaggle repository, https://doi.org/10.34740/KAGGLE/DS/2160743, accessed on 9 May 2022. More details about the data collection are given in Section 3.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.L.; Stolcke, A.; Yu, D.; Zweig, G. Toward human parity in conversational speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2410–2423. [Google Scholar] [CrossRef]
  2. Park, S.; Culnan, J. A comparison between native and non-native speech for automatic speech recognition. J. Acoust. Soc. Am. 2019, 145, 1827. [Google Scholar] [CrossRef]
  3. Pandey, K.K.; Jha, S. Exploring the interrelationship between culture and learning: The case of English as a second language in India. Asian Englishes 2021, 1–17. [Google Scholar] [CrossRef]
  4. O’Brien, M.G.; Derwing, T.M.; Cucchiarini, C.; Hardison, D.M.; Mixdorff, H.; Thomson, R.I.; Strik, H.; Levis, J.M.; Munro, M.J.; Foote, J.A.; et al. Directions for the future of technology in pronunciation research and teaching. J. Second Lang. Pronunc. 2018, 4, 182–207. [Google Scholar] [CrossRef] [Green Version]
  5. Mulholland, M.; Lopez, M.; Evanini, K.; Loukina, A.; Qian, Y. A comparison of ASR and human errors for transcription of non-native spontaneous speech. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 5855–5859. [Google Scholar]
  6. Kovtun, V.; Kovtun, O.; Semenov, A. Entropy-Argumentative Concept of Computational Phonetic Analysis of Speech Taking into Account Dialect and Individuality of Phonation. Entropy 2022, 24, 1006. [Google Scholar] [CrossRef]
  7. Yarra, C.; Srinivasan, A.; Gottimukkala, S.; Ghosh, P.K. SPIRE-fluent: A Self-Learning App for Tutoring Oral Fluency to Second Language English Learners. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 968–969. [Google Scholar]
  8. Kelly, A.C.; Karamichali, E.; Saeb, A.; Veselỳ, K.; Parslow, N.; Deng, A.; Letondor, A.; O’Regan, R.; Zhou, Q. Soapbox Labs Verification Platform for Child Speech. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 486–487. [Google Scholar]
  9. Zhang, J.; Zhang, Z.; Wang, Y.; Yan, Z.; Song, Q.; Huang, Y.; Li, K.; Povey, D.; Wang, Y. Speechocean762: An open-source non-native English speech corpus for pronunciation assessment. arXiv 2021, arXiv:2104.01378. [Google Scholar]
  10. Evanini, K.; Wang, X. Automated speech scoring for non-native middle school students with multiple task types. In Proceedings of the INTERSPEECH, Lyon, France, 25–29 August 2013; pp. 2435–2439. [Google Scholar]
  11. Mostow, J. Why and how our automated reading tutor listens. In Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT), Stockholm, Sweden, 6–8 June 2012; pp. 43–52. [Google Scholar]
  12. Radha, K.; Bansal, M.; Shabber, S.M. Accent Classification of Native and Non-Native Children using Harmonic Pitch. In Proceedings of the 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP), Vijayawada, India, 12–14 February 2022; pp. 1–6. [Google Scholar] [CrossRef]
  13. Bansal, M.; Sircar, P. Phoneme Based Model for Gender Identification and Adult-Child Classification. In Proceedings of the 2019 13th International Conference on Signal Processing and Communication Systems (ICSPCS), Surfers Paradise, Australia, 16–18 December 2019; pp. 1–7. [Google Scholar] [CrossRef]
  14. Bansal, M.; Sircar, P. AFM Signal Model for Digit Recognition. In Proceedings of the 2021 Sixth International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 25–27 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 354–358. [Google Scholar]
  15. Gretter, R.; Matassoni, M.; Falavigna, G.D.; Keelan, E.; Leong, C.W. Overview of the interspeech tlt2020 shared task onasr for non-native children’s speech. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
  16. Li, Q.; Russell, M.J. An analysis of the causes of increased error rates in children’s speech recognition. In Proceedings of the Seventh International Conference on Spoken Language Processing, Denver, CO, USA, 16–20 September 2002. [Google Scholar]
  17. Shivakumar, P.G.; Georgiou, P. Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Comput. Speech Lang. 2020, 63, 101077. [Google Scholar] [CrossRef] [Green Version]
  18. Matassoni, M.; Gretter, R.; Falavigna, D.; Giuliani, D. Non-native children speech recognition through transfer learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6229–6233. [Google Scholar]
  19. Laptev, A.; Andrusenko, A.; Podluzhny, I.; Mitrofanov, A.; Medennikov, I.; Matveev, Y. Dynamic acoustic unit augmentation with bpe-dropout for low-resource end-to-end speech recognition. Sensors 2021, 21, 3063. [Google Scholar] [CrossRef]
  20. Qian, Y.; Evanini, K.; Wang, X.; Lee, C.M.; Mulholland, M. Bidirectional LSTM-RNN for Improving Automated Assessment of Non-Native Children’s Speech. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1417–1421. [Google Scholar]
  21. Kathania, H.; Singh, M.; Grósz, T.; Kurimo, M. Data augmentation using prosody and false starts to recognize non-native children’s speech. arXiv 2020, arXiv:2008.12914. [Google Scholar]
  22. Lo, T.H.; Chao, F.A.; Weng, S.Y.; Chen, B. The NTNU system at the interspeech 2020 non-native Children’s speech ASR challenge. arXiv 2020, arXiv:2005.08433. [Google Scholar]
  23. Knill, K.M.; Wang, L.; Wang, Y.; Wu, X.; Gales, M.J. Non-Native Children’s Automatic Speech Recognition: The INTERSPEECH 2020 Shared Task ALTA Systems. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 255–259. [Google Scholar]
  24. Shahin, M.A.; Lu, R.; Epps, J.; Ahmed, B. UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 265–268. [Google Scholar]
  25. Chen, G.; Na, X.; Wang, Y.; Yan, Z.; Zhang, J.; Ma, S.; Wang, Y. Data Augmentation For Children’s Speech Recognition–The “Ethiopian” System For The SLT 2021 Children Speech Recognition Challenge. arXiv 2020, arXiv:2011.04547. [Google Scholar]
  26. Ghazi, S.R.; Ullah, K. Concrete operational stage of Piaget’s cognitive development theory: An implication in learning general science. Gomal Univ. J. Res. [GUJR] 2015, 31, 78–89. [Google Scholar]
  27. SurveyLex. Available online: http://neurolex.co/uploads/ (accessed on 1 January 2022).
  28. Schwoebel, J. SurveyLex. Available online: https://www.surveylex.com/ (accessed on 1 January 2022).
  29. Fernando, S.; Moore, R.K.; Cameron, D.; Collins, E.C.; Millings, A.; Sharkey, A.J.; Prescott, T.J. Automatic recognition of child speech for robotic applications in noisy environments. arXiv 2016, arXiv:1611.02695. [Google Scholar]
  30. Radha, K.; Bansal, M. Non-Native Children Speech Mini Corpus. Available online: https://doi.org/10.34740/KAGGLE/DS/2160743 (accessed on 9 May 2022).
  31. (cbagwell@users.sourceforge.net), C.B. Sound Exchange. Available online: http://sox.sourceforge.net/SoX/Resampling (accessed on 5 February 2022).
  32. Goodman, J.T. A bit of progress in language modeling. Comput. Speech Lang. 2001, 15, 403–434. [Google Scholar] [CrossRef]
  33. Mohri, M.; Pereira, F.; Riley, M. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing; Springer: Berlin/Heidelberg, Germany, 2008; pp. 559–584. [Google Scholar]
  34. Ben-Yishai, A.; Burshtein, D. A discriminative training algorithm for hidden Markov models. IEEE Trans. Speech Audio Process. 2004, 12, 204–217. [Google Scholar] [CrossRef]
  35. Morris, A.C.; Maier, V.; Green, P. From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. In Proceedings of the Eighth International Conference on Spoken Language Processing, Jeju Island, Korea, 4–8 October 2004. [Google Scholar]
  36. Dua, M.; Aggarwal, R.K.; Biswas, M. Discriminatively trained continuous Hindi speech recognition system using interpolated recurrent neural network language modeling. Neural Comput. Appl. 2019, 31, 6747–6755. [Google Scholar] [CrossRef]
  37. Lu, C.; Tang, C.; Zhang, J.; Zong, Y. Progressively Discriminative Transfer Network for Cross-Corpus Speech Emotion Recognition. Entropy 2022, 24, 1046. [Google Scholar] [CrossRef]
  38. Hasija, T.; Kadyan, V.; Guleria, K.; Alharbi, A.; Alyami, H.; Goyal, N. Prosodic Feature-Based Discriminatively Trained Low Resource Speech Recognition System. Sustainability 2022, 14, 614. [Google Scholar] [CrossRef]
  39. Gillick, D.; Wegmann, S.; Gillick, L. Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 4745–4748. [Google Scholar]
  40. Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Discriminative training for automatic speech recognition: Modeling, criteria, optimization, implementation, and performance. IEEE Signal Process. Mag. 2012, 29, 58–69. [Google Scholar] [CrossRef]
  41. Povey, D.; Kanevsky, D.; Kingsbury, B.; Ramabhadran, B.; Saon, G.; Visweswariah, K. Boosted MMI for model and feature-space discriminative training. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 30 March–4 April 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 4057–4060. [Google Scholar]
  42. Seide, F.; Fu, H.; Droppo, J.; Li, G.; Yu, D. On parallelizability of stochastic gradient descent for speech DNNS. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 235–239. [Google Scholar] [CrossRef]
  43. Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; IEEE Signal Processing Society: Piscataway, NJ, USA, 2011. [Google Scholar]
  44. Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8132–8136. [Google Scholar]
  45. Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
  46. Hulstijn, J.H. Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Lang. Assess. Q. 2011, 8, 229–249. [Google Scholar] [CrossRef] [Green Version]
  47. Kathania, H.K.; Kadiri, S.R.; Alku, P.; Kurimo, M. Using data augmentation and time-scale modification to improve asr of children’s speech in noisy environments. Appl. Sci. 2021, 11, 8420. [Google Scholar] [CrossRef]
Figure 1. Proposed discriminatively trained non-native children speech recognition using speed perturbation-based audio augmentation.
Figure 1. Proposed discriminatively trained non-native children speech recognition using speed perturbation-based audio augmentation.
Entropy 24 01490 g001
Figure 2. An example word lattice of spontaneous speech utterance: “It is a umbrella. It protects us from rain”.
Figure 2. An example word lattice of spontaneous speech utterance: “It is a umbrella. It protects us from rain”.
Entropy 24 01490 g002
Figure 3. Speed perturbation-based data augmentation.
Figure 3. Speed perturbation-based data augmentation.
Entropy 24 01490 g003
Figure 4. Read speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Figure 4. Read speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Entropy 24 01490 g004
Figure 5. Spontaneous speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Figure 5. Spontaneous speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Entropy 24 01490 g005
Figure 6. Combined speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Figure 6. Combined speech: (a) word error rate (%) for baseline (fold-1) and synthetic (fold-3, 5, 7) data on discriminative techniques and (b) (%) relative WER improvement on fMMI, fbMMI for baseline and 7-fold speed-perturbated data.
Entropy 24 01490 g006
Table 1. Details of non-native children speech corpora.
Table 1. Details of non-native children speech corpora.
BaselineUtterancesTotal WordsUnique
Words
Duration
(hours)
Read SpeechTrain158518,1541551.82
Test58867691230.65
Spontaneous SpeechTrain56969061310.67
Test1561896820.18
Combined SpeechTrain227426,4432562.65
Test62472861830.67
Table 2. Examples of disfluency in non-native children speech corpus.
Table 2. Examples of disfluency in non-native children speech corpus.
Type of DisfluencyExample
Word Repetitionsthis is a umbrella it is it is useful when it rains
Word Fragmentsthere was a sa- sad dog he did not have any friend
Intra-Word Switchingeveryone in my schooluu likes icecream
Hesitationson his way home he hmm he crossed a river and saw another dog
Ungrammatical Wordsthis is a pencil we will use it when when we should write
our home works
Table 3. Audio augmented data of children training set with different speed perturbation factors. Different number of hours (#Hrs) and the respective utterances (#Tokens) for training are used.
Table 3. Audio augmented data of children training set with different speed perturbation factors. Different number of hours (#Hrs) and the respective utterances (#Tokens) for training are used.
DatasetsFoldPerturb. Factors#Hrs#Tokens
Read Speech1Baseline1.821.58 K
30.9, 1.0, 1.15.514.75 K
50.8, 0.9, 1.0, 1.1, 1.29.327.92 K
70.8, 0.85, 0.9, 0.95, 1.0, 1.1, 1.1513.4611.1 K
Spontaneous Speech1Baseline0.670.56 K
30.9, 1.0, 1.12.031.7 K
50.8, 0.9, 1.0, 1.1, 1.23.432.84 K
70.8, 0.85, 0.9, 0.95, 1.0, 1.1, 1.154.953.98 K
Combined Speech1Baseline2.652.27 K
30.9, 1.0, 1.186.82 K
50.8, 0.9, 1.0, 1.1, 1.213.5311.37 K
70.8, 0.85, 0.9, 0.95, 1.0, 1.1, 1.1519.5315.9 K
Table 4. Experimental results of acoustic and discriminative models in %WER for original and synthetic data with different styles of children’s speech.
Table 4. Experimental results of acoustic and discriminative models in %WER for original and synthetic data with different styles of children’s speech.
Type of
Speech
FoldAcoustic ModelsDiscriminative Models
MonoTri1Tri2Tri3MMIbMMIfMMIfbMMI
Read
Speech
12.872.302.222.662.392.422.262.13
32.602.672.222.252.112.202.072.07
52.932.352.072.332.282.262.062.05
72.902.162.062.422.162.332.012.05
Spontaneous
Speech
119.5115.8216.1717.2517.2517.3517.1417.09
322.7318.6716.0917.0916.9816.9816.7216.56
521.7816.7716.5115.9816.3516.2415.9815.56
721.8916.2415.5515.8216.0316.0915.7715.51
Combined
Speech
13.432.852.963.032.842.832.902.81
33.462.992.682.722.852.842.792.72
53.692.832.642.922.592.612.582.57
73.722.852.902.792.612.732.572.48
Table 5. Comparison of baseline and speed perturbation on different types of children’s speech.
Table 5. Comparison of baseline and speed perturbation on different types of children’s speech.
Type of SpeechFold#HrsWER (%)Rel. Improvement (%)
fMMIfbMMIfMMIfbMMI
Read
Speech
11.822.262.1311.063.7
713.462.012.05
Spontaneous
Speech
10.6717.1417.097.999.24
74.9515.7715.51
Combined
Speech
12.652.902.8111.3711.74
719.532.572.48
Table 6. Comparative analysis of earlier state-of-the-art results.
Table 6. Comparative analysis of earlier state-of-the-art results.
Year/
Author
Augmentation
Type
Dataset
Type
Front-End
Approach
State-of-the-Art
Model
Performance
2017
[20]
No augmentationEnglish read, picture narration, and
spontaneous speech (11–15 years)
from native Arabic, Chinese,
French, German, and many
other children speaking English
MFCCBiLSTM-RNNWER of 13.4%
is obtained
2018
[18]
No augmentationItalian, German, English, and
Swedish children aged
9–10 years
MFCCDNNNon-native adaption
was used in transfer
learning and results in
14.2% WER for
Italian and 15% for
German speaking English
2020
[21]
Prosody-based augmentation,
spectrogram augmentation
TLT non-native corpus—
English read speech
from native Italian children
(9–16 years)
MFCCTDNN+BiLSTM
+VTLN
WER of 18.71%
with spectrogram
augmentation
2020
[22]
Speed perturbation,
spectrogram perturbation
TLT non-native corpus—
English read speech
from native Italian children
(9–16 years)
MFCC+
i vectors,
CMVN,
VTLN
TDNN-F,
CNN-TDNN-F
WER is 17.59%
with semi-supervised
learning
2020
[23]
Spectrogram augmentationTLT non-native English,
German, and Italian children
corpus—spontaneous speech
(9–16 years)
MFCCTDNN-F+LSTMWER is 15.7% by
combining all systems
independent of grade
2020
[24]
Speed perturbation,
room impulse response
(RIR),
babble noise,
non-speech noise
OGI, MyST, CU, CMU
(5–16 years)
read and spontaneous
speech
MFCCGMM-HMM
CNN+TDNN-F
WER is 16.59 with
min. Bayes-risk
decoding
2020
[25]
Pitch, speed, volume,
tempo, reverberation
perturbations
SLT Mandarin children
read and conversational
speech (4–16 years)
MFCCCNN-TDNN-F,
EspNet
CER is 16.48% by
combining all types of
perturbations
2021
[47]
Data augmentation,
time-scale modification
Clean adult’s speech
WSJCAM0 (train data)
Noisy children’s speech
PF-STAR (test data)
MFCCDNN-HMMWER is 14.88% by
all types of data
augmentation in
combined system
Proposed
work
Speed perturbation-
3Way, 5Way, and 7Way
7–12 years of English read and
spontaneous speech from
native Indian (Telugu) children
MFCC
CMVN
GMM-HMM
MMI, bMMI,
fMMI, fbMMI
WER for different
styles of speech
2.01% (read),
15.51% (spontaneous),
2.48% (combined)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Radha, K.; Bansal, M. Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning. Entropy 2022, 24, 1490. https://doi.org/10.3390/e24101490

AMA Style

Radha K, Bansal M. Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning. Entropy. 2022; 24(10):1490. https://doi.org/10.3390/e24101490

Chicago/Turabian Style

Radha, Kodali, and Mohan Bansal. 2022. "Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning" Entropy 24, no. 10: 1490. https://doi.org/10.3390/e24101490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop