Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Ren, Qing-Dao-Er-Ji; Wang, Lele; Zhang, Wenjing; Li, Leixiao

doi:10.3390/app14020625

Open AccessArticle

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

¹

School of Information Engineering, Inner Mongolia University of Technology, Hohhot 010051, China

²

College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(2), 625; https://doi.org/10.3390/app14020625

Submission received: 14 December 2023 / Revised: 6 January 2024 / Accepted: 7 January 2024 / Published: 11 January 2024

(This article belongs to the Special Issue Audio, Speech and Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Mongolian speech synthesis.

Abstract

The core challenge of speech synthesis technology is how to convert text information into an audible audio form to meet the needs of users. In recent years, the quality of speech synthesis based on end-to-end speech synthesis models has been significantly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model has achieved few results, and there are still some problems with the performance and synthesis quality. First, the phoneme information of Mongolian was further improved and a Bang-based pre-training model was constructed to reduce the error rate of Mongolian phonetic synthesized words. Second, a Mongolian speech synthesis model based on Ghost and ILPCnet was proposed, named the Ghost-ILPCnet model, which was improved based on the Para-WaveNet acoustic model, replacing ordinary convolution blocks with stacked Ghost modules to generate Mongolian acoustic features in parallel and improve the speed of speech generation. At the same time, the improved vocoder ILPCnet had a high synthesis quality and low complexity compared to other vocoders. Finally, a large number of data experiments were conducted on the proposed model to verify its effectiveness. The experimental results show that the Ghost-ILPCnet model has a simple structure, fewer model generation parameters, fewer hardware requirements, and can be trained in parallel. The average subjective opinion score of its synthesized speech reached 4.48 and the real-time rate reached 0.0041. It ensures the naturalness and clarity of synthesized speech, speeds up the synthesis speed, and effectively improves the performance of the Mongolian speech synthesis model.

Keywords:

Mongolian speech synthesis; non-autoregressive; Ghost module; vocoder

1. Introduction

The main goal of speech synthesis technology is to convert text information into sounds in order to use various voice applications and improve the user’s experience. This makes human life more exciting and creates many conveniences in people’s lives. At this stage, speech synthesis technology (text to speech, TTS) has become an integral part of the field of human–computer interaction applications, has attracted the attention of various application fields in the scientific community [1], and has achieved remarkable results.

In the past few decades, speech synthesis models have been developing and evolving. Among them, speech splicing technology, acoustic parameter, and statistical speech synthesis models are the most common. However, the construction and implementation of these models are very complex, require extensive linguistic knowledge, and require a large investment of time and effort. Compared with human speech, the rhythm and clarity of synthesized audio are often unstable and sound very unnatural. Hinton et al. proposed deep learning (DL), a new machine learning architecture in 2006 [2], which effectively combines deep neural networks and unsupervised layer-by-layer initialization technology, thus greatly improving the efficiency and accuracy of a new generation of artificial intelligence.

In recent years, deep learning has become a key research project of the new generation of artificial intelligence. Models based on deep learning have been widely used in related fields, such as visual processing and natural language processing, and have achieved major breakthroughs. In 2016, the WaveNet [3] model was proposed. It captures audio sampling points through a specially designed CNN model, converts the sequence into linear features, and automatically generates audio waveforms. It not only solved the problem of high time-resolution audio waveform modeling, but also the synthesized speech surpassed most speech synthesis models at the time, which was a groundbreaking significance. In 2017, Jose Sotelo et al. proposed a new end-to-end speech synthesis model: Char2Wav [4]. Different from traditional speech synthesis models, it can directly input text for training, output acoustic features, and finally set the time window size manually to obtain the audio through SampleRNN [5]. Since then, the end-to-end speech synthesis model has been widely recognized by researchers for its unique efficiency and accuracy. In the same year, the Baidu team launched DeepVoice [6], a new end-to-end speech synthesis system built entirely using deep neural networks (DNNs) and capable of generating efficient and accurate speech. It mainly relies on the grapheme-to-phoneme model [7] and uses an audio synthesis component with a three-layer bidirectional encoder and a three-layer unidirectional decoder with a gated recurrent unit (GRU) for speech synthesis. This model has no complicated processing flow, minimizes traces of manual intervention, and lays the foundation for true end-to-end speech synthesis that can directly input characters and output speech. Subsequently, DeepVoice2 [8] and DeepVoice3 [9] were proposed one after another. The quality of synthesized audio was significantly improved compared to DeepVoice. DeepVoice3 even realized the end-to-end speech synthesis effect directly from text to speech. In 2020, Peng et al. proposed a new feedforward neural network (FNN) model, namely, the ParaNet network model [10]. It can perform the parallel production of mel spectrograms by inputting text sequences, thus greatly improving the calculation efficiency and accuracy. ParaNet synthesized speech is 46-times faster than DeepVoice3 and its sound quality is clearer and more natural. Compared with DeepVoice3, its performance is even better.

With the continuous development of natural language processing technology, speech synthesis technology for mainstream languages, such as Chinese and English, has made great progress, and speech synthesis for minority languages has also received more and more attention. As a typical low-resource language, Mongolian is also one of China’s minority languages and is widely used in the Inner Mongolia Autonomous Region. It has a long history and rich cultural heritage. Due to the limited technical resources that can be used for reference in Mongolian informatization work at home and abroad, the development of its information processing technology lags. In order to promote intelligent life in areas where Mongolian is the main language, experts and scholars have performed many active explorations and practices. In 2017, Liu Rui et al. applied the acoustic model based on deep neural network technology to Mongolian speech synthesis and built a complete Mongolian speech synthesis system based on the characteristics of Mongolian [11], which provided Mongolian speech synthesis. In 2019, Liu Zhinan proposed an end-to-end Mongolian speech synthesis model [12], which was based on an encoder–decoder structure with an attention mechanism. It used a hybrid method that combined rules and statistics to deal with the problem of converting Mongolian letters to phonemes, which effectively reduced the word error rate and phoneme error rate. In 2022, Liu Rui et al. proposed a completely non-autoregressive, real-time, high-fidelity Mongolian speech synthesis model, MonTTS [13], based on the FastSpeech2 model to further improve the efficiency and quality of Mongolian speech synthesis.

The abovementioned research shows that, with the continuous expansion of Mongolian speech synthesis research, the synthesized speech quality of the speech synthesis model has been greatly improved. However, due to the characteristics of the Mongolian language and the lack of an audio corpus, the Mongolian speech synthesis model still has problems, such as a poor real-time synthesis rate, large model parameters, complex structure, mispronunciation, and missing words in the synthesized speech. This paper continues to explore Mongolian speech synthesis technology and strives to improve the quality and speed of synthesized speech. This not only promotes the development of Mongolian intelligent information processing technology, but also creates a better, more efficient, more convenient, and faster user experience. Focusing on the abovementioned issues, this paper proposes to build a Mongolian phoneme pre-training model based on Bang and a speech synthesis model suitable for traditional Mongolian, named Ghost-ILPCnet, and conducts a large number of data experiments to verify its effectiveness.

2. Para-WaveNet Mongolian Speech Synthesis Model

In recent years, ParaNet has been widely used for English and Chinese speech syntheses. Since Mongolian language knowledge is more complex than English writing and pronunciation, some complex pronunciation rules require a large amount of training data to be learned. Especially when certain rules appear less frequently in the training data, it is difficult for the neural network to fully learn these rules. Coupled with the lack of a Mongolian audio corpus, this leads to errors in synthesized speech. If ParaNet is directly used in Mongolian speech synthesis, this will result in word skipping and wrong words when input directly in the form of Mongolian text. Phonemes are the most basic units of pronunciation, such as initial consonants and finals in Mongolian, which themselves reflect pronunciation attributes. Therefore, using phoneme sequences as inputs can effectively avoid these error problems.

In order to solve the abovementioned problems, this paper improves the front-end module of the ParaNet acoustic model and adds the Mongolian character to the phoneme module. By inputting the converted phoneme sequence into the acoustic model, the pronunciation of the text can be simulated more accurately, and the <phoneme sequence, speech > data pair can be used to more effectively match the sound spectrum of the speech. These greatly reduce the word error rate of synthesized speech, further improving the sound quality of synthesized Mongolian speech. The overall structure of the model is shown in Figure 1.

Traditional Mongolian letters have different writing forms at the beginning, middle, and end of the word, for example, the beginning of the word is expressed as “ᠶᠠ‍ᠶᠠ‍”, the middle of the word is expressed as “᠊ᠶᠠ᠊ ”, and the end of the word is expressed as “᠊ᠶ᠎ᠠ”. This change in writing increases the complexity of training the Mongolian speech synthesis model and the generated speech also has problems, such as word skipping and missing words. Therefore, in order to solve the complex writing problems in Mongolian, in the front-end module of Mongolian character conversion into phonemes, Mongolian characters are first converted into their corresponding Latin forms [14], and then the Latin-form characters are converted into phonemes through preprocessing. Among them, the process of converting Latin characters into phonemes is called grapheme to phoneme (G2P). The processing flow is shown in Figure 2.

In the vocoder part, the WaveNet vocoder is used to generate the final waveform and synthesize Mongolian speech. WaveNet is a type of convolutional neural network (CNN). In WaveNet, the CNN takes the original signal as the input and outputs one synthetic sample at a time by sampling from the SoftMax distribution of the signal value.

3. Mongolian Speech Synthesis Model Based on Ghost-ILPCnet

The speech synthesized by the Para-WaveNet Mongolian speech synthesis model has basically achieved intelligible effects, but there are still problems in the quality and performance of the speech synthesis, making it difficult to achieve a practical application. Based on the abovementioned reasons, this paper starts from the perspective of Mongolian language characteristics, improves the Para-WaveNet model, and proposes a model suitable for Mongolian speech synthesis named Ghost-ILPCnet.

Firstly, the Mongolian phoneme pre-training model [15] based on Bang is added to the front-end processing part of the Para-WaveNet Mongolian speech synthesis model. Using larger-scale phoneme pre-training, better representations can be extracted from the phoneme sequence, which can effectively reduce the word error rate. The encoder is responsible for converting phoneme sequences into latent feature representations, which provides key and value as phoneme representations. Secondly, in the attention block in the encoder–decoder structure, it is proposed to use STFT loss and adversarial loss to optimize the training steps, replacing the training of knowledge distillation models (teacher–student model) and convolution networks, which does not require the use of density distillation in the conventional teacher–student framework; so, the entire model can be easily trained, even with a small number of parameters. At the same time, the Ghost module is used to replace the traditional convolution block, which improves the speed and stability of the ParaNet model to generate speech. Finally, a vocoder ILPCnet suitable for Mongolian speech synthesis is proposed to replace WaveNet to realize Mongolian speech synthesis. The overall structure of the model is shown in Figure 3. The detailed steps of the three modules of the Mongolian phoneme pre-training model, encoder–decoder, and vocoder are shown in Section 3.1, Section 3.2 and Section 3.3.

3.1. Mongolian Phoneme Pre-Training Model Based on Bang

For the cases where the generated audio had reduced naturalness and insufficient authenticity, we further analyzed the reasons for this and found that there was a phenomenon in Mongolian where the glyphs had the same shape, different pronunciations, and different meanings, that is, the same letters in different contexts had different meanings and different pronunciations. For example, the Mongolian text “ᠣᠳᠣ” is placed in the context of describing tenses and means “now”. The transliterated Latin form is “vdv” and the corresponding phoneme sequence is “[ɔ d ɔ:]”. If you put it in the context of describing the sky scene, it means “star”, which is translated into the Latin form as “vd”, and the corresponding phoneme sequence is [ɔ d]. Therefore, in the front-end processing part, this paper proposes to build a Mongolian phoneme pre-training model based on Bang to further strengthen the training of Mongolian phonemes, which can determine the status of Mongolian in different contexts.

Bang consists of a multi-layer stacked transformer encoder using a self-attention mechanism and a multi-layer stacked transformer decoder using a cross-stream visible multi-stream self-attention mechanism [16]. The Mongolian phoneme pre-training model is based on the Bang structure and uses the input sequence

X = \{x_{1}, x_{2}, \dots, x_{| X |}\}

to generate a prediction target sequence:

Y = \{y_{1}, y_{2}, \dots, y_{| Y |}\}

. Its purpose is to achieve an accurate alignment of Mongolian phonemes and acoustic features [17] to reduce the rates of word errors and word omissions in synthesized Mongolian speech [18].

In the Bang pre-training model, for each phoneme,

y_{t},

in the original Mongolian target sequence,

Y

, the predicted Mongolian target,

\hat{Y}

, considers masking, [MASK], its

i

previous phonemes to optimize the entire output sequence. Masked language modeling (MLM) and next sentence prediction (NSP) are the two important pre-training tasks in Bang [19]. The workflow is shown in Figure 4. M-S (mainstream) indicates that it contains real Mongolian phonemes, while P-S (predicting stream) predicts Mongolian phonemes, which contains [M]. In order to obtain the previous real-word and [MASK] phoneme information, [M] in the P-S stream uses the information in M-S and its previous P-S for the attention calculation. Taking the prediction of

y_{4}

as an example, the specific calculation processes are as follows.

First-line predicted phoneme

y_{4}

: M-S and first P-S are shown, using [MASK] to analyze

y_{1}, y_{2}, y_{3}

in M-S, and

y_{4}

is predicted with conditional probability

P (y_{4}| y_{1}, y_{2}, y_{3})

. All the phoneme information in the first P-S is autoregressively predicted with the complete previous information.

Second-line predicted phoneme

y_{4}

: the [M] of

y_{4}

in the second P-S focuses on the true

y_{1}, y_{2}

values from the M-S and [M] of

y_{3}

from the first P-S.

y_{3}

in the first prediction stream and

y_{4}

in the second prediction stream are generated with the conditional probability

P (y_{3}, y_{4}| y_{1}, y_{2})

. As the attention flow increases, the previous information is masked, and the generation method also moves from autoregressive to non-autoregressive.

The last row shows that

y_{4}

in the fourth P-S ultimately performs predictions in a non-autoregressive manner. At this time, the predicted [M] of

y_{4}

in the fourth P-S is calculated from the [M] of

y_{1}

in the first P-S, the [M] of

y_{2}

in the second P-S, and the [M] of

y_{3}

in the third P-S. Therefore, this can be used to quickly predict the next Mongolian word phoneme without considering the surrounding phoneme information.

Assuming that the length of the target sequence

|Y| = n

, Bang sets

n

prediction streams. At this time, the prefix of any length of each word replaced by [M] is predicted in parallel in the same time step, effectively speeding up the prediction speed.

Fine-tuning [20] the Mongolian phoneme information combined with contextual information through the Bang pre-training model not only reduces the training complexity of the Mongolian speech synthesis model, but also reduces the probability of wrong words and missing words in the generated speech, making the generated speech more accurate and natural, greatly improving the quality of the generated speech. The data experiment and result analysis of the Mongolian phoneme pre-training model are given in Section 4.4.

3.2. Encoder–Decoder

In the encoder–decoder part, this paper uses multi-resolution STFT to assist loss optimization training and incorporates the Ghost module into the encoder and decoder of the improved Para-WaveNet model to improve the training efficiency of the Mongolian acoustic model.

3.2.1. Multi-Resolution STFT Auxiliary Loss

Since the improved Para-WaveNet Mongolian speech synthesis model generates speech slowly, has a long training cycle, and a high real-time rate, this paper proposed the use of STFT loss and adversarial loss, instead of training of knowledge distillation model and convolutional network in the attention block in the encoder–decoder to optimize the training steps, which improved the speed and stability of speech generation by the Para-WaveNet model. When using adversarial training, it requires linear prediction parameters to convert glottal excitations into audio waveforms. Therefore, this paper used multiple STFT auxiliary losses to optimize the training steps. The principle is as shown in Equation (1):

L_{S} (G) = E_{z ~ p (z), x - P_{d a t a}} [L_{s c} (x, \hat{x}) + L_{m a g} (x, \hat{x})]

(1)

Among them,

\hat{x}

represents generated audio,

L_{s c}

represents spectral convergence,

|S T F T (\cdot)|

represents STFT amplitude, and

L_{m a g}

represents logarithmic STFT amplitude loss, specifically as follows Equations (2) and (3):

L_{s c} (x, \hat{x}) = \frac{{| | | S T F T (x) | - |S T F T (\hat{x})| | |}_{F}}{{| | | S T F T (x) | | |}_{F}}

(2)

L_{m a g} (x, \hat{x}) = \frac{1}{N} {‖\log |S T F T (x)| - \log |S T F T (\hat{x})|‖}_{1}

(3)

Among them,

{‖\cdot‖}_{F}

represents the Frobenius norm and

{‖\cdot‖}_{1}

represents the L₁ norm. Multi-resolution loss is the superposition of different parameters (number of FFT points, window size, and frame shift). M is the number of STFT losses. The multi-resolution STFT loss is shown in Equation (4):

L_{a u x} (G) = \frac{1}{M} \sum_{m = 1}^{M} L_{s}^{(m)} (G)

(4)

3.2.2. Ghost Module

The research shows that, in ordinary deep convolutional neural networks, in order to obtain richer feature information [21], the number of convolution channels and the network parameters are usually increased. However, it also brings about other problems, such as an increased computational complexity and slower speech synthesis speed. In addition, the features generated by the convolution have a considerable amount of redundant information and some channel characteristics are highly similar to each other [22]. Therefore, a variety of linear operation methods are needed to deal with these characteristics to improve the computational efficiency.

Compared with the fully convolutional neural network, the Ghost module used in this paper was a unique neural network, which was composed of a convolution and linear transformation. The convolution part could effectively reduce the number of parameters and calculations by reducing the number of output channels, while the linear transformation part dynamically adjusted the number of channels to ensure that the output was consistent with the dimensions of the Mel spectrogram. Especially when the sentences were long, the Ghost module was used to completely parallelize the model training process, which greatly improved the training efficiency. The principle of the Ghost module is shown in Figure 5.

Suppose that

c

is the input phoneme sequence

X

channels number and

n

is the output acoustic feature

Y

channels number. In order to obtain n output channels, the input,

X

, must first pass through an ordinary convolution layer with a convolution kernel size of

k

and obtain a feature sequence,

Y^{'}

(ontology feature sequence), with a channel number of

m

, where

m

is much smaller than

n

. Therefore, let

n = m * s

, that is, after some linear operations with a small number of calculations,

s

feature pathways are generated, which are ‘phantom’ feature pathways. The calculation process is as follows:

y_{i j} = \emptyset_{i, j} (y_{i}^{'}) \forall i = 1, \dots, m, j = 1, \dots s

(5)

In Equation (5), the feature of the i-th channel in

Y^{'}

is

y_{i}^{'}

and the j-th linear operation of

y_{i}^{'}

is represented by

\emptyset_{i, j}

. After linear processing,

Y = [y_{11}, y_{12} {, \dots, y}_{m s}]

containing

n

feature channels can be obtained and used as the output of the Ghost module. In order to achieve this goal, we treated the ontology feature sequence directly as a component of the output feature sequence. Therefore, each intrinsic channel underwent

s - 1

linear transformations, resulting in a total of

n (s - 1) / s

linear transformations to achieve more efficient calculations. Using the grouped convolution, a linear transformation could be achieved, and the number of groups was

n / s

. Each intrinsic channel can use

s - 1

convolution kernels of size

d

to obtain

s - 1

outputs, and the outputs of these grouped convolutions can be spliced together using identity mapping, so that the final output is obtained, as shown in Table 1.

By comparing the parameters and operations of the ordinary convolution and Ghost, when the convolution kernel size of the ordinary convolution was k, the sizes of k and d were similar, and

s

was much smaller than c. Then, the comparison of the parameter quantity and calculation quantity is as shown in Equations (6) and (7), respectively.

r_{p a r a} = \frac{c * k * n}{c * k * \frac{n}{s} + d * \frac{n}{s} * (s - 1)} \approx \frac{s * c}{c + s - 1} \approx s

(6)

r_{c a l} = \frac{c * k * n * l^{'}}{c * k * \frac{n}{s} * l^{'} + d * \frac{n}{s} * (s - 1) * l^{'}} \approx \frac{s * c}{c + s - 1} \approx s

(7)

In the above Equation (7),

l^{'}

is the length of the output sequence. Therefore,

s

can be called the parametric compression ratio or computational speedup ratio. From the comparison of various data in Table 1, it can be concluded that using the Ghost module can greatly reduce the number of parameters of the model and improve its training speed. In addition, in order to prevent gradient disappearance or gradient explosion, stacked Ghost modules were connected with residuals, and their structure is shown in Figure 6.

The acoustic model proposed in this paper adopted a feed-forward network structure that could iteratively refine the attention alignment between the text and Mel spectrogram in a layer-by-layer manner, where the encoder and decoder used several Ghosts as the backbone. In the model, the input phoneme sequence entered the front-end module and passed through the Bang pre-training and Ghost modules to achieve effective predictions of the alignment between the phonemes and acoustic features, and the acoustic features were input into the decoder for adversarial training. Finally, the Mel spectrogram was generated in parallel. This method greatly improved the speed of speech generation by the Mongolian acoustic model, reduced the word error rate and missing word rate of the generated speech, and improved the performance of the Mongolian acoustic model.

3.3. ILPCnet Vocoder

In the vocoder part, this paper proposed a vocoder that combined digital signal processing (DSP) and a neural network (NN) in order to improve the speed of the vocoder in generating Mongolian speech while ensuring the high quality of the generated speech [23], which added noise and filters to the Mongolian frequency band, called the ILPCnet vocoder. This structure was simple and intuitive, and greatly reduced the parameter generation. It can achieve the real-time synthesis of high-quality speech on an ordinary CPU and reduce the hardware requirements of the model. The workflow is shown in Figure 7.

ILPCnet consists of two sub-networks, the upsampling network and the waveform generation network, which are used to complete the task of matching the time resolution of the input acoustic features with the sampling rate of the speech signal, and automatically regressing to generate waveform samples. The advantage is that the linear prediction coefficient [24] is added to the neural network, making the structure simpler and easier to implement.

In the upsampling network, two

1 \times 3

convolutional layers were first used to extract the local context of the acoustic features, and the context vector was constructed from the current two frames, the past two frames, and the future two frames. Then, it was connected to the acoustic features to make the context vector more dominant for the current frame information. The fully connected layer mapped the dimensions of the context vector to the input dimensions of the context waveform generation network. Finally, the output of the fully connected layer was upsampled with the time resolution through the transposed convolutional layer, while the coefficients in the source–filter structure were predicted using a neural network, and the filters were calculated using DSP methods. In particular, the specific calculation formula is as follows, where each sampling point,

x

, is calculated by

p

in the filter and excitation,

e

. The excitation,

e

, is predicted using a neural network and

p

is calculated directly according to the DSP method. The calculation process is shown in Equations (8) and (9):

x_{n} = e_{n} + p_{n}

(8)

p_{n} = \sum_{i = 1}^{M} α_{i} * x_{n - i}

(9)

Among them,

x_{n}

,

e_{n},

and

p_{n}

represent the n^th speech sample, excitation, and intermediate predictor, respectively;

α_{i}

represents the i^th linear prediction coefficient, and its order is

M

. The linear prediction filter represented the spectral components of the Mongolian speech signal well and the WaveRNN framework was used to effectively generate the remaining components of the speech signal, so the quality of the synthesized speech signal was further improved.

According to the distribution relationship between the sampling points and excitations in Figure 7, the neural network was used to predict

w

,

u

, and

s

. The specific prediction process can be observed in Equations (10) and (11):

p (x_{n}| x_{< n}, h) = \sum_{i = 1}^{N} w_{n, i} * \frac{1}{\sqrt{2 π s_{n, i}}} \exp [- \frac{{(x_{n} - μ_{n, i})}^{2}}{2 s_{n, i}^{2}}] w_{n, i}^{x} = w_{n, i}^{e}

(10)

μ_{n, i}^{x} = μ_{n, i}^{e} + p_{n} s_{n, i}^{x} = s_{n, i}^{e}

(11)

Among them,

x_{n}

,

x_{< n}

,

h

, and

N

, respectively, represent the current Mongolian speech sample, previous speech sample, conditional acoustic features, and mixing times; [

w_{n, i}, μ_{n, i}, s_{n, i}

] represents the i-th mixing parameter consisting of gain, average, and scale components, respectively; and the superscripts

e

and

x

represent the stimulus and speech, respectively.

Then, we used the autoregressive neural vocoder to predict the mixing parameters of the excitation signal and calculated the prediction term,

p_{n},

through Equation (12). And the following mixing parameters of the Mongolian speech signal could be obtained:

w_{n} = s o f t m a x (z_{n}^{w}) μ_{n} = z_{n}^{μ} + p_{n} s_{n} = \exp (z_{n}^{s})

(12)

In the above formula, [

z_{n}^{w}, z_{n}^{μ}, z_{n}^{s}

] represents the output vector of the neural vocoder connected to the gain, mean, and scale nodes of the MoG distribution, respectively.

When training the network, we calculated the possibility of MoG through Equation (12), and then used the minimizing negative likelihood (NLL) loss to optimize the weights in Equation (13). Because the sum of constant terms guarantees the accuracy of linear predictions, the weights of a neural network can be successfully trained using the standard backpropagation process.

L_{n l l} = - \sum_{n} \log p (x_{n} | x_{< n}, h)

(13)

In the waveform generation network, the hyperbolic tangent activation was first applied to the context vector to match its dynamic range to the waveform domain, which was bounded by [−1, 1]. The concatenated vector between the context vector and the previous waveform sample then passed through two gated recurrent unit (GRU) layers, followed by a fully connected layer to generate the output vector [

z_{n}^{w}, z_{n}^{μ}, z_{n}^{s}

]. Finally, the LP-MDN model was used to calculate the mixing parameters of the speech distribution, as shown in Equation (12).

4. Data Experiment

4.1. Experimental Data

The Mongolian speech text dataset [25] (NMLR-Mon2Chs ST) was used in the experiment to train the network model. This dataset was recorded by 36 Mongolian speakers, aged between 20–25 years old, all from Hohhot City, Inner Mongolia Autonomous Region, and included voice and text content. The size of the dataset was 2.68 GB, containing a total of 21,478 audio files. The audio files were 16 kHz monophonic speech. And the valid duration of the dataset was 25 h. An example is shown in Table 2.

The Mongolian text was effectively converted into the corresponding Latin sequence through the Mongolian data processing method provided in Section 2, where the processed text sequence corresponded to the audio file one to one. The processed dataset was then randomly divided into a training set, a test set, and a validation set, with the numbers being 16,800, 2100, and 2100, respectively. The text sequences in the experiments of this paper all used Latin sequences after the conversion of the Mongolian text. Some data are shown in Table 3.

4.2. Model Training Platform

According to the training requirements of the Mongolian speech synthesis model Ghost-ILPCnet, the model training environment presented in this paper is shown in Table 4. In addition, a series of installation packages were required in this experiment, including numpy, inflect, matplotlib, librosa, scipy, Unidecode, nltk, etc.

4.3. Indicators for Model Performance Evaluation

Generally speaking, voice quality assessment methods can be divided into two types: subjective and objective. This study used a combination of subjective and objective evaluation methods to evaluate the performance of the speech synthesis model more accurately and comprehensively.

Subjective average opinion score method (MOS). MOS adopts a five-level scoring system. The better the voice quality, the higher the score [26]. It can reflect the naturalness and clarity of synthesized speech in the simplest, truest, most direct, and efficient way. When the MOS is lower than 3, it means that most listeners are not satisfied with the sound quality of the synthesized speech, indicating that its quality is not ideal; when the MOS is higher than 4, it means that the majority of listeners recognize the sound quality of the synthesized speech, as shown in Table 5.

The real-time factor (RTF) was used to evaluate the real-time performance of speech synthesis. In order to evaluate the real-time performance of speech synthesis, this paper used this objective evaluation method. It can reflect the time it takes for the speech synthesis model to synthesize speech per unit time. The value range of the RTF is [0, 1]. The smaller the RTF value, the better the real-time performance of the speech synthesis model and the better the interactivity of the model. According to Equation (14), the real-time rate can be calculated as follows:

R T F = \frac{t h e t i m e r e q u i r e d t o s y n t h e s i z e a p i e c e o f s p e e c h}{t h e d u r a t i o n o f a p i e c e o f s p e e c h}

(14)

4.4. Experiment Process

This section proposes that the main process of the Ghost-ILPCnet model experiment includes three parts: front-end processing, acoustic model, and vocoder module. The experiment process is shown in Figure 8.

The Mongolian speech synthesis process is divided into a training module and a synthesis module. In the training module, Mongolian text is input, the Mongolian phoneme information is predicted and aligned through the Bang-based pre-training model, and the phoneme information is trained based on the Ghost acoustic model to obtain the Mel spectrogram. Finally, the Mel spectrogram is trained on the improved ILPCnet vocoder to synthesize the Mongolian speech.

In the synthesis module, Mongolian text is input into the trained acoustic model to generate a Mel spectrogram, and then the Mel spectrogram is input into the trained vocoder to finally synthesize Mongolian speech. Some parameters of the acoustic model based on Ghost are shown in Table 6.

4.5. Data Test for the Model and Results

4.5.1. Training Loss Curve

The training loss value curve can intuitively show the fitting degree of the model. As shown in Figure 9, the model loss value decreases rapidly in the interval of [0, 60 k] for the number of iteration steps. When the number of iteration steps reaches 140 k, the model gradually converges and achieves better training results.

4.5.2. Word Error Rate Analysis

In addition, in the sequence-to-sequence model, the encoder–decoder module can cause a mismatch between the phoneme sequence and the mel spectrum during the training process, resulting in unstable factors, such as wrong words and missing words, in the synthesized speech. In order to test the word error rate of the Mongolian speech synthesis model, this study selected 30 Mongolian sentences and tested them under the same experimental conditions. Five listeners counted the number of missing words and wrong words in the test speech set and averaged these data to obtain the final results of the training model, as shown in Table 7.

From the perspective of the impact of the input on model results, using phoneme sequences as the model input can reduce the error rate of the model to varying degrees and improve the accuracy of the model’s synthesized speech. Therefore, the subsequent models all used phoneme sequences as the model input.

From the perspective of the impact of multi-resolution STFT loss on the model results, the Para-WaveNet (phoneme) improved by multi-resolution STFT loss had 9 sentences with missing words, 7 sentences with errors, and an error rate of 53%. The fixed STFT loss-assisted improved Para-WaveNet (phoneme) had 10 missing words and 7 incorrect sentences, with an error rate of 56%. Under the same conditions, the error rate of speech generated by the model was reduced by 3%. Then, the experiment introduced a Transformer-based Mongolian speech synthesis model for comparison. After the experimental verification, the Transformer model (phoneme) assisted by a multi-resolution STFT loss generated a speech error rate of 10% and was assisted by a single fixed STFT loss of 17%. The error rate of the model decreased by 7%. It was concluded that the improved Para-WaveNet model assisted by the multi-resolution STFT loss proposed in this article could stably generate high-definition Mongolian speech, which improved the quality of the generated speech.

From the perspective of the impact of adding the Bang pre-training model on the experimental results, the Ghost-ILPCnet model (phoneme) added to the Bang pre-training model generated 1 sentence with missing words and 1 sentence with the wrong number of words in the speech. The error rate was only 6%. The Ghost-ILPCnet model (phoneme) without adding the Bang pre-training model generated 1 sentence with missing words in the speech and 2 sentences with the number of errors. The error rate was 10% and the error rate of the model differed by 4%. By adding a Bang-based pre-training mechanism for phoneme information, context information can be used to more accurately generate speech signals in parallel, improving the quality of the speech generated by the Mongolian speech synthesis model.

In summary, taking phoneme sequences as the input, adding the Bang pre-training model, and the using multi-resolution STFT loss to assist the Ghost-ILPCnet Mongolian speech synthesis model can optimize the problem of different pronunciations of polyphone words in different sentences. This is conducive to using contextual phoneme information to generate speech signals in parallel more accurately and stably, ultimately improving the quality of speech generated by the Mongolian speech synthesis model.

4.5.3. Real-Time Rate Analysis

When the WaveNet vocoder synthesized speech, the speed of its speech synthesis was limited due to its reliance on the audio sample points generated in the previous step, resulting in a reduction in the speech quality [27]. In the experiment, the real-time rate of the speech synthesis between the Ghost-ILPCnet model and the improved Para-WaveNet model was compared. The real-time rate of speech synthesis was used to measure the performance of the speech synthesis model. The smaller the RTF value, the better the real-time rate of speech generated by the model and the better the interactivity. The results are shown in Table 8.

In terms of vocoder, the Mel Spectrum + WaveNet Vocoder RTF value was 0.2412 and Mel Spectrum + ILPCnet was 0.1921. The experiments show that, under the same conditions, the real-time rate of the ILPCnet vocoder converting Mel spectrum into a speech waveform is 0.0491 faster than WaveNet. Therefore, introducing ILPCnet as the vocoder of the Ghost acoustic model can improve the synthesis speed of the entire system to a certain extent and meet the real-time requirements of the speech synthesis system.

Observed from the aspect of using algorithm distillation, comparing the ParaNet model and the improved Para-WaveNet model, without any complex probability density distillation, the improved Para-WaveNet acoustic model generates Mongolian speech with a real-time rate of 0.0859, 0.1027 faster than the improved ParaNet model. Therefore, the real-time performance of speech generated by the improved Para-WaveNet acoustic model was optimized to a great extent.

We observed whether or not non-autoregressive sequence algorithms were used and compared the more popular Transformer, Tactron2, ParaNet series models, and Ghost-ILPCnet. Since both Transformer and Tactron2 are models based on autoregressive sequences, their real-time rates for synthesizing Mongolian speech are both less than 1, and the fastest real-time rate is 0.9141, which basically meets the requirements for real-time speech synthesis. However, compared with models based on non-autoregressive sequences, the slowest non-autoregressive model generates speech at a real-time rate of 0.1886, and the speed of non-autoregressive models generating speech is more advanced.

Finally, we tested whether adding Bang pre-training to the model affected the real-time rate of speech generated by the model. When the pre-trained model was added to the Mongolian speech synthesis model based on Ghost-ILPCnet, the real-time rate was 0.0041, which was only 0.002 slower than without adding it. However, the real-time rate of this model was still very fast among the non-autoregressive models.

4.5.4. MOS Subjective Analysis

Since the subject of the MOS evaluation is people, different people have different feelings about a sentence. Therefore, in order to more accurately evaluate the sound quality of synthesized speech, this study randomly selected 50 Mongolian audios from the 2100 test set, composed of 15 Mongolian listeners, conducted a listening test, and then subjectively rated the quality and naturalness of the synthesized speech they heard. Finally, the average value was calculated to obtain the MOS score. Table 9 shows the average subjective opinion scores of different models.

The experimental results show that by using real Mel spectrum as the inputs of WaveNet and ILPCnet, the MOS scores are 4.01 and 4.03, respectively. Through the comparison, it can be seen that the speech synthesis quality of the two vocoders is almost the same, but the WaveNet network model is complex and the training model takes a very long time. When the synthesized speech quality is the same, the ILPCnet network is more lightweight and faster.

By comparing the difference between the autoregressive model and the non-autoregressive model in generating Mongolian speech, the MOS value of the autoregressive model Transformer and Tactron2 in generating Mongolian speech can reach up to 4.45. The non-autoregressive ParaNet series model and Ghost-ILPCnet generated Mongolian speech with the highest MOS value of 4.48. By using the Mongolian speech synthesis model based on Ghost-ILPCnet proposed in this paper, the MOS score of the Mongolian speech synthesis model reached 4.48 while ensuring synthesis efficiency. This was significantly higher than other baseline models, and only differed from the MOS value of actual speech by 0.14. The model was further improved in terms of the synthesis sound quality, speed, and performance, verifying the effectiveness of the improved model proposed in this paper.

5. Conclusions

This paper mainly studied the method of Mongolian speech synthesis. First of all, we assessed various parameters in order to solve the problem of Mongolian characters having the same shape and different pronunciations in different environments, and the problem that Mongolian letters are written differently at the beginning, middle, and end of words, which causes the model to be unable to accurately recognize. Therefore, in the front-end part of Mongolian speech synthesis, a Bang-based Mongolian phoneme pre-training model was constructed, which reduced errors, such as missing words and skipped words in generated speech, and improved the quality of Mongolian speech synthesis.

Because existing Mongolian acoustic models use a large number of convolutional neural network structures, resulting in low training efficiency, we proposed the Mongolian speech synthesis model Ghost-ILPCnet based on Ghost and ILPCnet. This model improves the current situation of low training efficiency of the Mongolian speech synthesis model, poor real-time speech generation rate, and unnatural speech synthesis effect. Through experimental verification, the naturalness of the speech synthesized by this model was closer to the original audio. However, this method still has room for improvement. Due to the use of techniques, such as sequence distillation and fine-tuning in the pre-training process of phonemes, the model training time was extended, which affected the speed of speech synthesis to a certain extent. Therefore, further improvements need to be made in future work to improve the performance of speech synthesis. At the same time, the current Mongolian speech synthesis model has a single timbre and lacks emotion. Therefore, on the basis of continuously improving the quality of the corpus, how to improve the emotional expression of Mongolian speech synthesis is also the focus of the next step of research.

Author Contributions

Conceptualization, Q.-D.-E.-J.R. and L.L.; methodology, Q.-D.-E.-J.R., L.W. and L.L.; software, L.W.; validation, L.W.; formal analysis, W.Z.; investigation, L.W.; resources, Q.-D.-E.-J.R.; data curation, L.W.; writing—original draft preparation, Q.-D.-E.-J.R., L.W. and W.Z.; writing—review and editing, W.Z. and L.W.; visualization, L.W.; supervision, Q.-D.-E.-J.R. and W.Z.; project administration, Q.-D.-E.-J.R. and L.W.; funding acquisition, Q.-D.-E.-J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62066035, 62206138), Inner Mongolia Natural Science Foundation (2022MS06013, 2022LHMS06004), Inner Mongolia Science and Technology Program Project (2021GG0140, 2020GG0104), Support Program for Young Scientific and Technological Talents in Inner Mongolia Colleges and Universities (NJYT23059), universities directly under the autonomous region funded by the Fundamental Research Fund Project (JY20220122, JY20220089, RZ2300001739, RZ2300001743, JY20220186), and basic scientific research business expenses of universities directly in the Inner Mongolia Autonomous Region (ZTY2023021, JY20220419).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [China Scientific Data] at [https://doi.org/10.11922/sciencedb.j00001.00345] (accessed on 6 January 2024), reference number [1,2,3,4,5,6,7].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Klatt, D.H. Review of Text-to-speech Conversion for English. J. Acoust. Soc. Am. 1987, 82, 737–793. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Sotelo, J.; Mehri, S.; Kumar, K.; Santos, J.F.; Kastner, K.; Courville, A.; Bengio, Y. Char2wav: End-to-end Speech Synthesis. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Mehri, S.; Kumar, K.; Gulrajani, I.; Kumar, R.; Jain, S.; Sotelo, J.; Courville, A.; Bengio, Y. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arXiv 2016, arXiv:1612.07837. [Google Scholar]
Arik, S.O.; Chrzanowski, M.; Coates, A.; Diamos, G.; Gibiansky, A.; Kang, Y.; Li, X.; Miller, J.; Ng, A.; Raiman, J.; et al. Deep Voice: Real-time Neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 195–204. [Google Scholar]
Deri, A.; Knight, K. Grapheme-to-phoneme Models for (almost) any Language. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 399–408. [Google Scholar]
Gibiansky, A.; Arik, S.; Diamos, G.; Miller, J.; Peng, K.; Ping, W.; Raiman, J.; Zhou, Y. Deep voice 2: Multi-speaker Neural text-to-speech. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, LA, USA, 4–9 December 2017; pp. 2962–2970. [Google Scholar]
Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep Voice 3: 2000-Speaker Neural Text-to-Speech. arXiv 2017. [Google Scholar] [CrossRef]
Peng, K.; Ping, W.; Song, Z.; Zhao, K. Non-autoregressive Neural text-to-speech. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 7586–7598. [Google Scholar]
Liu, R.; Bao, F.; Gao, G. Mongolian Text-to-Speech System Based on Deep Neural Network. In Proceedings of the National Conference on Man-Machine Speech Communication, Lianyungang, China, 11–13 October 2017; pp. 99–108. [Google Scholar]
Liu, Z. Research on End-to-End Mongolian Speech. Master’s Thesis, Inner Mongolia University, Hohhot, China, 2019. [Google Scholar]
Liu, R.; Kang, S.; Gao, G.; Li, J.; Bao, F. MonTTS: A Real-time and High-fidelity Mongolian TTS Model with Pure Non-autoregressive Mechanism. J. Chin. Inf. Process. 2022, 36, 86–97. [Google Scholar]
Bao, F.L.; Gao, G.L.; Yan, X.L. Research on Grapheme to Phoneme Conversion for Mongolian. Appl. Res. Comput. 2013. [Google Scholar]
Dong, C.; Xie, Y.; Ding, B.; Shen, Y.; Li, Y. Collaborating Heterogeneous Natural Language Processing Tasks via Federated Learning. arXiv 2022, arXiv:2212.05789. [Google Scholar]
Peng, H.; Kasai, J.; Pappas, N.; Yogatama, D.; Wu, Z.; Kong, L.; Schwartz, R.; Smith, N.A. ABC: Attention with Bounded-Memory Control. arXiv 2021, arXiv:2110.02488. [Google Scholar]
Zhang, Y.; Zhu, H.; Wang, Y.; Xu, N.; Li, X.; Zhao, B. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-Wise Perspective in Angular Space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 4892–4903. [Google Scholar]
Zheng, W. Pre-Trained Models for Natural Language Processing Editorial. ZTE Technol. J. 2022, 28, 1–2. [Google Scholar]
Yeshambel, T.; Mothe, J.; Assabie, Y. Learned Text Representation for Amharic Information Retrieval and Natural Language Processing. Information 2023, 14, 195. [Google Scholar] [CrossRef]
Chifu, A.-G.; Fournier, S. Sentiment Difficulty in Aspect-Based Sentiment Analysis. Mathematics 2023, 11, 4647. [Google Scholar] [CrossRef]
Jin, W.; Cheng, Y.; Shen, Y.; Chen, W.; Ren, X. A Good Prompt is Worth Millions of Parameters? Low-Resource Prompt-Based Learning for Vision-Language Models. arXiv 2021, arXiv:2110.08484. [Google Scholar]
Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagrande, N.; Lockhart, E.; Stimberg, F.; Oord, A.; Dieleman, S.; Kavukcuoglu, K. Efficient Neural Audio Synthesis. arXiv 2018, arXiv:1802.08435. [Google Scholar]
Zhao, L.; Liu, C.; Liu, X.; Zhang, L. Prolate spheroidal wave functions signal time-frequency analysis based on Fourier series. J. Mod. Electron. Technol. 2022, 17, 35–40. [Google Scholar]
Gao, E. Research on the application of linear prediction in speech signal processing. China New Telecommun. 2022, 24, 72–74. [Google Scholar]
Qi, X.; BORJIGIN, B.T.; Sun, Y.; Zhao, X. A dataset of Mongolian-Chinese speech translation. China Sci. Data 2022, 7, 2. [Google Scholar]
Song, N. Research on SignLanguage-to-Mandarin/Tibetan EmotionalSpeech Conversion by Combining FacialExpression Recognition. Master’s Thesis, Northwest Normal University, Lanzhou, China, 2019. [Google Scholar]
Tang, J.; Zhang, L.; Li, J. A real-time robust speech synthesis method based on improved attention mechanism. J. Signal Process. 2022, 3, 527–535. [Google Scholar]

Figure 1. Model of Mongolian text to speech based on improved Para-WaveNet.

Figure 2. Front-end processing.

Figure 3. Structure of Mongolian speech synthesis model based on Ghost-ILPCnet.

Figure 4. Information flow in Bang pre-training.

Figure 5. Ghost module.

Figure 6. Ghost module stacks.

Figure 7. Structure of ILPCnet model.

Figure 8. Experiment process.

Figure 9. Loss curve of Ghost-ILPCnet model.

Table 1. Dimension transform of Ghost module.

Operation	Input $\to$ Output
ordinary convolution	$l * c \to l * \frac{n}{s}$
linear transformation	$l * \frac{n}{s} \to l * \frac{n}{s} (s - 1)$
concat	$l * \frac{n}{s}, l * \frac{n}{s} (s - 1) \to l * n$

Table 2. Sample of Mongolian dataset.

Audio File Name	Corresponding Mongolian Text
ahei40-0001	ᠡᠭᠦᠳᠡᠨ ᠲᠢᠩᠬᠢᠮ ᠤᠨ ᠳᠤᠤᠷ᠎ᠠ ᠲᠠᠯ᠎ᠠ ᠳ᠋ᠦ᠍ ᠪᠤᠢ
ahei40-0002	ᠪᠢ ᠤᠳᠤ ᠲᠠᠨ ᠳ᠋ᠦ᠍ ᠨᠢᠭᠡ ᠬᠡᠰᠡᠭ ᠠᠪᠴᠦ ᠦᠭ᠍ᠭᠦᠶ᠎ᠡ
ahei40-0003	ᠬᠡᠷᠪᠡ ᠲᠠ ᠪᠠᠰᠠ ᠶᠠᠮᠠᠷ ᠱᠠᠭᠠᠷᠳᠠᠯᠭ᠎ᠠ ᠪᠠᠶᠢᠪᠠᠯ ᠨᠠᠳᠠ ᠳ᠋ᠦ᠍ ᠬᠡᠯᠡᠬᠦ ᠪᠤᠯᠪᠠᠤ
ahei40-0004	ᠲᠡᠭᠦᠨ ᠳ᠋ᠦ᠍ ᠰᠠᠨᠠᠭ᠎ᠠ ᠵᠤᠪᠠᠬᠤ ᠬᠡᠷᠡᠭ ᠦᠬᠡᠢ

Table 3. Latin sequence after conversion of Mongolian text.

Mongolian Text	Latin Sequence
ᠡᠭᠦᠳᠡᠨ ᠲᠢᠩᠬᠢᠮ ᠤᠨ ᠳᠤᠤᠷ᠎ᠠ ᠲᠠᠯ᠎ᠠ ᠳ᠋ᠦ᠍ ᠪᠤᠢ	u: den tinghimon dvr tald bvi
ᠪᠢ ᠤᠳᠤ ᠲᠠᠨ ᠳ᠋ᠦ᠍ ᠨᠢᠭᠡ ᠬᠡᠰᠡᠭ ᠠᠪᠴᠦ ᠦᠭ᠍ᠭᠦᠶ᠎ᠡ	bi vdv: tand nege heseg abcv ugye
ᠬᠡᠷᠪᠡ ᠲᠠ ᠪᠠᠰᠠ ᠶᠠᠮᠠᠷ ᠱᠠᠭᠠᠷᠳᠠᠯᠭ᠎ᠠ ᠪᠠᠶᠢᠪᠠᠯ ᠨᠠᠳᠠ ᠳ᠋ᠦ᠍ ᠬᠡᠯᠡᠬᠦ ᠪᠤᠯᠪᠠᠤ	herbe ta: bas yamar sha: rdelge baibe: l naded heleh bvlbasv
ᠲᠡᠭᠦᠨ ᠳ᠋ᠦ᠍ ᠰᠠᠨᠠᠭ᠎ᠠ ᠵᠤᠪᠠᠬᠤ ᠬᠡᠷᠡᠭ ᠦᠬᠡᠢ	tegund sana: jvbehv hereg ugei

Table 4. Model training platform parameters.

Name	Parameters
CPU	Intel Core i5-10500 CPU@3.10GHz
GPU	Nvidia Tesla P100
Operating system	Ubuntu 18.04.5
Programming language	Python 3.6.13
Deep learning framework	Pytorch 0.4.1

Table 5. MOS scoring criteria.

MOS Score	Voice-Quality Evaluation Standards
1	Extremely poor, the pronunciation is not clear, the delay is long, and it is impossible to distinguish
2	Severe distortion, blurred speech, obvious delay, and almost indistinguishable
3	Obvious distortion, certain delay, and there is noise, but the clarity is acceptable
4	The distortion is not obvious, the delay is short, there is noise, and the voice can be heard clearly
5	Smooth and natural, short delays, flawless, and clear pronunciation

Table 6. Partial parameters of acoustic model based on Ghost.

Parameters	Value
SqueezeExcite	se_ratio: 0.25 reduced_base_chs = None act_layer = nn.ReLU gate_fn = hard_sigmoid divisor = 4
ConvBnAct	stride = 1 act_layer = nn.ReLU
GhostNet	num_classes: 1000 width: 1.0 dropout: 0.3
Bang	DEFAULT_MAX_SOURCE_POSITIONS: 512 DEFAULT_MAX_TARGET_POSITIONS: 512

Table 7. Comparison of the word error rates of each model in 30 sentences.

Model	STFT Loss	Missing Words	Wrong Words	Error Rate
ParaNet (character)		11	11	60%
ParaNet (phoneme)		10	10	53%
Para-WaveNet (character)	$L_{S}^{(1)}$	13	13	66%
Para-WaveNet (phoneme)	$L_{S}^{(1)}$	10	10	56%
Improved Para-WaveNet (phoneme)	$L_{s}^{(1)} + L_{s}^{(2)} + L_{s}^{(3)}$	9	9	53%
Transfomer (phoneme)	$L_{S}^{(1)}$	3	3	17%
Transfomer (phoneme)	$L_{s}^{(1)} + L_{s}^{(2)} + L_{s}^{(3)}$	2	2	10%
Ghost-ILPCnet (phoneme) (without Bang pre-training)	$L_{s}^{(1)} + L_{s}^{(2)} + L_{s}^{(3)}$	1	1	10%
Ghost-ILPCnet (phoneme) (Bang pre-training)	$L_{s}^{(1)} + L_{s}^{(2)} + L_{s}^{(3)}$	1	1	6%

Table 8. Real-time factors for each model.

Model	RTF
Mel + WaveNet	0.2412
ParaNet	0.1886
Improved Para-WaveNet	0.0859
Transformer	0.9141
Tactron2	0.9411
Mel + ILPCnet	0.1921
Ghost-ILPCnet (without Bang pre-training)	0.0021
Ghost-ILPCnet (Bang pre-training)	0.0041

Table 9. MOS scores of different models.

Model	MOS Score
Real Voice	4.62
Mel + WaveNet	4.01
ParaNet	4.32
Improved Para-WaveNet	4.21
Transformer	4.34
Tactron2	4.45
Mel + ILPCnet	4.03
Ghost-ILPCnet (without Bang pre-training)	4.22
Ghost-ILPCnet (Bang pre-training)	4.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ren, Q.-D.-E.-J.; Wang, L.; Zhang, W.; Li, L. Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet. Appl. Sci. 2024, 14, 625. https://doi.org/10.3390/app14020625

AMA Style

Ren Q-D-E-J, Wang L, Zhang W, Li L. Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet. Applied Sciences. 2024; 14(2):625. https://doi.org/10.3390/app14020625

Chicago/Turabian Style

Ren, Qing-Dao-Er-Ji, Lele Wang, Wenjing Zhang, and Leixiao Li. 2024. "Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet" Applied Sciences 14, no. 2: 625. https://doi.org/10.3390/app14020625

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a Mongolian Text to Speech Model Based on Ghost and ILPCnet

Abstract

Featured Application

Abstract

1. Introduction

2. Para-WaveNet Mongolian Speech Synthesis Model

3. Mongolian Speech Synthesis Model Based on Ghost-ILPCnet

3.1. Mongolian Phoneme Pre-Training Model Based on Bang

3.2. Encoder–Decoder

3.2.1. Multi-Resolution STFT Auxiliary Loss

3.2.2. Ghost Module

3.3. ILPCnet Vocoder

4. Data Experiment

4.1. Experimental Data

4.2. Model Training Platform

4.3. Indicators for Model Performance Evaluation

4.4. Experiment Process

4.5. Data Test for the Model and Results

4.5.1. Training Loss Curve

4.5.2. Word Error Rate Analysis

4.5.3. Real-Time Rate Analysis

4.5.4. MOS Subjective Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI