Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Zhang, Weizhao; Yang, Hongwu

doi:10.3390/app122312185

Open AccessArticle

Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

by

Weizhao Zhang

¹

and

Hongwu Yang

^2,3,*

¹

College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070, China

²

School of Educational Technology, Northwest Normal University, Lanzhou 730070, China

³

National and Provincial Joint Engineering Laboratory of Learning Analysis Technology in Online Education, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12185; https://doi.org/10.3390/app122312185

Submission received: 27 October 2022 / Revised: 20 November 2022 / Accepted: 21 November 2022 / Published: 28 November 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The paper proposes a meta-learning-based Mandarin-Tibetan cross-lingual text-to-speech (TTS) to realize both Mandarin and Tibetan speech synthesis under a unique framework. First, we build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual baseline TTS. One is a shared encoder Mandarin-Tibetan cross-lingual TTS, and another is a separate encoder Mandarin-Tibetan cross-lingual TTS. Both baseline TTS use the speaker classifier with a gradient reversal layer to disentangle speaker-specific information from the text encoder. At the same time, we design a prosody generator to extract prosodic information from sentences to explore syntactic and semantic information adequately. To further improve the synthesized speech quality of the Tacotron2-based Mandarin-Tibetan cross-lingual TTS, we propose a meta-learning-based Mandarin-Tibetan cross-lingual TTS. Based on the separate encoder Mandarin-Tibetan cross-lingual TTS, we use an additional dynamic network to predict the parameters of the language-dependent text encoder that could realize better cross-lingual knowledge sharing in the sequence-to-sequence TTS. Lastly, we synthesize Mandarin or Tibetan speech through the unique acoustic model. The baseline experimental results show that the separate encoder Mandarin-Tibetan cross-lingual TTS could handle the input of different languages better than the shared encoder Mandarin-Tibetan cross-lingual TTS. The experimental results further show that the proposed meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis method could effectively improve the voice quality of synthesized speech in terms of naturalness and speaker similarity.

Keywords:

meta-learning; Mandarin-Tibetan cross-lingual speech synthesis; Tibetan speech synthesis

1. Introduction

Speech synthesis (or text-to-speech, TTS) has been widely used in applications such as smart homes, navigation, and audiobooks. However, there is still a challenge for a computer to generate natural and expressive human-like speech.

Generally speaking, there are several approaches for generating speech: (a) formant synthesis [1], (b) unit selection-based speech synthesis [2], (c) hidden Markov model (HMM)-based statistical parametric speech synthesis (SPSS) [3], and (d) deep learning-based speech synthesis [4,5,6,7]. Unit selection-based speech synthesis generates the most natural-sounding speech by concatenating small units of prerecorded waveforms with a unit selection algorithm, but its flexibility is limited. HMM-based SPSS has been the most popular TTS technology since the 1990s. The main advantage of HMM-based SPSS over unit selection-based speech synthesis is its flexibility in changing speech features, speaking styles, and emotions. This flexibility is mainly attributed to applying many techniques for controlling variation in speech, such as adaptation [8], interpolation [9], and eigen-voice. However, the synthesized speech of HMM-based SPSS is muffled compared with natural speech [10].

Since 2006, SPSS TTS has improved with the development of deep learning technologies. These deep learning-based statistical parametric acoustic models can be roughly divided into two categories: one adopt restricted Boltzmann machines (RBMs) [11] or deep belief networks (DBNs) [12] to improve the probability density function of HMM states, while the other is to use feed-forward neural networks (FNNs), mixture density networks (MDNs) [13], or long short term memory networks (LSTMs) [14] to model the relationship between linguistic features and acoustic parameters. Because deep learning technologies can represent the mapping between the context linguistic features and acoustic parameters well compared with statistical models commonly used in SPSS (e.g., decision trees, HMM), they help overcome the loss of detailed characteristics in synthesized speech.

Sequence-to-sequence (seq2seq) neural networks have been successfully applied to various tasks such as machine translation [15,16] and speech recognition [17,18]. In [19], Sotelo et al. proposed a seq2seq acoustic model for speech synthesis named Char2Wav, consisting of a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that encodes the input text into hidden feature representations. The decoder with attention takes the hidden feature representations of the encoder to predict intermediate acoustic representations for the neural vocoder. The neural vocoder refers to a conditional extension of SampleRNN [20] that generates raw waveform samples from intermediate representations. Subsequently, the seq2seq TTS represented by Tacotron can be directly trained on the pairs of <text, speech> and automatically learn the alignment and mapping of the characters to spectrogram frames, which are then converted into waveforms by the neural vocoder [21]. Vaswani et al. [22] further proposed an attention mechanisms-based transformer for replacing complex recurrent or convolutional neural networks in encoder and decoder. Recent studies show that transformer-based TTS [6,7] is superior to Tacotron-based TTS in inference speed and robustness. Therefore, the synthesized speech quality of state-of-the-art seq2seq TTS is close to humans’ natural voices.

The rest of the paper is organized as follows. We first introduce related work in Section 2. Then, we illustrate the baseline in Section 3 and present our meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis framework in Section 4. Subsequently, we explain the experimental data and setup in Section 5. The experimental results and analysis are presented in Section 6. Finally, a brief conclusion and future work are provided in Section 7.

2. Related Work

Speech in various languages can be synthesized for a monoglot speaker in cross-lingual speech synthesis. In the early stages, people often use monolingual or simple multilingual TTS to synthesize cross-lingual speech, leading to switching languages accompanied by voice switching. To address this problem, Traber et al. [23] built a polyglot TTS using a bilingual speech corpus recorded by a bilingual speaker. In [24], Mandarin and English context-dependent HMM states were shared, and the mapping was learned from a bilingual dataset recorded by a bilingual speaker. However, these cross-lingual TTS systems are mainly limited to constructing a multilingual corpus, characterized by a time-consuming, tedious and costly development path. Typically, only the data from monoglot speakers in different languages are available. Therefore, modern cross-lingual TTS systems are expected to generate a high-quality same-speaker voice for various languages using a monolingual speech corpus.

Modern cross-lingual speech synthesis mainly faces two challenges. One challenge is that speaker-specific information and language characteristics are directly correlated, i.e., each speaker speaks only one language, making it very hard to maintain consistent voices when language switching, especially within an utterance. In [25], the authors attempted to factorize speaker-specific and language-specific information for HMM-based SPSS. The experiment showed that speaker and language factorization could disentangle speaker-specific and language-specific information. Subsequently, researchers first used the language-specific and speaker-specific components of encoders for seq2seq-based cross-language speech synthesis [26]. Moreover, researchers used the gradient reversal layer (GRL) to further disentangle speaker-specific and language-specific information [27,28]. Another challenge is resolving the inconsistency in the phonetic systems for different languages. One approach is to assume that the input and output of the acoustic model are language-dependent while the middle layers of the acoustic model can be language-independent and shared in the cross-lingual speech synthesis. In [29,30], the authors proposed a multilingual deep learning-based SPSS that shared hidden layers across different languages while the input and output of acoustic models were language-dependent. For cross-lingual seq2seq TTS, researchers used language-dependent text encoders to alleviate the mutual interference of inputs from different languages [26,31], or used language ID to distinguish the input of different languages in the language-independent text encoder [31]. Another approach is first to convert graphemes of different languages to the uniform representations of the Unicode bytes or International Phonetic Alphabet (IPA) [32,33]. The models then consume the Unicode bytes or IPA as the input.

Meta-learning (or learning-to-learn) [34] is a new machine learning approach that has gradually emerged in recent years. As early as in [35], the authors adopted a dynamic network to predict a few parameters in the main network. Specifically, the main network and dynamic network use the same input image. The main network outputs the prediction results, while the dynamic network outputs the convolutional parameters of the main network. Similarly, Ha et al. [36,37] used dynamic networks to generate all weights for convolutional neural networks (CNNs) and LSTMs. The experimental results showed that the proposed method outperformed traditional neural network-based methods. In [38], researchers utilized a dynamic network called contextual parameter generator to generate convolutional parameters of the text encoder in the seq2seq multilingual TTS. The experimental results showed that the proposed meta-learning-based multilingual TTS could produce natural-sounding multilingual speech using more languages and less training data than previous approaches.

Tibetan has three different dialects, including the Lhasa dialect, the Kang dialect, and the Amdo dialect. Because the Lhasa dialect is the standard pronunciation of Tibetan, the paper focuses on realizing Lhasa dialect speech synthesis. Although the above three Tibetan dialects have different pronunciations, they use the same Tibetan characters. A Tibetan character with horizontal and vertical spelling structure differs from English with fully linear spelling. A typical Tibetan character consists of seven parts, as shown in Figure 1. In our previous work, we proposed the HMM-based [39] and deep learning-based [40] Mandarin-Tibetan cross-lingual SPSS. In [39], we realized the Tibetan grapheme-to-phoneme conversion module to convert Tibetan characters into initial and final sequences and used speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis. In [40], we proposed the deep learning-based Mandarin-Tibetan cross-lingual speech synthesis to realize both Mandarin and Tibetan speech synthesis under a unique framework. Similarly, Xu et al. [41] proposed an HMM-based monolingual Tibetan SPSS. However, these Mandarin-Tibetan cross-lingual and Tibetan monolingual TTS still used SPSS methods. Therefore, the voice quality of the synthesized speech still needs to be further improved. Subsequently, Zhao et al. [42,43] proposed the seq2seq-based Tibetan TTS. In [42], authors explored the Tacotron2-based Lhasa dialect TTS combined with the WaveNet model. In [43], authors presented the Tacotron2-based Tibetan multi-dialect speech synthesis model to synthesize the Lhasa dialect and Amdo dialects. Although the above seq2seq-based monolingual Tibetan TTS models have greatly improved the synthesized voice quality compared to that of SPSS, these methods still have not achieved seq2seq-based Mandarin-Tibetan cross-lingual speech synthesis.

In the paper, we implemented a meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis. Our main contributions are as follows.

(1): We build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual seq2seq TTS: the shared encoder Mandarin-Tibetan cross-lingual TTS and the separate encoder Mandarin-Tibetan cross-lingual TTS, using the speaker classifier with gradient reversal layer to disentangle speaker-specific information from the text encoder. At the same time, we design a prosody generator to extract prosodic information from a sentence to explore the syntactic and semantic information adequately.
(2): We propose a meta-learning-based Mandarin-Tibetan cross-lingual TTS to improve the synthesized speech quality of Mandarin and Tibetan. Based on the separate encoder Mandarin-Tibetan cross-lingual TTS, we use an additional dynamic network to predict the parameters of the language-dependent text encoder that could realize better cross-lingual knowledge sharing in the seq2seq TTS.

3. Baseline of Mandarin-Tibetan Cross-Lingual Speech Synthesis

We use two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual TTS as the baseline system. The two baseline systems are called shared encoder Mandarin-Tibetan cross-lingual TTS and separate encoder Mandarin-Tibetan cross-lingual TTS, respectively.

The shared encoder Mandarin-Tibetan cross-lingual TTS is shown in Figure 2. To make the original Tacotron2-based TTS better adapt to cross-lingual speech synthesis for the shared encoder Mandarin-Tibetan cross-lingual TTS, we change the original Tacotron2-based TTS as follows.

(1): The residual connection and batch normalization layer are added to the text encoder of the original Tacotron2, named the residual text encoder. The output of the residual text encoder is shown in Equation (1).

${Encoder}_{res} ({\{x_{j}\}}_{j = 1}^{N}) = Encoder ({\{x_{j}\}}_{j = 1}^{N}) + Phone_embedding ({\{x_{j}\}}_{j = 1}^{N})$

(1)

where ${\{x_{j}\}}_{j = 1}^{N}$ represents the input phoneme sequences, $Encoder ({\{x_{j}\}}_{j = 1}^{N})$ represents the encoder output of the original Tacotron2-based TTS, and $Phone_embedding ({\{x_{j}\}}_{j = 1}^{N})$ represents the embedding of the input phoneme sequences.
(2): In the Mandarin-Tibetan cross-lingual speech synthesis task, language embedding and speaker embedding are added to the Tacotron2-based TTS to distinguish the pronunciation characteristics of different speakers in different languages. In particular, the speaker embedding and the output of attention LSTM are concatenated as the input of the decoder LSTM to improve the similarity between the synthesized speech and the original speech.
(3): We use an adversarial speaker classifier to disentangle the speaker-specific information from the text encoder. The adversarial speaker classifier consists of a gradient reversal layer, two fully connected layers, and a softmax layer. Due to the introduction of the adversarial speaker classifier, the new training loss function is shown in Equation (2).

${LOSS}_{Total} = {LOSS}_{D E C} + {LOSS}_{Speaker_classifier}$

(2)

where ${LOSS}_{D E C}$ represents the loss between the synthesized speech and the target speech, ${LOSS}_{Speaker_classifier}$ represents the loss of speaker classification predicted by the adversarial speaker classifier.
(4): Chinese and Tibetan characters are different from English and other Roman characters-based languages. Because there is no distinct separator between adjacent words, an occasional word segmentation error leads to semantic confusion and prosodic errors in speech synthesis. In [44], the experiments show that enhanced input text by integrating prosodic information (e.g., prosodic word boundaries, prosodic phrase boundaries, HTS-based context-dependent information) significantly improves the naturalness of the synthesized speech in seq2seq TTS. In [45,46,47], the text embedding extracted by the pre-training model Bidirectional Encoder Representations from Transformers (BERT) is an additional input to add to the seq2seq TTS based on Tacotron2. Because these text embedding features contain linguistic and semantic-related information, they help the speech synthesis system generate more natural speech. To adequately explore the syntactic and semantic information, we design a prosody generator to extract the prosodic information from the sentence [48]. The prosody generator includes a text analyzer, a feature vector extraction module, a question set, and a hidden feature extraction module.

The framework of separate encoder Mandarin-Tibetan cross-lingual TTS is similar to that of the shared encoder Mandarin-Tibetan cross-lingual TTS, as shown in Figure 3. The difference is that Mandarin and Tibetan texts use different residual text encoders. The language ID is used as a mask to select Mandarin or Tibetan text from the output of residual text encoders so that the Mandarin or Tibetan text is encoded into different hidden feature representations.

4. The Framework of Meta-Learning-Based Mandarin-Tibetan Cross-Lingual Speech Synthesis

The meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis framework is shown in Figure 4. We use a dynamic network to predict the parameters of the language-dependent text encoder based on the separate encoder Mandarin-Tibetan TTS. Because generating all weights of RNN [37] is difficult, we replace the bidirectional LSTM of the original text encoder with a fully convolutional encoder from DCTTS [49]. Furthermore, the parameters of the convolutional encoder are generated by corresponding fully connected (FC) layers called the dynamic network to achieve better cross-language knowledge sharing.

5. Experiments

5.1. Experimental Data

We build the Tibetan corpus called Tibetan_Lasa_1 for speech synthesis. Tibetan_Lasa_1 is a standard Lhasa dialect corpus for one female speaker. The utterances are recorded in a professional recording studio. We use high-fidelity microphones and professional audio workstations to record speech corpus and conduct legality checks on all recordings to avoid errors in the recording process. There are 10,000 recordings in total from a 25-year-old female Lhasa dialect broadcaster. All sentences are mainly declarative utterances from news, Tibetan websites, and Tibetan books. All recordings were saved in the Microsoft WAV format (mono-channel, 16-bit depth, sampled at 16 kHz).

Besides, we use the open-source Mandarin speech synthesis dataset of DataBaker. This open-source dataset consists of 10,000 recordings from a 20–30 years old female. All recordings were saved in the Microsoft WAV format (mono-channel, 16-bit depth, sampled at 48 kHz). In the experiment, all recordings were down-sampled from 48 kHz to 16 kHz.

5.2. Experimental Setup

In the experiment, we selected training corpus from the DataBaker and Tibetan_Lasa_1. First, two sets of 150 utterances were randomly selected from the DataBaker to serve as the Mandarin test set and Mandarin development set, respectively. Similarly, two sets of 150 utterances were randomly selected from the Tibetan_Lasa_1 to serve as the Tibetan test set and Tibetan development set, respectively. Then the remaining Mandarin utterances were marked as {M}, while the remaining Tibetan utterances were marked as {T}.

We build the following four TTS models to verify the effectiveness of the proposed meta-learning-based Mandarin-Tibetan cross-lingual TTS.

seq2seq-M or seq2seq-T: Based on the original Tacotron2, we trained Mandarin and Tibetan monolingual TTS, called seq2seq-T and seq2seq-M, respectively.
shared-MT: We call the shared encoder Mandarin-Tibetan (MT) cross-lingual TTS as shared-MT, which is trained on training subsets {M} and {T}, and its architecture is shown in Figure 2.
separate-MT: We call the separate encoder Mandarin-Tibetan cross-lingual TTS as separate-MT, which is trained on training subsets {M} and {T}, and its architecture is shown in Figure 3.
meta-MT: We call the meta-learning-based Mandarin-Tibetan cross-lingual TTS as meta-MT, which is trained on training subsets {M} and {T}, and its architecture is shown in Figure 4.

The Adam optimizer was used in the experiment to train the above TTS models. The learning rate stepped from

10^{- 3}

and halved every 10k step. All models were trained until the loss of the development set started increasing. Two NVIDIA Tesla P4 GPUs were used in the experiment, and batch_size was set to 64. The dynamic network consists of fully connected layers. The input of the dynamic network is the 20-dimensional language embedding, and the size of each convolution kernel determines its output dimension. The bidirectional LSTM of the prosody generator consists of a two layers LSTM with 128 units. The other parameters of all models are similar to the original Tacotron2.

6. Evaluations and Analysis

To evaluate the effectiveness of the proposed meta-learning-based Mandarin-Tibetan cross-lingual TTS, we evaluate the synthesized speech quality of the above models under different training sets. The evaluations mainly include multilingual speech synthesis evaluation and cross-lingual speech synthesis evaluation.

6.1. Multilingual Speech Synthesis Evaluation

In multilingual speech synthesis evaluation, we mainly evaluate the speech quality of the speakers’ native language synthesized by the above four TTS models. The multilingual speech synthesis evaluation includes objective evaluations and subjective evaluations.

6.1.1. Objective Evaluation

We evaluated the Mel-cepstral distortion (MCD) in the objective evaluations. The MCD represents the Mel Frequency Cepstral Coefficients (MFCCs) error calculated from predicted speech samples and natural speech samples. Assuming that the MFCCs of a frame are expressed as y and the predicted MFCCs are expressed as

\hat{y}

, the MCD of the frame can be calculated as follow:

MCD (y; \hat{y}) = {∥ y - \hat{y} ∥}_{2}

(3)

where

{∥ \cdot ∥}_{2}

represents

L 2

norm.

The MCD of different seq2seq TTS models on the test set is shown in Table 1. We can find that the MCD values of synthesized Mandarin speech of the Mandarin-Tibetan cross-lingual models, including the shared-MT, the separate-MT and the meta-MT, increase by 0.119, 0.026 and 0.019 compared with seq2seq-M, respectively. For Tibetan speech synthesis, the findings can be analyzed that the baseline of Mandarin-Tibetan cross-lingual models, including the shared-MT and the separate-MT, help to improve the voice quality of the synthesized Tibetan speech. The separate-MT has higher modeling accuracy compared with the shared-MT. This may be due to the fact that the separate-MT alleviates the mutual interference of inputs from different languages. Furthermore, the meta-MT seems slightly better than separate-MT for Mandarin and Tibetan speech synthesis. It fulfills our expectations as meta-MT should be more flexible.

6.1.2. Subjective Evaluation

For subjective evaluations, 20 speech samples were randomly selected from the test set. We conducted a mean opinion score (MOS) test and AB preference test to evaluate the voice quality of synthesized speech. In addition, we invited 30 native Mandarin and 30 Tibetan listeners as subjects. Mandarin subjects were invited to evaluate the synthesized Mandarin speech samples, while Tibetan subjects were invited to assess the synthesized Tibetan speech samples.

In the MOS test, subjects were asked to rate the naturalness of the synthesized speech using a 5-point scale score. The average MOS scores of synthesized Mandarin speech samples are shown in Figure 5. We can find that the seq2seq-M has a higher MOS score, and the MOS scores of meta-MT and separate-MT are slightly higher than that of shared-MT from Figure 5. The average MOS scores of synthesized Tibetan speech are shown in Figure 6. The meta-MT has slightly higher average MOS scores for Mandarin and Tibetan speech synthesis compared with the separate-MT and shared-MT. These results are consistent with the objective results.

In the AB preference test, the sentences were the same in both items within a pair. Each pair of synthesized speech samples was played at random. The subjects were asked to listen and judge the quality of which sample was better (or “neutral” means that the subjects had no preference). The preference results of the synthesized Tibetan and Mandarin speech samples are shown in Table 2 and Table 3. We can find that the AB preference results are consistent with the objective and MOS evaluation results.

6.2. Cross-Lingual Speech Synthesis Evaluation

We build a small-scale dataset containing 50 bilingual sentences for cross-lingual speech synthesis evaluation. Because seq2seq-M and seq2seq-T are monolingual TTS, the speech samples are only generated by the shared-MT, separate-MT, and meta-MT models in Mandarin speaker’s and Tibetan speaker’s voices, respectively. We also randomly selected 20 synthesized speech samples to conduct MOS and AB preference tests for speech naturalness and cross-language speaker similarity preservation of the proposed TTS models. In addition, 20 bilingual speakers in Mandarin and Tibetan were invited to participate in each test.

In the MOS test, subjects were asked to rate the naturalness of the synthesized speech samples using a 5-point scale score. The average MOS scores of synthesized bilingual speech samples in Mandarin speaker’s or Tibetan speaker’s voices are shown in Figure 7. The MOS results of the baseline show that the separate-MT outperforms the shared-MT in cross-lingual speech synthesis, which indicates that the separate-MT benefits more from less language interference when encoding the inputs to hidden representations. Furthermore, the MOS results also show that our proposed meta-MT model can synthesize more natural speech than the baseline in bilingual sentences. Besides, the MOS results show that we can synthesize more natural bilingual speech in a Tibetan speaker’s voice than in a Mandarin speaker’s voice.

In the AB preference test, the listeners were required to provide a speaker similarity choice among three options: (1) similar to the Mandarin speaker’s voice; (2) no preference; (3) similar to the Tibetan speaker’s voice. The AB preference test results are shown in Figure 8. Similar to the previous results, the separate-MT slightly outperforms the shared-MT in cross-lingual speech synthesis. Furthermore, the meta-MT outperforms baseline TTS by successfully maintaining both the Tibetan speaker’s voice and the Mandarin speaker’s voice. Specifically, no models can maintain the Mandarin speaker’s voice as well as the Tibetan speaker’s voice, which may be partially due to the low naturalness of synthesized Mandarin speech.

7. Conclusions and Future Work

The paper proposes a meta-learning-based Mandarin-Tibetan cross-lingual TTS to realize both Mandarin and Tibetan speech synthesis under a unique framework. We first build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual baseline TTS, including the shared encoder Mandarin-Tibetan cross-lingual TTS and the separate encoder Mandarin-Tibetan cross-lingual TTS. Then, to further improve the synthesized speech quality of Tacotron2-based Mandarin-Tibetan cross-lingual TTS, we propose the meta-learning-based Mandarin-Tibetan cross-lingual TTS using an additional dynamic network to predict the parameters of the language-dependent text encoder. Finally, we synthesize Mandarin speech or Tibetan speech through the unique acoustic model. The baseline experimental results show that the separate encoder Mandarin-Tibetan cross-lingual TTS can handle the input of different languages better than the shared encoder cross-lingual Mandarin-Tibetan TTS. Furthermore, the experimental results also show that the proposed meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis method could effectively improve the voice quality of synthesized speech in terms of naturalness and speaker similarity.

Future work will investigate how to use the least corpus for synthesizing Mandarin-Tibetan cross-lingual speech with satisfactory voice quality. Future work will also improve prosody with other linguistic features (e.g., BERT-derived features).

Author Contributions

Conceptualization, W.Z. and H.Y.; methodology, W.Z.; software, W.Z.; validation, W.Z., H.Y; formal analysis, W.Z.; resources, H.Y.; data curation, W.Z. and H.Y.; writing—original draft preparation, W.Z.; writing—review and editing, H.Y.; visualization, W.Z.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported by the research fund from the National Natural Science Foundation of China (Grant No. 62067008, No. 62267008). Additionally, part of this work is also supported by the Science and Technology Program of Gansu Province (Grant No. 20JR10RA095, No. 21JR7RA117) and the University Innovation Foundation of Gansu Province (Grant No. 2022B-091).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The open-source Mandarin speech synthesis dataset of DataBaker can be found at https://www.data-baker.com/data/index/TNtts, (accessed on 26 October 2022). Access to Tibetan_Lasa_1 can be contacted with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Panagiotopoulos, D.; Orovas, C.; Syndoukas, D. Neural network based autonomous control of a speech synthesis system. Intell. Syst. Appl. 2022, 14, 200077. [Google Scholar] [CrossRef]
Hunt, A.J.; Black, A.W. Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, USA, 9 May 1996; pp. 373–376. [Google Scholar]
Tokuda, K.; Nankaku, Y.; Toda, T.; Zen, H.; Yamagishi, J.; Oura, K. Speech synthesis based on hidden markov models. Proc. IEEE 2013, 101, 1234–1252. [Google Scholar] [CrossRef] [Green Version]
Ze, H.; Senior, A.; Schuster, M. Statistical parametric speech synthesis using deep neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7962–7966. [Google Scholar]
Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Saurous, R.A. Tacotron: Towards end-to-end speech synthesis. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 4006–4010. [Google Scholar]
Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with Transformer network. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6706–6713. [Google Scholar]
Ren, Y.; Ruan, Y.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z. Fastspeech: Fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 3171–3180. [Google Scholar]
Tachibana, M.; Izawa, S.; Nose, T.; Kobayashi, T. Speaker and style adaptation using average voice model for style control in hmm-based speech synthesis. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 4633–4636. [Google Scholar]
Yoshimura, T.; Masuko, T.; Tokuda, K.; Kobayashi, T.; Kitamura, T. Speaker interpolation in hmm-based speech synthesis system. J. Acoust. Soc. Jpn. 1997, 4, 199–206. [Google Scholar]
Ling, Z.; Kang, S.; Zen, H.; Senior, A.; Schuster, M.; Qian, X. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Proc. Mag. 2015, 3, 35–52. [Google Scholar] [CrossRef]
Xiang, Y.; Ling, Z.; Hu, Y.; Dai, L. Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 2013, 10, 2129–2139. [Google Scholar]
Kang, S.; Qian, X.; Meng, H. Multi-distribution deep belief network for speech synthesis. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8012–8016. [Google Scholar]
Zen, H.; Senior, A.W. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy, 4–9 May 2014; pp. 3844–3848. [Google Scholar]
Song, E.; Soong, F.; Kang, H. Effective spectral and excitation modeling techniques for LSTM-RNN-based speech synthesis systems. IEEE ACM Trans. Audio Speech Lang. Process. 2013, 11, 2152–2161. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7—9 May 2015. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421. [Google Scholar]
Chorowski, J.; Bahdanau, D.; Cho, K.; Bengio, Y. End-to-end continuous speech recognition using attention-based recurrent nn: First results. In Proceedings of the 28th Conference on Neural Information Processing Systems Workshop on Deep Learning and Representation Learning, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 577–585. [Google Scholar]
Sotelo, J.; Mehri, S.; Kumar, K.; Santos, J.F.; Kastner, K.; Courville, A.C. Char2Wav: End-to-end speech synthesis. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Mehri, S.; Kumar, K.; Gulrajani, I.; Kumar, R.; Jain, S.; Sotelo, J. SampleRNN: An unconditional end-to-end neural audio generation model. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 4479–4483. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N. Attention is all you need. In Proceedings of the the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Traber, C.; Huber, K.; Nedir, K.; Pfister, B.; Keller, E.; Zellner, B. From multilingual to polyglot speech synthesis. In Proceedings of the 6th European Conference on Speech Communication and Technology, Budapest, Hungary, 5–9 September 1999. [Google Scholar]
Yao, Q.; Hui, L.; Soong, F.K. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 1231–1239. [Google Scholar]
Zen, H.; Braunschweiler, N.; Buchholz, S.; Gales, M.; Knill, K.; Krstulovic, S. Statistical parametric speech synthesis based on speaker and language factorization. IEEE ACM Trans. Audio Speech Lang. Process. 2012, 20, 1713–1724. [Google Scholar] [CrossRef]
Nachmani, E.; Wolf, L. Unsupervised polyglot text-to-speech. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 7055–7059. [Google Scholar]
Zhang, Y.; Weiss, R.J.; Zen, H.; Wu, Y.; Chen, Z.; Skerry-Ryan, R. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 2668–2672. [Google Scholar]
Cao, Y.; Liu, S.; Wu, X.; Kang, S.; Meng, H. Code-switched speech synthesis using bilingual phonetic posteriorgram with only monolingual corpora. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 April 2020; pp. 7619–7623. [Google Scholar]
Ning, Y.; Wu, Z.; Li, R.; Jia, J.; Cai, L. Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5545–5549. [Google Scholar]
Fan, Y. Speaker and language factorization in DNN-based TTS synthesis. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 5540–5544. [Google Scholar]
Cao, Y.; Wu, X.; Liu, S.; Yu, J.; Meng, H.M. End-to-end code-switched TTS with mix of monolingual recordings. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6935–6939. [Google Scholar]
Li, B.; Zhang, Y.; Sainath, T.; Wu, Y.; Chan, W. Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 5621–5625. [Google Scholar]
Zhan, H.; Zhang, H.; Ou, W.; Lin, Y. Improve cross-Lingual text-to-speech synthesis on monolingual corpora with pitch contour information. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 31 August–3 September 2021. [Google Scholar]
Hospedales, T.M.; Antoniou, A.; Micaelli, P.; Storkey, A.J. Meta-learning in neural networks: A Survey. IEEE Trans. Pattern Anal. 2022, 9, 5149–5169. [Google Scholar] [CrossRef]
Klein, B.; Wolf, L.; Afek, Y. A dynamic convolutional layer for short rangeweather prediction. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4840–4848. [Google Scholar]
Jia, X.; De Brabandere, B.; Tuytelaars, T.; Gool, L.V. Dynamic filter networks. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2015. [Google Scholar]
Ha, D.; Dai, A.; Le, Q.V. Hypernetworks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Nekvinda, T.; Dušek, O. One model, many languages: Meta-learning for multilingual text-to-speech. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, 25–29 October 2020; pp. 2972–2976. [Google Scholar]
Yang, H.; Oura, K.; Wang, H.; Gan, Z.; Tokuda, K. Using speaker adaptive training to realize mandarin-tibetan cross-lingual speech synthesis. Multimed. Tools Appl. 2015, 22, 1–16. [Google Scholar] [CrossRef]
Zhang, W.; Yang, H.; Bu, X.; Wang, L. Deep learning for Mandarin-Tibetan cross-lingual speech synthesis. IEEE Access 2019, 7, 167884–167894. [Google Scholar] [CrossRef]
Xu, S.; Yu, H.; Li, G. The influence of context on Tibetan Lhasa speech synthesis. In Proceedings of the IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference, Chongqing, China, 25–26 March 2017; pp. 625–629. [Google Scholar]
Zhao, Y.; Hu, P.; Xu, X.; Wu, L.; Li, X. Lhasa-tibetan speech synthesis using end-to-end model. IEEE Access 2019, 7, 167884–167894. [Google Scholar] [CrossRef]
Xu, X.; Yang, L.; Zhao, Y.; Wang, H. End-to-end speech synthesis for tibetan multidialect. Complexity 2021, 2, 1–8. [Google Scholar] [CrossRef]
Lu, Y.; Dong, M.; Chen, Y. Implementing prosodic phrasing in Chinese end-to-end speech synthesis. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 7050–7054. [Google Scholar]
Li, J.; Wu, Z.; Li, R.; Zhi, P.; Meng, H. Knowledge-based linguistic encoding for end-to-end Mandarin text-to-speech synthesis. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 4494–4498. [Google Scholar]
Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K.; Livescu, K. Pre-trained text embeddings for enhanced text-to-speech synthesis. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 4430–4434. [Google Scholar]
Xiao, Y.; He, L.; Ming, H.; Soong, F.K. Improving prosody with linguistic and bert derived features in multi-speaker based Mandarin Chinese neural TTS. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6704–6708. [Google Scholar]
Zhang, W.; Yang, Y. Improving sequence-to-sequence Tibetan speech synthesis with prosodic information. ACM Trans. Asian Low-Reso. 2022; submitted. [Google Scholar]
Tachibana, H.; Uenoyama, K.; Aihara, S. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 4784–4788. [Google Scholar]

Figure 1. The typical longest Tibetan character. The spelling order of a Tibetan character is prescript, superscript, radical, subscript, vowel, postscript, and post-postscript. A Tibetan character has at least one radical.

Figure 2. The shared encoder Mandarin-Tibetan cross-lingual TTS. The shared residual encoder handles phoneme sequences from different languages with explicit language embedding. In addition, a prosody generator is designed to extract the prosodic information from a sentence. The prosody generator includes a text analyzer, a feature vector extraction module, a question set, and a hidden feature extraction module. In the figure, the input is a Tibetan sentence (left) or a Chinese sentence (right), which means that all houses have been built in the last ten years.

Figure 3. The separate encoder Mandarin-Tibetan TTS. We use separate residual encoders for each language to handle phoneme sequences from different languages with explicit language embedding. In the figure, the input is a Tibetan sentence (left) or a Chinese sentence (right), which means that all houses have been built in the last ten years.

Figure 4. The framework of meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis. We use a dynamic network to predict the parameters of the language-dependent text encoders based on the separate encoder Mandarin-Tibetan TTS. In the figure, the input is a Tibetan sentence (left) or a Chinese sentence (right), which means that all houses have been built in the last ten years.

Figure 5. The average MOS scores of synthesized Mandarin speech samples under 95% confidence intervals.

Figure 6. The average MOS scores of synthesized Tibetan speech samples under 95% confidence intervals.

Figure 7. The average MOS scores of synthesized bilingual speech samples in Mandarin speaker’s or Tibetan speaker’s voices under 95% confidence intervals.

Figure 8. Speaker similarity results of Mandarin speech samples in Tibetan speaker’s (TS) voice generated by shared-MT, separate-MT and meta-MT models, Tibetan speech samples in Mandarin speaker’s (MS) voice generated by shared-MT, separate-MT and meta-MT models. MS-GT and TS-GT are the ground truth of Mandarin speaker and Tibetan speaker, respectively. NP denotes no preference.

Table 1. The MCD values of different seq2seq TTS models on the test sets. The seq2seq-M and seq2seq-T are Mandarin and Tibetan monolingual TTS models, respectively. The shared-MT, separate-MT, and meta-MT are Mandarin-Tibetan cross-language TTS models.

Languages	seq2seq-M	seq2seq-T	shared-MT	separate-MT	meta-MT
Mandarin	4.633	-	4.752	4.659	4.652
Tibetan	-	4.185	4.055	4.031	3.985

Table 2. Subjective AB preference score(%) of synthesized Mandarin speech samples with p <

0.01

.

Table 2. Subjective AB preference score(%) of synthesized Mandarin speech samples with p <

0.01

.

	seq2seq-M	shared-MT	separate-MT	meta-MT	Nertral
1	56.5	31.2	-	-	12.2
2	-	37.8	49.8	-	12.5
3	-	-	22.3	66.2	11.5

Table 3. Subjective AB preference score(%) of synthesized Tibetan speech samples with p <

0.01

.

Table 3. Subjective AB preference score(%) of synthesized Tibetan speech samples with p <

0.01

.

	seq2seq-T	shared-MT	separate-MT	meta-MT	Nertral
1	19.3	70.8	-	-	9.8
2	-	27.3	56.7	-	15.3
3	-	-	24.7	64.8	10.5

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Yang, H. Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis. Appl. Sci. 2022, 12, 12185. https://doi.org/10.3390/app122312185

AMA Style

Zhang W, Yang H. Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis. Applied Sciences. 2022; 12(23):12185. https://doi.org/10.3390/app122312185

Chicago/Turabian Style

Zhang, Weizhao, and Hongwu Yang. 2022. "Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis" Applied Sciences 12, no. 23: 12185. https://doi.org/10.3390/app122312185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Abstract

1. Introduction

2. Related Work

3. Baseline of Mandarin-Tibetan Cross-Lingual Speech Synthesis

4. The Framework of Meta-Learning-Based Mandarin-Tibetan Cross-Lingual Speech Synthesis

5. Experiments

5.1. Experimental Data

5.2. Experimental Setup

6. Evaluations and Analysis

6.1. Multilingual Speech Synthesis Evaluation

6.1.1. Objective Evaluation

6.1.2. Subjective Evaluation

6.2. Cross-Lingual Speech Synthesis Evaluation

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI