1. Introduction
Speech is an effective medium to express emotions and attitudes through language. Applications of emotion recognition in speech can be found in many areas [
1,
2]. Extracting and recognizing emotional information from speech signals is an important subject to realize more natural human-computer interaction.
Speech emotion recognition with deep learning methods aims to extract deep emotion features through artificial neural networks. The majority of speech emotion recognition architectures utilize neural networks such as convolutional neural networks (CNN), recurrent neural networks (RNN), long-short term memory (LSTM), or their combinations [
3,
4,
5,
6,
7,
8,
9,
10]. In recent years, in order to obtain higher recognition accuracy, most research has adopted two strategies to enrich the emotion information that one model can obtain.
One strategy is to design and apply more complex architectures, such as deep neural networks (DNN). In 2018, Tzirakis [
11] proposed an end-to-end continuous speech emotion recognition model to extract features from the raw speech signal based on DNN, and stack a 2-layer long short-term memory (LSTM) to consider the contextual information in the data. Furthermore, in a model also based on DNN, Sarma [
12] investigated the choices of inputs and two different strategies of giving labels and applied the best combination to the IEMOCAP database.
Another approach to improve accuracy is to consider abundant speech features to model the emotion space. In 2020, Yu [
13] proposed a speech emotion recognition model with an “attention long-term short-term memory (LSTM)-attention” structure, which combined IS09 and Mel-scaled spectrograms. Issa et al. [
14] took Mel-frequency Cepstral Coefficients (MFCCs), chromagrams, Mel-scaled spectrograms, Tonnetz representations and spectral contrast features extracted from speech as inputs and achieved 86.1% recognition accuracy on a 7-class EMO-DB dataset using a deep CNN network.
Although the above research obtained good performance in the speech emotion recognition task, it must be pointed out that such performance is acquired at the sacrifice of the model’s portability. There have been attempts to overcome the problems of huge model size and feature redundancy. In 2021, Muppidi [
15] jumped out of the traditional method based on machine learning, focusing on high-level features in real value space, and proposed a unique method of feature and network coding using quaternion structure model (QCNN), which not only ensures good accuracy of speech emotion recognition, but also greatly reduces the size of the model. In order to overcome the problem of feature redundancy in speech emotion recognition, Bandela [
16] applied unsupervised feature selection to a combination of INTERSPEECH 2010 paralinguistic features, Gammatone Cepstral Coefficients (GTCC) and Power Normalized Cepstral Coefficients (PNCC). The Feature Selection with Adaptive Structure Learning (FSASL), Unsupervised Feature Selection with Ordinal Locality (UFSOL) and the novel Subset Feature Selection (SuFS) algorithms were used to reduce the feature dimension of input features to obtain better SER performance. Although these research works have explored methods to simplify the model, they still depend on acoustic features extracted from speech. In other words, clear speech signals are still necessary.
However, when it comes to applications in real life, what we need is an efficient model to deal with much more complex scenes. For example, clear speech may be hard to obtain. Aiming at facing these problems and improving the practicability of the model, we seek to design an efficient speech emotion recognition model to realize a comparable performance with less and steadier input (just one) and simpler model architecture.
Electroglottograph (EGG) is a signal which can reflect vocal cord movement through recording electrical impedance in the glottis collected by electrodes situated on the throat [
17]. The procedure of generating speech can be abstracted as the source-filter model, shown in
Figure 1, set up by Fant [
18]. It represents speech signals as the combination of a source and a linear acoustic filter, corresponding to the vocal cords and the vocal tract (soft palate, tongue, nasal cavity, oral cavity, etc.), respectively. As an aspect of the source-filter model, EGG is a credible resource to acquire the periodic source information exactly. Additionally, considering the special acquisition of the EGG signal, it is not affected by mechanical vibrations and noise, which makes it suitable for applications in the real life.
In 2017, Sunil Kumar [
19] proved that using the phase of the EGG signal can detect the glottal closure instant (GCI) and glottal opening instant (GOI) within a glottal cycle accurately and robustly, which indicated the advanced performance of the EGG signal in exactly extracting excitation source information of speech signals. In 2016, we realized a text-independent phoneme segmentation combining EGG and speech data, which reflected the superiority of EGG signals regarding robustness to noise [
20].
As the EGG signal highly corresponds to speaking, a multitude of research has been carried out regarding EGG and its application in the task corresponding to speech [
21,
22,
23,
24,
25,
26,
27]. As for figuring out the relationship between EGG and emotions, there are studies that have found a strong base to utilizing EGG signals in speech emotion recognition tasks. In 2015, Lu [
28] found that EGG can actually serve to identify emotions between neutral, happy and sad. Based on traditional methods, Chen extracted two classes of speech emotional features from EGG and speech, which were the power-law distribution coefficients (PLDC) and the real discrete cosine transform coefficients of the normalized spectrum of EGG and speech signals [
29].
In particular, EGG signals have been utilized to help extract emotion features from speech. In 2010, Prasanna et al. [
30] analyzed changes in incentive source characteristics between different emotions and observed that the fundamental frequency and incentive intensity were related to emotions. Taking the fundamental frequency extracted from the electroglottograph as the ground truth, the features extracted from EGG and speech are compared to verify the effectiveness of extracting excitation source features from speech. Based on this conclusion, Pravena et al. [
31] studied and proved the effectiveness of incentive intensity in identifying emotions in 2017. Incentive intensity explores and introduces the excitation parameters related to emotion (strength of excitation, SoE, and instantaneous fundamental frequency,
). Combined with the MFCC and GMM models, it realizes an emotion recognition model based on speech and electroglottograph signals. However, although Prevena proved that EGG signals can help improve performance in the SER task, it still relies on the information from speech signals and cannot realize recognition based on only EGG signals.
Cross-modal distillation aims to improve model performance by transferring supervision and knowledge from different modalities. It normally adopts a teacher-student learning mechanism, where the teacher model is usually pre-trained on one modality and then guides the student model on another modality to obtain a similar distribution. The distillation methods usually involve the traditional response-level knowledge distillation [
32,
33,
34], which uses the logits as the supervision, and feature-level distillation [
35,
36], which encourages the student network to learn and imitate the intermediate representations of the teacher network. For the speech emotion recognition task, in 2018, Albanie et al. [
37] proposed a method of training a speech emotion recognition model with unlabeled speech data via response-level distillation from a pre-trained facial emotion recognition model given visual-audio pairs. Li et al. [
38] proposed a method of training the speech emotion recognition model without any labeled speech emotion data with the help of emotion knowledge from a pre-trained text emotion model. These studies apply cross-modal distillation to speech emotion recognition, which inspired our study. However, it must be highlighted that our paper aims to extract emotion information from EGG signals, not speech itself, which is a fundamental difference compared to the above research.
In the present paper, to face up with the reality of poor anti-noise performance of the inputs extracted from speech in the speech emotion recognition task, we propose an EGG-based speech emotion recognition model. Furthermore, to cover the information latent in the modulation of the sound track, we adopt cross-modal emotion distillation (CMED), which transfers robust speech emotion representations from the log-Mel-spectrogram-based model to the EGG-based model.
This paper is organized as follows:
Section 2 introduces our materials and methods and presents our proposed model in detail. In
Section 3, we discuss the results of our model and the comparison experiments we have conducted. In
Section 4, we discuss our work. Finally,
Section 5 provides the conclusions of the present work and highlights the expected future works.
4. Discussion
In
Section 3, we conducted a series of experiments to explore the best framework.
From the aspect of feature selection, we compared two different classic acoustic features in the teacher model, log-Mel-spectrograms and Mel-spectrograms, and concluded that log-Mel-spectrograms perform better than the latter. As for the choice of the architecture, for the teacher model, by contrasting ResNet18 with CRNN, the result indicated that ResNet18 achieves better performance on classification as well as faster convergence. For the student model, different numbers of layers were explored, and it was concluded that 4 is the best depth for Bi-LSTM. Two experiments regarding the settings of the hyper-parameters were conducted to verify the best experimental conditions.
As for the evaluation of the results, we adopted unweighted validation accuracy and the visualization result utilizing t-SNE. Both of the two results on CDESD have proved that CMED works efficiently and can obtain a comparable result in the Chinese speech emotion recognition task. We realized an improvement from 58.98% to 66.80% via cross-modal emotion distillation on the S70 subset of the CDESD, which achieves a comparable accuracy to human subject evaluation.
To explore the performance of our model on other languages, we conducted experiments on EMO-DB with all the experiment conditions remaining the same. With the aid of CMED, we obtained a higher validation accuracy, which improved from 32.29% to 42.71%. For the phenomenon in which the final result cannot reach a similar level with the teacher model, we speculate that it is due to the insufficiency of the training data and the different characteristics of different languages. Nevertheless, the huge improvement observed can still prove that cross-modal emotion distillation can still help to improving the result regardless of the performance of the student model independently.