Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion

Wang, Feifan; Shen, Xizhong

doi:10.3390/electronics12173599

Open AccessArticle

Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion

by

Feifan Wang

and

Xizhong Shen

^*

School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai 201418, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3599; https://doi.org/10.3390/electronics12173599

Submission received: 4 July 2023 / Revised: 18 August 2023 / Accepted: 23 August 2023 / Published: 25 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

As an important part of our daily life, speech has a great impact on the way people communicate. The Mel filter bank used in the extraction process of MFCC has a better ability to process the low-frequency component of a speech signal, but it weakens the emotional information contained in the high-frequency part of the speech signal. We used the inverted Mel filter bank to enhance the feature processing of the high-frequency part of the speech signal to obtain the IMFCC coefficients and fuse the MFCC features in order to obtain I_MFCC. Finally, to more accurately characterize emotional traits, we combined the Teager energy operator coefficients (TEOC) and the I_MFCC to obtain TEOC&I_MFCC and input it into the CNN_LSTM neural network. Experimental results on RAVDESS show that the feature fusion using Teager energy operator coefficients and I_MFCC has a higher emotion recognition accuracy, and the system achieves 92.99% weighted accuracy (WA) and 92.88% unweighted accuracy (UA).

Keywords:

speech emotion recognition; feature fusion; teager energy operator; IMFCC; MFCC; LSTM

1. Introduction

Speech emotion recognition has been gaining popularity in the newly developing field of human–computer interaction. This is due to the fact that a thorough examination of the emotions expressed in speech is essential for raising the level of sophistication and intelligence of conversational robot systems and human–computer interaction systems. By assisting individuals in understanding the feelings that are embedded in speech, we can provide higher-quality services and create a more intelligent and seamless human–computer interaction experience.

Speech emotion recognition in human–computer interaction is increasingly being applied across various domains, enhancing the user experience by recognizing emotional information in speech. This technology proves particularly useful in scenarios that involve human–computer interaction, such as in movies and computer-based courses, where it can intelligently respond to a user’s emotions in a timely manner [1]. Furthermore, it has great potential for application in intelligent vehicle systems since it makes it possible to track a driver’s emotional state in real-time, which helps with driving by preventing strong emotional reactions. It also helps psychologists diagnose patients, which enables them to make more informed choices [2]. Furthermore, in automatic translation systems incorporating emotion recognition, emotional alignment between communicators plays a crucial role in dialogues. Notably, research indicates that systems utilizing prosody training [3] outperform those lacking prosodic elements in speech recognition. Consequently, speech emotion recognition is widely utilized in call centers and mobile communications [4].

The differences in vocal attributes among individuals, such as pitch, energy, gender, speech rate, and speaking style, introduce variations in speech emotion recognition. Extensive research in the field has explored various techniques using different speech emotion features and deep learning methods. These features include fundamental frequency, energy, duration, speech rate, resonance peaks, linear prediction coefficients (LPC), linear frequency cepstral coefficients (LFCC), Mel-frequency cepstral coefficients (MFCC) [5], and Teager energy operator coefficients (TEOC) [6]. Researchers have abstracted these features into fixed feature vectors that encompass multiple attributes of acoustic parameters, providing accurate representations of mean values, variances, extremes, and variations. This dimensional representation enhances the rationality and ease of processing in speech emotion analysis. The feature extraction process directly impacts the overall accuracy and performance of the system. As multi-feature fusion research advances, it has been discovered that proper feature fusion selection can increase the accuracy of speech emotion recognition [7]. Initially developed for speech recognition, Mel-frequency cepstral coefficients (MFCC) are calculated based on the auditory mechanism of the human ear. By mimicking the physiological structure of the human ear, MFCC can effectively capture the emotional information embedded in speech signals.

In the classification task of speech emotion recognition, traditional methods have employed a range of algorithms, including the Hidden Markov Model (HMM) [8], Support Vector Machines (SVM) [9], the Gaussian Mixture Model (GMM), and others [10,11]. However, in recent years, researchers have increasingly explored the use of neural networks to achieve speech emotion recognition. They have utilized various popular neural network techniques, such as Convolutional Neural Networks (CNN) [12], Recurrent Neural Networks (RNN), Deep Neural Networks (DNN), Artificial Neural Networks (ANN), and Long Short-Term Memory (LSTM). These emerging neural network technologies have found wide application in speech emotion recognition, providing more choices for classification [13,14]. Researchers have discovered that deep learning-based MFCC feature extraction methods exhibit outstanding performance in speech emotion recognition tasks [15,16,17]. In paper [18], it was mentioned that LSTM and CNN can be applied to speech emotion recognition to identify emotions like neutral, anger, sadness, and happiness. Paper [19] proposed an integrated model, CNN_LSTM, which achieved an accuracy of 70.56% using MFCC. In paper [20], the authors used a BGRU network with an attention layer to extract spectrograms and derivatives as features while obtaining a weighted accuracy of 72.83% and an unweighted accuracy of 67.75% on the IEMOCAP dataset. In paper [21], the authors combined MFCC, spectrogram, and embedded high-level auditory information to show good performance. In paper [22], the authors introduced Gaussian-shaped filters (GF) in calculating MFCC and IMFCC instead of traditional triangular mel-scale filters. In paper [23], the authors extracted T_MFCC features and combined them with the Gaussian Mixture Model (GMM) for classifying different emotions in speech, showing improved performance. In paper [24], the authors fused MFCC_IMFCC features and achieved higher rates of emotion recognition. In paper [25], the authors extracted a multidimensional feature vector comprising MFCC, ZCR, HNR, and TEO, and validated it using SVM, demonstrating better speech emotion recognition than using a single feature. These research findings indicate that the adoption of different features and fusion methods can enhance the effectiveness of emotion recognition and provide valuable references for further advancements in this field.

The extraction of speech features can be impacted by the speech database’s noise and interference levels during the processing of speech signals. As a result, denoising techniques must be used before feature extraction [26,27]. Voice Activity Detection (VAD) is utilized to extract significant speech elements from the speech signal in order to make a distinction between the speech segments and the non-speech segments in a given speech signal, which may be either clean or noisy. VAD can enhance the signal’s speech component, making it simpler to extract elements necessary for emotion in speech recognition.

This research provides a novel approach for feature extraction and fusion in light of the aforementioned factors. Specifically, it combines MFCC, IMFCC, and Teager energy operator coefficients (TEOC) to form the fused feature TEOC and I_MFCC. The CNN_LSTM neural network uses this fused feature as input, and the presence of LSTM enables the network to record contextual information. The RAVDESS dataset is used to conduct experimental verification. The outcomes show that the recognition accuracy, weighted accuracy (WA), and unweighted accuracy (UA) are all improved by our suggested strategy. The system also has benefits, including a straightforward structure, simple feature extraction, and less neural network parameters.

Figure 1 in this paper shows the schematic diagram of the suggested system for recognizing emotions in speech.

2. Feature Preprocessing

2.1. The Extraction Process of TEOC

The Teager energy operator (TEO), a nonlinear operator, detects and records transient energy variations in a signal to improve data comprehension and processing. The research findings by H.M. Teager have provided a novel operator for nonlinear speech modeling, which has been widely recognized [28]. This paper introduces a concise algorithm for signal analysis to handle continuous signal

x (t)

, denoted as

Ψ

, as follows:

Ψ  [(t)] = {(\frac{d x (t)}{d t})}^{2} - x (t) \frac{d x^{2} (t)}{d t^{2}}

(1)

where

\frac{d x (t)}{d t}

is the derivative of x(t),

\frac{d x^{2} (t)}{d t^{2}}

is the second derivative of x(t).

Let us first consider the case of a linear oscillator, assuming it undergoes undamped free vibration. We describe the displacement of the oscillator at time

t

as

x (t) = A c o s (ω_{c} t + θ)

, where

x (t)

represents the displacement at time

t

,

A

is the amplitude representing the maximum displacement,

ω_{c}

is the angular frequency representing the relationship between frequency and period of the vibration,

t

is time, and

θ

is the initial phase representing the starting phase of the vibration. By substituting

x (t)

into

Ψ [(t)],

we can obtain:

Ψ  [(t)] = Ψ (A c o s (ω_{c} t + θ)) = {(A ω_{c})}^{2}

(2)

Additionally, it is known that the oscillator’s instantaneous total energy has a constant value,

E = m \frac{{(A ω_{c})}^{2}}{2}

, where

m

is the oscillator’s mass. This is only a constant factor

m / 2

away from the

Ψ

result of the above equation, so this

Ψ

operator is called the Teager energy operator.

We utilize the TEO to extract parameters from the speech signals after VAD processing. Firstly, through frame-based processing, the audio signal is separated into equal-length frames. Subsequently, the input signal is returned to its original state. Then, the energy of each frame is calculated. By comparing the frame energy with a predetermined threshold, voice activity detection is performed to determine whether it belongs to a speech or non-speech segment. Generally, VAD algorithms categorize the audio signal into voiced, unvoiced, and silent segments, providing a more accurate representation of speech characteristics. These processing steps contribute to improving the accuracy of speech recognition. By incorporating information from different traditional acoustic parameters, we are able to identify and differentiate various speech emotions more accurately.

According to the analysis of speech emotion features presented in [29], it has been observed that, compared to neutral speech, speech signals under different emotions exhibit energy shifts in different frequency bands, resulting in the concentration of primary energy in different frequency ranges. This energy distribution difference becomes more significant after the nonlinear transformation by the Teager energy operator. In the frequency domain of speech signals, spectral peaks contribute more to perception, while spectral valleys contribute less [30]. Therefore, the TEO-based nonlinear transformation emphasizes the spectral peak information during high-energy periods, making it easier to differentiate the spectral energy differences between different emotions and thus improving the accuracy of speech emotion recognition. We conducted experiments to validate this idea in subsequent studies.

In this study, we extracted the Teager energy operator coefficients (TEOC) from the speech signals after VAD processing. The Teager energy operator exhibits good conformity with the nonlinearity of both the speech signals and itself. Moreover, it has the advantages of being a nonlinear operator and having a relatively low computational complexity, which makes it an efficient feature extraction method.

According to Kaiserc [31], for discrete speech signals, this is the definition of the Teager energy operator:

Ψ  [x (n)] = {[x (n)]}^{2} - x (n + 1) x (n - 1)

(3)

For discrete speech signals, to determine the TEO parameters at a specific point, only the values of the time before and time after the voice signal

x (n)

must be obtained. In the case of speech signals in the dataset, the state transitions between different emotions exhibit nonlinear and non-stationary characteristics. By incorporating the TEO into the nonlinear feature extraction process of speech signals, it becomes possible to effectively capture the underlying emotional components present in different speech signals, thereby enhancing the accuracy of speech emotion recognition.

The original speech signals, speech signals processed by VAD, and speech signals processed by TEO for four randomly selected audio samples are shown in Figure 2.

2.2. The Extraction Process of MFCC and IMFCC

The feature extraction process for IMFCC and MFCC is essentially the same, with the only difference being the response functions of the Mel filters used. It is apparent that the IMFCC and MFCC characteristics work well together, and using the high-frequency data from the IMFCC improves the speech recognition of emotions and the systems perform better. This complementary relationship provides a new direction for the field of speech emotion recognition and offers an effective approach for enhancing system performance. Figure 3 depicts the flowchart with the goal of extracting MFCC and IMFCC features.

MFCC is the output of the DCT. It exhibits good robustness and can reflect emotional feature parameters in the signal [32]. However, through spectral analysis of the speech signal, it has been found that the spectra related to emotions are distributed in both the low-frequency and high-frequency regions of the speech signal. The Mel filter bank structure of MFCC cannot effectively utilize the high-frequency portion of the signal. Therefore, we refer to an improved version called Inverted Mel-frequency cepstral coefficients (IMFCC), which is obtained by inverting the Mel filters.

The general steps of the extraction of the MFCC and IMFCC are as follows:

Pre-emphasis of speech: The speech signal passes through a high-pass filter: $H (z) = 1 - μ z^{- 1}$ ,where the value of $μ$ is between 0.9 and 1.0, and $μ$ is taken as 0.97 after comparative experiments. The speech production system tends to suppress the high-frequency components of speech [14].
Framing of speech signals: To avoid abrupt changes between adjacent frames, an overlap region is introduced between them. The sampling frequency of speech signals in this paper is 48 KHz.
Hamming Window: Suppose that the signal after framing is $S (n), n = 0, 1, \dots, n - 1$ , $N$ is the size of the frame, then $S^{'} (n) = S (n) \times W (n)$ after multiplying the Hamming window, $W (n)$ has the following form:

$W (n, a) = (1 - a) - a \times \cos [\frac{2 π n}{N - 1}], 0 \leq n \leq N - 1$

(4)

Different values of $a$ will produce different Hamming Windows, and $a$ is generally taken to be 0.46.
Fast Fourier transform: The signal is first Hamming windowed, and then each frame of the signal undergoes a Fast Fourier Transform (FFT) to determine how energetic the signal is over the frequency spectrum. The frequency spectrum of the signal is then multiplied by magnitude squared to produce the power spectrum of the voice signal.
Mel filter bank: The key role of the triangular filters is to smooth the frequency spectrum, emphasizing the resonant peaks of the speech signal and eliminating unnecessary frequency fluctuations. The structure diagrams of the Mel filter bank and the reversed Mel filter bank are shown in Figure 4.

Through this transformation, the structure of the filter bank changes, and the response function undergoes corresponding changes as well:

{I H}_{i} (k) = H_{p - i + 1} (\frac{N}{2} - k + 1)

(5)

where

H_{i} (k)

is the filter response in the Mel frequency domain,

i

is the index of the filter,

k

is the index of the frequency,

N

is the filter length, and

p

is the number of filter banks.

The conversion between the Mel frequency and the actual frequency can be expressed by the following formula:

f_{m e l} = 2595 * \log_{10} (1 + \frac{f}{700})

(6)

The inverted Mel filter bank introduces the concept of the inverted Mel frequency domain, and the conversion between it and the actual frequency can be expressed by the following formula:

F_{I M e l} = 2195.268 - 2595 \log_{10} (1 + \frac{4031.25 - f}{700})

(7)

Figure 5 illustrates the comparison between Mel frequency and actual frequency:

6.: Calculate the log energy of each filter bank output as:

$s (m) = \ln (\sum_{k = 0}^{n - 1} {|X_{a} (k)|}^{2} H_{m} (k)), 0 \leq m \leq M$

(8)

where $m$ represents the index of the filter bank, $n$ represents the number of points in the FFT, $X_{a} (k)$ represents the amplitude spectrum of the frequency domain signal, and $H_{m} (k)$ represents the response function of filter bank $m$ .
7.: Discrete cosine transform: MFCC coefficients are obtained by discrete cosine Transform (DCT):

$C (n) = \sum_{m = 0}^{N - 1} s (m) \cos (\frac{π n (m - 0.5)}{M}), n = 1, 2, \dots \dots L$

(9)

By performing the discrete cosine transform on the logarithmic energy, we can calculate L-order MFCC coefficients. In the equation, M represents the number of triangular filters.
8.: Extraction of dynamic difference parameters: To capture the dynamic characteristics of speech in addition to the static features extracted by MFCC, differential spectral features are introduced as a complement to the static features. The calculation of differential parameters can be implemented using the following formula:

$d_{t} = \{\begin{array}{l} C_{t + 1} - C_{t} t < K \\ \frac{\sum_{k = 1}^{K} k (C_{t + k} - C_{t - k})}{\sqrt{2 \sum_{k = 1}^{K} k^{2}}} e l s e \\ C_{t} - C_{t - 1} t \geq Q - K \end{array}$

(10)

where $d_{t}$ represents the first-order difference of the t-th frame, $C_{t}$ represents the t-th cepstral coefficient, $Q$ represents the order of cepstral coefficients, and $K$ represents the time difference for the first-order derivative, which can be 1 or 2. By substituting the results from the equation back, we can obtain the parameters for the second-order difference.

We denote the i-dimensional MFCC parameters extracted from each speech file as

M_{i}

, the IMFCC parameters as

I M_{i}

and the fused parameters of MFCC and IMFCC as

I_M_{i}

. For feature fusion, we opt for an embedded technique in which the three parameters obtained from the aforementioned steps, namely MFCC, IMFCC, and TEOC, are combined.

Cross-embedding of 20-dimensional IMFCC features and 19-dimensional MFCC features is a part of the unique feature fusion approach. This guarantees that all three types of features are present in every frame, allowing for a more accurate depiction of speech emotion data. While keeping the feature dimension relatively modest to reduce parameters and boost training efficiency, the LSTM neural network is used to gather contextual information and generate more accurate decisions. Finally, the extracted Teager energy operator coefficient (TEOC) are concatenated with

I_M_{i}

to obtain the

T E O C & I_M_{i}

parameters, which serve as the input for the CNN_LSTM neural network. The feature fusion approach is illustrated in Figure 6.

3. Model for the Neural Network

The convolutional neural network is made up of a sequence of alternating convolutional and pooling layers, as well as the choice of batch normalization layers and dropout layers, to adapt to the real properties of voice signals. Two important characteristics of CNN are local connectivity and weight sharing. This effectively reduces the number of parameters and improves the efficiency of feature extraction. In the neural network model built in this paper, we adopt a three-layer structure for the convolutional layers, and the specific structural parameters are shown in Table 1.

In the network structure model built in this paper, the output dimension of the LSTM module unit is 32. The output is flattened into a one-dimensional vector by the Flatten layer to prepare for subsequent fully connected layers. Finally, a softmax activation function is applied for multi-class emotion classification. The structure of the LSTM module unit used in this paper is shown in Figure 7.

Specifically, the forget gate is controlled by

f (t)

, which determines how much information from the previous short-term memory and the current input should be forgotten in the long-term memory.

f_{(t)} = σ (W_{f} \cdot  [h_{(t - 1)}, x_{t}] + b_{f})

(11)

Among them:

W

represents the weight of each input vector;

b

is offset vector;

σ

is activation function.

The input gate is controlled by

i (t)

and consists of two parts. The first part, determined by

σ (\cdot)

, decides the values to be updated, while the second part, governed by

g (t)

, determines which values can be added to the long-term memory.

i_{(t)} = σ (W_{i} \cdot  [h_{(t - 1)}, x_{t}] + b_{i})

(12)

g_{(t)} = t a n h (W_{g} \cdot  [h_{(t - 1)}, x_{t}] + b_{g})

(13)

The output gate is controlled by o(t) and is responsible for determining the short-term memory and the output of the current unit based on the previous short-term memory, long-term memory, and input state.

o_{(t)} = σ (W_{o} \cdot  [h_{(t - 1)}, x_{t}] + b_{o})

(14)

c_{(t)} = f_{(t)} * c_{(t - 1)} + i_{(t)} * g_{(t)}

(15)

h_{(t)} = o_{(t)} * t a n h (c_{(i)})

(16)

We choose ELU (Exponential Linear Unit) as the activation function for each convolutional layer. Compared to ReLU, ELU does not suffer from the issue of “dead neurons” (referring to the problem where gradients can cause neurons to die or gradients vanish during backpropagation) and, in some cases, can provide higher accuracy compared to ReLU and its variants. The smooth negative part of ELU allows for the modeling of richer activation patterns. By using ELU, we can reduce training time and improve accuracy in neural networks.

4. Experimental Database and Results

4.1. Database

In this paper, we conducted experiments using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset [33]. The RAVDESS dataset provides a rich collection of emotional speech and song recordings. It is a publicly available database that offers a reliable data foundation for our research. Each file in the dataset has a duration of approximately 3.5 to 5 s and a sampling frequency of 48 kHz. We standardized the length to 5 s during the Voice Activity Detection (VAD) processing step. The distribution of speech files for various emotions in the database used in this paper is shown in Table 2. With equally distributed numbers of male and female speakers, the database is gender-balanced. The speech files are recorded with a North American accent. The third number in the file name represents the emotion label of the speech file. In this example, “01” denotes neutral, “02” calm, “03” happy, “04” sad, “05” angry, “06” fearful, “07” disgust, and “08” surprised. One example of a file name is “03-01-01-01-01-01.wav”, which indicates the following: participant ID is 03 (referring to the third participant), sentence ID is 01 (referring to the first sentence), emotion label is 01 (referring to neutral emotion), intensity label is 01 (referring to low intensity), gender label is 01 (referring to female), and label number is 01 (referring to the original version).

For the validation experiments on the database, we employed the 10-fold cross-validation method. The dataset was randomly divided into ten subsets. In each experiment, nine subsets were used as the training set to train the neural network, while the remaining subset was used as the validation set to evaluate the performance of the neural network. This approach allowed us to comprehensively validate and assess the performance of the neural network on different subsets of data. The fused feature parameters, TEOC and I_MFCC, were input into the constructed CNN_LSTM neural network for speech emotion recognition.

The model used in this study was built using the TensorFlow framework. The specific configuration of the deep learning neural network parameters is as follows: learning rate of 0.01, batch size of 64, learning rate decay of 0.001, momentum of 0.8, and the number of iterations (epochs) set to 1500. Stochastic Gradient Descent (SGD) was used as the network optimizer.

4.2. Comparison of Experimental Results and Analysis

The values of weighted accuracy (WA) and unweighted accuracy (UA) are used to achieve an efficient interpretation of performance in this paper. Whereas WA represents the overall precision of sample information and UA represents the average of emotion precision. We used the TEOC&I_MFCC parameters in the feature fusion process as the input of the neural network CNN_LSTM, and specifically, the network structure parameters we built are shown in Table 3.

To validate the effectiveness of the proposed feature fusion and neural network model for speech emotion recognition, we conducted experiments on the RAVDESS emotional speech corpus. The experimental results include the model loss curve and model accuracy curve, as shown in Figure 8.

As a way to confirm that our proposed method has a better effect, the accuracy of speech emotion recognition of our research and the existing research model method is compared. Table 4 shows the WA and UA data obtained from the existing research and our experiment. All the data obtained from the reference research in Table 4 are based on the RAVDESS database.

Based on the data in Table 4, it is obvious that our proposed method obtains the highest weighted and unweighted accuracy in the RAVDESS sentiment corpus compared to the techniques in the literature mentioned above.

The extraction processes of MFCC and IMFCC are similar but not identical. In order to verify the relationship between the Teager energy operator coefficients obtained in the Teager energy operator process and the signal features such as MFCC, IMFCC, and feature ablation, tests were carried out to confirm the efficiency of each module in the proposed model. In Table 5, the experimental results are provided.

In the feature ablation experiments, we conducted a total of five groups of experiments. While keeping the network structure unchanged, we separately validated the following scenarios: MFCC only, IMFCC only, feature fusion of MFCC and IMFCC, feature fusion of TEOC and IMFCC, feature fusion of TEOC, and MFCC and IMFCC. The results showed that IMFCC was better at capturing emotional features than MFCC as a single feature. The recognition performance improved when fusing TEOC with IMFCC compared to fusing TEOC with MFCC, which further validated our hypothesis. Based on the experimental results of the feature fusion of MFCC, IMFCC, and TEOC, we achieved the highest recognition accuracy, weighted accuracy, and unweighted accuracy. The experimental results also confirmed the effectiveness of our proposed feature fusion method and the neural network structure. The feature ablation experiments indicated that the inclusion of other feature components had a certain effect on improving speech emotion recognition.

We analyzed the classification methods indicated in the various literatures in order to more thoroughly test the recognition accuracy of the voice emotion detection system suggested in this paper using the RAVDESS database. Self-Supervised Learning (SSL), Self-Attention, Transformer, and Wav2Vec 2.0, for instance, are currently in extensive use. In the same database, various categorization networks are compared and summarized in Table 6.

The classification network and feature selection of speech emotion recognition put forward in this article, as seen in Table 6, outperform the RAVDESS recognition results. Due to our deliberate selection and incorporation of more useful TEOC and IMFCC emotion features, the classification network suggested in this paper outperforms the others in terms of network structure, parameter transfer between networks, and training duration.

During the experiments on the database, we generated a confusion matrix plot for the RAVDESS database, as shown in Figure 9.

As can be seen from the plotted results, surprised, neutral, anger, and calm all have rather high recognition rates. The confusion matrix also indicates a relatively high recognition rate for the emotions of disgust, fear, happiness and sadness. We hypothesize that this might be because the energy fluctuations in the speech signals for these emotions are not significant and lack prominent emotional energy patterns compared to other emotions. The high recognition rates for commonly detected emotions such as calm and anger are particularly valuable in applications such as intelligent assisted driving, where detecting a driver’s emotional fluctuations related to agitation can be highly relevant and informative.

5. Conclusions

In this paper, the CNN–LSTM neural network for voice emotion identification is proposed. The speech signals in the database are preprocessed using voice activity detection, and the MFCC, IMFCC, and Teager energy operator coefficients are extracted from the speech signals. These features are then fused into a new parameter called TEOC and IMFCC, which serves as the input to the CNN–LSTM neural network. The Teager energy operator is highly sensitive to transient components in the signal and the IMFCC parameters capture the emotional information contained in the high-frequency part of the speech signal. By combining the strengths of these two features, feature fusion is achieved. Comparative experiments on the RAVDESS database demonstrate that our proposed method outperforms similar methods and models in terms of speech emotion recognition accuracy. Particularly, our method performs well in detecting and recognizing surprised, neutral, anger, and calm emotions, achieving a good weighted accuracy and unweighted accuracy. This has practical applications in domains such as intelligent assisted driving and intelligent voice feedback.

In the following study, we explore additional deep learning networks in order to further enhance the precision of speech emotion recognition. For instance, the SSL model can adjust its performance through learning from data, perform better on numerous classification tasks, and more effectively incorporate speech emotion features.

Author Contributions

Methodology, F.W.; software, F.W.; writing—original draft preparation, F.W.; writing—review and editing, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schuller, B.; Rigoll, G.; Lang, M. Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-Belief Network Architecture. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 1, pp. I-577–I-580. [Google Scholar]
France, D.J.; Shiavi, R.G.; Silverman, S.; Silverman, M.; Wilkes, M. Acoustical Properties of Speech as Indicators of Depression and Suicidal Risk. IEEE Trans. Biomed. Eng. 2000, 47, 829–837. [Google Scholar] [CrossRef] [PubMed]
Hansen, J.H.L.; Cairns, D.A. ICARUS: Source Generator Based Real-Time Recognition of Speech in Noisy Stressful and Lombard Effect Environments. Speech Commun. 1995, 16, 391–422. [Google Scholar] [CrossRef]
Goos, G.; Hartmanis, J.; van Leeuwen, J.; Hutchison, D.; Kanade, T.; Kittler, J.; Kleinberg, J.M.; Mattern, F.; Mitchell, J.C.; Naor, M.; et al. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1973. [Google Scholar]
Ks, D.R.; Rudresh, G.S. Comparative Performance Analysis for Speech Digit Recognition Based on MFCC and Vector Quantiza-tion. Glob. Transit. Proc. 2021, 2, 513–519. [Google Scholar] [CrossRef]
Alimuradov, A.K. Speech/Pause Segmentation Method Based on Teager Energy Operator and Short-Time Energy Analysis. In Proceedings of the 2021 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), Yekaterinburg, Russia, 13–14 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 45–48. [Google Scholar]
Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Attention Driven Fusion for Multi-Modal Emotion Recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3227–3231. [Google Scholar]
Zhiyan, H.; Jian, W. Speech Emotion Recognition Based on Wavelet Transform and Improved HMM. In Proceedings of the 2013 25th Chinese Control and Decision Conference (CCDC), Guiyang, China, 25–27 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 3156–3159. [Google Scholar]
Rajasekhar, A.; Hota, M.K. A Study of Speech, Speaker and Emotion Recognition Using Mel Frequency Cepstrum Coefficients and Support Vector Machines. In Proceedings of the 2018 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 3–5 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 114–118. [Google Scholar]
Ko, Y.; Hong, I.; Shin, H.; Kim, Y. Construction of a Database of Emotional Speech Using Emotion Sounds from Movies and Dramas. In Proceedings of the 2017 International Conference on Information and Communications (ICIC), Hanoi, Vietnam, 26–28 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 266–267. [Google Scholar]
Han, Z.; Wang, J. Speech Emotion Recognition Based on Gaussian Kernel Nonlinear Proximal Support Vector Machine. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2513–2516. [Google Scholar]
Zhao, J.; Mao, X.; Chen, L. Learning Deep Features to Recognise Speech Emotion Using Merged Deep CNN. IET Signal Proc. 2018, 12, 713–721. [Google Scholar] [CrossRef]
Ying, X.; Yizhe, Z. Design of Speech Emotion Recognition Algorithm Based on Deep Learning. In Proceedings of the 2021 IEEE 4th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 19–21 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 734–737. [Google Scholar]
Zhao, H.; Ye, N.; Wang, R. A Survey on Automatic Emotion Recognition Using Audio Big Data and Deep Learning Architectures. In Proceedings of the 2018 IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Interna-tional Conference on High Performance and Smart Computing, (HPSC) and IEEE International Conference on Intelligent Data and Security (IDS), Omaha, NE, USA, 3–5 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 139–142. [Google Scholar]
Singh, Y.B.; Goel, S. Survey on Human Emotion Recognition: Speech Database, Features and Classification. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 298–301. [Google Scholar]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Kumbhar, H.S.; Bhandari, S.U. Speech Emotion Recognition Using MFCC Features and LSTM Network. In Proceedings of the 2019 5th International Conference on Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 19–21 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–3. [Google Scholar]
Dhavale, M.; Bhandari, S. Speech Emotion Recognition Using CNN and LSTM. In Proceedings of the 2022 6th International Conference On Computing, Communication, Control And Automation ICCUBEA, Pune, India, 26 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–3. [Google Scholar]
Mohan, M.; Dhanalakshmi, P.; Kumar, R.S. Speech Emotion Classification Using Ensemble Models with MFCC. Procedia Comput. Sci. 2023, 218, 1857–1868. [Google Scholar] [CrossRef]
Yan, Y.; Shen, X. Research on Speech Emotion Recognition Based on AA-CBGRU Network. Electronics 2022, 11, 1409. [Google Scholar] [CrossRef]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic In-formation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23 May 2022. [Google Scholar]
Chakroborty, S.; Saha, G. Improved Text-Independent Speaker Identification Using Fused MFCC & IMFCC Feature Sets Based on Gaussian Filter. Int. J. Signal Process. 2009, 5, 11–19. [Google Scholar]
Bandela, S.R.; Kumar, T.K. Stressed Speech Emotion Recognition Using Feature Fusion of Teager Energy Operator and MFCC. In Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
Gupta, A.; Gupta, H. Applications of MFCC and Vector Quantization in Speaker Recognition. In Proceedings of the 2013 International Conference on Intelligent Systems and Signal Processing (ISSP), Piscataway, NJ, USA, 1 May 2013. [Google Scholar]
Aouani, H.; Ayed, Y.B. Speech Emotion Recognition with Deep Learning. Procedia Comput. Sci. 2020, 176, 251–260. [Google Scholar] [CrossRef]
Wanli, Z.; Guoxin, L.; Lirong, W. Application of Improved Spectral Subtraction Algorithm for Speech Emotion Recognition. In Proceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing, Dalian, China, 26–28 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 213–216. [Google Scholar]
Yu, Y.; Kim, Y.-J. A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism. In Proceedings of the 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), Baguio City, Philippines, 29 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Teager, H.M.; Teager, S.M. Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract. In Speech Production and Speech Modelling; Hardcastle, W.J., Marchal, A., Eds.; Springer: Dordrecht, The Netherlands, 1990; pp. 241–261. ISBN 978-94-010-7414-8. [Google Scholar]
Hui, G.; Shanguang, C.; Guangchuan, S. Emotion Classification of Mandarin Speech Based on TEO Nonlinear Features. In Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Qingdao, China, 3 July–1 August 2007; IEEE: Piscataway, NJ, USA, 2017; pp. 394–398. [Google Scholar]
Strope, B.; Alwan, A. A Model of Dynamic Auditory Perception and Its Application to Robust Word Recognition. IEEE Trans. Speech Audio Process. 1997, 5, 451–464. [Google Scholar] [CrossRef]
Kaiser, J.F. On a Simple Algorithm to Calculate the “energy” of a Signal. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, 3–6 April 1990; IEEE: Piscataway, NJ, USA, 1990; pp. 381–384. [Google Scholar]
Logan, B. Mel Frequency Cepstral Coefficients for Music Modeling. In Proceedings of the International Society for Music In-formation Retrieval Conference, Plymouth, MA, USA, 23–25 October 2000. [Google Scholar]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; ISCA: Buenos Aires, Argentina, 2019; pp. 1656–1660. [Google Scholar]
Jalal, M.A.; Loweimi, E.; Moore, R.K.; Hain, T. Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; ISCA: Buenos Aires, Argentina, 2019; pp. 1701–1705. [Google Scholar]
Koo, H.; Jeong, S.; Yoon, S.; Kim, W. Development of Speech Emotion Recognition Algorithm Using MFCC and Prosody. In Proceedings of the 2020 International Conference on Electronics, Information, and Communication (ICEIC), Barcelona, Spain, 19–22 January 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
Pratama, A.; Sihwi, S.W. Speech Emotion Recognition Model Using Support Vector Machine Through MFCC Audio Feature. In Proceedings of the 2022 14th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia, 18 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 303–307. [Google Scholar]
Yadav, A.; Vishwakarma, D.K. A Multilingual Framework of CNN and Bi-LSTM for Emotion Classification. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kha-Ragpur, India, 1–3 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Ayadi, S.; Lachiri, Z. A Combined CNN-LSTM Network for Audio Emotion Recognition Using Speech and Song Attributs. In Proceedings of the 2022 6th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Sfax, Tunisia, 24 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Huang, L.; Shen, X. Research on Speech Emotion Recognition Based on the Fractional Fourier Transform. Electronics 2022, 11, 3393. [Google Scholar] [CrossRef]
Pastor, M.A.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations. Appl. Sci. 2023, 13, 9062. [Google Scholar] [CrossRef]
Yue, P.; Qu, L.; Zheng, S.; Li, T. Multi-Task Learning for Speech Emotion and Emotion Intensity Recognition. In Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1232–1237. [Google Scholar]
Alisamir, S.; Ringeval, F.; Portet, F. Multi-Corpus Affect Recognition with Emotion Embeddings and Self-Supervised Representations of Speech. In Proceedings of the 2022 10th International Conference on Affective Computing and Intelligent Interaction (ACII), Nara, Japan, 18 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Chaudhari, A.; Bhatt, C.; Krishna, A.; Travieso-González, C.M. Facial Emotion Recognition with Inter-Modality-Attention-Transformer-Based Self-Supervised Learning. Electronics 2023, 12, 288. [Google Scholar] [CrossRef]
Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset. Appl. Sci. 2021, 12, 327. [Google Scholar] [CrossRef]
Ye, J.; Wen, X.; Wei, Y.; Xu, Y.; Liu, K.; Shan, H. Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition. In Proceedings of the CASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]

Figure 1. Flowchart of speech emotion recognition.

Figure 2. The original speech signal, the VAD processed, and the TEO processed speech signal. (a) the anger speech signal is processed by VAD and TEO; (b) the calm speech signal is processed by VAD and TEO; (c) the disgust speech signal is processed by VAD and TEO; (d) the fear speech signal is processed by VAD and TEO.

Figure 3. Flowchart of MFCC and IMFCC feature extraction.

Figure 4. Diagram of the structure of the Mel filter bank (a) and the IMel filter bank (b).

Figure 5. Frequency comparison plots of Mel frequency and IMel frequency.

Figure 6. Feature fusion for MFCC, IMFCC, and TEOC.

Figure 7. Diagram of the architecture of an LSTM.

Figure 8. (a) Model loss curve; (b) Model accuracy curve.

Figure 9. The confusion matrix plot of the results of the speech emotion recognition experiment.

Table 1. The CNN structure and parameters are shown in the table.

Convolutional Layer Architecture	Structural Parameters
Conv1	Filters: 32; Kernel_size: 9 padding: ‘same’; Maxpooling: 2 Dropout: 0.25
Conv2	Filters: 64; Kernel_size: 7 Padding: ‘same’; Maxpooling: 2 Dropout: 0.25
Conv3	Filters: 128; Kernel_size: 5 Padding: ‘same’; Maxpooling: 2 Dropout: 0.25

Table 2. Speech statistics for various emotions in the RAVDESS database.

Emotion Wav	Count
Neutral	96
Calm	192
Happy	192
Sad	192
Angry	192
Fearful	192
Disgust	192
Surprise	192
Total	1440

Table 3. Parameters of the CNN_LSTM structure used in this paper.

Layer (Type)	Output Shape	Param
Conv1D_1	(None, 40, 32)	320
Batch_normalization_1	(None, 40, 32)	128
Activation _1	(None, 40, 32)	0
Max_pooling1d_1	(None, 20, 32)	0
Dropout_1	(None, 20, 32)	0
Conv1D_2	(None, 20, 64)	14,400
Batch_normalization_2	(None, 20, 64)	256
Activation_2	(None, 20, 64)	0
Max_pooling1d_2	(None, 10, 64)	0
Dropout_2	(None, 10, 64)	0
Conv1D_3	(None, 10, 128)	41,088
batch_normalization_3	(None, 10, 128)	512
Activation_3	(None, 10, 128)	0
Max_pooling1d_3	(None, 5, 128)	0
Dropout_3	(None, 5, 128)	0
LSTM	(None, 5, 32)	20,608
Flatten	(None, 160)	0
Dense	(None, 8)	1288

Table 4. Comparison of studies by others based on MFCC.

Paper	Feature	WA	UA
Parry et al. [34]	MFCCs + Mfcs + F0 + log-energy	-	53.08%
Jalal et al. [35]	augmented by delta and delta-delta	-	56.2%
Koo et al. [36]	MFCC+delta+ delta of acceleration	64.47%	-
Pratama et al. [37]	MFCC	-	71.16%
Yadav et al. [38]	MFCCS	-	73%
Ayadi et al. [39]	MFCC	73.33%	-
Huang et al. [40]	Frft_MFCC	79.86%	79.51%
This paper	TEOC&I_MFCC	92.99%	92.88%

Table 5. The feature fusion proposed performs ablation experiments.

Feature	WA	UA
MFCC	75.64%	76.46%
IMFCC	79.92%	79.79%
TEOC + MFCC	73.73%	73.54%
TEOC + iMFCC	81.80%	81.67%
TEOC + MFCC + IMFCC	92.99%	92.88%

Table 6. Comparison between other classification models and this paper based on RAVDESS.

Paper	Classification Model	WA	UA
Pastor et al. [41]	CNNSelfAtt	-	70.92%
Yue et al. [42]	Wav2Vec 2.0	80.01%	80.07%
Alisamir et al. [43]	W2V2-FT	-	82.20%
Chaudhari et al. [44]	SSL+ transformer	86.40%	-
Luna-Jiménez et al. [45]	xlsr-Wav2Vec2.0 + MLP	86.70%	-
Ye et al. [46]	TIM-Net	91.93%	92.08%
This paper	CNN_LSTM	92.99%	92.88%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, F.; Shen, X. Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion. Electronics 2023, 12, 3599. https://doi.org/10.3390/electronics12173599

AMA Style

Wang F, Shen X. Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion. Electronics. 2023; 12(17):3599. https://doi.org/10.3390/electronics12173599

Chicago/Turabian Style

Wang, Feifan, and Xizhong Shen. 2023. "Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion" Electronics 12, no. 17: 3599. https://doi.org/10.3390/electronics12173599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Speech Emotion Recognition Based on Teager Energy Operator Coefficients and Inverted MFCC Feature Fusion

Abstract

1. Introduction

2. Feature Preprocessing

2.1. The Extraction Process of TEOC

2.2. The Extraction Process of MFCC and IMFCC

3. Model for the Neural Network

4. Experimental Database and Results

4.1. Database

4.2. Comparison of Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI