Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism

Tang, Chaolin; Zhang, Dong; Tian, Qichuan

doi:10.3390/app132212418

Open AccessArticle

Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism

by

Chaolin Tang

¹

,

Dong Zhang

²

and

Qichuan Tian

^1,3,*

¹

School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

³

Beijing Key Laboratory of Robot Bionics and Function Research, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(22), 12418; https://doi.org/10.3390/app132212418

Submission received: 26 October 2023 / Revised: 12 November 2023 / Accepted: 14 November 2023 / Published: 16 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

The relationships among different subregions in facial images and their varying contributions to facial expression recognition indicate that using a fixed subregion weighting scheme would result in a substantial loss of valuable information. To address this issue, we propose a facial expression recognition network called BGA-Net, which combines bidirectional gated recurrent units (BiGRUs) with an attention mechanism. Firstly, a convolutional neural network (CNN) is employed to extract feature maps from facial images. Then, a sliding window cropping strategy is applied to divide the feature maps into multiple subregions. The BiGRUs are utilized to capture the dependencies among these subregions. Finally, an attention mechanism is employed to adaptively focus on the most discriminative regions. When evaluated on CK+, FER2013, and JAFFE datasets, our proposed method achieves promising results.

Keywords:

facial expression recognition; attention mechanism; sliding window; Bi-GRU

1. Introduction

A significant area of research in the realm of affective computing is FER. A study by psychologist Mehrabian revealed that language only accounts for 7% of the information expressed in interpersonal interactions, whereas facial expressions express as much as 55%. This suggests that facial expressions play a significant role in human emotion judgment [1] and can be applied in a variety of scenarios, including medical treatment [2,3], fatigue driving detection [4,5,6,7], social robots [8,9,10], and other human–computer interaction systems [11,12,13,14].

In fact, investigations by Martinez et al. [15] and Emery et al. [16] have shown associations between various facial expressions and a number of local locations of the face. A few key areas, including mouth and eyes, are good at capturing the variations in expressions. For example, when someone is pleased, their mouth corners tend to lift, and when they are unhappy, their mouth tends to be slightly pursed. The investigation and use of important regions for feature recognition in this study have been spurred on by these observations.

In the area of FER founded on deep learning, convolutional neural networks have proven to be an effective technique for extracting local spatial characteristics from images. However, this spatial localization makes it difficult for the model to establish the relationships between various facial regions because each convolutional filter in CNN is computed on a limited region. As a result, some related literature has tried to address this problem by using unique strategies such global pooling [17], deepening the model [18], or expanding the kernel [19]. Nevertheless, the effect is general.

Neurons in recurrent neural networks (RNNs) can receive information not only from themselves but also from other neurons through a recurrent mechanism, and this mechanism can effectively alleviate the problem of not being able to obtain the long-range inductive bias in CNNs [20]. BiGRU is a branch of RNN. It can solve the gradient vanishing or gradient explosion problem in backpropagation and makes learning the contextual information of the provided sequence easier [21,22]. Sequence data is the input for BiGRU and, as it can create global dependencies between sequence data, it is frequently used for tasks like language production and speech recognition. There are also global dependencies between different facial regions, so it is also suitable for FER.

In summary, BGA-Net is proposed to alleviate the uncertainty of facial expression; by adopting the sliding window cropping strategy, the facial region feature information and the facial global feature information are completely retained, and the dependency between different facial regions is captured by BiGRU; by using the attention mechanism, different weights are assigned to multiple facial region features of the inputs, to inhibit the influence of the useless facial region features and the noisy region features; and Focal Loss is introduced [23], which enables the model to take more note of the difficult-to-classify objects by focusing on the focusing parameter, thereby enhancing the model’s classification performance.

The following is a brief summary of our study’s contributions.

A novel facial expression recognition method fusing an attention mechanism is proposed, by which attention to key regions can be enhanced.
A sliding window cropping strategy is applied to acquire multiple facial subregion feature vectors and the effects of different scales of windows as well as step size on expression recognition are explored.
The BiGRU structure is employed to establish the global dependency between the global region and each facial subregion in two directions, and the effects of different network layers and different input sequence feature dimensions are explored.
The Focal Loss function is adopted to cope with difficult samples in the facial expression recognition issue and the effect of hyperparameters on FER is investigated.

2. Related Work

Three basic steps are usually involved in facial expression recognition: face detection, feature extraction, and classification. Among these steps, feature extraction is a crucial component in facial expression recognition. Existing feature extraction methods primarily include traditional approaches and deep learning methods.

Traditional facial expression recognition methods employ handcrafted feature extractors, such as Gabor filters and Local Binary Patterns (LBPs). For instance, Shishir et al. [24] proposed a method that utilizes Gabor wavelets and Learning Vector Quantization (LVQ) for facial expression recognition. They extracted features using Gabor filters and combined them with LVQ for expression classification. Shan et al. [25] combined LBP features with the AdaBoost algorithm for facial expression image representation. Another approach by Sarnarawickrame et al. [26] involved using an Active Shape Model (ASM) and Support Vector Machines (SVMs) for facial expression recognition, investigating its accuracy and effectiveness. Saha et al. [27] fused feature space algorithms with Principal Component Analysis (PCA), achieving good performance in facial expression recognition.

Traditional image feature extraction algorithms heavily rely on handcrafted features, which may lead to the destruction of deep feature information that is essential for correctly classifying images. As a result, FER techniques based on deep learning have become available and shown exceptional recognition performance. These deep-learning-based methods offer more feature extraction flexibility than manually designed approaches. They are capable of extracting high-level features from images, thereby contributing to improved accuracy in facial expression recognition.

Over the years, significant progress has been made in the development of deep neural network methods for facial expression feature extraction. For instance, Rodriguez et al. [28] proposed utilizing a pre-trained VGG16 facial expression recognition network, as introduced by Parkhi et al. [29], as a feature extraction network. The extracted facial expression features were then inputted into an LSTM network. Furthermore, Yang et al. [30] introduced a novel feature disentanglement model called SwapGAN (Generative Adversarial Network) for facial expression recognition. This method achieves a high degree of separation between facial-expression-related features and expression-invariant features, effectively overcoming the interference caused by individual differences in the facial expression recognition process. Tang et al. [31] proposed a dual-channel network based on the Canny edge detector, which utilizes both the original image network and the edge image network to extract features without the need for additional redundant network layers or training. Zhang et al. [32] introduced a novel Erasing Attention Consistency (EAC) method that automatically suppresses noisy samples during the training process. Specifically, they designed an imbalanced framework based on the flip semantic consistency of facial images. Subsequently, they randomly erased a portion of the input images and employed flip attention consistency to prevent the model from excessively focusing on specific features.

3. Proposed Method

Figure 1 shows the proposed facial expression recognition model, which consists of four parts: (1) feature extraction layer; (2) feature vector generation for facial subregion layer; (3) bidirectional attention GRU layer; (4) classification layer.

3.1. Feature Extraction Layer

To obtain effective facial features, we employ a convolutional neural network (CNN) as the feature extractor, consisting primarily of convolutional layers, max pooling layers, and batch normalization (BN) [33] layers. The convolutional layers play a crucial role in extracting local features from the input data. The parameters of the convolutional layers, including kernel size, padding, and stride, determine the size of the extracted feature maps. By convolving the local pixel matrix of the input layer with the corresponding kernel and accumulating the multiplications, we obtain a new pixel value in the feature map. Each convolutional kernel slides over the input feature maps with a specified stride and padding is applied to the input feature maps to increase their size during the convolution process.

The shallow convolutional layers are primarily used to extract low-level features from images, such as edge and spatial contour information. Regardless of the model used, similar information is learned at this stage. Therefore, in this study, we first employed a convolutional layer to extract shallow features from the images. The input channels of the first convolutional layer were set to the 3 channels of the original image. We used 64 3 × 3 convolutional kernels for the convolution operation in this layer. A nonlinear activation function was applied after the convolutional layer to enhance the network’s learning and fitting capabilities. In this study, we utilized the rectified linear unit (ReLU) activation function, as shown in Equation (1).

f (x) = m a x (0, x)

(1)

The selection of the ReLU activation function offers several advantages. Compared to traditional activation functions like tanh and sigmoid, ReLU computation is simpler and more computationally efficient, as it avoids exponential calculations. This helps reduce the network’s floating-point operations (FLOPs), enabling the utilization of hardware acceleration and successful training of deep, multi-layer networks with nonlinear activation functions through backpropagation.

To further reduce model complexity, we applied max pooling to the output feature maps of the convolutional layers. The pooling operation downsamples the image by a factor of two while maintaining the same number of channels. Additionally, we introduced batch normalization layers at the end of the feature extraction layer. The purpose of BN is to address the issue of numerical instability in deep neural networks by ensuring that the features of each batch have similar distributions, making the network easier to train.

The specific architecture is depicted in Figure 2, where the blue blocks represent the convolutional layers and the yellow blocks represent the pooling layers.

3.2. Feature Vector Generation for Facial Subregion Layer

We employ a sliding window cropping strategy [34] in contrast to the majority of expression recognition techniques, which use fixed cropping regions to get adequately rich local regions of the face. The sliding window cropping process uses the global feature map, which was produced from the feature extraction layer, as the target, and adjusts the number of subregions based on the size of the cropped subregions and as hyperparameters to obtain various numbers of subregions for the face.

It is known that the feature map generated by the feature extraction layer is

F_{0}

and the feature maps obtained by sliding window cropping are

F = {F_{1}, F_{2}, . . ., F_{n}}

, which represent different regions of the facial feature maps, where n is calculated as follows:

n = {(⌊\frac{(G_{0} - s c a l e)}{s t r i d e}⌋ + 1)}^{2}

(2)

where

G_{0}

denotes the size of the global feature map,

s c a l e

denotes the size of the cropping region, and

s t r i d e

is the step distance. After obtaining the corresponding global feature maps and facial subregion feature maps, these feature maps will be further processed to generate facial subregion feature vectors and the calculation process is shown below (Algorithm 1):

Algorithm 1 Facial subregion feature vector generation

3.3. Bidirectional Attention GRU Layer

GRU (gated recurrent unit) is an improved type of RNN that addresses the issues of gradient explosion and vanishing gradients encountered by traditional recurrent neural networks. By introducing update gates and reset gates, GRU effectively mitigates these problems. Compared to Long Short-Term Memory (LSTM), GRU has a lower parameter count, approximately 2/3 of LSTM. This property makes GRU less prone to overfitting and also demonstrates better performance in terms of convergence time and iteration.

To effectively leverage contextual information among different facial regions in facial expression recognition, this study adopts BiGRU to model the feature information of facial subregions. By applying BiGRU to these feature sequences of facial subregions in both forward and backward computations, two different sets of hidden states are obtained. By combining these two sets of states, BiGRU outperforms unidirectional GRU in terms of fitting capability. This combination of forward and backward computations allows for better utilization of the input sequence information, enabling the network to capture subtle variations in facial expressions more effectively.

As shown in Figure 1, the cell structures of the forward-propagating GRU and the backward-propagating GRU are given, and the corresponding computational formulas are given as follows:

\{\begin{matrix} z_{t}^{(p)} = σ (W_{z}^{(p)} x_{t} + U_{z}^{(p)} h_{t - 1}^{(p)} + b_{z}^{(p)}) & (a) \\ r_{t}^{(p)} = σ (W_{r}^{(p)} x_{t} + U_{r}^{(p)} h_{t - 1}^{(p)} + b_{r}^{(p)}) & (b) \\ {\hat{h}}_{t}^{(p)} = φ (W_{h}^{(p)} x_{t} + U_{h}^{(p)} (h_{t - 1}^{(p)} ⊙ r_{t}^{(p)}) + b_{h}^{(p)}) & (c) \\ h_{t}^{(p)} = (1 - z_{t}^{(p)}) h_{t - 1}^{(p)} + z_{t}^{(p)} ⊙ {\hat{h}}_{t}^{(p)} & (d) \end{matrix}

(3)

\{\begin{matrix} z_{t}^{(b)} = σ (W_{z}^{(b)} x_{t} + U_{z}^{(b)} h_{t - 1}^{(b)} + b_{z}^{(b)}) & (a) \\ r_{t}^{(b)} = σ (W_{r}^{(b)} x_{t} + U_{r}^{(b)} h_{t - 1}^{(b)} + b_{r}^{(b)}) & (b) \\ {\hat{h}}_{t}^{(b)} = φ (W_{h}^{(b)} x_{t} + U_{h}^{(b)} (h_{t - 1}^{(b)} ⊙ r_{t}^{(b)}) + b_{h}^{(b)}) & (c) \\ h_{t}^{(b)} = (1 - z_{t}^{(b)}) h_{t - 1}^{(b)} + z_{t}^{(b)} ⊙ {\hat{h}}_{t}^{(b)} & (d) \end{matrix}

(4)

h_{t} = h_{t}^{(p)} + h_{t}^{(b)}

(5)

The BiGRU formulation uses the index t to denote the step of the current iteration and the letters p and b to denote the forward and backward GRU cells, respectively, to which they belong, where the reset and update gates of the GRU unit are denoted by r and z, respectively. The amount of information from the preceding stage that is to be saved to the present stage’s memory is determined by the update gate, whereas the reset gate determines the amount of information that is to be kept from the preceding stage.

W_{z}

,

W_{r}

, and

W_{h}

are the initial weights of the neurons of the present input sequence,

U_{z}

,

U_{r}

, and

U_{h}

represent the initial weights of the hidden layer in the preceding stage, and

b_{z}

,

b_{r}

, and

b_{h}

are the respective offsets.

Specifically, each GRU unit first computes the states of the reset and update gates by obtaining the hidden state

h_{t - 1}

of the preceding stage and the present input

x_{t}

. A state vector in the 0–1 range is obtained as a result of the application of the sigmoid activation function. Next, the state vector after forgetting is obtained by computing

h_{t - 1} ⊙ r_{t}

, where a multiplication operation between individual elements is depicted by ⊙. Then, the result of the preceding step is added to the bias and the corresponding weighted input vector

x_{t}

, and activated by the nonlinear activation function

φ

. The update gate in the present cell decides which information should be kept as the final information.

Before passing the intermediate layer results to the attention layer, it is necessary to add the hidden states of the GRU cells and to obtain the final corresponding hidden states, as shown in Equation (5).

(1) Attention mechanism: Motivated by the most advanced attention mechanism at present, the attention layer is positioned after the BiGRU layer to enhance the recognition ability of facial expressions. The feature information of the facial region is adaptively weighted to emphasize the salient information via two fully connected layers and a ReLU activation function.

The attention module’s weight values are displayed in red in Figure 1. The state vector s is derived by adding the product of the corresponding weight coefficients, which is computed using Formula (6), to the final state

h_{t}

of every hidden layer.

\{\begin{matrix} s = \sum_{t = 1}^{n} α_{t} \times h_{t} \\ α_{t} = \frac{e x p (β_{t})}{\sum_{t = 1}^{n} e x p (β_{t})} \\ β_{t} = w_{t}^{2} (R e L u (w_{t}^{1} \times h_{t} + b_{t}^{1})) + b_{t}^{2}) \end{matrix}

(6)

where

{w_{t}^{1}, b_{t}^{1}}

and

{w_{t}^{2}, b_{t}^{2}}

represent the weight matrices and bias vectors of two fully connected layers,

β_{t}

represents the weights for each facial subregion, and

α_{t}

signifies the normalized weights of the facial subregions obtained through normalization. A score of 1 for

α_{t}

implies that the corresponding facial region contributes the most to the facial expression classification, while a score of 0 indicates complete irrelevance of the region to the expression classification. After obtaining the respective weights for the subregion relationships, the feature vectors

h_{t}

learned by BiGRU are weighted accordingly. This process yields the weighted feature vector for the t-th subregion. Finally, all the weighted feature vectors are aggregated into a fine-grained global feature representation, which serves as the input to the classification layer for facial expression recognition.

3.4. Classifier Layer

The classification layer is accountable for classifying the weighted features fused by the bidirectional gated recurrent units incorporating the attention mechanism. Initially, a fully connected layer reduces the feature dimensions to the number of categories, yielding a k-dimensional vector, and, to get the final probability vector, it was further normalized using the softmax function. Suppose the input of the network is

Y = {Y_{1}, Y_{2}, \dots, Y_{k}}

; then, the operation of the Softmax function is shown in Equation (7):

P (Y = i | x) = s o f t m a x (Y_{i}) = \frac{e^{w_{i} Y_{i}}}{\sum_{k = 1}^{K} e^{w_{k} Y_{k}}}

(7)

3.5. Loss Function

The facial expression recognition task can be considered a multi-classification problem. In most of the existing work, the cross-entropy loss function is usually used. However, in the real world, most of the current datasets generally have data imbalance phenomena, and the imbalance of the data in different categories indirectly leads to the problem of the imbalance of the hard and easy samples. In order to reduce the weight of the simple samples so that the model focuses on the classification of the hard samples, we use Focal Loss (FL). The definition of the FL loss function is as follows:

F L = - \sum_{i = 1}^{N} {(1 - p_{i})}^{γ} g_{i} l o g (p_{i})

(8)

where

F L

is the Focal Loss loss function, N represents the number of images in the training set,

p_{i}

and

g_{i}

denote the probability of the predicted label and the true label, respectively, and the focusing parameter

γ

is a non-negative hyperparameter used to control the weights of the easy and hard samples. Throughout the training process, as the value of

γ

increases, the loss function will increase the weight of the difficult samples in the total loss, and drastically reduce the loss of the easy samples, so that the model focuses more on the learning of the difficult samples.

4. Experiments

In this section, we perform a series of experiments on three distinct facial expression datasets: JAFFE [35], CK+ [36], and Fer2013 [37], to validate the efficacy of the method presented in this paper.

4.1. Datasets

CK+
The CK+ dataset contains 593 video sequences from 123 objects. The duration of the sequences varies from 10 to 20 frames and they show the transition from neutral to peak expressions (Figure 3). Out of these sequences, the last three frames of 327 video sequences were selected, totaling 981 images.
Fer2013
The Fer2013 dataset was generated from the International Conference On Machine Learning (ICML) 2013 challenge. The images in the dataset are all grayscale images of size 48 × 48, with seven types of expressions corresponding to the numbers 0–6, namely, anger, disgust, fear, happiness, sadness, surprise, and neutrality, respectively (Figure 4). The training set is 28,709 images and the test set is 3589 images. It should be noted that there are various influences on these images such as occlusion, age, pose, etc. and there are also some labeling errors in the test set of this dataset, which leads to a low test accuracy on this dataset. In addition, the recognition accuracy of the naked eye on this dataset is only about 65%.
JAFFE
The JAFFE dataset collects seven categories of facial expressions from 10 Japanese women, which contain six basic expressions and one neutral expression (Figure 5). There are 213 images in total, each with a resolution of 48 × 48, and the number of images for each expression category is 30 on average. In this experiment, the training set totaled 174 images and the test set was 39 images.

4.2. Data Preprocessing and Augmentation

In order to enhance the model’s generalization capacity, mitigate the occurrence of overfitting and underfitting, and improve its resilience to disturbances such as noise and variations in angles, we performed data preprocessing and augmentation on the facial expression images used in the experiments. Specifically, prior to training, all original images were first resized to a fixed dimension of 256 × 256 pixels. Subsequently, random cropping was applied to obtain images of size 224 × 224 pixels. Moreover, we employed the Five-Crop strategy to further augment each expression image. This strategy involved cropping sub-images of specified sizes from the top-left, top-right, bottom-left, bottom-right, and center regions of the image, resulting in a five-fold increase in the number of image samples.

4.3. Experimental Detail Settings

The experiments were performed on a deep learning server with GPU A5000, linux operating system, and python version 3.8. By default, the cropping parameters are set to scale = 7 and stride = 2, and the hyperparameter

γ

is set to 0.5.

The model was trained on the Fer2013 dataset with the batch size set to 128; the number of iterations was 250; the SGD optimizer [38] and softmax were used for computation and classification, in which the momentum was set to 0.9, the initial learning rate was 0.01, and a strategy of decreasing learning rate was used, where the model iterated up to 80 epochs and for every five iterations the learning rate was set to the original 0.9.

In the training process of the CK+ dataset, in order to make the experimental results avoid chance, through the ten-fold cross-validation method, the face dataset samples are divided into 10 copies; each sample contains seven kinds of expressions; in each experiment, select one copy as the test set and the remaining nine copies as the training set; repeat the operation 10 times and finally take the average value as the final result, in which the batch size is set to 32; the number of iterations is 60; when the model iterates to 20 epochs, the learning rate is set to the original 0.8 for every five iterations and the rest of the settings are consistent with the Fer2013 training.

The training of the JAFFE dataset is similar to the CK+ dataset and the same K-fold cross-validation method is used, but the value of K is set to 4; the rest of the settings are consistent with the training of the CK+ dataset.

4.4. Ablation Experiment

In order to illustrate the efficacy of the BGA-Net model, we carried out comprehensive ablation experiments on three datasets. Specifically, we examined the input dimensions, number of hidden layer layers, and hyperparameter

γ

of BiGRU on the Fer2013 dataset, and we thoroughly analyzed each BGA-Net module.

The three BGA-Net modules were components of a series of ablation evaluations that we conducted in our experiments. First, we used CNN as the baseline network, to which we added the sliding window cropping strategy, BiGRU, and the attention module in turn. As a control, we also conducted experiments on RNN, LSTM, and GRU.

4.4.1. BiGRU Parameters

Considering the influence of the number of network layers and the dimension of the input sequence of the hidden layer in the BiGRU on the expression feature extraction, a series of experiments were conducted in the Fer2013 dataset and the classification results are shown in Table 1.

The experiments show that, with the increase in the input sequence dimension, the classification accuracy keeps improving under the same number of network layers but there is an upper limit of the input sequence dimension for improving the recognition rate, which can be seen when the number of network layers is 1; the recognition accuracy decreases when the input sequence dimension reaches 512. When the number of network layers is 2, the recognition accuracy decreases when the input sequence dimension reaches 1024 and the experimental results show that, in the same input sequence dimension, increasing the number of network layers does not necessarily improve the performance of facial expression recognition, which may be due to the fact that the single-layer network has already extracted enough local area features. For expression classification, the two-layer network model with 512 hidden layer input sequence dimensions yields the highest recognition rate.

4.4.2. Sliding Window Cropping Parameters

Different scales of image cropping methods will retain different degrees of local features, thus affecting the classification results of the model. In this study, multiple scales were explored on the Fer2013 dataset and the experimental results are shown in Table 2.

The experimental results show that the highest classification accuracy can be obtained when the parameter scale = 7 and the parameter stride = 2, which indicates that this scale maximizes the retention of all important local features in the image. The accuracy of the model may decline as the scale increases because the local features contain more meaningless information. The cropping window cannot ensure that local features are complete at smaller scales, which affects the model’s classification accuracy. In addition, to avoid excessive computation and running time, the case of stride = 1 is not considered in the experiment.

4.4.3. Focal Loss Parameter

In the FER task, due to the imbalance of category samples, and the problem of difficult and easy samples caused by the existence of similar discriminative features between expressions, this paper chooses Focal Loss as the loss function for the expression recognition task. A series of experiments are conducted on the hyperparameter

γ

to find the appropriate value, which will allow the model to enhance the learning of features of difficult samples and reduce the loss weight of easy samples.

As can be seen from Table 3, when

γ

= 0.5, the accuracy is the highest; when the

γ

value continues to increase, the accuracy of expression recognition instead drops; this is because, when the

γ

value is getting bigger and bigger, the model will be too focused on a small number of difficult samples of feature learning, instead of reducing the recognition rate of easy samples that account for the majority of the samples, which leads to a decline in the performance of the model; similarly, when the

γ

value is too small, it is difficult to play the role of the network to learn the features of the hard sample.

4.4.4. Ablation Experiment Results

According to Table 4, our method significantly improves 1.97%, 4.09%, and 6.21% compared to the baseline network on the three datasets.

4.5. Analysis of Experimental Results

4.5.1. Confusion Matrices

In order to analyze the recognition accuracy of BGA-Net for various types of expressions, the confusion matrices on the Fer2013, CK+, and JAFFE datasets were plotted. In the confusion matrix, the higher the percentage corresponding to the elements on the diagonal, the better the recognition accuracy of the method in the corresponding category.

Based on the confusion matrix of Fer2013 shown in Figure 6a, it can be observed that the recognition rates for the categories of anger, fear, and sadness are relatively lower, while the rates for the categories of happiness and surprise are higher. This phenomenon can be attributed to the fact that, compared to other categories, the facial features in happiness and surprise expressions are more prominent. For instance, in a state of happiness, there are often distinct features such as upturned corners of the mouth and raised eyebrows, while surprise expressions exhibit similar characteristics like widened eyes. On the other hand, anger, fear, and sadness share certain common features, such as open mouths and tense facial expressions. Among these three categories, fear expressions bear more similarities to the other categories, which results in the lowest recognition rate for fear expressions among all categories.

The confusion matrices in Figure 6b,c represent the CK+ dataset and JAFFE dataset, respectively. It can be observed that emotions such as happiness, surprise, and disgust have relatively higher recognition rates, while anger, fear, sadness, and contempt have lower recognition rates. The substantial difference in recognition accuracy compared to the FER2013 dataset is primarily attributed to the data collection process of these two datasets. Both CK+ and JAFFE datasets are captured in controlled laboratory settings with minimal interference from factors such as occlusion, lighting variations, and human-related disturbances. Consequently, the image quality is higher, making the emotions more discernible and easier to recognize.

4.5.2. ROC Curves

ROC curves are another crucial tool for assessing a model’s performance. The ROC curves for the model on the Fer2013, CK+, and JAFFE datasets are shown in Figure 7a to Figure 7c, respectively. Analysis reveals that the confusion matrix’s outcomes and each dataset’s ROC curve match up. The area under the curve for the corresponding category tends to be larger when emotion recognition accuracy is higher. The ROC curve offers a more thorough evaluation of the classifier’s efficiency than the confusion matrix.

4.5.3. Comparison with State-of-the-Art Methods

To validate the superiority of BGA-Net, we compared our method with state-of-the-art approaches, including RSACNN and FERFC. Table 5 presents the accuracy comparison among these different methods.

From Table 5, it is evident that our proposed BGA-Net outperforms the method proposed by Khanbebin and Mehrdad [39], which combines handcrafted features with convolutional neural networks for deep feature extraction, by 1.89% on the Fer2013 dataset. In comparison to the approach introduced by Chang et al. [40], which trains multiple classifiers based on the complexity of test samples to address the complexity of facial expression recognition, our BGA-Net achieves similar recognition performance in an end-to-end manner. Similarly, on the CK+ dataset, our method surpasses the approach proposed by Sun et al. [41], which utilizes multi-layer convolutions to extract feature vectors from different ROI regions, by 9.19%. Furthermore, our method outperforms the fusion of local binary pattern-based facial expression images with the original image proposed by Sun et al. [42] by 0.76%. On the JAFFE dataset, our method achieves a 2.18% improvement over the approach presented by Shen et al. [43], which combines the ResNet18 network structure with a multi-head channel attention mechanism. Overall, our approach consistently delivers the best FER recognition results across all three datasets, thanks to the combination of BiGRU and the attention mechanism in BGA-Net, enabling the model to focus on crucial facial expression regions.

Lastly, we would like to highlight the potential applications of the proposed facial expression recognition method. In the field of human–computer interaction, facial expression recognition can contribute to improving the user experience by adaptively adjusting interfaces and content based on the recognized emotional states of users. In the healthcare domain, it can be utilized to identify patients’ emotional states and tailor treatment plans accordingly, thereby enhancing treatment effectiveness. Additionally, in gaming and entertainment, it can enable the recognition of players’ emotional states and dynamically adjust factors such as sound effects and visuals to elevate the gaming experience for users.

Table 5. Comparison with the accuracy of existing methods.

Dataset	Method	Accuracy/%	Year
Fer2013	[39] Khanbebin et al.	70.00	2023
	[44] Minaee et al.	70.02	2021
	[45] RSACNN	71.33	2022
	[40] Chang et al.	71.67	2019
	BGA-Net(Ours)	71.89	-
CK+	[41] Sun et al.	87.20	2020
	[46] Pan et al.	93.53	2023
	[47] Meena et al.	95.00	2023
	[42] FERFC	95.60	2021
	BGA-Net(Ours)	96.36	-
JAFFE	[42] FERFC	91.90	2021
	[48] Debnath et al.	92.05	2022
	[49] SSAE-FER	92.50	2023
	[43] Shen et al.	93.33	2023
	BGA-Net(Ours)	95.51	-

5. Conclusions

This paper proposes a facial expression recognition method that integrates BiGRU structures and attention mechanisms. By incorporating a sliding window cropping strategy and attention-based BiGRU, the model enhances the classification performance on the baseline CNN network. Experimental results validate the effectiveness of introducing the Focal Loss loss function in effectively addressing the challenge of difficult samples in facial expression recognition. Additionally, the study explores different scales of windows and strides in the sliding window cropping, various network layers, and hidden layer input feature dimensions in the BiGRU, as well as hyperparameter values in the Focal Loss. Research suggests that, in facial expression recognition tasks, appropriate sliding window sizes, increasing the number of layers in BiGRU, and adjusting input feature dimensions contribute to improving model recognition accuracy.

Notably, this study did not sufficiently address the impact of facial occlusion and head pose on performance. Moreover, the utilization of a limited dataset may hinder the applicability of the proposed method to a broader range of populations and scenarios. We plan to employ more sophisticated data enhancement techniques and test the suggested model’s potential on additional datasets in the future. Furthermore, we will use different CNN structures in future studies to address various research applications because the current CNN structure is relatively simple and we want to obtain a more effective model. We also suggest some future research directions for our proposed model, including industrial defect detection, video pedestrian anomaly detection, and other classification tasks.

Author Contributions

C.T. proposed and implemented the main ideas of this research and contributed to the writing of parts of the paper. D.Z. and Q.T. provided suggestions and made revisions to the manuscript. D.Z. also contributed to writing a portion of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Graduate Education and Teaching Quality Improvement Project (J2022012) and the Graduate Innovation Project (PG2023098) at the Beijing University of Civil Engineering and Architecture.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The study’s authors declare that they have no conflicts of interest to disclose.

References

Mehrabian, A.; Russell, J.A. An Approach to Environmental Psychology; The MIT Press: Cambridge, MA, USA, 1974. [Google Scholar]
Edwards, J.; Jackson, H.J.; Pattison, P.E. Emotion recognition via facial expression and affective prosody in schizophrenia: A methodological review. Clin. Psychol. Rev. 2002, 22, 789–832. [Google Scholar] [CrossRef] [PubMed]
Chu, H.C.; Tsai, W.W.J.; Liao, M.J.; Chen, Y.M. Facial emotion recognition with transition detection for students with high-functioning autism in adaptive e-learning. Soft Comput. 2018, 22, 2973–2999. [Google Scholar] [CrossRef]
Zhu, Z.; Ji, Q. Real time and non-intrusive driver fatigue monitoring. In Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems, Washington, DC, USA, 3–6 October 2004; pp. 657–662. [Google Scholar]
Ji, Q.; Zhu, Z.; Lan, P. Real-time nonintrusive monitoring and prediction of driver fatigue. IEEE Trans. Veh. Technol. 2004, 53, 1052–1068. [Google Scholar] [CrossRef]
Kang, H.B. Various Approaches for Driver and Driving Behavior Monitoring: A Review. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 616–623. [Google Scholar]
Sacco, M.; Farrugia, R.A. Driver fatigue monitoring system using support vector machines. In Proceedings of the 2012 5th International Symposium on Communications, Control and Signal Processing, Rome, Italy, 2–4 May 2012; pp. 1–5. [Google Scholar]
Jeong, S. The Impact of Social Robots on Young Patients’ Socio-Emotional Wellbeing in a Pediatric Inpatient Care Context. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2017. [Google Scholar]
Silva, V.; Queirós, S.; Soares, F.O.; Esteves, J.S.; Matos, D. A Supervised Autonomous Approach for Robot Intervention with Children with Autism Spectrum Disorder. In Proceedings of the ICINCO (2), Prague, Czech Republic, 29–31 July 2019; pp. 497–503. [Google Scholar]
Görür, O.C. Reshaping Human Intentions by Autonomous Sociable Robot Moves through Intention Transients Generated by Elastic Networks Considering Human Emotions. Master’s Thesis, Middle East Technical University, Ankara, Turkey, 2014. [Google Scholar]
Cepeda, C.; Tonet, R.; Osorio, D.N.; Silva, H.P.; Battegay, E.; Cheetham, M.; Gamboa, H. Latent: A flexible data collection tool to research human behavior in the context of web navigation. IEEE Access 2019, 7, 77659–77673. [Google Scholar] [CrossRef]
Musa, N.H.B. Facial Emotion Detection for Educational Purpose Using Image Processing Technique. Bachelor’s Thesis, Universiti Teknologi MARA, Shah Alam, Selangor, 2020; p. 372. [Google Scholar]
Junior, W.T.L.; Silva, J.R. From Licklider to cognitive service systems. Braz. J. Technol. Commun. Cogn. Sci. 2017, 5, 1–15. [Google Scholar]
Chattopadhyay, J.; Kundu, S.; Chakraborty, A.; Banerjee, J.S. Facial expression recognition for human computer interaction. In New Trends in Computational Vision and Bio-Inspired Computing: Selected Works Presented at the ICCVBIC 2018, Coimbatore, India; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1181–1192. [Google Scholar]
Martinez, B.; Valstar, M.F.; Jiang, B.; Pantic, M. Automatic analysis of facial actions: A survey. IEEE Trans. Affect. Comput. 2017, 10, 325–347. [Google Scholar] [CrossRef]
Emery, A.E.; Muntoni, F.; Quinlivan, R. Duchenne Muscular Dystrophy; Oxford University Press: Oxford, UK, 2015. [Google Scholar]
Li, S.; Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215. [Google Scholar] [CrossRef]
Jun, H.; Shuai, L.; Jinming, S.; Yue, L.; Jingwei, W.; Peng, J. Facial Expression Recognition based on VGGNet Convolutional Neural Network. In Proceedings of the 2018 Chinese Automation Congress (CAC), Xi’an, China, 30 November–2 December 2018; pp. 4146–4151. [Google Scholar]
Agrawal, A.; Mittal, N. Using CNN for facial expression recognition: A study of the effects of kernel size and number of filters on accuracy. Vis. Comput. 2020, 36, 405–412. [Google Scholar] [CrossRef]
Huang, Q.; Huang, C.; Wang, X.; Jiang, F. Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 2021, 580, 35–54. [Google Scholar] [CrossRef]
Huan, R.H.; Shu, J.; Bao, S.L.; Liang, R.H.; Chen, P.; Chi, K.K. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
Shen, C.; Chen, Y.; Xiao, F.; Yang, T.; Wang, X.; Chen, S.; Tang, J.; Liao, Z. BAT-Net: An enhanced RNA Secondary Structure prediction via bidirectional GRU-based network with attention mechanism. Comput. Biol. Chem. 2022, 101, 107765. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Bashyal, S.; Venayagamoorthy, G.K. Recognition of facial expressions using Gabor wavelets and learning vector quantization. Eng. Appl. Artif. Intell. 2008, 21, 1056–1064. [Google Scholar] [CrossRef]
Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
Sarnarawickrame, K.; Mindya, S. Facial expression recognition using active shape models and support vector machines. In Proceedings of the 2013 International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 11–15 December 2013; pp. 51–55. [Google Scholar]
Saha, A.; Pradhan, S.N. Facial expression recognition based on eigenspaces and principle component analysis. Int. J. Comput. Vis. Robot. 2018, 8, 190–200. [Google Scholar] [CrossRef]
Rodriguez, P.; Cucurull, G.; Gonzàlez, J.; Gonfaus, J.M.; Nasrollahi, K.; Moeslund, T.B.; Roca, F.X. Deep pain: Exploiting long short-term memory networks for facial expression classification. IEEE Trans. Cybern. 2017, 52, 3314–3324. [Google Scholar] [CrossRef]
Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the BMVC 2015—Proceedings of the British Machine Vision Conference 2015; British Machine Vision Association: Durham, UK, 2015. [Google Scholar]
Yang, L.; Tian, Y.; Song, Y.; Yang, N.; Ma, K.; Xie, L. A novel feature separation model exchange-GAN for facial expression recognition. Knowl.-Based Syst. 2020, 204, 106217. [Google Scholar] [CrossRef]
Tang, X.; Liu, S.; Xiang, Q.; Cheng, J.; He, H.; Xue, B. Facial Expression Recognition Based on Dual-Channel Fusion with Edge Features. Symmetry 2022, 14, 2651. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 418–434. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Qiu, S.; Zhao, G.; Li, X.; Wang, X. Facial expression recognition using local sliding window attention. Sensors 2023, 23, 3424. [Google Scholar] [CrossRef]
Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with gabor wavelets. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, 3–7 November 2013; Part III 20; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124. [Google Scholar]
Bera, S.; Shrivastava, V.K. Analysis of various optimizers on deep convolutional neural network model in the application of hyperspectral remote sensing image classification. Int. J. Remote Sens. 2020, 41, 2664–2683. [Google Scholar] [CrossRef]
Khanbebin, S.N.; Mehrdad, V. Improved convolutional neural network-based approach using hand-crafted features for facial expression recognition. Multimed. Tools Appl. 2023, 82, 11489–11505. [Google Scholar] [CrossRef]
Chang, T.; Li, H.; Wen, G.; Hu, Y.; Ma, J. Facial expression recognition sensing the complexity of testing samples. Appl. Intell. 2019, 49, 4319–4334. [Google Scholar] [CrossRef]
Sun, X.; Zheng, S.; Fu, H. ROI-attention vectorized CNN model for static facial expression recognition. IEEE Access 2020, 8, 7183–7194. [Google Scholar] [CrossRef]
Sun, K.; Zhang, B.; Chen, Y.; Luo, Z.; Zheng, K.; Wu, H.; Sun, X. The facial expression recognition method based on image fusion and CNN. Integr. Ferroelectr. 2021, 217, 198–213. [Google Scholar] [CrossRef]
Shen, T.; Xu, H. Facial Expression Recognition Based on Multi-Channel Attention Residual Network. CMES-Comput. Model. Eng. Sci. 2023, 135, 539–560. [Google Scholar]
Minaee, S.; Minaei, M.; Abdolrashidi, A. Deep-emotion: Facial expression recognition using attentional convolutional network. Sensors 2021, 21, 3046. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Wang, Y.; Lei, B.; Yang, W. Regional Self-Attention Convolutional Neural Network for Facial Expression Recognition. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2256013. [Google Scholar] [CrossRef]
Pan, B.; Hirota, K.; Jia, Z.; Zhao, L.; Jin, X.; Dai, Y. Multimodal emotion recognition based on feature selection and extreme learning machine in video clips. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 1903–1917. [Google Scholar] [CrossRef]
Meena, G.; Mohbey, K.K.; Indian, A.; Khan, M.Z.; Kumar, S. Identifying emotions from facial expressions using a deep convolutional neural network-based approach. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–22. [Google Scholar]
Debnath, T.; Reza, M.M.; Rahman, A.; Beheshti, A.; Band, S.S.; Alinejad-Rokny, H. Four-layer ConvNet to facial emotion recognition with minimal epochs and the significance of data diversity. Sci. Rep. 2022, 12, 6991. [Google Scholar] [CrossRef]
Ahmad, M.; Sanawar, S.; Alfandi, O.; Qadri, S.F.; Saeed, I.A.; Khan, S.; Hayat, B.; Ahmad, A. Facial expression recognition using lightweight deep learning modeling. Math. Biosci. Eng. MBE 2023, 20, 8208–8225. [Google Scholar] [CrossRef]

Figure 1. Framework for the proposed BGA-Net.

Figure 2. Feature extraction layer.

Figure 3. CK+ dataset sample.

Figure 4. Fer2013 dataset sample.

Figure 5. JAFFE dataset sample.

Figure 6. Confusion matrices for the Fer2013, CK+, and JAFFE datasets. True labels are represented by the vertical axis, while predicted labels are represented by the horizontal axis. The prediction accuracy for each category is represented by the elements on the diagonal; higher accuracy is indicated by darker colors. (a) Fer2013, (b) CK+, (c) JAFFE.

Figure 7. ROC curves for the Fer2013, CK+, and JAFFE datasets. (a) Fer2013, (b) CK+, (c) JAFFE.

Table 1. Parameters of BiGRU.

Number of Hidden Layers	Dimension of Hidden Layer Node	Accuracy%
1	128	70.86
1	256	71.36
1	512	71.02
1	1024	71.11
2	128	70.33
2	256	70.74
2	512	71.89
2	1024	70.52

Table 2. Parameters of sliding window cropping.

Stride	Scale	Accuracy%
2	3	70.44
2	5	70.33
2	7	71.89
2	9	70.10
2	11	71.02
3	3	70.08
3	5	70.38
3	7	70.80
3	9	70.38
3	11	70.99

Table 3. The value of

γ

.

Table 3. The value of

γ

.

$γ$	0	0.5	1	2	5
Accuracy%	70.74	71.89	70.94	70.41	68.51

Table 4. Evaluation of each module in our model on the three datasets. ’SWC’ refers to sliding window cropping strategy.

Network Model	CK+	JAFFE	Fer2013
CNN	94.39%	91.42%	65.68%
CNN+SWC	94.89%	93.45%	66.74%
CNN+SWC+RNN	95.14%	94.34%	68.72%
CNN+SWC+LSTM	94.52%	94.33%	68.92%
CNN+SWC+GRU	95.23%	94.58%	69.43%
CNN+SWC+BiGRU	95.54%	95.23%	70.62%
BGA-Net	96.36 %	95.51%	71.89%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, C.; Zhang, D.; Tian, Q. Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism. Appl. Sci. 2023, 13, 12418. https://doi.org/10.3390/app132212418

AMA Style

Tang C, Zhang D, Tian Q. Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism. Applied Sciences. 2023; 13(22):12418. https://doi.org/10.3390/app132212418

Chicago/Turabian Style

Tang, Chaolin, Dong Zhang, and Qichuan Tian. 2023. "Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism" Applied Sciences 13, no. 22: 12418. https://doi.org/10.3390/app132212418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convolutional Neural Network–Bidirectional Gated Recurrent Unit Facial Expression Recognition Method Fused with Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Feature Extraction Layer

3.2. Feature Vector Generation for Facial Subregion Layer

3.3. Bidirectional Attention GRU Layer

3.4. Classifier Layer

3.5. Loss Function

4. Experiments

4.1. Datasets

4.2. Data Preprocessing and Augmentation

4.3. Experimental Detail Settings

4.4. Ablation Experiment

4.4.1. BiGRU Parameters

4.4.2. Sliding Window Cropping Parameters

4.4.3. Focal Loss Parameter

4.4.4. Ablation Experiment Results

4.5. Analysis of Experimental Results

4.5.1. Confusion Matrices

4.5.2. ROC Curves

4.5.3. Comparison with State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI