1. Introduction
Human emotions encompass psychological and physiological manifestations linked to subjective feelings, temperament, personality, motivational inclinations, behavioral reactions, and physiological arousal [
1,
2]. Emotions influence people’s cognition, decision-making, and interpersonal interactions [
3]. Positive emotions may contribute to improving daily work efficiency, while negative emotions may inconvenience people’s normal lives [
4].
Recently, numerous researchers have increasingly focused on the domain of brain–computer interfaces. These interfaces utilize neural activity produced by the brain to facilitate seamless interactions between users and computers [
5]. Concurrently, the emergence of affective artificial intelligence in human–machine interaction (HMI) has attracted growing interest [
6]. The integration of emotions into human–computer interactions has rapidly evolved into a multidisciplinary research domain known as affective computing, encompassing computer science, neuroscience, psychology, and cognitive science [
7]. Emotion recognition is one of the issues in the field of affective computing [
8], with potential applications in disease assessment, fatigued driving, and psychological workload estimation [
9].
In the domain of emotion recognition, methods for detecting affective states can be broadly categorized into two groups: nonphysiological signals and physiological signals. The former includes facial expressions [
10], speech [
11], and gestures [
12], while the latter relies on physiological signals like electroencephalogram (EEG), electrocardiogram (ECG), electromyogram (EMG), galvanic skin response (GSR), and respiration (RSP). Compared to nonphysiological signals, which can be easily feigned in emotion recognition, physiological signals are considered more objective and reliable in conveying emotions [
13]. Among these physiological signals, the EEG is recognized for its excellent temporal resolution, enabling direct emotion recognition by analyzing instantaneous brain activities elicited by emotional stimuli [
14]. Recently, there has been growing interest in enhancing Brain–computer Interfaces by leveraging user emotional state information obtained from EEG, a concept known as the Affective Brain–computer Interface [
15].
Many researchers have conducted emotion recognition experiments on various EEG datasets. These datasets are predominantly created using classic emotional induction paradigms to stimulate and record the resulting brainwave signals, for instance, the popular DEAP [
2], SEED [
7], and MAHNOB-HCI [
16] datasets. However, The classic paradigm of emotion induction relies on passive emotion induction in laboratory settings, which differs significantly from real-world contexts and cannot fully encompass the psychological and physiological components of participants [
17]. Therefore, eliciting intense and multifaceted emotions in participants is challenging, hindering the understanding of emotional processes. An ideal emotional induction paradigm should immerse participants in more realistic environments and adjust based on their responses, thereby facilitating deeper engagement in EEG-based emotion research.
Virtual reality (VR) serves as a potential medium to bridge the gap between laboratory and real-world environments. VR offers highly immersive and realistic complex virtual environments. VR also allows for experimental control, fully immersing users in the created surroundings and potentially enhancing their emotional experiences [
18]. In virtual environments, emotions can be naturally and authentically evoked [
19], offering benefits in entertainment, education, and psychotherapy [
20]. Due to its unique features such as immersion and interactivity, VR could be an excellent solution for emotional induction paradigms.
In the application of VR scenarios, VR scene display methods include helmet-based, desktop-based, and projection-based methods, with helmet-based methods being more commonly employed for virtual scene display. Similar to traditional imaging, VR imagery also encompasses two modes: nonstereoscopic (VR-2D) and stereoscopic (VR-3D). Currently, the VR-2D mode is the predominant method for presenting VR films, as its production is relatively straightforward. However, VR-3D live-action films require more expensive filming equipment, a more complex production process, and more stringent broadcasting conditions, but they offer a more intense stereoscopic sensation and more realistic effects than VR-2D films.
Recent researches have compared differences in brain activation during emotion arousal between VR-3D and traditional 2D environments. For example, Yu et al. [
21] examined the neural mechanisms underlying two visual experiences, VR-3D and traditional 2D, based on EEG, while subjects viewed positive and negative emotional videos. They discovered that the
and
networks exhibited higher global efficiency in the VR-3D group. Tian et al. [
22] investigated the impact of two visual modes, VR-3D and traditional 2D, on emotion arousal. They found stronger emotional stimuli and greater emotional arousal in the VR-3D environment, along with higher beta EEG power identified in VR-3D than in traditional 2D. Xie et al. [
23] conducted an emotion induction experiment involving six basic emotions to record EEG signals while participants watched VR-3D and traditional 2D videos. They found significant differences in induced discrete emotions between the VR-3D and traditional 2D modes, with greater brain activation observed in the VR-3D mode. However, these studies have focused on traditional 2D and VR-3D environments, with less extensive research on the disparities between the VR-2D mode and the VR-3D mode.
Given the constraints of prior investigations, the principal aim of this study is to delineate disparities in brain activation patterns using EEG-based head-mounted VR displays across VR-2D and VR-3D modalities. Additionally, we aim to incorporate more pertinent EEG signal data to enhance the accuracy of EEG emotion recognition. The principal research contributions of this paper can be delineated as follows: (a) selected effective EEG stimulation materials and collected dataset; (b) a hybrid model combining Differential Entropy (DE), Convolutional Neural Network (CNN), and Long Short-Term Memory Network (LSTM), proposed in this paper.
2. Methods
To investigate the disparities in EEG signals between the VR-2D mode and the VR-3D mode, this study conducted EEG experiments in both modes. The experimental design comprises six steps. First, EEG data from subjects were collected in VR-2D and VR-3D modes, with their subjective questionnaire scores. Second, the EEG signals were preprocessed to obtain clean and high-quality signals. Third, power spectral density (PSD) and DE features were extracted from the preprocessed EEG signals. Fourth, data analysis was performed. Fifth, feature selection was performed through recursive feature elimination. Sixth, machine learning and deep learning were applied to classify the EEG data. Last, feature importance was discussed based on Shapley-value machine learning interpretability.
Figure 1 illustrates the process of VR emotion induction and analysis.
2.1. Participants and Ethics
This study recruited 32 undergraduate and graduate students by posting recruitment advertisements on campus, detailing the study’s theme and criteria such as age range and health status. Throughout the recruitment process, strict adherence to ethical and legal standards was maintained, employing rigorous recruitment procedures. Participants were aged between 21 and 26 years (mean age = 24.23 years, standard deviation = 4.15 years), with an equal gender distribution of 16 males and 16 females. All participants were right-handed and randomly assigned to either the VR-2D or VR-3D group, each consisting of 16 individuals (8 males and 8 females). Medical history information was obtained through questionnaire surveys, confirming no history of psychiatric disorders or brain trauma, and normal or corrected-to-normal vision. Due to signal artifacts, two participants were excluded from the final analysis, resulting in data analysis based on 30 participants (15 males and 15 females). All participants provided voluntary consent and received comprehensive information regarding the research objectives, experimental procedures, and associated risks. Prior to commencement, participants signed written informed consent forms and received appropriate compensation upon completion, with the option to withdraw from the experiment at any time without consequences. Ethical approval for this study was obtained from the local ethics committee.
The selection of undergraduate and graduate students as participants was based on several considerations. Firstly, these two groups typically demonstrate higher academic levels and relatively good physical health, which helps minimize external factors’ interference with experimental results. Additionally, they represent a wide age range and diverse educational backgrounds, enhancing the universality and representativeness of the study findings. Finally, undergraduate and graduate students usually possess strong learning and adaptability skills, enabling them to better understand and comply with experimental requirements and procedures.
The sample size of 32 participants was chosen to strike a balance between considerations of research design and resource constraints. This sample size selection aims to balance the statistical power of the experiment with the feasibility of available resources. Although a sample size of 32 participants may not cover all individual differences, such sample sizes are relatively common in similar experiments to achieve significance and reliability of results.
2.2. Experimental Equipment
The experimental setup, as illustrated in
Figure 2, comprises the following components: a computer equipped with an Intel Core i7 processor(Intel Corporation, Santa Clara, CA, USA), GTX 3060 graphics card(NVIDIA Corporation, Santa Clara, CA, USA), and 16 GB RAM(Micron Technology, Inc., Boise, ID, USA), and an HTC Vive head-mounted (HTC Corporation, New Taipei City, Taiwan) display for presenting VR videos. The inter-pupillary distance of each participant is measured and adjusted accordingly. EEG signals were collected using the Neuracle EEG Recorder (Borui Kang (Changzhou) Co., Ltd., Changzhou, China), which includes a cap with 32 EEG channels. Electrode placement follows the 10/20 system. To ensure EEG signal quality, electrodes are filled with conductive gel, and during the experiment, electrode impedance should be kept below 5 kΩ. REF serves as the reference electrode, while GND is the ground electrode.
2.3. Measurement Metrics
This experiment utilized the widely employed Self Assessment Manikin (SAM) for measuring individual emotional responses. SAM, designed by Bradley and Lang [
24], is a participant subjective rating scale that directly measures the emotional valence and arousal levels that humans experience in response to different stimuli, including indices such as valence, arousal, and dominance. In this experiment, scoring was focused primarily on valence and arousal, with rating scales ranging from 1 to 9 (the valence scale ranges from unhappy to happy, with higher numbers indicating greater happiness; the arousal scale ranges from calm to excited, with higher numbers indicating stronger excitement).
To ensure the validity and fidelity of the SAM, we conducted an extensive literature review and performed preliminary testing before the experiment. Additionally, detailed usage instructions were provided to participants to ensure their correct understanding and utilization of the scale. Through these measures, we aimed to accurately capture participants’ emotional experiences using SAM in our study.
2.4. Stimuli Selection
Given the deficiency in required expertise [
25], the inability to access gold-standard equipment, suitable controlled environments, and the VR materials in the field of affective computing [
26], the emotional elicitation video stimuli utilized in the experiment were sourced from an immersive VR video public database at Stanford University [
27]. This dataset includes 73 immersive VR clips, each assigned a valence and arousal score within the four quadrants of Russell’s emotion model [
28]. On this valence–arousal (VA) plane, the four quadrants are HAHV, HALV, LALV, and LAHV.
For the HAHV, LAHV, and LALV quadrants, videos with valence and arousal scores closest to the extreme corners of the quadrants were selected. Considering the absence of videos in the HALV quadrant in the dataset, this experiment adopted emotion-eliciting videos previously utilized in a paper by Li et al. [
25], where researchers selected 15 of the most viewed horror videos on YouTube. Each video was rated by at least 16 volunteers according to the discrete 9-point SAM scale for valence and arousal. For each video
, the normalized arousal and valence scores were calculated by dividing the average score by the standard deviation
. Ultimately, researchers chose two videos closest to the extreme corners of the VA plane quadrant as HALV videos. The HALV videos used in this paper can be found online at
https://www.youtube.com/watch?v=ViLReDIvk_A(accessed on 21 March 2024),
https://www.youtube.com/watch?v=C0Rl4m38gOU(accessed on 21 March 2024). Additionally, 20 volunteers (10 males and 10 females, aged 24.1 ± 2.05 years) were separately recruited to rate the valence and arousal of the VR imagery materials before EEG data collection. Details and evaluation results of the video materials employed in the experiment are shown in
Table 1. The content of the videos observed in the VR-2D and VR-3D groups was the same, with a consistent data format, a resolution of 4096 × 2048 dpi, a frame rate of 30 frames per second, and formatted in H.264 encoding. Videos were edited using Adobe Premiere software 2022. The difference between the VR-2D group and the VR-3D group was the removal of parallax between left eye images and right eye images in the VR-2D group. As shown in
Figure 3, eight VR videos corresponding to the respective quadrants were selected.
2.5. Experiment Apparatus
All participants voluntarily took part in the experiment and received comprehensive information regarding the research objectives, experimental procedures, and associated risks. Prior to the experiment, participants signed a written informed consent form and received appropriate compensation after the experiment’s completion. The participants were also informed that they could stop the experiment at any time without any consequences. The experimenters explained the SAM emotional scale to the subjects and how to fill out the self-assessment form. Next, the experimenters fitted the EEG cap on the subjects and applied conductive paste to ensure that the electrode impedance was below 5 kΩ. After the EEG equipment was set up, all participants had to wear VR devices. During the experiment, participants were instructed not to randomly move their heads or bodies to avoid errors caused by limb movement and interference from the VR headset with the EEG cap. Before the formal playing of video clips, subjects were asked to sit quietly in a chair for five minutes, allowing them to quickly adapt to the experimental environment and calm down, thus avoiding experimental errors due to excitement. During the official video playback, the subjects’ EEG signals were synchronously recorded.
Figure 4a illustrates the experimental procedure for each trial, consisting of a baseline phase and nine trials. The first trial is a test trial intended to familiarize subjects with the experimental procedure. The specific process for each trial is shown in
Figure 4b. After the experiment started, eight VR videos were presented in random order across the eight trials. Each trial included three steps. First, a fixed plus sign is displayed for five seconds. Second, the subjects randomly watched one of the VR videos. Following each viewing, participants were prompted to subjectively assess their present emotional state. To mitigate potential artifacts stemming from frequent donning of the VR head-mounted display (VR HMD) and potential disruption to EEG electrode positioning, the SAM emotional scale was presented on a virtual screen. This functionality enabled subjects to complete the SAM emotional scale using the VR controller without the need to remove the VR HMD. The scoring and rest process took approximately three minutes. During this time, subjects can also recuperate from the preceding emotional induction phase to avoid fatigue. Each VR video lasts approximately three minutes. The experiment took approximately one hour to complete. The total duration of the experiment was kept concise to prevent fatigue in the subjects.
2.6. Data Preprocessing
Clean and high-quality EEG signals were obtained through preprocessing the experimental data using EEGLAB 2021.1, an open-source MATLAB 2021b toolbox renowned for its robust EEG preprocessing capabilities, feature extraction, and emotion recognition algorithms [
29]. As depicted in
Figure 5, the data preprocessing included the following steps: first, importing the raw EEG data into MATLAB; second, locating the electrode points; third, applying FIR bandpass filtering in the range of 0.1–75 Hz to minimize the introduction of artifacts and filtering out 48–52 Hz electrical noise; fourth, downsampling the data to 200 Hz; fifth, segmenting the data based on marker labels, and dividing each subject’s collected EEG data into eight trials, with each trial comprising approximately three minutes of video playback (video data). Subsequent preprocessing for each trial involved bad lead interpolation and bad segment removal, using independent component analysis to eliminate artifacts such as blinking, eye drift, and electromyographic noise.
To mitigate the impact of video duration on EEG analysis results, the EEG data from all subjects were truncated to equal lengths, with each subject’s EEG data (for each emotional state) having a total duration of 120 s, amounting to 120,000 data points. Additionally, to ensure the experimental data’s reliability, resting-state EEG signals were separately collected for subjects in both modes. The resting-state durations for both modes were 60 s each, totaling 60,000 data points, to check for significant differences in baseline brain activity between the two groups before formally watching the videos. After data preprocessing, purified EEG signals were obtained for the subsequent feature extraction steps.
2.7. Data Analysis
2.7.1. Questionnaire Analysis
For the data from the SAM scale, one-way analysis of variance was conducted using SPSS 27.0.1 statistical software (IBM Inc., Chicago, IL, USA) [
30]. Independent two-sample
t-tests were performed, and the EEG PSD was statistically analyzed with correction for multiple comparisons using the Benjamini-Hochberg False Discovery Rate (BHFDR). All results were reported at a significance level of 0.05.
2.7.2. EEG Analysis
Utilizing Butterworth bandpass filters, each EEG segment was decomposed into four frequency bands (namely,
and
). Subsequently, the PSD (measured in
units) [
30] features of EEG signals in each frequency band were computed using fast Fourier transform (FFT). PSD
can be defined according to Equation (1):
where
represents the FFT of the EEG segment signal
,
denotes the number of data segments (in this experiment, the number of data segments is 1), and
indicates the number of data points in each segment (in this experiment, the number of data points is 120,000).
We computed the PSD of each subject across different frequency bands, which reflects the distribution of signal power in the frequency domain [
30,
31]. Subsequently, we utilizde the PSD to compare differences in brain activation between the two modes.
As shown in
Figure 6, the following brain regions and channels were selected for analysis in this study to explore the results of PSD differences in the
and
frequency bands across different brain areas: the frontal region (FP1, FP2, FZ, F3, F4, FCZ, FC3 and FC4), parietal region (C3, CZ, C4, CP3, CPZ, CP4, PZ, P3 and P4), temporal region (F7, F8, FT7, FT8, T7, T8, TP7, TP8, P7 and P8), and occipital region (OZ, O1 and O2).
To quantitatively assess the differences in EEG signals between the two modes, this study employed Cohen’s d as the measure of effect size. The formula for calculating Cohen’s d is as follows:
where
and
represent the mean values of specific brain regions under the VR-2D and VR-3D modes, respectively, and
and
denote the sizes of the two sample groups, with
and
being their respective standard deviations.
Based on the magnitude of Cohen’s d value, the significance of the effect can be interpreted: a d value less than 0.2 indicates a small effect, greater than 0.5 suggests a medium effect, and greater than 0.8 denotes a large effect [
32].
Furthermore, to quantitatively assess the extent of data support for the null hypothesis (H0) versus the alternative hypothesis (H1), this research incorporates the use of the Bayesian factor as a statistical instrument. Within the scope of this study, the null hypothesis stipulates that there are no significant differences in EEG signal activities across all brain regions and frequency bands between VR-2D and VR-3D environments. Conversely, the alternative hypothesis posits the existence of at least one brain region wherein EEG signal activity exhibits significant differences across at least one frequency band when comparing VR-2D to VR-3D environments.
By employing the Bayesian Information Criterion (BIC), as proposed by Jarosz et al. [
33], an approximation of the Bayesian factor is calculated. This methodology facilitates a direct evaluation of the evidence supporting each hypothesis from the data, thus offering a quantified pathway for analysis.
The level of data support for the alternative hypothesis is stratified based on the inverse of the Bayesian factor: an inverse value ranging from 1 to 3 signals weak support or merely anecdotal evidence for the alternative hypothesis. An inverse falling between 3 and 10 indicates positive or substantial support for the alternative hypothesis. Meanwhile, an inverse value from 10 to 20 denotes strong support for the alternative hypothesis.
2.7.3. Feature Selection
Given that small datasets in multidimensional feature spaces are prone to overfitting, feature selection was conducted to prevent this and enhance EEG-based emotion recognition performance. Recursive feature elimination (RFE) iteratively selects features by progressively considering smaller subsets of features. Initially, an estimator is trained on the full feature set, and the importance of each feature is assessed using attributes such as coef_ or feature_importances_. Subsequently, the least important features are eliminated from the current set. This process iterates recursively on the pruned set until the desired number of features is achieved. Specifically, to increase the number of training samples, the original EEG signals were divided into
= 1 s long, nonoverlapping segments (yielding approximately 14,400 samples per subject), with each segment assigned the same label as the original EEG signal. Next, each segment was decomposed into five frequency bands (namely,
and
) using Butterworth bandpass filters. DE features were extracted from each frequency band with a 0.5 s window. For each segment, the DE feature rates of the five frequency bands were calculated, yielding 30 × 5 = 150 features. The RFECV function in sklearn [
34] was employed to perform RFE within a cross-validation loop to determine the optimal number of features. In this study, the Support Vector Machine (SVM) [
35] model was utilized for feature selection. The SVM model was fitted using 150 features, calculating the importance of each feature to determine the optimal number. Subsequently, only the selected DE features were used to classify brain activation states using an SVM. DE features have been suggested to be among the most effective for emotion recognition [
36].
DE features are used to measure the complexity of EEG signals. The DE feature
is defined as:
where
is a random variable and
is the probability density function of
. If
follows a Gaussian distribution
, then the DE feature in Equation (1) is calculated using the following formula:
where
and
represent Euler’s constant and the standard deviation of
, respectively.
2.7.4. Machine Learning-Based Classification of Brain Activation States
Because our collected dataset has its own characteristics, SVM is not biased towards VR-2D or VR-3D data. Additionally, SVM is widely used in classification tasks and has shown good performance across various domains. Therefore, we chose the SVM-based method as the initial classification approach. Subsequently, to further improve the accuracy of EEG emotion recognition, we introduced deep learning methods.
The ratings given by the subjects for each video based on arousal and valence (1–9) were used as labels, and these labels were divided into two binary classification problems based on thresholds of 5 on both the arousal and valence dimensions. A support vector machine (SVM) was employed to classify the DE features of EEG signals in both modes to verify whether EEG signals under these two modes could be differentiated. Evaluation metrics included average accuracy, precision, recall, specificity, and F1 score. To optimally utilize the data and select the best parameter combination, this study adopted a dual ten-fold cross-validation method. The outer layer utilized StratifiedKFold for data partitioning to fully exploit the data, while the inner layer employed a grid search algorithm (grid search, GS) to obtain the best SVM parameter combination. The search range for parameter C was set to [1×10−5, 11] and that for gamma was set to [1×10−5, 1], with both ranges divided into 10 equal parts. The search range for the kernel parameter was set to ‘linear’ and ‘rbf’.
2.7.5. Deep Learning-Based Classification of Brain Activation States
To delve deeper into the differences in EEG signals for emotional arousal between the two modes and effectively integrate the spatial-frequency-temporal information of EEG,
Figure 7 illustrates the proposed hybrid model that combines DE, CNN, and LSTM, namely, the DE-CRNN EEG emotion recognition architecture. This architecture comprises six parts: original EEG signals, original EEG trials, EEG segments, the 4D spatial-frequency-temporal structure, the CRNN, and EEG classification results. Below, the details of the 4D spatial-frequency-temporal structure, CRNN, and classifier are sequentially described.
This study evaluates the performance of the EEG emotion recognition method using the established protocol by Li et al. [
37]. Specifically, ten-fold cross-validation was conducted for each subject, with their individual performance represented by the average classification accuracy (ACC) and standard deviation (STD). The mean ACC and STD across all subjects signify the final performance of this model.
Training the DE-CRNN model on an NVIDIA GTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). Adam optimizer is used, with the learning rate and epochs set to 0.001 and 100, respectively.
4D Spatial-Frequency-Temporal Structure
Assume that the original EEG segment is represented as a feature vector , where and denote the number of channels and the sampling rate of the original EEG signal, respectively. For each EEG segment, DE features for each frequency band are calculated using 0.5 s nonoverlapping windows. Thus, the 3D feature tensor is extracted from the original EEG segment; , where is the total number of samples, is the number of frequency bands, and represents twice the length of the segment. In this paper, is set to 5. To effectively utilize the spatial structure information of electrodes, all electrode channels are organized into a compact 2D map of size 8 × 6. Different frequency bands’ 2D maps are then stacked into a 3D array to effectively integrate their complementary information. Therefore, the 3D feature tensor is ultimately transformed into a 4D feature tensor representation , where and are the height and width, respectively, of the 2D map. In this study, = 8 and = 6 are set.
CRNN
For the 4D spatial-frequency-temporal representation
, spatial and frequency information is extracted from each of its temporal slices through a CNN module, as depicted in
Figure 7. The module comprises four convolution blocks, a max-pooling layer, and a fully connected layer. The ReLU activation function is applied to all convolution layers. A dropout layer with a 25% dropout probability follows the pooling layer to prevent overfitting and enhance the model’s generalization capability. The output from the dropout layer is flattened and fed into a fully connected layer with 64 units. Therefore, for each temporal slice
, the final output
serves as its representation in spatial and frequency.
Due to the temporal variations between different temporal slices containing valuable temporal information for emotion recognition, LSTM is employed to learn the temporal information of each temporal slice in the CNN output, thereby achieving more accurate emotion classification, an LSTM module is utilized to learn the temporal information from the time slices of the CNN output results. The LSTM module takes the output sequence
from the CNN, where
and
. In
Figure 7, an LSTM layer with 64 memory units is utilized to explore the temporal dependencies within the segments. The output of the LSTM layer is calculated as follows:
where
represents the sigmoid activation function, and
and
represent the input gate, forget gate, output gate, update cell, and hidden state, respectively.
denotes the output of the gated recurrent unit at time
.
and
are the weight matrices and bias terms, respectively.
The final output is obtained from the last LSTM node. Therefore, the ultimate representation of the EEG segment is , which integrates the frequency, spatial, and temporal information of .
Classifier
From the ultimate representation
of the EEG signal, a fully connected layer and a softmax activation function are utilized to predict the label of the 4D feature
, which is described as follows:
where
and
are learnable parameters and
represents the probability of EEG segment
belonging to emotion category
.
4. Discussion
This paper conducted statistical analysis on the SAM scale and the EEG PSD for four emotions and employed machine learning and deep learning techniques to classify EEG-based emotional responses induced under VR-2D and VR-3D modes. The aim was to explore the impact of these two modes on brain activity during the arousal process of different emotions. The following sections provide a more detailed discussion.
4.1. Brain Activity Differences
The study findings demonstrate that in both the VR-2D mode and VR-3D mode, the VR-3D mode exhibits more significant brain activation states. The regions with the most substantial differences in brain activation across the four emotions are located primarily in the frontal, temporal, and occipital regions, consistent with findings from previous studies. Perry et al. [
38] confirmed that the frontal region is closely related to emotions. The hippocampus, which is located in the deep grooves of the temporal region, plays a significant role in emotional processing and cognitive functions. The occipital region, which is responsible for visual information processing in the brain, is associated with increased αactivity during multimedia presentations (videos and images) [
23].
4.2. Brain Activation State Classification
In this study, first, the original EEG signals were segmented into nonoverlapping segments of = 1 second to increase the number of training samples. Second, recursive feature elimination was employed for feature selection, followed by emotion classification using an SVM with dual ten-fold cross-validation. The results demonstrated effective classification in both modes, with notably higher EEG-based emotion recognition accuracy in the VR-3D mode. Last, to delve deeper into the differences in EEG signals for emotional arousal between the two modes and to effectively integrate the EEG’s spatial-frequency-temporal information, this paper proposed the DE-CRNN model for EEG-based emotion classification. The results indicated more significant brain activation in the VR-3D mode, and the high AUC values for each subject in the ROC curves underscored the robustness of the DE-CRNN model.
4.3. Future Work
Considering some limitations of this study, first, EEG signals are susceptible to interference. To mitigate artifacts such as ocular and myoelectric noise and uphold the integrity of the experimental data, subjects were instructed to refrain from rotating their heads during the tasks. Although this restriction limited participants from viewing the full 360° VR videos, thereby somewhat attenuating the immersive experience, it was necessary to ensure data validity. Future study endeavors should contemplate validating the outcomes presented herein through magnetic resonance imaging (MRI) or dedicated MRI scans, thereby delving deeper into the distinctions in brain activation across the two VR modalities. Second, the current study is limited to binary classification in the dimensions of valence and arousal. Future study aims to delve into classifying emotions across four distinct categories. Third, this paper focused on EEG-based emotion classification across all channels and frequency bands. Future studies will separately investigate specific channels and each frequency band to identify key channels and frequency bands for EEG-based emotion recognition. Fourth, the participants in this study consisted of undergraduate and graduate students. In the future, we plan to recruit participants from different age groups to further enhance the generalizability of the experiment. Fifth, although the deep learning model proposed in this study can effectively capture the temporal information of EEG signals, Graph Neural Networks (GNNs) are better suited for handling spatial information. Therefore, in the future, we will consider using GNNs to extract temporal information from EEG signals.