Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods

Mazhar, Tehseen; Malik, Muhammad Amir; Nadeem, Muhammad Asgher; Mohsan, Syed Agha Hassnain; Haq, Inayatul; Karim, Faten Khalid; Mostafa, Samih M.

doi:10.3390/sym14122607

Open AccessArticle

Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods

by

Tehseen Mazhar

¹

,

Muhammad Amir Malik

²

,

Muhammad Asgher Nadeem

³,

Syed Agha Hassnain Mohsan

⁴

,

Inayatul Haq

⁵

,

Faten Khalid Karim

^6,* and

Samih M. Mostafa

^7,*

¹

Department of Computer Science, Virtual University of Pakistan, Lahore 54000, Pakistan

²

Department of Computer Science and Software Engineering, International Islamic University, Islamabad 44000, Pakistan

³

Department of Computer Science, University of Sargodha, Sargodha 40100, Pakistan

⁴

Optical Communications Laboratory, Ocean College, Zhejiang University, Zheda Road 1, Zhoushan 316021, China

⁵

School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China

⁶

Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

⁷

Computer Science Department, Faculty of Computers and Information, South Valley University, Qena 83523, Egypt

^*

Authors to whom correspondence should be addressed.

Symmetry 2022, 14(12), 2607; https://doi.org/10.3390/sym14122607

Submission received: 7 November 2022 / Revised: 25 November 2022 / Accepted: 2 December 2022 / Published: 9 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

The critical component of HCI is face recognition technology. Emotional computing heavily relies on the identification of facial emotions. Applications for emotion-driven face animation and dynamic assessment are numerous (FER). Universities have started to support real-world face expression recognition research as a result. Short video clips are continually uploaded and shared online, building up a library of videos on various topics. The enormous amount of movie data appeals to system engineers and researchers of autonomous emotion mining and sentiment analysis. The main idea is that categorizing things may be done by looking at how individuals feel about specific issues. People might choose to have a basic or complex facial appearance. People worldwide continually express their feelings through their faces, whether they are happy, sad, or uncertain. An online user can visually express themselves through a video’s editing, music, and subtitles. Additionally, before the video data can be used, noise in the data must frequently be eliminated. Automatically figuring out how someone feels in a video is a challenging task that will only get harder over time. Therefore, this paper aims to show how facial recognition video analysis can be used to show how sentiment analysis can help with business growth and essential decision-making. To determine how people are affected by reviewers’ writing, we use a technique for deciding emotions in this analysis. The feelings in movies are assessed using machine learning algorithms to categorize them. A lightweight machine learning algorithm is proposed to help in Aspect-oriented emotion classification for movie reviews. Moreover, to analyze real and published datasets, experimental results are compared with different Machine Learning algorithms, i.e., Naive Bayes, Support Vector Machine, Random Forest, and CNN. The proposed approach obtained 84.72 accuracy and 79.24 sensitivity. Furthermore, the method has a specificity of 90.64 and a precision of 90.2. Thus, the proposed method significantly increases the accuracy and sensitivity of the emotion detection system from facial feature recognition. Our proposed algorithm has shown contribution to detect datasets of different emotions with symmetric characteristics and symmetrically-designed facial image recognition tasks.

Keywords:

emotion detection; machine learning; facial recognition; reviews classification

1. Introduction

Videos contain many opinionated data, and researchers are paying attention to automated extraction techniques to make them useful for decision-making. Business companies are facing rising costs and decreased budgets and are worried about customer responses, competitors, product market value, etc. And contrary to this, they are putting their efforts into retrieving useful information from user-generated social media platforms, thus giving rise to e-commerce into social commerce because customers nowadays rely on recommendations, referrals, reviews, and ratings to make purchase decisions. Therefore, to mine user-generated data, sentiment analysis is employed to determine the user’s opinion by recognizing their facial expression from user videos about movie reviews.

Opinions can be classified as either positive/negative, for example, happy faces can be termed as positive sentiments, and nasty, awful, and horrific faces are classified as people with negative emotions. Other than individual opinion, there may exist such facial expressions and nonverbal expressions that are also vital to emotion analysis. But it isn’t easy to implement a specific method as a generic algorithm because people are different and unique according to their society, area, environment, and educational values. Therefore, understanding user opinion through emotion detection and analysis is cumbersome due to issues (i) Opposite Orientations: An opinionated expression may have opposite orientations implied in diverse domains. (ii) Interrogative/Conditional face: Sometimes, an expression may not articulate any emotion, and this phenomenon is observed in interrogative and conditional situations. For example, the person’s expression when he asks, “Can you please recommend a good Mobile movie?”, “If I found a good movie phone as per my required specifications from this shop, I will surely buy it.” The positive word “good” is present in both examples, but it does not make sure that facial expression is also positive or negative sentiment. (iii) Sarcastic expressions, which are complicated to treat. For example, the person’s facial expression when he said, “What a great movie! Stopped watching in a day”. Some video frames do not contain any opinion faces but are objective expressions stating factual information. For example, facial expression while asking, “watching this movie on TV consumes more electricity,” since there is no sentiment word. However, it still corresponds to negative sentiment when the speaker utters the words. These kinds of challenges are focused on in this study to devise a method for extracting user sentiments about the movie and scaling them on different emotion scales.

Customers’ reviews of a company’s goods, services, expertise, antiques, etc., are highly valued. Additionally, individual clients are curious about what other clients have to say about particular businesses [1]. Companies have changed their structures partly due to comments and posts from different people [2].

The authors represent the interaction between a gripper and an object depicted by the Interaction Bisector Surface (IBS). It has been successfully utilized to ascertain the spatial relationships between 3D geometric objects. The IBS is the Voronoi diagram of two neighboring 3D geometric items [3].

Authors proposed By combining multi-pixel pits points with point cloud coordinates, an improved region-growing segmentation technique is suggested explicitly for detecting distortion zones in images of accessible mining rock slopes. This technique allows you to spot the area of a picture that has been edited. IRGSM reduces the average identification error in the X and Y directions by 13.37% and 11.29%, compared to the original region-expanding segmentation approach [4].

Similarly, other authors proposed that the image’s brightness is restored using Gaussian filtering once the color space has been altered. The lighting distribution parameters are adjusted nonlinearly to boost the image’s brightness. This increases the accuracy of the colors. The suggested method corrects for uneven lighting in a way that produces an overall image that is cleaner and more realistic [5].

Authors proposed that to detect whether a 3D shape is real, use a discriminative variation auto encoder to learn the shape. Asymmetry detection network is present to anticipate symmetries that, when applied to the learned form, produce shapes with high probability based on the prior of the shape. We provide a novel method for configuring symmetry that makes it easier to train end-to-end networks and detect multiple symmetries for learning-based symmetry estimation of rotational and reflection symmetry. To learn the symmetry-aware shape prior, symmetry detection and form completion can be combined into a single technique. As a result, symmetry may be identified with more accuracy and dependability. Experiments show that the proposed method locates reflection and rotational symmetries efficiently and is robust to difficult conditions, such as several objects blocking the view and high scanning noise [6].

Facial expressions can be read from the dynamics of facial muscles caused by internal perceptions, intentions, personal emotional states, or social relations [7]. It is a natural ability of human beings to analyze and judge facial expressions. Nonverbal Communication becomes easy with the help of facial expressions along with head movements [8]. Much research has been conducted on facial expression recognition in the last twenty years. FER has played a vital role in developing interfaces that can identify people’s emotional expressions and responses. However, facial expression recognition in a social environment becomes significant for computers due to various challenges caused by real-world situations. Videos contain a group of images that help understand people’s moods by recognizing their emotions—a system for evaluating the overall perspective of people in a snap. People are often subjective about the content they read or the movie they watch. Sentiment analysis is a broad concept that includes opinions, feelings, hypothetical reasoning, and sentiment analysis. One profession where subjectivity is common is opinion mining. Videos can be used to convey emotional information because they contain a lot of bright and beautiful images. Noise and poor video quality are barriers to analysis because they can also misguide the polarity orientation. Emotion detection methods result in poor performance due to the above-described factors. Features are extracted from videos after preprocessing, which involves the initial preparation of videos for further processing. Video analysis tools provide several preprocessing techniques, such as frame segmentation, noise removal, and normalization [9]. The most common talent for video-based objectivity analysis is recognizing the facial polarity from a video clip. Several recognition and analysis are investigated in the literature; To be trained and evaluated with real-world input datasets, machine learning algorithms require a training dataset. Naive Bayes, Multinomial Naive Bayes, Stochastic Gradient Descendent, Maximum Entropy, and Random Forest are some popular ML techniques [10]. Six Basic Emotions are presented in Figure 1.

Emotions greatly influence human behavior. In everyday speech, a feeling is a conscious experience described by a solid mental state and a high degree of displeasure or pleasure. Other factors such as temperament, personality, disposition, mood, and motivation are also tagged with emotion [12]. The sentiment is also related to the cognitive ability of a person. Emotion is a person’s physiological and mental state involving subjective experience, a lot of behavior, feelings, thoughts, and actions [13].

A facial expression, a line of speech, or a bodily action can all be used to express emotion. An emotion can be multi-dimensional. All kinds of the same expressions do not look identical. Everybody feels different emotions, from soft frustration to blind anger. The emotions of people change in waves. It means there must be a beginning and an end to every story that explores an emotion. Regardless of how the expression is expressed, this is always the case. The emotion period might begin due to an internal or external cause, ending when the cause is dealt with Emotions.

AffectNet and FER (facial emotion recognition) are the datasets that were utilized for the evaluation of the proposed system. FER is a publicly available dataset on the Kaggle website consisting of images and emotional facial scans. AffectNet is a publicly available dataset used for emotion classification. AffectNet is one such database that contains facial images of more than 1 million people along with annotation data in different languages collected over a long period of time [14]. AffectNet data set is developed to explore the relationships among different facial expressions and emotions in region-based facial feature identification and interpretation.

The results of this investigation suggest that the surface conceals an undiagnosed and unrecognized mental disorder. Some feelings are tactile in our movements, apparent in our words, and evident on our faces. The majority of emotions change in waves. This suggests that there must be a beginning and an end to every emotional performance. When an internal or external trigger is present, the behavior starts and ends when the trigger is removed [15].

This paper presents a facial recognition-based movie review system that can detect and classify emotions using HAAR features. Seven output nodes correspond to the seven facial expressions (neutral, angry, disgusted, fearful, happy, sad, and surprised). Therefore, keeping in mind the system’s business value and because the final classification is done to get the reviews about a movie, the system is developed to classify in binary emotions about a movie.

2. Literature Review

Authors in [16] used a fusion strategy to detect valence from visual recognition using the effect net and FER data set using the RNN algorithm. They reported an accuracy of 61.6% on their fusion strategy. Researchers used deep learning (CNN) with handcrafting techniques using the Bag of visual words model to detect facial emotions over the Affect net and FER dataset. They reported an accuracy of 59.6% using their method [17]. Deep learning with ensemble technique used for emotion recognition over affect net and FER data set. They identified an accuracy of 59.3% over the affected net dataset [18].

Researchers reported supervised learning methods over the affect net dataset for facial emotion recognition and classification. Their proposed supervised learning model achieved an average accuracy of 58.7% [19]. They offered ERMOPI (Emotion Recognition using Meta-learning across Occlusion, Pose, and Illumination) for detecting emotion over the Effect Net data set. They reported an accuracy achievement of 68% for automated feature recognition [20]. The authors presented the use of a region attention network for detecting facial image expression over the Affect Net. They calculated 59.5% accuracy over the affect net dataset [21].

The authors proposed a popular face detector to locate the face in the image. After that, a new technique was used to detect a face in a parametric model as in active appearance models (AAM). In these models, facial parts (facial points) were focused on locating rather than the whole face, which has recently gained much popularity. Firstly, the locations of facial parts like lips, nose, and eyes were established. Secondly, facial features derived from these facial points. Thirdly, a feature vector was developed using these individual features calculated in the first case. Finally, principal component analysis (PCA) was used on this feature vector to obtain a less noisy, compact, and discriminating feature representation. The classifier then classifies a face’s state into either fixed emotion classes or in the FACS format [22].

Researchers stated that facial expressions help us with nonverbal Communication. An individual’s emotional state is conveyed to observers through facial expressions. An individual can adopt a facial expression voluntarily or involuntarily [23]. A facial expression is developed due to the change in position or movement of facial muscles under the skin. Facial expression recognition is analyzed with various methods, including facial detection, feature extraction, and classification. An image may contain several faces, and some portion of any face may be covered or occluded by other objects. The extraction of features becomes complex if a face in an image is occluded by different faces or objects such as hair, glasses, hands, and masks [24].

Two main approaches, namely the holistic and feature-based approaches, are used for face detection. The holistic approach considers the face as one entity, whereas the feature-based approach focuses on some face points, such as the mouth, eyes, nose, etc. The face recognition stage demonstrates the presence and area of a face in an image. It calculates the location of the face by comparing the faces from every other pattern present in the image. Face modeling and face segmentation are very useful in accomplishing this task. Several factors affect the face detection process, such as illumination, pose, imaging noise, resolution, occlusion, etc. [23].

Face detection algorithms are used to find the position of the individual’s face in an image. Different techniques have been discussed, such as edge detection, skin color segmentation, and other heuristics [25]. A modern analysis of face detection studies can be found in [25]. The most popular face detector design is commonly utilized to detect a face in an image. This model is based on a boosting framework in hair-like features that detect the face. The cascade classifier analyzes different image locations using a frame or sub-window to detect a face in the image. Ada Boost is used for instance, to hear features to train the cascade classifier. Different head poses are used to train the models to detect a multi-dimensional face in the image. The VJ face detector has a long training time [26], suggesting the greedy feature selection method. Some discriminating features are identified using the forward feature selection technique before training the cascade classifier. Other energy-based models are also used to locate the face and simultaneously head poses in the image. Authors in [27] introduced other vector boosting methods in which the face area is divided into smaller subspaces to represent a tree.

Similarly, the authors suggested using a facial point detector as a pictorial structure (PS). He added different face poses to extend the pictorial structure model. They used one framework for head pose inference and face and fiducially point detection. The suggested face detector model gives better results than the VJ-face detector. The disadvantage of a pictorial structure-based detector is that it needs initialization from a VJ-face detector. They make up for this drawback by using various poses as mixture detectors. Several problems are faced in selecting suitable face and facial point detectors [28].

Researchers found the facial points using the Bezier volume deformation tracker. Two successive frames are matched to get the motion information. Geometric features are calculated using AAM models on input faces. They used a database and compared different AAM techniques. The disadvantage of this method is that facial parts are manually initialized. They introduced a geometric descriptor called an emotion image (EI). The facial point detector provides input to an undirected map that further derives a visual map. They stated that geometric and appearance features performed similarly, whereas geometric features highly depend on accurate location information of facial points. The detection of facial points is accurate in controlled conditions such as research labs compared with real-world environments. If the facial parts are not detected properly due to any error, then this error passes on to the geometric feature representation [29]. Description of Appearance features is presented in Figure 2.

Some studies [30,31] proposed using crafted algorithms such as active contour and superpixel algorithms as alternatives to CNN. However, the claim needs to be tested in emotion recognition from facial features. Some other researchers proposed Syra GAN algorithm as a possible alternative to CNN for classification with noisy image data.

The authors used PCA to characterize the geometry of faces. They believed that Eigen faces efficiently provide sufficient information to construct a look. In another research, eigenfeatures were used instead of Eigen faces. The eigenfeatures include eyes, mouth, nose, cheeks, etc. This was less sensitive compared with the Eigen face approach. With the eigenfeature approach, the recognition rate increased up to 95%. Authors in [32] presented a facial feature extraction technique to compare two faces. The proposed process showed better performance in face recognition. This approach also faced problems such as illumination, pose, and expression [33].

In real-time systems, the feature selection technique can also detect objects. An object detector with high computational efficiency is trained using the eigenvectors [34]. The local binary pattern (LBP) is considered one of the best techniques to demonstrate the texture features of an image. The authors said that due to its high-speed rotation invariance and computation, LBP is extensively used in face recognition, image segmentation, texture examination, image retrieval, etc. Moving objects are successfully detected with LBP by subtracting the background. Every pixel with a specific texture value in LBP is compared with the target to see a thing. Critical points in the neighborhood environment are identified by uniform LBP pattern, and these key points are used to make a mask that helps select color-texture features. The Ada Boost algorithm derives a robust and more accurate learner by combining many weak and incorrect learners. This is the core technique for boosting machine learning. Ada Boost is widely used in various applications in different fields; it uses a method in which weak and simple classifiers are linearly combined to produce a robust classifier that is more accurate and faster. Even though the Ada Boost algorithm is more inflexible to overfitting than numerous machine learning calculations, it is more than once delicate to noisy data and anomalies [35].

Researchers used different support vector machine kernels and boosting features techniques using the Ada Boost algorithm. The performance of this technique was better than feature selection using Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA). They suggested two methods. The first method was proposed for automatic Action Unit (AU) recognition in video sequences, and in the second method, learned emotion categories are derived from classifying AU-coded expressions. It was the proper method for analyzing temporal sequence patterns. Temporal template and post-face registration history are maintained from image sequence AUs and identified using temporal rules. This method only had a 90% recognition rate with 27 Aus [36].

In the same way, PANIC and his fellows got 20 facial points using a wavelet-based Gentle Boost template approach. These facial points further define the spatiotemporal features. Ada Boost and SVM approach select some of these features for the presence of Aus. They designed and developed the Lie detection system using facial Aus [37]. They analyzed the expression of AUs on the face during a lie. According to them, people involuntarily face rapid heartbeat, muscle trembling, blood pressure, palm sweating, and facial expression variations during a lie [38] proposed a new model using the k-NN and SVM classifiers to identify facial expressions. They suggested the computation of similar features and used K-Means clustering on apex images. Cluster centers are used to measure L2 distance, considering the similarity score.

Further, Ada Boost Algorithm is applied to temporal patterns for classification. The framework shows better but requires prerequisite knowledge about frames for measuring the similarity features. A method in which the face is divided into local parts roughly related to AUs defined by FACS. Hair features are computed on local factors, and the feature vector is calculated by minimizing the error. This is merged in a boosted framework for classification.

Authors recognized facial expressions using the convolutional neural network (CNN). They used greyscale images to train CNN models with different depths. They used a combination of HOG features and raw pixel data to train CNN. They used training histories to evaluate the performance of developed models [39]. Researchers demonstrated that facial changes are observed in response to personal emotional feelings, intentions, social interaction, discussions, and Communication. Analysis of facial expressions is an easy task for almost every person. In nonverbal Communication, head movement plays a vital role in expressing a level of understanding. Over the past 20 years, many studies have been done on how computers can deduce human emotions from facial expressions. Facial expression recognition is crucial to create user interfaces that can read and react to people’s feelings. Even though computers can understand how people feel “in the social context,” this is difficult for various reasons. Among others, applications for affective computing include lie detection, intelligent settings, psychiatry, and emotion. It can categorize, analyze, and find pain, drowsiness, etc. Facial animation is based on a person’s feelings and is often used in real life and computer games. As a result, FER research in practical settings is essential for both academia and business and should be actively pursued [40].

3. Proposed Approach

Computer-based emotion detection systems can automatically detect and analyze images’ facial feature recognition and emotion detection. Further research enhanced the CAD system by combining linear and nonlinear transformations with efficient deep learning features to generate more abstract and practical representations of machine learning methods and methodologies for human emotion detection. However, a computerized system is required to accurately and efficiently identify user moods with minimal time complexity. This section proposes a hybrid deep learning classification model using Hair features for identifying emotional segments from images, described in detail. This system’s preprocessing techniques are defined as part of the development process. The suggested approach uses a CNN model to detect emotional elements in images. It constructs a grid the same size as the analyzed image, which is used to classify face expressions. All pictures are converted to grayscale from RGB during the preprocessing step. An adaptive median filter is used to improve image quality. During CNN training, picture initialization and emotion categorization is performed.

Performance assessment indicators such as recall and F1 gauge how well the proposed system would function. Figure 3 depicts a high-level representation of the suggested model. Two sub-volumes are blended into a single picture to create a single grid. This grid shows the probability that each location in the original image has an emotional face characteristic.

3.1. Image Preprocessing

For the input, 512-by-512-pixel JPEG photos were used. They were preprocessed to determine the facial characteristic needed to meet the research’s goal. Two different Effect Net datasets and FER datasets were used to conduct this investigation. The photographs must first be converted to greyscale to rescale the photos. Image quality is improved after rescaling, and the photos are then utilized for training the algorithm. The detailed steps are described in the following lines.

3.2. Image Bitmap Conversion

The jpeg picture is then transformed into a bitmap image in the next phase. There are usually two picture formats available for input. However, the bitmap conversion phase is included so that the system can function with various information to make it compatible with live or offline 3D pictures. Different properties apply to a 2D formatted view. We utilized a Python Keras library-provided function to convert all input pictures into Bitmap image format. The conversion of Bitmap image helped to use 3D Format’s 3D—MHD file data to operate with both 2D and 3D photos; it is required to convert to a single version. This is because equivalent characteristics can be acquired for both image input types after converting to bitmap images.

3.3. Removal of Noise

The elimination of noise contributes to the improvement of the picture and the image quality. Image filtering methods aid in improving the quality of the images used in this procedure. Noise affects picture quality, which lowers preprocessing system performance. An adaptive median filter is utilized to remove noise, which uses an average method to change pixels’ values based on an area/cluster-wide weighted average. Noise removal is performed with the help of a weighted average that helps in adjusting the pixel value according to the area/cluster-wide weighted average. After the noise removal, image quality is improved. The mean weighted average is an intuitive filter with ease of application. It reduces the density variation and adjusts the cluster’s mean value. Moreover, it uses the definition of each image pixel to remove noise. It changes each missing pixel value with the average weighted value and adjusts it according to its neighbors.

3.4. RGB Image to Gray Scale Conversion

Because the bitmap pictures are in RGB Format, they must be transformed to a grey scale before being used in the algorithm. Various techniques turn photos into grayscale once they have been captured in color. The average-weighted process is used in this case. The average RGB matrix approach utilizes the average RGB value of a pixel in colored pictures to calculate the image’s hue, saturation, and brightness. R, G, and B component average values are obtained and projected to greyscale values.

3.5. Image Enhancement

The picture is enhanced in its entirety throughout the image enhancement process. It lowers the opacity to make the color more visible and process able. Gaussian high pass filtering is used here. This stage selects pixels in the picture with a value lower than the cluster’s average. The intensity value of these pixels is then applied to them. With the modification of the intensity value, the picture is enhanced while the quality is also improved. The method identifies pixels with black values less than or equal to the threshold and modifies their brightness and contrast to match the grayscale. This grey toning process enhances an image’s appearance. The last step is picture enhancement, which is carried out after rescaling. When it comes to identifying Hair features and muscles, grey toning assists since the area with the most significant pixel value is where they may be discovered. As a result, finding low and high adjustments is essential for feature segmentation. For this purpose, the intensities of 0.41 and 0.42 are low and high. All pixels below the threshold are treated as 0, while pixels with a value greater than the threshold are treated as 1. As a result of this action, Hair Features are used for segmentation. An average of the pixel intensity of the facial muscle pixels is used to determine the upper and lower bounds, with the lower bound being the most negligible value of the face pixel used in assessing pixel value.

3.6. Dataset for Evaluation of System

The authors said that FER is the dataset utilized to evaluate the proposed system. FER data set consists of images of 48 by 48 pixels, consisting of grayscale portraits of people. The faces have been officially recorded so that each is roughly in the middle of the picture and takes up about the same amount of space. This makes sure that the faces are the same in each image. The goal of this activity is to put each face into one of seven groups based on how it makes you feel: 0, for anger; 1, for disgust; 2, for fear; 3, for happiness; 4, for sadness; 5, for surprise; and 6, for neutral. In the training set, there are 28,709 examples, while in the public test set, there are 3589 examples. Both these two datasets are available publically over www.Kaggle.com (10 October 2022) [14].

Affect Net consists of authentic facial images that consist of a library of face emotions that are taken from the internet and saved in Affect Net by using 1250 emotion-related keywords written in six different languages (English, German, Spanish, Portuguese, Arabic, and Farsi) to search three of the most popular search engines. About 440 thousand photos were manually annotated to see if they had any of seven different facial expressions (categorical model) and how much valence and arousal were present (dimensional model). Affect Net is wild pictures’ most significant natural facial expressions, valence, and arousal database.

The affect Net data set is developed to explore the relationships among facial expressions and emotions in region-based facial feature identification and interpretation. After creating the proposed learning module, it is trained on this data set. For testing, the data set is divided into two different portions, training data, and testing data, with a ratio of 20:80 (80% of overall images recorded for learning: 20% of records for testing). Initially, this system is tested using this standard dataset (i.e., Affect Net). Afterward, the proposed method is used to detect people’s emotions about movie reviews as an implementation case study after successful training and evaluation. The suggested model is then coded in the Mat lab and assessed.

Split the total data set into ten divisions. One of these subsets were used as the test set, and the remaining 9 as training sets.
Perform a training experiment with that 9 fractions of training data.
After training, evaluate the test dataset and create the resultant file.
After iterating all folds, combine the results by averaging them to obtain the outcome for all occurrences.

4. Proposed Architecture of CNN

When setting up a CNN model, the size of the pooling layer and the filters matter more than the number of convolution layers. As a result, multiple experiments were carried out, each using a different combination of filter widths and convolution depths. The CNN architecture’s complexity was considered when the studies were done. The best CNN architecture for a specific input image size can be found using a Python program. A complex design would take a long time to learn.

This is because a single convolution layer does not offer enough details. However, the model’s complexity and training time rise when more than three convolution layers are included. Researchers claimed that using as many pooling layers as possible results in the best performance in most situations to compare models with two and three convolution layers. The dropout value is utilized in a neural network simulation to make the model more universal. A hidden layer follows the single, entirely connected layer at the center of the design. The total number of nodes in the hidden layer must be considered from the beginning of the plan. The suggested model has a similar layout but different parts. There were initially only three convolutional layers. You can only utilize a 2 × 2 × 2 pool due to the concept’s intricacy. We used a variety of patch sizes to identify the ideal structural arrangement. The definition of a “moderate” patch size was examined using a variety of patch sizes. A 3 × 3 × 3 filter could fit three convolutional layers on the surface. The complexity of the models rose along with the size of the filters. Finally, the maximum size of the pooling layer used was determined to be 2 × 2 × 2. Different patch sizes were examined throughout training, but 24 by 32 by 32 was the best arrangement [41]. The General Architecture of the Convolutional Neural Network is presented in Figure 4.

The different Model parameters are presented in Table 1.

The graphical representation of sensitivity achieved by the model is presented in Figure 5.

The Experimentation of Different CNN models and their sensitivity is shown in Table 2.

The sensitivity of the model is observed during the evaluation of different models. And it is observed that with the usage of softmax as activation fiction, models dropped sensitivity. While with the use of the sigmoid function, better sensitivity is achieved. Improving the CNN algorithm’s accuracy requires extensive training sessions. Therefore, the amount of false Retraining in different parts of the dataset reduces the positives in emotion identification. These groups contain many images that are labeled based on their emotional content. Once the initial training phase is complete, the inaccurate examples of emotion recognition are manually filtered and blended into the next training cycle. Since this is the case, the proposed CNN architecture achieves the highest accuracy possible while simultaneously decreasing the frequency of false positives in emotion identification. Detection cases are manually filtered and blended into the next training cycle. As a result, the accuracy is maximized with reduced false emotion detection of the proposed CNN architecture. The CNN Training Algorithm is presented in Figure 6.

5. Results

The experimental procedures are described in the practical detail. The details are given below. The tests were run in Python, the same platform used for the CNN execution, and the outcomes were compared.

5.1. Experimentation Setup

The experimental setup used in this research is given below:

Tool: Python
Library Used: Pandas, KERAS
Algorithm: CNN
Algorithm Feature Identification: Auto
Data Set: Affect Net
The processor used: Intel I-10
RAM size: 16 Giga Byte

5.2. Mean Squared Error Evaluation

The dataset was divided into 80:20 percent fractions. Eighty percent of the photos were utilized to train the algorithm, while the remaining twenty percent were used for testing and assessment. During the validation process, 10 fold cross-validation is used. The validation stage is completed once the algorithm has been finished with the most accurate parameter values. The following describes how cross-validation is done.

Split the total data set into ten divisions. One of these subsets were used as the test set, and the remaining 9 as training sets.
Perform a training experiment with that 9 fractions of training data.
After training, evaluate the test dataset and create the resultant file.
After iterating all folds, combine the results by averaging them to obtain the result for all occurrences.

The proposed CNN (Convolutional Neural Network) method required 550 rounds of training before being suggested. Convolutional neural networks benefit from additional training iterations because they reduce training mistakes. The learning factor is 0.45, while the momentum is 0.75. A convolutional neural network’s training step involves evaluating the precision of a quadratic error prediction. The observed quadratic error values are shown in the Figure below. After the last iteration, the quadratic error is found to be 0.01920. This illustrates the operation of deep learning and is the minor quadratic error that is feasible. The proposed system’s training time is 118 min, while the average testing time for an unknown test input is 19 s. We have ignored the training time due to our focus over accuracy. The Mean Square error after Algorithm training is presented in Figure 7.

For the classification techniques described below, true negatives and true positives are observed. A subset of the Affect Net data set (consisting of 10,018 instances) was used for testing purposes. The results are described in the form of a confusion matrix.

According to the results of the testing dataset, the algorithm successfully detected 8487 cases and misclassified 1531 instances. Using these details given in the Table 3, accuracy is calculated. After various tests, the sensitivity and specificity of the proposed emotion detection system were assessed. The algorithm’s sensitivity correlates with how well it recognizes genuine emotions. Recognizing false-negative data is another crucial technique for figuring out specificity. The categorization algorithms we will examine below have false positives and negatives. According to the testing dataset results, the system correctly discovered 8487 incidents while incorrectly classifying 1531 others. As a result, the system achieved an accuracy of 84.72 percent and a sensitivity of 79.24 percent. The method also has a specificity of 90.64 and an accuracy of 90.2. The graphic below depicts a graph representation of the results. The results after Algorithm Testing are shown in Figure 8.

5.3. Comparison of State of the Art Results over Affect Net Data Set

A comparison with other state-of-the-art results over Affect Net is presented in Table 4.

The comparison of Results is shown in Figure 9.

6. Conclusions

Several CNN architectural standards were adopted during this project, and their efqfectiveness was assessed. The suggested architecture was created using private information. Researchers concluded that the discovery was mainly true, but there was still room for improvement after carefully studying the data and comparing it to what was already known. The CNN algorithm was correct 84.8% of the time, precise 90% of the time, and accurate 90.6% during this inquiry. This survey mainly interested in how people felt about going to the cinema. To interpret what moviegoers were saying, the writers of this thesis looked into the many different types of human emotions and created a training module for facial expression recognition. Studies show that a person’s emotional condition can be more readily understood from facial features. These facial characteristics include lines and how facial features are arranged (considered HAAR factors). While sentiment analysis for corporate marketing is a growing topic, there hasn’t been as much research on image-based categorization to ascertain how people feel about movie reviews. To improve photo and video categorization systems, the author of this thesis seeks to stimulate more research into emotion recognition. This study can also ascertain a user’s true feelings by examining their images and facial expressions. Without human input, the system will automatically add offensive emotions that it learns to recognize but aren’t already in its database. The system can also train itself using an annotated dataset, improving its learning ability.

7. Future Work

The study can be further expanded by evaluating the proposed algorithm on different data sets and real-time people images. Moreover, this study works on user images and processes them to understand and classify emotions. Similar work can be further performed to process and organize real-time live videos as a future study dimension.

Author Contributions

Conceptualization, T.M., F.K.K. and S.M.M.; Data curation, S.A.H.M. and I.H.; Formal analysis, T.M., M.A.M., M.A.N. and S.A.H.M.; Funding acquisition, F.K.K.; Investigation, M.A.M., M.A.N., I.H. and S.M.M.; Methodology, T.M., F.K.K. and S.M.M.; Supervision, F.K.K.; Writing—original draft, M.A.N. and S.M.M.; Writing—review & editing, F.K.K., S.M.M. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

AffectNet and FER are two data set used in this paper. These datasets used in this paper are publically available over www.Kaggle.com (10 October 2022).

Acknowledgments

This research project was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R300), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

AbdulJabbar, I.A.A.; Yakoob, Z.A. Hybrid Technique to Improve Face Recognition Using Principal Component Analysis and Singular Value Decomposition. System 2019, 2, 3. [Google Scholar]
Alabort-I-Medina, J.; Zafeiriou, S. A Unified Framework for Compositional Fitting of Active Appearance Models. Int. J. Comput. Vis. 2017, 121, 26–64. [Google Scholar] [CrossRef] [Green Version]
Dubey, A.; Jain, V. A review of face recognition methods using deep learning network. J. Inf. Optim. Sci. 2019, 40, 547–558. [Google Scholar] [CrossRef]
Ekman, P. Facial expressions of emotion: An old controversy and new findings. Philos. Trans. R. Soc. B Biol. Sci. 1992, 335, 63–69. [Google Scholar] [CrossRef]
ELLaban, H.A.; Ewees, A.A.; Elsaeed, A.E. A real-time system for facial expression recognition using support vector machines and k-nearest neighbor classifier. Int. J. Comput. Appl. 2017, 159, 23–29. [Google Scholar]
Fekri-Ershad, S.; Ramakrishnan, S. Cervical cancer diagnosis based on modified uniform local ternary patterns and feed forward multilayer network optimized by genetic algorithm. Comput. Biol. Med. 2022, 144, 105392. [Google Scholar] [CrossRef]
Georgescu, M.-I.; Ionescu, R.T.; Popescu, M. Local Learning with Deep and Handcrafted Features for Facial Expression Recognition. IEEE Access 2019, 7, 64827–64836. [Google Scholar] [CrossRef]
Gokalp, O.; Tasci, E.; Ugur, A. A novel wrapper feature selection algorithm based on iterated greedy metaheuristic for sentiment classification. Expert Syst. Appl. 2020, 146, 113176. [Google Scholar] [CrossRef]
Gupta, A.; Thakkar, K.; Gandhi, V.; Narayanan, P. Nose, eyes and ears: Head pose estimation by locating facial keypoints. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
Jaswanth, K.; David, D.S. A novel based 3D facial expression detection using recurrent neural network. In Proceedings of the 2020 International Conference on System, Computation, Automation and Networking (ICSCAN), Pondicherry, India, 3–4 July 2020. [Google Scholar]
Jayanthy, S.; Anishkka, J.; Deepthi, A.; Janani, E. Facial recognition and verification system for accessing patient health records. In Proceedings of the 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India, 15–17 May 2019. [Google Scholar]
Khair, A.A.; Zainuddin, Z.; Achmad, A.; Ilham, A.A. Face Recognition in Kindergarten Students using the Principal Component Analysis Algorithm. In Proceedings of the 2019 International Conference on Advanced Mechatronics, Intelligent Manufacture and Industrial Automation (ICAMIMIA), Batu-Malang, Indonesia, 9–10 October 2019. [Google Scholar] [CrossRef]
Khalil, A.; Ahmed, S.G.; Khattak, A.M.; Al-Qirim, N. Investigating Bias in Facial Analysis Systems: A Systematic Review. IEEE Access 2020, 8, 130751–130761. [Google Scholar] [CrossRef]
Kuruvayil, S.; Palaniswamy, S. Emotion recognition from facial images with simultaneous occlusion, pose and illumination variations using meta-learning. J. King Saud Univ.-Comput. Inf. Sci. 2021, 34, 7271–7282. [Google Scholar] [CrossRef]
Li, Q.; Song, D.; Yuan, C.; Nie, W. An image recognition method for the deformation area of open-pit rock slopes under variable rainfall. Measurement 2021, 188, 110544. [Google Scholar] [CrossRef]
Liu, R.; Wang, X.; Lu, H.; Wu, Z.; Fan, Q.; Li, S.; Jin, X. SCCGAN: Style and Characters Inpainting Based on CGAN. Mob. Netw. Appl. 2021, 26, 3–12. [Google Scholar] [CrossRef]
Luo, G.; Yuan, Q.; Li, J.; Wang, S.; Yang, F. Artificial Intelligence Powered Mobile Networks: From Cognition to Decision. IEEE Netw. 2022, 36, 136–144. [Google Scholar] [CrossRef]
Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef] [Green Version]
Moolthaisong, K.; Songpan, W. Emotion Analysis and Classification of Movie Reviews Using Data Mining. In Proceedings of the 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA), Medan, Indonesia, 16–17 July 2020. [Google Scholar] [CrossRef]
Nonis, F.; Dagnes, N.; Marcolin, F.; Vezzetti, E. 3D Approaches and Challenges in Facial Expression Recognition Algorithms—A Literature Review. Appl. Sci. 2019, 9, 3904. [Google Scholar] [CrossRef] [Green Version]
Patel, K.; Mehta, D.; Mistry, C.; Gupta, R.; Tanwar, S.; Kumar, N.; Alazab, M. Facial Sentiment Analysis Using AI Techniques: State-of-the-Art, Taxonomies, and Challenges. IEEE Access 2020, 8, 90495–90519. [Google Scholar] [CrossRef]
Prospero, M.R.; Lagamayo, E.B.; Tumulak, A.C.L.; Santos, A.B.G.; Dadiz, B.G. Skybiometry and AffectNet on facial emotion recognition using supervised machine learning algorithms. In Proceedings of the 2018 International Conference on Control and Computer Vision, Singapore, 15–18 June 2018. [Google Scholar]
Saleem, S.M.; Zeebaree, S.R.; Abdulrazzaq, M.B. Real-life dynamic facial expression recognition: A review. J. Phys. 2021, 1963, 012010. [Google Scholar] [CrossRef]
Salmam, F.Z.; Madani, A.; Kissi, M. Emotion Recognition from Facial Expression Based on Fiducial Points Detection and using Neural Network. Int. J. Electr. Comput. Eng. (IJECE) 2018, 8, 52–59. [Google Scholar] [CrossRef]
Samadiani, N.; Huang, G.; Cai, B.; Luo, W.; Chi, C.-H.; Xiang, Y.; He, J. A Review on Automatic Facial Expression Recognition Systems Assisted by Multimodal Sensor Data. Sensors 2019, 19, 1863. [Google Scholar] [CrossRef] [Green Version]
Scherer, D.; Müller, A.; Behnke, S. Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In Proceedings of the International Conference on Artificial Neural Networks, Thessaloniki, Greece, 15–18 September 2010. [Google Scholar]
Schoneveld, L.; Othmani, A.; Abdelkawy, H. Leveraging recent advances in deep learning for audio-Visual emotion recognition. Pattern Recognit. Lett. 2021, 146, 1–7. [Google Scholar] [CrossRef]
Shafiei, F.; Ershad, S.F. Detection of Lung Cancer Tumor in CT Scan Images Using Novel Combination of Super Pixel and Active Contour Algorithms. Trait. Signal 2020, 37, 1029–1035. [Google Scholar] [CrossRef]
Shalmiya, P.; Thirugnanam, G. Robust facial expression recognition based on convolutional neural network in pose and occlusion. i-Manag. J. Pattern Recognit. 2020, 7, 14. [Google Scholar] [CrossRef]
She, Q.; Hu, R.; Xu, J.; Liu, M.; Xu, K.; Huang, H. Learning high-DOF reaching-and-grasping via dynamic representation of gripper-object interaction. ACM Trans. Graph. 2022, 41, 1–14. [Google Scholar] [CrossRef]
Shetty, C.; Khan, A.; Singh, T.; Kharatmol, K. Movie review prediction system by real time analysis of facial expression. In Proceedings of the 2021 6th International Conference on Communication and Electronics Systems (ICCES), Coimbato, India, 8–10 July 2021. [Google Scholar]
Shi, Y.; Xu, X.; Xi, J.; Hu, X.; Hu, D.; Xu, K. Learning to Detect 3D Symmetry from Single-View RGB-D Images with Weak Supervision. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef] [PubMed]
Siqueira, H.; Magg, S.; Wermter, S. Efficient Facial Feature Learning with Wide Ensemble-Based Convolutional Neural Networks. Proc. Conf. AAAI Artif. Intell. 2020, 34, 5800–5809. [Google Scholar] [CrossRef]
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-F.; Pantic, M. A survey of multimodal sentiment analysis. Image Vis. Comput. 2017, 65, 3–14. [Google Scholar] [CrossRef]
Stöckli, S.; Schulte-Mecklenbeck, M.; Borer, S.; Samson, A.C. Facial expression analysis with AFFDEX and FACET: A validation study. Behav. Res. Methods 2017, 50, 1446–1460. [Google Scholar] [CrossRef] [Green Version]
Tsai, H.-H.; Chang, Y.-C. Facial expression recognition using a combination of multiple facial features and support vector machine. Soft Comput. 2017, 22, 4389–4405. [Google Scholar] [CrossRef]
Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef] [Green Version]
Wang, W.; Chen, Z.; Yuan, X. Simple low-light image enhancement based on Weber–Fechner law in logarithmic space. Signal Process. Image Commun. 2022, 106, 116742. [Google Scholar] [CrossRef]
Wollmer, M.; Weninger, F.; Knaup, T.; Schuller, B.; Sun, C.; Sagae, K.; Morency, L.-P. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context. IEEE Intell. Syst. 2013, 28, 46–53. [Google Scholar] [CrossRef]
Zhang, H.; Luo, G.; Li, J.; Wang, F.-Y. C2FDA: Coarse-to-Fine Domain Adaptation for Traffic Object Detection. IEEE Trans. Intell. Transp. Syst. 2021, 23, 12633–12647. [Google Scholar] [CrossRef]
Zhao, X.; Li, J.; He, S.; Zhu, C. Geometric conditions for injectivity of 3D Bézier volumes. AIMS Math. 2021, 6, 11974–11988. [Google Scholar] [CrossRef]
Ionescu, R.T.; Khan, F.S.; Georgescu, M.-I.; Shao, L. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]

Figure 1. Six Basic Emotions [11].

Figure 2. Description of Appearance features.

Figure 3. Flow Chart of Proposed System.

Figure 4. General Architecture of Convolutional Neural Network.

Figure 5. The model achieves the graphical representation of sensitivity.

Figure 6. The CNN Training Algorithm.

Figure 7. The Mean Square error after Algorithm training.

Figure 8. The results after Algorithm Testing.

Figure 9. The comparison of Results [7,14,22,27,33].

Table 1. The different Models parameters.

Name of Parameter	Specifications
Name of Parameter	1st Model	2nd Model	3rd Model	4th Model	5th Model
Patch size	12 × 24 × 24	18 × 30 × 30	24 × 36 × 36	28 × 42 × 42	36 × 42 × 42
Size of first Convolution Layers	3, 5, 5	3, 5, 5	3, 5, 5	3, 5, 5	3, 5, 5
Max Pooling layer	3, 3, 3	3, 3, 3	3, 3, 3	3, 3, 3	3, 3, 3
CNN Dropping out the value	0.15	0.15	0.20	0.20	0.20
Second Convolution Layer size	3, 5, 5	3, 6, 6	4, 6, 6	3, 5, 5	3, 5, 5
Double Max Pooling layer size	2, 2, 2	2, 2, 2	2, 2,2	2, 2, 2	2, 2, 2,
CNN Dropping out Value for layer 2	0.15	0.15	0.15	0.15	0.15
No of the nodes in the first FCN layer	120 (nodes)	150	220	240	300
No nodes in the second FCN layer	2 (nodes)	2 (nodes)	2 (nodes)	2 (nodes)	2 (nodes)
Type of Activation function used	Soft max	Sigmoid	Sigmoid	Sigmoid	Sigmoid

Table 2. The Experimentation of Different CNN models.

Experimented Architecture	Achieved Sensitivity
1st Model	0.670
2nd Model	0.689
3rd Model	0.745
4th Model	0.737
5th Model	0.802

Table 3. Confusion Matrix for CNN.

N = 10,018	Predicted Yes	Predicted No
Actual Yes	TP	FP
Actual Yes	4127	450
Actual No	FN	TN
Actual No	1081	4360

Table 4. Comparison with other state-of-the-art results over Affect Net.

Sr. No	Citations	Technique Used	Results
1	[16]	RNN	61.6%
2	[42]	CNN with handcrafted features	59.6%
3	[18]	CNN with automated features	59.3%
4	[19]	Supervised Learning model with automated features	58.6%
5	[20]	ERMOPI	68%
6	Our proposed method	CNN	84.7%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mazhar, T.; Malik, M.A.; Nadeem, M.A.; Mohsan, S.A.H.; Haq, I.; Karim, F.K.; Mostafa, S.M. Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods. Symmetry 2022, 14, 2607. https://doi.org/10.3390/sym14122607

AMA Style

Mazhar T, Malik MA, Nadeem MA, Mohsan SAH, Haq I, Karim FK, Mostafa SM. Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods. Symmetry. 2022; 14(12):2607. https://doi.org/10.3390/sym14122607

Chicago/Turabian Style

Mazhar, Tehseen, Muhammad Amir Malik, Muhammad Asgher Nadeem, Syed Agha Hassnain Mohsan, Inayatul Haq, Faten Khalid Karim, and Samih M. Mostafa. 2022. "Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods" Symmetry 14, no. 12: 2607. https://doi.org/10.3390/sym14122607

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Movie Reviews Classification through Facial Image Recognition and Emotion Detection Using Machine Learning Methods

Abstract

1. Introduction

2. Literature Review

3. Proposed Approach

3.1. Image Preprocessing

3.2. Image Bitmap Conversion

3.3. Removal of Noise

3.4. RGB Image to Gray Scale Conversion

3.5. Image Enhancement

3.6. Dataset for Evaluation of System

4. Proposed Architecture of CNN

5. Results

5.1. Experimentation Setup

5.2. Mean Squared Error Evaluation

5.3. Comparison of State of the Art Results over Affect Net Data Set

6. Conclusions

7. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI