Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic

Vaiyapuri, Thavavel; Jagannathan, Sharath Kumar; Ahmed, Mohammed Altaf; Ramya, K. C.; Joshi, Gyanendra Prasad; Lee, Soojeong; Lee, Gangseong

doi:10.3390/su15086404

Open AccessArticle

Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic

by

Thavavel Vaiyapuri

¹

,

Sharath Kumar Jagannathan

²

,

Mohammed Altaf Ahmed

³

,

K. C. Ramya

⁴,

Gyanendra Prasad Joshi

^5,*

,

Soojeong Lee

⁵ and

Gangseong Lee

^6,*

¹

College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

²

Frank J. Guarini School of Business, Saint Peter’s University, 2641 John F. Kennedy Boulevard, Jersey City, NJ 07306, USA

³

Department of Computer Engineering, College of Computer Engineering & Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

⁴

Department of EEE, Sri Krishna College of Engineering and Technology, Coimbatore 641008, India

⁵

Department of Computer Science and Engineering, Sejong University, Seoul 05006, Republic of Korea

⁶

Ingenium College, Kwangwoon University, 20 Kwangwoon-ro, Nowon-gu, Seoul 01897, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Sustainability 2023, 15(8), 6404; https://doi.org/10.3390/su15086404

Submission received: 23 December 2022 / Revised: 30 March 2023 / Accepted: 6 April 2023 / Published: 9 April 2023

(This article belongs to the Special Issue Sustainable Application of Internet of Things and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The COVID-19 outbreak is a disastrous event that has elevated many psychological problems such as lack of employment and depression given abrupt social changes. Simultaneously, psychologists and social scientists have drawn considerable attention towards understanding how people express their sentiments and emotions during the pandemic. With the rise in COVID-19 cases with strict lockdowns, people expressed their opinions publicly on social networking platforms. This provides a deeper knowledge of human psychology at the time of disastrous events. By applying user-produced content on social networking platforms such as Twitter, the sentiments and views of people are analyzed to assist in introducing awareness campaigns and health intervention policies. The modern evolution of artificial intelligence (AI) and natural language processing (NLP) mechanisms has revealed remarkable performance in sentimental analysis (SA). This study develops a new Marine Predator Optimization with Natural Language Processing for Twitter Sentiment Analysis (MPONLP-TSA) for the COVID-19 Pandemic. The presented MPONLP-TSA model is focused on the recognition of sentiments that exist in the Twitter data during the COVID-19 pandemic. The presented MPONLP-TSA technique undergoes data preprocessing to convert the data into a useful format. Furthermore, the BERT model is used to derive word vectors. To detect and classify sentiments, a bidirectional recurrent neural network (BiRNN) model is utilized. Finally, the MPO algorithm is exploited for optimal hyperparameter tuning process, and it assists in enhancing the overall classification performance. The experimental validation of the MPONLP-TSA approach can be tested by utilizing the COVID-19 tweets dataset from the Kaggle repository. A wide comparable study reported a better outcome of the MPONLP-TSA method over current approaches.

Keywords:

sustainability; sentiment analysis; low resource language; natural language processing; deep learning; pattern recognition; COVID-19 pandemic

1. Introduction

Automated knowledge extraction has a two-dimensional technique such as autonomous extraction of shallow knowledge from massive document collections, and, then, cumulative statistics of mechanically collected superficial understanding suggest more semantics [1]. Furthermore, traditional information extraction, semantic annotation, linguistic annotation in natural language processing (NLP), and ontology-based data extraction might help in automatic knowledge extraction and NLP for the lexicography of lower resource languages [2,3]. The involvement of people in an online social network (OSN) surged at the time of the COVID-19 pandemic since normal activities moved online [4]. Several usages of OSN (i.e., individuals utilize OSN to express their thoughts, interact with relatives, have online meetings, and so on.) were shown [5,6]. Like other OSNs, the usage of the famous microblogging service Twitter also has an impact. It is the most popular social networking platform for interacting with common people and creates awareness of public health at the time of health crises [7]. As a result, individuals generally spend time on Twitter, and users are very active at any time [8]. Their participations rise during the time of lockdown to receive the latest news relating to COVID-19. Meanwhile, they share their feelings and opinions with their friends. Thus, Twitter data analysis pulls a massive interest from research scholars in this pandemic [9,10].

Political communication at the time of the COVID-19 epidemic called for crucial, strong, and efficient sense-making of the crisis [11]. The language utilized by the leaders acts as a significant role in framing a chaotic and ambiguous crisis such that it could boost the deteriorating public spirit into a collective hope [12]. Researchers have concentrated on the crisis-in-the-moment, wherein the roles and duties of the leaders are taken to be well-defined, and the unit of study is the progression of the crisis itself—that is, how the crisis is unfolding. However, the COVID-19 pandemic was not transitory but rather an ongoing reality requiring both retrospective and prospective sense-making of the situation by the leaders. It is necessary to create an emotionally charged situation where particular negative emotions (such as frustration, anxiety, shock, etc.) rise very soon, generating a challenging issue for decision-making [13].

The mass media users were rising and the data volume was also rising; this concentrated on the utilization of natural language processing (NLP) having distinct methods of Artificial Intelligence (AI) for extracting meaningful data effectively [14]. NLP and their applications resulted in a substantial influence on mass media classification and text analysis; however, the difficulties in determination of content’s inherent significance utilizing NLP approaches, such as contextual words and phrases and uncertainty in speech or text, mandate the usage of machine learning (ML)-related methods. Sentiment analysis (SA) includes executing some mathematical computations for examining people’s sentiments against a particular aspect or individual. Sentiment classification, subjectivity analysis, and opinion mining were other related terminologies in the literature [15].

SA is performed by a lexicon-related technique such as linguistic inquiry word count (LIWC), SentiStrength, affective norms for English words (ANEW), Senti Word Net, an ML technique namely multinomial Naïve Bayes (MNB), Naïve Bayes (NB), multi-layer perceptron (MLP), Maximum Entropy, random forest (RF), support vector machine (SVM), or a hybrid technique which utilizes ML as well as a lexicon-based technique [8]. ML computes sentiment polarity via statistical methods which are highly reliable on the dataset size and are ineffectual in dealing with intensifying and negative sentences and execute poorly in various fields. The lexicon-based method, contrarily, needs manual input of sentiment lexicons and is well-executed in any field but fails to encounter entire informal lexicons. The hybrid method will help overcome the limits of both methods, therefore improving scalability, performance, and efficiency [15]. The study has proven that employing a hybrid technique hastens precision, sustains stabilities, and offers superior outcomes than utilizing one standard tool or one technique.

Recently, deep learning (DL) models are found useful for the sentiment classification process. Although several ML and DL models for sentiment classification are available in the literature, it is still needed to enhance the classification performance. Due to the incessant deepening of the model, the number of parameters involved in the DL models gets raised rapidly which leads to model overfitting. At the same time, different hyperparameters have a significant impact on the efficiency of the DL models. Particularly, hyperparameters such as epoch count, batch size, and learning rate selection are essential to attain effectual outcomes. Since the trial and error method for hyperparameter tuning is a tedious and erroneous process, metaheuristic algorithms can be applied. Therefore, in this work, we employ the marine predator optimization (MPO) algorithm for the parameter selection (i.e., learning rate, batch size, and number of epochs) of the BiRNN model.

This study develops a new Marine Predator Optimization with Natural Language Processing for Twitter Sentiment Analysis (MPONLP-TSA) for the COVID-19 pandemic. The presented MPONLP-TSA model primarily undergoes data pre-processing to convert the data into a useful format. In addition, the BERT model is used to derive word vectors. To detect and classify sentiments, a bidirectional recurrent neural network (BiRNN) model is utilized. At last, the MPO technique can be exploited for optimal hyperparameter tuning process, and it assists in enhancing the overall classification performance. The MPO algorithm is mainly inspired by the different foraging strategies of marine predators and the optimal encounter rate strategies between predators and prey. The experimental validation of the MPONLP-TSA approach can be tested utilizing the COVID-19 tweet dataset from the Kaggle repository.

2. Related Works

Mostafa [16] suggests an SA model, which analyzes the sentiments of students during the learning procedure during the pandemic utilizing ML techniques and Word2vec approaches. The SA method commenced with processing the sentiments of students and choosing the features by words embedded after employing 3 ML classifiers that were SVM, decision tree (DT), and NB. In their study, [17] introduce a structure which employs deep learning (DL)-based language methods through long short-term memory (LSTM) for SA at the time of the upsurge of COVID-19 cases in India. The structure features the LSTM language method, a recent Bidirectional Encoder Representations from Transformers (BERT) language method, and global vector embedding. Mohan et al. [18] suggested a Prophet model and hybrid autoregressive integrated moving average (ARIMA) for predicting daily cumulative and confirmed cases. The auto ARIMA function has been initially utilized for selecting the optimum hyperparameter value of the ARIMA method. Afterwards, the altered ARIMA method has been utilized for finding the optimal fit among the test and predicting data for finding the optimal method parameter combinations.

Alkhaldi et al. [19] offer a novel sunflower optimization with DL-driven SA and classification (SFODLD-SAC) on COVID-19 tweets. The suggested method aims to identify the sentiments of people at the time of the COVID-19 pandemic. For establishing this, SFODLD-SAC methodology formerly preprocessed the tweets in different means such as link punctuations, stemming, usernames, numerals, and removal of stopwords. Additionally, the TF-IDF method can be implied for valuable features extracted from the pre-processed data. Furthermore, the cascaded recurrent neural network (CRNN) method can be used for analyzing and classifying sentiments. In [20], NLP techniques were used for opinion mining to derive positive and negative tweets or sentiments on COVID-19. The authors also examine NLP-related SA with the use of the recurrent neural network (RNN) method with LSTMs. Hossain et al. [21] suggested a DL architecture based on Bidirectional Gated Recurrent Unit (BiGRU) for accomplishing this objective. Then, they advanced two distinct corpora from labeled and unlabeled COVID-19 tweets and employed the unlabeled corpus to construct an enhanced labelled corpus.

3. The Proposed Model

In this study, a novel MPONLP-TSA method is presented for the recognition of sentiments that exist in Twitter data during the COVID-19 pandemic. The presented MPONLP-TSA model performed data pre-processing to convert the data into a useful format. Following, the BERT model is used to derive word vectors. To detect and classify sentiments, the MPO algorithm with the BiRNN model is utilized. Figure 1 exemplifies the overall working process of the MPONLP-TSA method.

3.1. Data Pre-Processing

Initially, the presented method pre-processed tweets in dissimilar ways, namely stemming, removing usernames, stopwords, numerals, and link punctuations.

Removing links and usernames on Twitter that do not affect SA.
Removal of punctuation marks such as hashtags and conversion to lower case
Removal of numerals and stopwords

Additionally, stemming is carried out for reducing the term to the root form. Furthermore, the procedure of minimizing the term assists in reducing the complication of the text feature. Next, the TextBlob method is utilized for determining the sentimental scores. Then, the BERT method is performed for generating a set of feature vectors. The BERT method is employed for useful feature extraction from the pre-processed information in this study.

3.2. BERT Model

The BERT model is used to derive word vectors once the twitter data is pre-processed. On the standard NLP tasks, the words in text data are commonly demonstrated as discrete values such as One-Hot encoded. The One-Hot encoded model integrates every word from the lexicon [22]. The dimensional of the vector was equivalent to the number of words from the lexicon. A major benefit of One-Hot encoded is that it can be easy. However, the One-Hot vector of all the words is independent and could not reflect the connection among words. Furthermore, if the number of words from the lexicon was huge, the dimensional of word vectors (WVs) are very big, and a dimension disaster is taking place.

For solving the issue of One-Hot coded, the researcher presented the encoded of WVs. The basic concept is for representing words as lower-dimension continuous dense vectors, and words with the same meanings are mapped for identical places from the vector space. Generally utilized WV execution methods were BERT, Glove, Word2Vec, and ELMo.

Google introduced a BERT method which is a novel pre-trained language method utilizing the domain of NLP. It can be a process that truly executes a bi-directional language and is more optimum efficient than other WV methods.

This method can utilize the BERT system for training WVs. All the words

w_{i}

in

S

have been transformed to WV

v_{i}

utilizing a BERT method, whereas

v_{i}

is a 768-dimension vector. Afterwards, the WV is weighted utilizing sentiment weight.

v_{i}^{'} = v_{i} * s e n t i (w_{i})

(1)

The weighted WV matrix can be utilized as the resultant of the embedding layer.

3.3. Sentiment Analysis Using the BiRNN Model

The BiRNN model is utilized to identify and classify sentiments. RNN is a variant of neural networks (NN) that makes use of sequential datasets and maintains its features with the help of the middle layer [23]. It is capable of processing sequence length by utilizing the memory and backpropagation mechanism. The variable length sentence vector is mapped to a static length sentence vector by filling or truncating the sequences. RNN presents a time (state)-based convolutional model that enables RNN to be considered as many convolution layers of a similar network at diverse time steps. All the neurons transmit the presently upgraded outcomes to the neuron at the following time step. Hence, the RNN layer can be utilized for extracting the temporal feature and long-term dependency from the text sequence.

In

{x_{1}, x_{2}, \dots, x_{n}}

the word embedding vector is placed into the step-by-step recurrent layers.

x_{t}

and

h_{t - 1}

, word vectors that present the hidden layer of the preceding steps, are the input series of

t

time. The hidden layer of

t

time,

h_{t}

, refers to the output.

U,

W

, and

V

denote the weighted matrixes. The RNN is established based on the input. As a result, we used the LSTM model to avoid the gradual disappearing gradient by controlling the flow of the data. Additionally, the long-term dependency could be captured very easily. LSTM is a complicated system from the recurrent layer that makes use of four distinct layers for controlling data communication. LSTM designs a ‘gate’ storage unit to increase or remove data. Firstly, the ‘forget gate’ defines that data must be excluded from the cell.

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(2)

Next, enter

t

as the ‘input gate’ to define the data to be upgraded, and generate a novel candidate value vector

G_{t}

via the

t a n h

layers.

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}

(3)

G_{t} = t a n h (W_{G} \cdot [h_{t - 1}, x_{t}] + b_{G})

(4)

While the

S_{t - 1}

older cell state is multiplied by

f_{t}

, useless data is excluded, and add the product of

i_{t}

and

G_{t}

. The novel candidate value is computed for updating the older cell state.

S_{t} = f_{t} \cdot S_{t - 1} + i_{t} \cdot G_{t}

(5)

At last, the output value is defined as

S_{t}

cell state. Firstly, the sigmoid gate is utilized for determining that part of the cell state as outcome, and the cell state has been handled by the

t a n h

gate as well as multiplied with the outcome of sigmoid gates. Lastly, the output part was defined.

O_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(6)

h_{t} = O_{t} . t a n h (S_{t})

(7)

The weight matrix can be represented by the term

W

;

b

signifies the bias;

f_{t},

i_{t},

O_{t}

characterize the weight value of the forget, input, and output of LSTM;

t a n h

and

σ

characterize the hyperbolic tangent and sigmoid functions;

h_{t}

and

G_{t}

represent the hidden layer and memory representations of LSTM at

t

time. The quantity of data attained by the hidden layer was imbalanced in the distinct time steps of the recurrent layer. The previously hidden layer attains the lesser vector computation, whereby the last hidden layer achieves further vector computation. The presented method is further expanded to mitigate the problem of data imbalance through the bi-directional recurrent layer, and it comprises two opposite recurrent layers that returned two hidden layer sequences in the backward and forward directions:

h_{f o r w a r d} = (\vec{h_{1}}, \vec{h_{2}}, \dots \vec{h_{n}})

(8)

h_{r e v e r s e} = (\overset{\leftarrow}{h_{1}}, \overset{\leftarrow}{h_{2}}, \dots \overset{\leftarrow}{h_{n}})

(9)

h_{t} = (\vec{h_{t}}, \overset{\leftarrow}{h_{t}})

(10)

Consequently, the document is denoted by

h = {h_{1}, h_{2}, \dots, h_{n}} .

3.4. Hyperparameter Tuning

Like other metaheuristics (MH) methods, the MPO [24] begins with the assignment of random value to several solutions based on the searching space, and it can be expressed by the following equation:

X = L B + r_{1} \times (U B - L B)

(11)

Let

L B

and

U B

be the lower and upper boundaries in the solution;

r_{1} \in [0,1]

represents the arbitrary integer. Considers the predator and prey as searching agents while once the predator finds the prey, they search for the food. Hence, the elite (matrixes of a top predator) would be upgraded after every generation. The equation of prey (X) and elite is represented by

E l i t e = [\begin{matrix} X_{11}^{1} & X_{12}^{1} & \dots & X_{1 d}^{1} \\ X_{21}^{1} & X_{22}^{1} & \dots & X_{2 d}^{1} \\ \dots & \dots & \dots & \dots \\ X_{n 1}^{1} & X_{n 2}^{1} & \dots & X_{n d}^{1} \end{matrix}], X = [\begin{matrix} X_{11} & X_{12} & \dots & X_{1 d} \\ X_{21} & X_{22} & \dots & X_{2 d} \\ \dots & \dots & \dots & \dots \\ X_{n 1} & X_{n 2} & \dots & X_{n d} \end{matrix}],

(12)

The following step is to upgrade the location of prey

X

which is implemented by three phases according to the ratio of velocity concurrently emulating the whole relationship between predator and prey. In the subsequent sections, every phase is deliberated in detail.

Stage 1: High-Velocity Ratio

During this phase, the predator moves very fast compared to

X

in the exploration stage, in addition to it taking place in the first third of the overall amount of generations

(i . e ., \frac{1}{3} t_{m a x})

. Consequently, the prey

S_{i}

is upgraded by the subsequent formula.

S_{i} = R_{B} \otimes (E l i t e_{i} - R_{B} \otimes X_{i}), i = 1,2, \dots, n

(13)

X_{i} = X_{i} + P . R \otimes S_{i}

(14)

From the equation,

R \in [0,1]

and

P = 0.5

represent a vector of uniform arbitrary number and a constant value; correspondingly,

R_{B}

characterizes a random vector that applies to Brownian motion.

\otimes

specifies the procedure of component-wise multiplication.

Stage 2: Unit Velocity Ratio

During this phase, prey and predator move in the same region, and then the movement mimics the procedure of finding the food or prey. Moreover, this represents the procedure of shifting the position of MPO from exploration to exploitation [25]. Both have equal opportunities to take place in this phase. Next, the predator implements exploration, whereas the prey can carry out exploitation. It is considered that Brownian motion and Lévy flight represents the predator and prey movements, correspondingly, and it is expressed as if

\frac{1}{3} t_{m a x} < t < \frac{2}{3} t_{m a x}

:

S_{i} = R_{L} \otimes (E l i t e_{i} - R_{L} \otimes X_{i}), i = 1,2, \dots, n

(15)

From the expression,

R_{L}

denotes a random number after a Lévy distribution. The abovementioned equations are employed for an initial half of the population, which signifies the exploitations:

S_{i} = R_{B} \otimes (R_{B} \otimes E l i t e_{i} - X_{i}), i = 1,2, \dots, n

(16)

X_{i} = X_{i} + P . C F \otimes S_{i}, C F = {(1 - \frac{t}{t_{m a x}})}^{(2 \frac{t}{t_{m a x}})}

(17)

Given that

C P

represents the variable that controls the step size of motion to predator, and

t_{m a x}

denotes the overall amount of generations.

Stage 3: Low-Velocity Ratio

This phase was the final procedure in an optimization technique that takes place if the predator’s motion is very fast compared to the prey. This indicates the exploitation stage if

t > \frac{2}{3} t_{m a x}

, and this is expressed by

S_{i} = R_{L} \otimes (R_{L} \otimes E l i t e_{i} - X_{i}), i = 1,2, \dots, n

(18)

X_{i} = X_{i} + P . C F \otimes S_{i}, C F = {(1 - \frac{t}{t_{m a x}})}^{(2 \frac{t}{t_{m a x}})}

(19)

Eddy Formation and FADs’ Effect

There exists an issue with the atmosphere which affects the performance of marine predators, namely fish aggregating devices (FAD), and it can be expressed as follows:

X_{i} = \{\begin{array}{l} X_{i} + C F [X_{m i n} + R \otimes (X_{m a x} - X_{m i n}) \otimes U & r_{5} < F A D \\ X_{i} + [F A D (1 - r) + r] (X_{r 1} - X_{r 2}) & r_{5} > F A D \end{array}

(20)

In Equation (21),

F A D = 0.2

, and

U

refers to a binary solution; then, it is carried out by randomly producing a solution, later converting them to a dual solution with the thresholding value of 0.2.

r \in [0,1]

indicates an arbitrary number.

r_{1}

and

r_{2}

represent the prey indices.

Marine Memory

The marine predator contains a memory that remembers the best place that has been attained. Generally, compare the fitness value of all the solutions with a preceding fitness value, and the optimal one can be stored in the memory.

The MPO technique is derived from FF for reaching superior classifier performance. It fixes a positive integer for denoting a superior outcome of the candidate solution. The reduction of classifier error rate was taken as FF, as follows.

f i t n e s s (x_{i}) = \frac{n u m b e r o f m i s c l a s s i f i e d s a m p l e s}{T o t a l n u m b e r o f s a m p l e s} \times 100

(21)

4. Performance Validation

The proposed model is simulated using Python 3.6.5 tool on PC i5-8600k, GeForce 1050Ti 4GB, 16GB RAM, 250GB SSD, and 1TB HDD. The parameter settings are given as follows: learning rate: 0.01; dropout: 0.5; batch size: 5; epoch count: 50; and activation: ReLU. The experimental validation of the MPONLP-TSA method is tested under the COVID-19 tweets dataset from the Kaggle repository [26]. For ease of simulation process, a set of 2750 samples are chosen from the dataset with 11 class labels as depicted in Table 1.

Figure 2 demonstrates the brief set of confusion matrices formed by the MPONLP-TSA technique under entire, 70% of training (TR), and 30% of testing (TS) data. The figures show that the MPONLP-TSA method has accomplished effectual classification results under all datasets.

Table 2 shows a detailed result analysis of the MPONLP-TSA methodology on distinct datasets. The experimental value shows that the MPONLP-TSA approach has outperformed effective outcomes in all aspects. For example, in the entire dataset, the MPONLP-TSA process has gained an average

a c c u_{y}

value of 99.72%. Eventually, in 80% of TR data, the MPONLP-TSA methodology has attained an average

a c c u_{y}

value of 99.70%. Meanwhile, in 20% of TS data, the MPONLP-TSA technique has reached an average

a c c u_{y}

value of 99.80%.

The training accuracy (TA) and validation accuracy (VA) attained using the MPONLP-TSA method on the test dataset are demonstrated in Figure 3. The outcomes showed the MPONLP-TSA methodology has attained maximum values of TA and VA. In particular, the VA is greater than TA.

The training loss (TL) and validation loss (VL) acquired using the MPONLP-TSA technique on the test dataset are displayed in Figure 4. The outcomes denoted the MPONLP-TSA technique has accomplished the least values of TL and VL. To be specific, the VL is lower than TL.

A clear precision–recall analysis of the MPONLP-TSA algorithm on the test dataset is depicted in Figure 5. The figure represented the MPONLP-TSA methodology has resulted in enhanced values of precision–recall values under all classes.

A detailed ROC analysis of the MPONLP-TSA system on the test dataset is represented in Figure 6. The outcome indicated that the MPONLP-TSA method has shown its ability in classifying dissimilar classes on the test dataset.

Table 3 reports a comparative study of the MPONLP-TSA method with recent techniques [19]. The results implied that the SVM and DT techniques have shown lower

a c c u_{y}

values of 90.78% and 90.66%, correspondingly.

Meanwhile, the RF, extreme gradient boosting (XGboost), and ensemble methods have reported moderately closer

a c c u_{y}

of 91.04%, 91.22%, and 93.03%, correspondingly. Moreover, the sunflower optimization with deep-learning-driven sentiment analysis and classification (SFODLD-SAC) model has obtained a reasonable

a c c u_{y}

of 99.50%. However, the MPONLP-TSA model has shown enhanced results with a higher

a c c u_{y}

of 99.80%.

In order to further validate the performance of the MPONLP-TSA model, another dataset of COVID-19 tweets from Kaggle repository is used (https://www.kaggle.com/gpreda/covid19-tweets, accessed on 11 December 2022). The dataset has 170k tweets with 3 class labels, namely positive, neutral, and negative. The comparison results of the MPONLP-TSA model with other models [27] on the COVID-19 tweets dataset are reported in Table 4. The experimental results indicate that the RF and SVM models reached poor performance. Although the hybrid LSTM-RNN model reaches somewhat improved results, the proposed model outperformed the other existing models with a maximum

a c c u_{y}

of 97.59%,

p r e c_{n}

of 97.79%,

r e c a_{l}

of 98.64%, and

F_{s c o r e}

of 97.45%. From the detailed discussion and results, it is assumed that the MPONLP-TSA model has gained maximal classification result over other techniques.

5. Conclusions

In this study, a novel MPONLP-TSA methodology was presented for the recognition of sentiments that exist in Twitter data during the COVID-19 pandemic. The presented MPONLP-TSA model performed data pre-processing at the initial stage to convert the data into a useful format. Following this, the BERT model is used to derive word vectors. To detect and classify sentiments, the BiRNN model is utilized. Eventually, the MPO technique is exploited for optimal hyperparameter tuning process, and it assists in enhancing the overall classification performance. The experimental validation of the MPONLP-TSA system was tested utilizing the COVID-19 tweet dataset from the Kaggle repository. A wide comparable study reported a better outcome of the MPONLP-TSA method over current techniques. In the future, hybrid DL models are derived to enhance the classifier performance of the MPONLP-TSA system. In addition, the sentiments related to the socio-political challenges such as wars and pandemics can affect stock prices. The spread of the COVID-19 pandemic continues to take a toll on the economy, and fluctuations in the sentiment of the concerns about the health impacts of the disease can be captured from the microblogging platform Twitter. Therefore, in the future, the performance of the proposed model can be extended on the stock price prediction.

Author Contributions

Conceptualization: T.V., S.K.J., M.A.A. and K.C.R.; Data curation, K.C.R., G.P.J. and S.L.; Formal analysis, M.A.A., K.C.R. and G.L.; Funding acquisition, G.P.J. and G.L.; Investigation, S.K.J., G.P.J. and G.L.; Methodology, T.V., M.A.A., G.P.J. and S.L.; Project administration, G.P.J., S.L. and G.L.; Resources, G.L.; Software, T.V., M.A.A. and G.L.; Supervision, G.P.J., S.L. and G.L.; Validation, S.K.J., K.C.R., S.L. and G.L.; Visualization, S.K.J. and G.L.; Writing–original draft, T.V.; Writing–review and editing, G.P.J. All authors have read and agreed to the published version of the manuscript.

Funding

The work reported in this paper was conducted during the sabbatical year of Kwangwoon University in 2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated during the current study.

Conflicts of Interest

The authors declare that they have no conflict of interest. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

References

Rahman, M.; Islam, M.N. Exploring the performance of ensemble machine learning classifiers for sentiment analysis of covid-19 tweets. In Sentimental Analysis and Deep Learning; Springer: Singapore, 2022; pp. 383–396. [Google Scholar]
Chintalapudi, N.; Battineni, G.; Amenta, F. Sentimental analysis of COVID-19 tweets using deep learning models. Infect. Dis. Rep. 2021, 13, 32. [Google Scholar] [CrossRef] [PubMed]
Mishra, R.K.; Urolagin, S.; Jothi, J.A.; Neogi, A.S.; Nawaz, N. Deep learning-based sentiment analysis and topic modeling on tourism during Covid-19 pandemic. Front. Comput. Sci. 2021, 3, 775368. [Google Scholar] [CrossRef]
Costola, M.; Nofer, M.; Hinz, O.; Pelizzon, L. Machine Learning Sentiment Analysis, COVID-19 News and Stock Market Reactions; (No. 288), SAFE Working Paper; SAFE: Frankfurt am Main, Germany, 2020. [Google Scholar]
Ebadi, A.; Xi, P.; Tremblay, S.; Spencer, B.; Pall, R.; Wong, A. Understanding the temporal evolution of COVID-19 research through machine learning and natural language processing. Scientometrics 2021, 126, 725–739. [Google Scholar] [CrossRef] [PubMed]
Chandrasekaran, R.; Mehta, V.; Valkunde, T.; Moustakas, E. Topics, trends, and sentiments of tweets about the COVID-19 pandemic: Temporal infoveillance study. J. Med. Internet Res. 2020, 22, 1–12. [Google Scholar] [CrossRef] [PubMed]
Khan, R.; Shrivastava, P.; Kapoor, A.; Tiwari, A.; Mittal, A. Social media analysis with AI: Sentiment analysis techniques for the analysis of twitter covid-19 data. Crit. Rev. 2020, 7, 2761–2774. [Google Scholar]
Naseem, U.; Razzak, I.; Khushi, M.; Eklund, P.W.; Kim, J. COVIDSenti: A large-scale benchmark Twitter data set for COVID-19 sentiment analysis. IEEE Trans. Comput. Soc. Syst. 2021, 8, 1003–1015. [Google Scholar] [CrossRef]
Alamoodi, A.H.; Zaidan, B.B.; Zaidan, A.A.; Albahri, O.S.; Mohammed, K.I.; Malik, R.Q.; Almahdi, E.M.; Chyad, M.A.; Tareq, Z.; Alaa, M. Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review. Expert Syst. Appl. 2021, 167, 114155. [Google Scholar] [CrossRef]
Nemes, L.; Kiss, A. Social media sentiment analysis based on COVID-19. J. Inf. Telecommun. 2021, 5, 1–15. [Google Scholar] [CrossRef]
Cervi, L.; García, F.; Marín-Lladó, C. Populism, Twitter, and covid-19: Narrative, fantasies, and desires. Soc. Sci. 2021, 10, 294. [Google Scholar] [CrossRef]
Rufai, S.R.; Bunce, C. World leaders’ usage of Twitter in response to the COVID-19 pandemic: A content analysis. J. Public Health 2020, 42, 510–516. [Google Scholar] [CrossRef]
Emilio, F. # COVID-19 on twitter: Bots, conspiracies, and social media activism. arXiv 2020, arXiv:2004.09531. [Google Scholar]
Chaves-Montero, A.; Relinque-Medina, F.; Fernández-Borrero, M.Á.; Vázquez-Aguado, O. Twitter, social services and covid-19: Analysis of interactions between political parties and citizens. Sustainability 2021, 13, 2187. [Google Scholar] [CrossRef]
Jena, P.R.; Majhi, R. Are Twitter sentiments during COVID-19 pandemic a critical determinant to predict stock market movements? A machine learning approach. Sci. Afr. 2023, 19, e01480. [Google Scholar] [CrossRef]
Mostafa, L. Egyptian student sentiment analysis using Word2vec during the coronavirus (Covid-19) pandemic. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Cairo, Egypt, 19–21 October 2020; Springer: Cham, Switzerland, 2020; pp. 195–203. [Google Scholar]
Chandra, R.; Krishna, A. COVID-19 sentiment analysis via deep learning during the rise of novel cases. PLoS ONE 2021, 16, e0255615. [Google Scholar] [CrossRef]
Mohan, S.; Solanki, A.K.; Taluja, H.K.; Singh, A. Predicting the impact of the third wave of COVID-19 in India using hybrid statistical machine learning models: A time series forecasting and sentiment analysis approach. Comput. Biol. Med. 2022, 144, 105354. [Google Scholar] [CrossRef]
Alkhaldi, N.A.; Asiri, Y.; Mashraqi, A.M.; Halawani, H.T.; Abdel-Khalek, S.; Mansour, R.F. Leveraging Tweets for Artificial Intelligence Driven Sentiment Analysis on the COVID-19 Pandemic. Healthcare 2022, 10, 910. [Google Scholar] [CrossRef]
Alorini, G.; Rawat, D.B.; Alorini, D. LSTM-RNN Based Sentiment Analysis to Monitor COVID-19 Opinions using Social Media Data. In Proceedings of the ICC 2021-IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Hossain, G.S.; Assaduzzaman, S.; Mynoddin, M.; Sarker, I.H. A Deep Learning Approach for Public Sentiment Analysis in COVID-19 Pandemic. In Proceedings of the 2022 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 24–26 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Yang, L.; Li, Y.; Wang, J.; Sherratt, R.S. Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 2020, 8, 23522–23530. [Google Scholar] [CrossRef]
Zhou, G.B.; Wu, J.; Zhang, C.L.; Zhou, Z.H. Minimal gated unit for recurrent neural networks. Int. J. Autom. Comput. 2016, 13, 226–234. [Google Scholar] [CrossRef] [Green Version]
Faramarzi, A.; Heidarinejad, M.; Mirjalili, S.; Gandomi, A.H. Marine Predators Algorithm: A Nature-inspired Metaheuristic. Expert Syst. Appl. 2020, 152, 113377. [Google Scholar] [CrossRef]
Al-Qaness, M.A.; Ewees, A.A.; Fan, H.; Abualigah, L.; Abd Elaziz, M. Marine predators algorithm for forecasting confirmed cases of COVID-19 in Italy, USA, Iran and Korea. Int. J. Environ. Res. Public Health 2020, 17, 3520. [Google Scholar] [CrossRef]
Sentiment Analysis of COVID-19 Related Tweets. Available online: https://www.kaggle.com/competitions/sentiment-analysisof-covid-19-related-tweets/data?select=validation.csv (accessed on 11 December 2022).
Singh, C.; Imam, T.; Wibowo, S.; Grandhi, S. A deep learning approach for sentiment analysis of COVID-19 reviews. Appl. Sci. 2022, 12, 3709. [Google Scholar] [CrossRef]

Figure 1. Overall working process of the MPONLP-TSA method.

Figure 2. Confusion matrices of MPONLP-TSA approach (a) Entire dataset, (b) 80% of TR data, and (c) 20% of TS data.

Figure 3. TA and VA analysis of MPONLP-TSA methodology.

Figure 4. TL and VL analysis of MPONLP-TSA methodology.

Figure 5. Precision–recall curve analysis of MPONLP-TSA methodology.

Figure 6. ROC curve analysis of MPONLP-TSA methodology.

Table 1. Dataset details.

Label	Class	No. of Instances
1	Optimistic	250
2	Thankful	250
3	Empathetic	250
4	Pessimistic	250
5	Anxious	250
6	Sad	250
7	Annoyed	250
8	Denial	250
9	Surprise	250
10	Official report	250
11	Joking	250
Total Number of Instances		2750

Table 2. Result analysis of the MPONLP-TSA method with different measures.

Labels	$A c c u_{y}$	$P r e c_{n}$	$R e c a_{l}$	$F_{s c o r e}$	$J a c c_{i n d e x}$
Entire Dataset
1	99.67	97.25	99.20	98.22	96.50
2	99.71	99.19	97.60	98.39	96.83
3	99.78	97.66	100.00	98.81	97.66
4	99.82	98.42	99.60	99.01	98.03
5	99.64	96.88	99.20	98.02	96.12
6	99.89	98.81	100.00	99.40	98.81
7	99.56	99.58	95.60	97.55	95.22
8	99.78	98.80	98.80	98.80	97.63
9	99.49	96.83	97.60	97.21	94.57
10	99.85	100.00	98.40	99.19	98.40
11	99.75	100.00	97.20	98.58	97.20
Average	99.72	98.49	98.47	98.47	97.00
Training Phase (80%)
1	99.68	97.07	99.50	98.27	96.60
2	99.68	98.94	97.38	98.15	96.37
3	99.73	97.12	100.00	98.54	97.12
4	99.82	98.48	99.49	98.98	97.98
5	99.59	96.17	99.50	97.81	95.71
6	99.95	99.51	100.00	99.75	99.51
7	99.50	99.50	95.19	97.30	94.74
8	99.77	98.97	98.46	98.71	97.46
9	99.50	96.59	98.02	97.30	94.74
10	99.82	100.00	97.99	98.98	97.99
11	99.68	100.00	96.57	98.25	96.57
Average	99.70	98.39	98.37	98.37	96.80
Testing Phase (20%)
1	99.64	98.00	98.00	98.00	96.08
2	99.82	100.00	98.31	99.15	98.31
3	100.00	100.00	100.00	100.00	100.00
4	99.82	98.21	100.00	99.10	98.21
5	99.82	100.00	97.92	98.95	97.92
6	99.64	96.00	100.00	97.96	96.00
7	99.82	100.00	97.62	98.80	97.62
8	99.82	98.21	100.00	99.10	98.21
9	99.45	97.87	95.83	96.84	93.88
10	100.00	100.00	100.00	100.00	100.00
11	100.00	100.00	100.00	100.00	100.00
Average	99.80	98.94	98.88	98.90	97.84

Table 3. Comparative analysis of MPONLP-TSA with recent methodologies.

Methods	$A c c u_{y}$	$P r e c_{n}$	$R e c a_{l}$	$F_{s c o r e}$
RF Model	91.04	92.32	91.09	91.34
XGboost Model	91.22	91.54	91.65	91.03
SVM Model	90.78	90.22	90.16	90.08
Ensemble Model	93.03	94.16	92.99	93.80
DT Model	90.66	90.65	90.75	90.29
SFODLDSAC Model	99.50	98.15	98.13	98.15
MPONLP-TSA	99.80	98.94	98.88	98.90

Table 4. Results analysis of MPONLP-TSA model with recent models on COVID-19 tweets dataset.

Methods	$A c c u_{y}$	$P r e c_{n}$	$R e c a_{l}$	$F_{s c o r e}$
Naive Bayes	66.89	68.35	69.47	68.2
Random Forest	59.84	60.83	61.78	59.9
SVM	62.15	63.36	61.99	63.09
Logistic Regression	70.33	69.23	69.75	66.83
LSTM-RNN	76.32	70.35	78.95	75.73
Hybrid LSTM-RNN	84.42	82.22	82.34	81.49
MPONLP-TSA	97.59	97.79	98.64	97.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vaiyapuri, T.; Jagannathan, S.K.; Ahmed, M.A.; Ramya, K.C.; Joshi, G.P.; Lee, S.; Lee, G. Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic. Sustainability 2023, 15, 6404. https://doi.org/10.3390/su15086404

AMA Style

Vaiyapuri T, Jagannathan SK, Ahmed MA, Ramya KC, Joshi GP, Lee S, Lee G. Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic. Sustainability. 2023; 15(8):6404. https://doi.org/10.3390/su15086404

Chicago/Turabian Style

Vaiyapuri, Thavavel, Sharath Kumar Jagannathan, Mohammed Altaf Ahmed, K. C. Ramya, Gyanendra Prasad Joshi, Soojeong Lee, and Gangseong Lee. 2023. "Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic" Sustainability 15, no. 8: 6404. https://doi.org/10.3390/su15086404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sustainable Artificial Intelligence-Based Twitter Sentiment Analysis on COVID-19 Pandemic

Abstract

1. Introduction

2. Related Works

3. The Proposed Model

3.1. Data Pre-Processing

3.2. BERT Model

3.3. Sentiment Analysis Using the BiRNN Model

3.4. Hyperparameter Tuning

4. Performance Validation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI