MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media

Alwakid, Ghadah; Osman, Taha; Haj, Mahmoud El; Alanazi, Saad; Humayun, Mamoona; Sama, Najm Us

doi:10.3390/app12083806

Open AccessArticle

MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media

by

Ghadah Alwakid

¹,

Taha Osman

²,

Mahmoud El Haj

³,

Saad Alanazi

¹

,

Mamoona Humayun

^4,*

and

Najm Us Sama

⁵

¹

Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka 72311, Saudi Arabia

²

School of Science and Technology, Nottingham Trent University, Clifton Lane, Nottingham NG11 8NS, UK

³

School of Computing and Communications, Lancaster University, Clifton Lane, Lancaster LA1 4YW, UK

⁴

Department of Information Systems, College of Computer and Information Sciences, Jouf University, Sakaka 72311, Saudi Arabia

⁵

Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, Kota Samarahan 94300, Sarawak, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(8), 3806; https://doi.org/10.3390/app12083806

Submission received: 12 March 2022 / Revised: 7 April 2022 / Accepted: 8 April 2022 / Published: 9 April 2022

(This article belongs to the Topic Machine and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

The semantically complicated Arabic natural vocabulary, and the shortage of available techniques and skills to capture Arabic emotions from text hinder Arabic sentiment analysis (ASA). Evaluating Arabic idioms that do not follow a conventional linguistic framework, such as contemporary standard Arabic (MSA), complicates an incredibly difficult procedure. Here, we define a novel lexical sentiment analysis approach for studying Arabic language tweets (TTs) from specialized digital media platforms. Many elements comprising emoji, intensifiers, negations, and other nonstandard expressions such as supplications, proverbs, and interjections are incorporated into the MULDASA algorithm to enhance the precision of opinion classifications. Root words in multidialectal sentiment LX are associated with emotions found in the content under study via a simple stemming procedure. Furthermore, a feature–sentiment correlation procedure is incorporated into the proposed technique to exclude viewpoints expressed that seem to be irrelevant to the area of concern. As part of our research into Saudi Arabian employability, we compiled a large sample of TTs in 6 different Arabic dialects. This research shows that this sentiment categorization method is useful, and that using all of the characteristics listed earlier improves the ability to accurately classify people’s feelings. The classification accuracy of the proposed algorithm improved from 83.84% to 89.80%. Our approach also outperformed two existing research projects that employed a lexical approach for the sentiment analysis of Saudi dialects.

Keywords:

sentiment analysis; Arabic NLP; lexical; Saudi dialects; Arabic social media; Twitter

1. Introduction

Throughout Arab nations, social networking communication is immensely famous in recent generations as a means of publicly conveying one’s views on a variety of motifs. It is estimated that there are far more than 11 million active Twitter accounts in the Arab region, and Saudi Arabia has the most active members with around 2.6 million. As an increasing number of individuals turn to social media sites to convey personal views and seek advice, demand for social media analysis intensifies. Experts in sentiment evaluation are particularly interested in assessing the community mood and spotting the latest tendencies in this area. Arabic social networking sites’ sentiment analysis is particularly difficult since it must confront Arabic’s complicated semantics, which is amplified by online posts in nonstandardized Arabic idioms and perhaps not following conventional linguistic forms [1]. Using acronyms and idioms to work around Twitter’s 280-character limit set is a common tactic for those who want to quickly express their point [2,3]. The use of abbreviations on Twitter sometimes leads to ambiguity in the interpretation of TTs, and these TTs also frequently involve multiple misspelled words and casual language rules. Throughout this research, the sentiment analysis of particular topics in social-network statistics is utilized to comprehend how views are disseminated to benefit social research analysts in tracking the influence of social challenges on individuals’ attitudes, emotions, and momentum-building trends.

The majority of sentiment analysis techniques are lexical (linguistic) or ML (statistical) in nature. Machine-learning (ML) algorithms, including support vector machines (SVM), naïve Bayes (NB), and decision trees (DT) are used to analyze retrieved textual features. These approaches were developed on the text that had been prelabeled by sentiment polarization. Sentiment assessment requires the creation of robust sentiment LX in the context of the sentiment classification domain, which includes a particular set of words having predefined polarities. Once these unfamiliar terms are identified, the statistical–semantical grading of these terms and their placement can be used to estimate the polarity of an entire paragraph. Tremendous human effort is required to manually gather thought words to create a strong LX in lexical techniques. Techniques for studying modern standard Arabic (MSA) and dialectic Arabic (DA) appear in the literature. Studies on ML techniques use a variety of different ML methodologies [4,5,6,7]. Sentiment studies for MSA and the lexical technique were investigated in existing work [8,9,10,11]. Dialect sentiment analysis, on the other hand, has received minimal research attention. It is difficult to conduct sentiment analysis on Saudi Twitter posts, since there is no gold-labeled dataset or vocabulary that incorporates all Saudi Arabian accents, including Hejazi (western area), Najdi (central area), Shamali (northern area), Janubi (southern area), and Sharqawa (southern area, eastern region). This poses a significant barrier in the study of Saudi Arabia’s TTs. Saudi Arabian dialects such as Hejazi and Nejdi are composed of many dialectal variations even in a similar country. This shows how diverse Arabic dialects are and is a significant hurdle to NLP efforts in Saudi Arabia. MSA’s nafitha (window) is taqa in the Hejazi dialect, but it is shobak in the Nejdi dialect [12,13,14,15].

The Arabic language is morphologically rich, with a significant amount of information about syntactic parts and relationships expressed at the word level. In the Arabic language, one word may have numerous distinct surface forms; nonetheless, every word could have a large number of forms. Furthermore, most Arabic names are derived from Arabic adjectives that are frequently misconstrued with feelings. Arabic words with the same root might have contradictory emotional orientations due to the usage of diacritics and rich morphology. When using stemming processes to determine the polarity of feelings, this presents a substantial issue. The absence of capital letters in Arabic, which would normally be utilized to distinguish traits and the propensity to repeat letters in writing to indicate moods are also regarded as problems in analyzing the Arabic language. Employing Twitter TTs from Saudi Arabia as a research study, we introduce a new approach for dialectal ASA. The suggested method creates an entire multisentiment lexicon of Saudi dialects and uses a modest stemming process to adhere sentiments in the narrative to the matching base term in the multidialectal sentiment LX. Numerous criteria, including emoji, intensifiers, negations, and other nonstandard languages, are taken into account when classifying TTs according to their emotional content using the suggested technique. Additionally, a feature–sentiment correlation technique is used to weed out feelings that do not impact the discussed topic. In several investigations, the suggested technique was more accurate at sentiment analysis and addressed the shortcomings of prior research.

The remainder of the paper is organized as follows: we provide a literature review in Section 2 regarding the state of the art of the area under study, our method for multifactorial lexical sentiment analysis is discussed in Section 3, the findings of our sentiment analysis technique are examined in Section 4, and Section 5 provides conclusions and ideas for further study. The study’s acronyms are listed in Table 1 to benefit the reader’s interpretation.

2. Related Work

The Arabic vocabulary is particularly linguistically rich when it comes to the syntactic features and relationships between words. The English language has less morphological disparity, and sentiment analyses can thus be successfully developed at the sentiment word level. However, because Arabic has several different semantic patterns for a single word, directly applying lexical attributes to sentiment analysis systems results in data sparsity; for example, the word “good” جيد is a root with several forms, such as جيده (feminine singular), جيدات (feminine plural), جيدتين (feminine dual), جيدان (masculine dual), and جيدين (masculine plural). To further complicate matters, the majority of Arabic given names and surnames originated from Arabic adjectives, which can be mistaken for emotional implications, for example, the Arabic name Saead سعيد can be the adjective “happy”. Lastly, in Arabic, the origin of an optimistic term is a POS meaning, whereas the origin of a pessimistic one is negative. However, some words with contradictory sentiment polarity can have a similar three-letter root, for example, the words “discrimination” تمييز tamyiz (negative) and “excellent” إمتياز “iimtiaz” (POS) appear with an incompatible emotional orientation that has the same Arabic root, ميز miz [16]. There are a number of studies about ASA, and this section focuses primarily on ML and lexical techniques.

2.1. ML Approaches

According to [17], four-tiered polarity was discovered through mining local e-newspaper comments. An average of 815 Arabic opinions were divided into 620 posts for the training dataset, and 195 comments for the test dataset, resulting in validity of 85%. Natural language processing techniques, including machine translation, text categorization, and sentiment analysis, require an enriched corpus for precision and quality standards, as the authors in [18] reported. In total, 1000 comments from the The Voice Facebook account and 1000 from the Al Arabiya Facebook news website were incorporated into the corpus. The scholars used Facebook to build the corpus in order to examine dialectal Arabic. Regarding sentiment analysis and cinematic purchase predictions, a corpus is also employed in publications (negative, neutral, and POS). POS taggers, tokenizers, vocalizers, and stemmers were used to build the corpus. Conventional labeling, interannotator agreement (IAA), and classifications such as DT, SVM, NB, and KNN were all used to determine the polarity of the text. The authors presented sentiment classification algorithms that are not really appropriate for capturing negative expressions, which has a massive effect on sentiment. The preprocessing phase also did not eliminate unnecessary statements.

NB and SVM classifications were utilized to analyze statements in Moroccan dialectal Arabic and MSA collected from Facebook. MSA and dialectal Arabic text classification was added before sentiment analysis to boost reliability. Light stemming was their primary goal in preprocessing, and they used numerous light-stemming methods to implement in to MSA and dialects. Two features, TF-IDF and N-grams, were used to classify the data. The SVM classifier achieved an accuracy of 81%, while the NB classifier achieved an accuracy of 78%. Nevertheless, this strategy is a challenge for huge samples because they need to evaluate whether a TT was written in MSA or dialect during the light-stemming step. The discriminative multinomial naïve Bayes (DMNB) classification model, which is assisted by numerous text preprocessing approaches, including normalization, N-gram tokenization, and TF-IDF, was enhanced by [19]. A public Twitter corpus dataset of 2000 Arabic TTs categorized as POS or negative, and a fivefold cross-validation procedure were utilized in his research. When compared to other corpus-based sentiment analysis methodologies, the study’s results revealed enhancement. Despite the author’s suggestion that feature selection is used as a strategy for future research, the report provides minimal information on categorization features. Jordanian dialect was the topic of [20]. Three polarities were assigned to the dataset: POS, negative, and neutral. For example, TTs may be cleaned up, normalized, or tokenized to remove stop words, or they could be tokenized and normalized. NB and SVM were used in their study as two supervised classifiers. SVM’s results had an accuracy rate of 82.1%. However, there is still room for improvement in light stemming and rooting in dialectical Arabic.

2.2. Lexical Approaches

In an effort to improve current lexical approaches for ASA, a novel lexical technique for ASA was suggested [21]. The LX was constructed in four stages. In the first step, we selected 300 root words from the SentiStrength site and added synonyms to our LX in the second step. In the third step, a phrase intensity weighting mechanism was deployed to the LX to examine if any terms had been omitted even after passing through the first two phases. The fourth step expanded the vocabulary by including terms from other Arabic dialects. Sentiment analysis was then conducted using this LX and the basic lexical technique by determining the text’s polarity without taking negation or intensification into consideration. Precision was measured at 70.05% using multiple LX scalability stages, but it only covered MSA and did not provide any dialects. For example, one of the three systems presented by [22] was built on top of an improved form of an older lexical technique that could accommodate contextual polarization such as negative and intensifying comments. Using these additional variables, the authors were able to achieve an accuracy rate of 91.75%. For unlabeled data, the authors in [23] found that the LX-based approach was frequently used. However, facts are labeled, and polarity is estimated using sentiment lexica. Sentimental terms and phrases from the LX can be used to gauge the tone of a piece of writing (such as a review). The authors in [24] performed LX-based sentiment classification for Arabic Twitter datasets on the Syrian civil conflict and issues. Arabic TTs provided as a “bag of words” (BOW) were negatively or positively evaluated by checking the given emotions in an Arabic sentiment dictionary. The findings of this article did not analyze dialectical Arabic or other factors that may influence SA efficiencies, including intensifying and negating.

ASA effectively uses both lexical and ML methodologies, as evidenced by surveying relevant research. We applied a text-based lexical strategy and evaluated the impact of numerous elements such as strength, light stemming, negating, and emoji on analytic precision in evaluating various lexical parameters.

3. Multifactor Lexical Dialectical Arabic Sentiment Analysis (MULDASA)

Lexical sentiment evaluation uses two sentiment LXs to compare sentiment words in TTs (POS and NEG). Sentiment terms are counted in the text to determine the entire polarity of a TT. The most common way to label a TT is to follow a set of rules. If the POS statements in a TT are greater than the NEG statements, the TT is considered to be POS, and vice versa [25]. In order to improve the precision of the sentiment analysis procedure, we applied a range of tricks, such as emoji, intensifiers, negations, and sounds and gestures such as supplication, proverb, and interjection. We call this approach multifactor.

3.1. Building Sentiment Analysis Corpus

We used the corpus that we created in our previous work on dialectical Arabic stemming [26] as the basis for building our gold-labeled dataset. Around 40,500 tweets were gathered from various hashtags and accounts. Before using NLP techniques, these tweets were lexically standardized. Following normalization, a gold-standard corpus was created by 7 human annotators manually annotating 7000 tweets, who labeled the polarity of each tweet with its corresponding emotion (positive or negative), as mentioned in Table 2. Twitter’s API was used to gather TTs on the basis of two factors: first, the location of the user (Saudi Arabia), and second, the dialect (such as Hejazi and Najdi); an equal number of TTs were acquired using the user’s location, and this process was sometimes complex since some participants restricted their location. TTs were collected using relevant hashtags to the unemployment problem domain, such as السعوديه ¬_ للسعوديين (Saudi Arabia for Saudis) and تكامل سوق الاتصالات (telecoms market integration), and subjected to the necessary preprocessing steps, data collection, preprocessing, normalization, light stemming, and notation. Full details on the volume of TTs in our corpus are shown in Table 2.

3.2. Domain Analysis and Feature Extraction

Our lexical method is based on a thorough examination of the issue domain’s knowledge. Analyses are crucial in this project since they allow for us to extract domain variables in order to link them with the indicated feelings. Domain knowledge encompasses information about a domain’s surroundings, important ideas, synonyms, ground facts, and linkages between these objects and external relationships that connect concepts from other domains [27]. Conversations with important stakeholders (e.g., people and authorities) and connectivity channels (Twitter posts) are included in our modeled knowledge for our labor problems, as shown in Figure 1. Figure 1 defines the domain variables and relationships existing between these variables to link these variables with the indicated feelings.

3.3. Construction of Arabic Sentiment LX

For dialectical Arabic language, this work develops a new domain-specific vocabulary. As previously stated, the Arabic language is composed of MSA and a variety of regional dialects that are often employed in everyday conversation. Arabs from various areas and nations frequently compose their TTs in their local accents. Saudi Arabia in particular features six distinct dialects. Another way of resolving this issue is by incorporating the vocabulary from several other Saudi dialects, including Hejazi and Najdi from the western and central regions, Shamali from the north, and Janubi and Sharqawi from the south. The LX utilized in this study was manually and automatically compiled by linguistics, and native Arab and Saudi accent speakers. For this reason, much effort is devoted to manually creating and expanding the vocabulary of the dialects. Since colloquial Arabic lacks a consistent vocabulary, the participation of native people of diverse dialects is crucial to the development of the LX. Annotators were chosen from a demographic group that uses social media often (e.g., age between 23 and 45). Annotators followed the same instructions and rules, including not permitting prejudice (such as religious, cultural, or societal beliefs) to affect their work. After the annotation procedure, Cohen’s kappa coefficient [28] was employed to evaluate annotation reliability. This is a statistical method for determining qualitative word inter-rater agreement. Because it considers agreement by chance, it is regarded to be a more reliable indication than simple percentage computation. The weighted kappa was 0.816, suggesting accurate annotations. The accepted degree of agreement was estimated to be 91.74% [29]. In order to construct the vocabulary, thousands of emotive expressions were collected from a variety of sources. As a starting point, Azmi and Alzanin chose 1130 MSA feelings terms from their work [17]. Using the MSA terminology and dialectical phrases for each emotion word, we created a list of MSA synonyms. Saudi dialects that were researched included Hejazi, Qassmi, Nejdi, Janubi, and Shamali. There were three annotators who assigned the terms to one of four polarity rates: highly POS (+1), POS (0.5), NEG (−0.5), or highly NEG (−1), as mentioned in Table 3.

It was then expanded from 1130 words in the sentiment vocabulary to a total of 16,500 words. In this paper, we propose multi-intensity sentiment LXs and different matching techniques to perform sentiment analysis as described below:

POS (P): sentiment score (SC) (intensity) = 0.5
NEG (N): sentiment SC (intensity) = −0.5
Very POS (VP): sentiment SC (intensity) = 1.0
Very NEG (VN): sentiment SC (intensity) = −1.0

This was manually evaluated by 7 annotators who then labelled the polarity of every TT with its sentiment (POS or NEG), which reflects the exact value of the sentiments of the TTs in the corpus. We considered this dual polarity after the annotators’ comments showed that, with respect to controversial issues, most people show critical and strong opinions, and they rarely annotated a TT with a neutral label. This conclusion is corroborated by the comprehensive survey of ASA by [30], where the authors referred to the issue of common POS vs. NEG opinion as binary sentiment analysis (BSA).

3.4. Feature–Sentiment Association

If we use an association pane (words that are on either side of the targeted word) to look for feelings associated with an idea identified as important in a domain during content modeling, we can exclude opinions that are not related to the given topic. Conventional POS-based referencing techniques, including those proposed by [6,31], cannot be effectively deployed for feature–sentiment identification in dialectal Arabic since it lacks the grammatical rules of MSA. Table 4 illustrates our feature–sentiment association method’s workings in detail.

Using our original dialectical Arabic light-stemming algorithm [26], word stemming was performed on the original TT (first row in Table 4) to find all POS/NEG sentiments (good, excellent, bad) and target domain features (second row). Then, using a two-word window around the sentiment (two words before and after the sentiment), neighboring (associated) domain features were identified (third row). The two-word window is enough for TTs because of their brief sentences.

Sentiments (tiered) تعب and (ruin) خرب were taken into account in the preceding example in Table 4 since they covered the domain’s semantic characteristics. The sentiment beautiful جمال is considered to be nonrelevant and was excluded because it refers to جو (weather), which is not a domain feature.

3.5. Computing Sentiment SC

The LX was searched for a specific term by using the term-matching approach. If the terms were identical, an SC was provided to the term. The steps mentioned in Algorithm 1 are required to come up with a final SC.

Algorithm 1: TT SC Calculation

Let TT = tweet; LX = lexicons; SC = sentiment score; W = word
1. Inputs: TT, LXs
2. Output: Sentiment SC
3. Set SC←0
4. W←Tokenize(TT)
5. FOR EACH W in words DO
6. IF W in POS-LX THEN
7. SC←SC + 0.5
8. ELS IF W in VeryPos-LX THEN
9. SC←SC + 1.0
10. ELSE IF W in NEG-LX THEN
11. SC←SC − 0.5
12. ELS IF W in VeryNeg-LX THEN
13. SC←SC − 1.0
14. Label←Classify-TT(SC)
15. RETURN SC and Label (VN, N, P, VP)

TT SC (TS) was determined by adding together the sentiment scores for each TT word (WS), as presented in the following equation.

TS = \sum WS

3.6. Strategies to Enhance the Basic Sentiment Analysis Approach

To enhance the precision of our sentiment analysis process, we tested a variety of strategies including intensification, negation, special phrases, and emoji.

3.6.1. Negations

It is essential for sentiment analysis to identify NEG words because they can affect the entire context and orientation of an idea. The authors in [32] proposed analysis of negating elements for ASA that takes into account two grammatical structures: أدوات النصب and أدوات الجزم. Two categories of negation particles were established on the basis of these criteria, which identify five key negation elements in Arabic: lan “لن”, maa “ما”, lam “لم”, laa “لا”, and laysa “ليس”. However, their proposed method applies simple grammatical rules that switch polarity only if negation particles follow the sentiment terms. This approach failed in some cases to determine the negation impact. An Arabic Facebook news section sentiment analyzer was presented [33] using an ML approach. The scholars classified negations using several ML techniques, even though only five MSA negations were studied, and dialectal negations were not analyzed. An extensive rule-based approach was needed in our investigation to deal with the complicated form of Arabic, especially with regard to NEG linguistic emotions. The catalog of negation words used throughout Saudi dialects was manually compiled, resulting in a range of 45 words, such as مش ، مو ، ماني ، مارح ، محد ، معاد (msh, mw, mani, marih, mahadun, mueadin).

NEG language can be used to reverse the polarity of a sentiment, which is useful for sentiment analysis purposes. Consider the NEG connotation of a statement like, “not happy”. This means that while calculating the sentiment SC, it is critical to take negation into account. A window for terms in TTs must be considered in order to examine negation. An example is TT “I dislike pizza”, “انا ما احب البيتزا”; in order to obtain the preceding word for every word in a TT, we must set the TT’s window size to 1: “_ I”, “I love”, “do not,” “not love” and “love salad,” “_انا“,“ انا ما”,” ما احب”,” احب السلطه”. The Algorithm 2 me is the process for the TT-SC(TT-SC) computation with negation.

Algorithm 2: TT-SC Computation with Negation

1. Inputs: TT, LX
2. Output: Sentiment SC
3. Initialize SC←0
4. window_list←generate-Window(TT, 1)//generate a window with size 1
5. FOR EACH previous_W, W in window_list DO
6.   IF W in POS-LX AND previous_W in negation_list THEN
7.    SC←SC − 0.5
8.   ELS IF W in VeryPos-LX AND previous_W in negation_list THEN
9.       SC←SC − 1.0
10. ELSE IF W in NEG-LX AND previous_W in negation_list THEN
11.       SC←SC + 0.5
12.   ELS IF W in VeryNeg-LX AND previous_W in negation_list THEN
13. SC←SC + 1.0 Label←Classify-TT(SC)
14. RETURN SC and Label (VN, N, P, VP)

3.6.2. Determining Sentiment Intensity

The majority of existing research treats the ASA problem as a binary classification challenge or 2-class (POS or NEG sentiment) difficulty. From this perspective, terms, statements, or records with varying intensities must be grouped into two specific categories, namely, they must be classified as either POS or NEG sentiment [34]. However, this is not the case in real life, where the polarity spectrum of emotions extends from extremely NEG to very NEG, neutral to POS, and POS to very POS. In addition, experts believe that modeling intensity at the word level is critical for improving the performance of NLP solutions, particularly in questioning answering and contextual inference [35]. As a result, scholars proposed a multiplication impact by pairing an intensifier (a support word) such as “very” with a polarity adjective such as “good” or “bad”. This can aid in determining distinct sentiment values for terms such as “very good”, “good”, “bad”, and “very bad”.

To our knowledge, no investigations on the effectiveness of intensifiers on sentiment polarity have been conducted in Saudi Arabia. Due to the absence of an intensification vocabulary in the research, intensification terms for Saudi dialects were personally acquired by native linguists throughout this analysis. Approximately 33 Saudi intensification words were gathered; Table 5 illustrates some examples.

To analyze sentiment intensity, we employed the gathered intensification words. We used a window for phrases in TTs to extract the preceding and following words for each TT, since intensity, as demonstrated in Table 6, is not always contiguous to the sentiment in linguistic Arabic used on social sites. The Algorithm 3 for TT-SC computation with intensification is shown below.

Algorithm 3: TT-SC Computation with Intensification

1. Inputs: A TT, LX
2. Output: Sentiment SC
3. Initialize SC←0
4. window_list←generate-Window(TT, 2)//generate a window with size 2
5. FOR EACH previous_w, w, next_w in window_list DO
6. IF w is POS-LX THEN
7. SC←SC + 0.5
8. IF previous_w in intensification_list OR next_w in intensification_list THEN
9. SC←SC + 0.5
10. ELSIF w in VeryPos-LX THEN
11. SC←SC + 1.0
12. IF previous_w in intensification_list OR next_w in intensification_list THEN
13. SC ← SC + 0.5
14. ELSEIF w is NEG-LX THEN
15. SC←SC − 0.5
16. IF previous_w in intensification_list OR next_w in intensification_list THEN
17. SC←SC − 0.5
18. ELSIF w is VeryNeg-LX THEN
19. SC←SC − 1.0
20. IF previous_w in intensification_list OR next_w in intensification_list THEN
21. SC←SC − 0.5
22. Label←Classify-TT(SC)
23. RETURN SC and Label (VN, N, P, VP)

3.6.3. Emoji

Emoji are tiny digital graphics that can be used to convey a variety of different kinds of information on social networking platforms [36]. Emoji have seen a massive boost in popularity in recent years, especially on Twitter, a blogging platform. An emoji can express affection more effectively than a word or phrase can because it does not depend on language or a specific context. As a consequence, the classification and identification of emoji are important for the development of appropriate sentiment analysis programs. A few studies [37] took into account the adoption of emoji in ASA. Sentiment analysis of microblogs could be improved by using innovative nonverbal features rather than NLP techniques, which the authors in [38] found to be a challenging undertaking, mainly when dealing with dialects. In our analysis, we advocated the usage of multiple ML methods and 969 emoji attributes. The suggested emoji-based attributes worked well in accurately identifying sentiment polarity, according to the findings of the experiments.

The lexical technique and nonverbal aspects are combined throughout this work to generate ASA. Emoji usage and its impact on sentiment analysis can be assessed under this approach. Emoji are used in both POS and NEG settings. Emoji that conveyed four unique moods were emphasized: very POS(VP), POS(P), NEG(N), and very NEG (VN). The experiment used a collection of emoji compiled by [39] that includes 592 different symbols. Human observers reviewed the emoji collection, and manually assigned polarity classes and scores for every emoji. The annotators were instructed to use VP = 1.0, P = 0.5, N = −0.5, and VN = −1, and assigned scores. Every emoji was also given an SC on the basis of how close it was to the mean of annotator scores. With a kappa (K) value of 0.85, the annotators’ ultimate agreement was rated at 91.2%, which was an impressive result. Table 7 and Table 8 provide a breakdown of the emoji that fall under the VP, P, N, and VN categories, and an illustration.

3.6.4. Considering Special Linguistic Phrases Affecting Sentiments

Supplications, interjections, and proverbs are among numerous distinctive sentences in dialectal Arabic that describe emotions, and they have a significant part in sentiment classification. It was critical to competently, precisely, and transparently interpret these specific terms to enhance the created ASA system throughout this research. Innovative methods were developed in the proposed work to handle supplications, interjections, and proverbs in dialectal Arabic in order to retrieve sentiment as correctly as possible from TTs.

Supplications

Supplications are commonly used by Arabs, particularly Saudis, in everyday life. Their own social-media posts reflect this behavior as well. Supplication can be utilized to express both POS and NEG feelings, according to linguistic scholars [40]. Whereas supplications are frequently employed in social networking sites to communicate both POS and NEG thoughts, we are aware of just a few studies that address them in the ASA framework [41]. Over 32% of the TTs in our corpus involved either POS or NEG supplications, demonstrating the significance of supplications in defining emotion. There are numerous types of supplication, including POS intentions (shown in Table 9) and NEG desires (shown in Table 10).

We compiled a list of common supplications from a variety of materials, including the Qur’an and quotations. Supplications were found in the TTs if they involved either of terms الله or اللهم, as described in Table 11. Various supplications gathered from the ALkalem attayeb site [42], in addition to a few other supplications, were manually included.

b.: Proverbs

Proverbs are brief summaries of popular experience and understanding. A proverbial statement is a traditional saying that is passed down through oral culture and is identical to a proverb. Idiomatic statements are equivalent structures, and differentiating them from proverbial terms can be challenging. The content of proverbial idioms and idiomatic statements often does not come from the sentence. Proverbs and proverbial statements are also classified as idioms by certain academics [43]. The examination of proverbs is considered in this work because people communicate their emotions about a topic when blogging regarding it. Proverbs were manually gathered, giving 200 proverbs in various Saudi dialects. Table 12 and Table 13 illustrate instances of POS and negative proverbs, and a TT containing a proverb.

c.: Interjections

Interjections are mainly used to indicate a negative emotion [44]. For instance, feelings such as الى متى ela mita, until when, وين حنا wain hena, where we are, من متى, from when, min meta وش باقي what remain to wish baki.

Punctuation marks are normally included such as ؟ and !. Table 14 shows the results of manually collecting approximately 30 interjections.

Lexical multifactor sentiment analysis, which focuses on a thorough assessment of contextual subject information, was discussed in detail. To strengthen the fundamental sentiment assessment, we present our Arabic Sentiment LX development strategies and then demonstrate methods for calculating sentiment scores and other strategies such as light Arabic stemming and morphological analysis, negatives, the intensity of feelings, emoji, and the consideration of special linguistic expressions that impact feelings.

4. Results and Discussion

4.1. Analysis of Experimental Findings

We describe and evaluate the outcomes of our experiments in this section. Experiments using lexical- and multifactor-sentiment analyses were conducted to assess the performance of the suggested algorithm, which included the consideration of emoticons, intensifiers, and negations. We adopted usual text classification parameters of precision (P), recall (R), accuracy (Acc), and F measure (F1) to examine the alternative techniques. F measure, a harmonic mean of retention and accuracy, is used to assess overall system effectiveness [45].

P = \frac{TP}{TP + FP}

(1)

R = \frac{TP}{TP + FN}

(2)

ACC = \frac{TP + TN}{TP + FP + TN + FN}

(3)

F 1 = \frac{2 PR}{P + R}

(4)

Categorization performance and the average F-SC among NEG and POS categories are shown in Table 15 and Figure 2, respectively. The experiments were run using a gold-labeled copy of our original records. We tested a variety of parameters that could reflect sentiment polarity or strength, as previously noted. Table 15 shows that the strategy incorporating all aspects (LX-based baseline + light stemming + polarity + negations + emoji + intensification words) achieved the highest classification outcomes, with precision SC of 89.80% and F-SC of 86.32%, improved performance of 5% and 9%, respectively, over the base point. Table 15 further demonstrates that the LX-based baseline had an excellent classification precision of 84.34% and F-SC of 76.47% because of, first, the knowledge-based strategy that enabled collecting particular field characteristics, and second, the efficacious development of Saudi dialect LXs (see Table 16).

Emoji, on the other hand, had a lesser rate of precision than that of the baseline method, with an F-SC of 48.70% and 82.63% accuracy rate. The usage of emoji to convey a sarcastic message may have influenced the quality of the evaluation in some situations. Integrating the LX- and polarity-based classifications achieved classification accuracy of 88.94% and an F-SC of 82.14%, as shown in Table 15. Merging LX-based and special phrases resulted in an accuracy rate of 85.39% and an F-SC of 76.99%. In considerations of light stemming, classification precision SC and F-SC were 88.99% and 81.16%, respectively, when combined with the LX-based technique.

In order to assess most negation phrases and avert misstatements owing to their inconsistent implementation, negation is a more complicated task that requires particular guidelines. The complexity of the Arabic language and the question of negation necessitated indepth linguistic analysis and semantic synthesis. The LX-based method’s negative result was hindered by two problems. One was the usage of special characters, including exception characters in the Arabic language, which are popular in TTs (e.g., لن يفلح الا المثابر, “nobody is successful unless they work extremely hard”); in this case, we needed to finish sentence analysis and consider the exception characters to determine the correct sentiment. The other problem was the free sequence of terms in an Arabic sentence, which caused the negation and feeling to be mismatched. The correctness of the negation with the LX-based approach was 79.53%, and the F-SC was 57.70%, according to the data. As a consequence, all parameters worked together to improve the accuracy rate, with the absence of negation and emoji. The maximal classification precision measurement was obtained by aggregating the components.

4.2. Evaluation against Similar Work on Dialectal ASA

Our LX-based method is primarily predicated on incorporating problem domain understanding into the operation when it comes to sentiment analysis. Because of this, it is beneficial to assess the relevance of our strategy to other issue areas as well. Our technique was evaluated in comparison to 2 LX-based techniques for Saudi dialects that have been presented in the Journal of Information Science.

The collection in [46] was the first with which we conducted our experiments. A variety of social concerns in Saudi Arabia were discussed using hashtags chosen by the writers, including #الراتب_مايكفي_الحاجة # alratb_mayikfy_alhaja (“our income is insufficient”), #قيادة_26_اكتوبر #qyadt_26_aktubar (“on 26 October, women will be driving”), and #المحتسبون_للديوان_مجدداً # almhtsbwn_lldywan_mjdda (sheikhs returned to the ruler to debate the issue of women driving). There were 1103 Arabic annotation TTs in the collection, which is not accessible to the general public. On the other hand, the publishers rendered the information available to us for the study. Systematic sentiment analysis was used to classify the 2 TTs into one of two categories on the basis of their polarity (POS or negative). By converting SentiWordNet into Arabic, the researchers used it to retrieve some sentiment terms for Arabic sentiment LXs. These phrases were then followed by a collection of their own. There are 1500 words in their emotional vocabulary (1000 negative and 500 POSs). It is possible to gauge how POS or negative an entire TT is by looking at how many POS words there are in it. They utilized regular expressions to create a negation phrase analyzer in their study. This takes TTs that are not annotated and returns TTs that are annotated. A few of the produced TTs, however, still lacked sentiment annotation. As a result, their semantic method is limited to terms that express sentiment, and the classifier does not correctly categorize a TT if the sentiment terms are missing from the LX. Whenever they looked at mild stemming and negation, they received the perfect results. An accuracy of 67.60%, 78.24% F-SC, 91.74% precision, and 67.43% recollection were the study outcomes.

The authors gathered the second collection used in the research [22]. Procedia Computer Science revealed their findings. In total, 14,806 TTs were personally described by the observers who had been selected. For the sake of convenience, the AraSenti-TT corpus is available online, and partitioned into training and testing datasets. To generate sentiment LX “AraSenti-Trans”, we used the MADAMIRA program and retrieved 131,342 words from the TT datasets. This was achieved by building a collection of all negation particles discovered in TTs and determining whether the TT contained one of those components. It was not taken into account while drawing conclusions. This study’s precision (78.38%) outpaced its recall (78.15%) by more than two standard deviations.

Table 16 and Figure 3 show the results of our sentiment analysis technique to those of the strategies of Aldayel et al., and Al-Twairesh et al. utilizing corpora from their investigations.

Our LX-based strategy outperformed the other two, with a 10% precision enhancement over the Adayel corpus and a 2% enhancement in the F-SC over the Al-Twairesh technique. Many aspects contribute to lexical analysis, comprising intensifiers, negations, proverbs, and interjections, and a full multi-intensity LX for the Saudi dialect. The favorable outcomes of our technique could have been due to these features. Our technique can be applied to other fields, as shown by the comprehensive review.

5. Conclusions and Future Work

Numerous elements add to the efficiency of sentiment analysis, and this research proposed a mechanism for dialectal Arabic which takes these into account. The presented study extended a sentiment LX that is accompanied by a complete overview of multidialectal sentiment alternatives for the target given the problem of unemployment. We employed a feature–sentiment association technique to weed out thoughts that were not significant to the problem domain and an effective light-stemming approach to associate the given feelings with the associated term root in the multidialectal LX.

In order to test the precision of our multi-intensity lexical sentiment analysis technique, we performed experiments. Findings showed that, in conjunction with light stemming, the collaborative considerations of numerous factors of emoji, intensification, negation, supplication, and interjection helped in improving the proposed algorithm in the sentiment categorization of TTs. The employed algorithm’s classification efficiency and F-SC were enhanced from 83.84% to 89.80% and from 73.47% to 84.70%, respectively. Our approach was compared with two existing research studies, and results showed that MULDASA outperformed the existing approaches.

Our future work involves investigating the integration of our lexical algorithm with ML techniques in a hybrid approach that could further improve overall classification accuracy. This is particularly useful for problem domains where it is difficult to recognize beforehand a comprehensive set of domain key ideas (features) that could be associated with the sentiment LX. An example of such a domain is “hate speech”, where the feature set and expressed sentiments cover a wider pool of terminology and dynamically change. We are also planning to explore the effictiveness of the proposed approach on neutral text.

Author Contributions

Data curation, G.A.; formal analysis, G.A.; investigation, M.E.H.; supervision, T.O. and M.E.H.; writing—original draft, G.A.; writing—review and editing, S.A., M.H. and N.U.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Furnished on request.

Acknowledgments

The authors acknowledge the Deanship of Scientific research at Jouf University.

Conflicts of Interest

The authors declare no conflict of interest.

References

El-Beltagy, S.R.; Khalil, T.; Halaby, A.; Hammad, M. Combining lexical features and a supervised learning approach for Arabic sentiment analysis. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Konya, Turkey, 3–9 April 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Albogamy, F.; Ramsay, A. POS tagging for Arabic tweets. In Proceedings of the International Conference Recent Advances in Natural Language Processing, Hissar, Bulgaria, 7–9 September 2015. [Google Scholar]
Govindarajan, M. Approaches and Applications for Sentiment Analysis: A Literature Review. In Data Mining Approaches for Big Data and Sentiment Analysis in Social Media; IGI: New York, NY, USA, 2022; pp. 1–23. [Google Scholar]
Duwairi, R.M.; Qarqaz, I. A framework for Arabic sentiment analysis using supervised classification. Int. J. Data Min. Model. Manag. 2016, 8, 369–381. [Google Scholar]
Mehmood, M.; Ayub, E.; Ahmad, F.; Alruwaili, M.; Alrowaili, Z.A.; Alanazi, S.; Rizwan, M.H.; Naseem, S.; Alyas, T. Machine learning enabled early detection of breast cancer by structural analysis of mammograms. Comput. Mater. Contin. 2021, 67, 641–657. [Google Scholar] [CrossRef]
Gouda, W.; Almurafeh, M.; Humayun, M.; Jhanjhi, N.Z. Detection of COVID-19 Based on Chest X-rays Using Deep Learning. Healthcare 2022, 10, 343. [Google Scholar] [CrossRef] [PubMed]
Al Shamsi, A.A.; Abdallah, S. Text mining techniques for sentiment analysis of Arabic dialects: Literature review. Adv. Sci. Technol. Eng. Syst. J. 2021, 6, 1012–1023. [Google Scholar] [CrossRef]
Abd-Elhamid, L.; Elzanfaly, D.; Eldin, A.S. Feature-based sentiment analysis in online Arabic reviews. In Proceedings of the 2016 11th International Conference on Computer Engineering & Systems (ICCES), Cairo, Egypt, 20–21 December 2016. [Google Scholar]
Mataoui, M.H.; Zelmati, O.; Boumechache, M. A proposed lexicon-based sentiment analysis approach for the vernacular Algerian Arabic. Res. Comput. Sci. 2016, 110, 55–70. [Google Scholar] [CrossRef]
Al-Twairesh, N.; Al-Khalifa, H.; Alsalman, A.; Al-Ohali, Y. Sentiment analysis of arabic tweets: Feature engineering and a hybrid approach. arXiv 2018, arXiv:1805.08533. [Google Scholar]
Ahmed, A.; Ali, N.; Alzubaidi, M.; Zaghouani, W.; Abd-alrazaq, A.A.; Househ, M. Freely Available Arabic Corpora: A Scoping Review. Comput. Methods Programs Biomed. 2022, 2, 100049. [Google Scholar] [CrossRef]
Al-Harbi, W.A.; Emam, A. Effect of Saudi dialect pre-processing on Arabic sentiment analysis. Int. J. Adv. Comput. Technol. 2015, 4, 91–99. [Google Scholar]
Al-Thubaity, A.; Alharbi, M.; Alqahtani, S.; Aljandal, A. A Saudi dialect Twitter Corpus for sentiment and emotion analysis. In Proceedings of the 2018 21st Saudi computer society national computer conference (NCC), Riyadh, Saudi Arabia, 25–26 April 2018. [Google Scholar]
Assiri, A.; Emam, A.; Al-Dossari, H. Towards enhancement of a lexicon-based approach for Saudi dialect sentiment analysis. J. Inf. Sci. 2018, 44, 184–202. [Google Scholar] [CrossRef]
Alahmary, R.M.; Al-Dossari, H.Z.; Emam, A.Z. Sentiment analysis of Saudi dialect using deep learning techniques. In Proceedings of the 2019 International Conference on Electronics, Information, and Communication (ICEIC), Auckland, New Zealand, 22–25 January 2019. [Google Scholar]
Alwakid, G.; Osman, T.; Hughes-Roberts, T. Challenges in sentiment analysis for Arabic social networks. Procedia Comput. Sci. 2017, 117, 89–100. [Google Scholar] [CrossRef]
Azmi, A.M.; Alzanin, S.M. Aara’—A system for mining the polarity of Saudi public opinion through e-newspaper comments. J. Inf. Sci. 2014, 40, 398–410. [Google Scholar] [CrossRef]
Itani, M.; Roast, C.; Al-Khayatt, S. Corpora for sentiment analysis of Arabic text in social media. In Proceedings of the 2017 8th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 4–6 April 2017. [Google Scholar]
AlSalman, H. An improved approach for sentiment analysis of arabic tweets in twitter social media. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 19–21 March 2020. [Google Scholar]
Atoum, J.O.; Nouman, M. Sentiment analysis of Arabic jordanian dialect tweets. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 256–262. [Google Scholar] [CrossRef] [Green Version]
Hasan, A.A.; Fong, A.C. Sentiment analysis based fuzzy decision platform for the Saudi stock market. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT), Rochester, MI, USA, 3–5 May 2018. [Google Scholar]
Al-Twairesh, N.; Al-Khalifa, H.; Al-Salman, A.; Al-Ohali, Y. Arasenti-tweet: A corpus for arabic sentiment analysis of saudi tweets. Procedia Comput. Sci. 2017, 117, 63–72. [Google Scholar] [CrossRef]
Al-Moslmi, T.; Albared, M.; Al-Shabi, A.; Omar, N.; Abdullah, S. Arabic senti-lexicon: Constructing publicly available language resources for Arabic sentiment analysis. J. Inf. Sci. 2018, 44, 345–362. [Google Scholar] [CrossRef]
Aloqaily, A.; Al-Hassan, M.; Salah, K.; Elshqeirat, B.; Almashagbah, M. Sentiment analysis for arabic tweets datasets: Lexicon-based and machine learning approaches. J. Theor. Appl. Inf. Technol. 2020, 98, 612–623. [Google Scholar]
Pak, A.; Paroubek, P. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, 19–21 May 2010. [Google Scholar]
Alwakid, G.; Osman, T.; Hughes-Roberts, T. Towards improved saudi dialectal Arabic stemming. In Proceedings of the 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 3–4 April 2019. [Google Scholar]
Khalil, H.; Osman, T. Challenges in information retrieval from unstructured arabic data. In Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, Cambridge, UK, 26–28 March 2014. [Google Scholar]
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Carletta, J. Assessing agreement on classification tasks: The kappa statistic. arXiv 1996, arXiv:cmp-lg/9602004. [Google Scholar]
Al-Ayyoub, M.; Nuseir, A.; Alsmearat, K.; Jararweh, Y.; Gupta, B. Deep learning for Arabic NLP: A survey. J. Comput. Sci. 2018, 26, 522–531. [Google Scholar] [CrossRef]
Ibrahim, H.S.; Abdou, S.M.; Gheith, M. Sentiment analysis for modern standard Arabic and colloquial. arXiv 2015, arXiv:1505.03105. [Google Scholar] [CrossRef]
Duwairi, R.M.; Alshboul, M.A. Negation-aware framework for sentiment analysis in Arabic reviews. In Proceedings of the 2015 3rd International Conference on Future Internet of Things and Cloud, Rome, Italy, 24–26 August 2015. [Google Scholar]
Hamouda, A.; El-taher, F.E.-Z. Sentiment analyzer for arabic comments system. Int. J. Adv. Comput. Sci. Appl. 2013, 99–104. [Google Scholar]
Badaro, G.; Baly, R.; Hajj, H.; El-Hajj, W.; Shaban, K.B.; Habash, N.; Al-Sallab, A.; Hamdi, A. A survey of opinion mining in Arabic: A comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Trans. Asian Low-Resour. Lang. Inf. Processing 2019, 18, 1–52. [Google Scholar] [CrossRef] [Green Version]
De Marneffe, M.-C.; Manning, C.D.; Potts, C. “Was it good? It was provocative”. Learning the meaning of scalar adjectives. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 11–16 July 2010. [Google Scholar]
Felbo, B.; Mislove, A.; Søgaard, A.; Rahwan, I.; Lehmann, S. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv 2017, arXiv:1708.00524. [Google Scholar]
Abdellaoui, H.; Zrigui, M. Using tweets and emojis to build tead: An Arabic dataset for sentiment analysis. Comput. Sist. 2018, 22, 777–786. [Google Scholar] [CrossRef]
Al-Azani, S.; El-Alfy, E.-S.M. Combining emojis with Arabic textual features for sentiment classification. In Proceedings of the 2018 9th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 3–5 April 2018. [Google Scholar]
Kralj Novak, P.; Smailović, J.; Sluban, B.; Mozetič, I. Sentiment of emojis. PLoS ONE 2015, 10, e0144296. [Google Scholar] [CrossRef] [PubMed]
Mohammad, S. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, San Diego, CA, USA, 16 June 2016. [Google Scholar]
Shamsudin, N.F.; Basiron, H.; Sa’aya, Z. Lexical based sentiment analysis-Verb, adverb & negation. J. Telecommun. Electron. Comput. Eng. JTEC 2016, 8, 161–166. [Google Scholar]
الكلم الطيب-موسوعة الفوائد والحكم والأدعية والأذكار والأقوال المأثورة. Available online: Kalemtayeb.com (accessed on 13 July 2019).
Ibrahim, H.S.; Abdou, S.M.; Gheith, M. Idioms-proverbs lexicon for modern standard Arabic and colloquial sentiment analysis. arXiv 2015, arXiv:1506.01906. [Google Scholar]
Ortigosa, A.; Martín, J.M.; Carro, R.M. Sentiment analysis in Facebook and its application to e-learning. Comput. Hum. Behav. 2014, 31, 527–541. [Google Scholar] [CrossRef]
Khalil, M.I.; Tehsin, S.; Humayun, M.; Jhanjhi, N.Z.; AlZain, M.A. Multi-Scale Network for Thoracic Organs Segmentation. Comput. Mater. Contin. 2022, 70, 3251–3265. [Google Scholar] [CrossRef]
Aldayel, H.K.; Azmi, A.M. Arabic tweets sentiment analysis—A hybrid scheme. J. Inf. Sci. 2016, 42, 782–797. [Google Scholar] [CrossRef]

Figure 1. Concept diagram.

Figure 2. Findings of multifactor LX-based sentiment analysis of social-media posts in dialectical Arabic.

Figure 3. Findings from Al-Twairesh and Adayel corpora using our LX-based technique.

Table 1. Definitions of used acronyms.

Acronyms	Used for
Tweets	TTs
ASA	Arabic sentiment analysis
SVM	Support vector machine
MULDASA	Multifactorial lexical sentiment analysis algorithm
MSA	Modern standard Arabic
NB	Naïve Bayes
DA	Dialectical Arabic
ML	Machine language
IAA	Interannotator agreement
DT	decision trees
NB	naïve Bayes
KNN	k-nearest neighbors
DMNB	Discriminative multinomial naïve Bayes
BOW	Bag-of-words
POS	Positive
NEG	Negative
ML	Machine learning
LX	Lexicons

Table 2. Data retrieved from our collection of TTs.

Dataset	NEG TTs	POS TTs
Total TTs	4996	2004
Total words	33,945	16,383
Average word count per TT (tokens)	10.03	7.56
Character count per TT	39.97	58.89

Table 3. An illustration of how a LX is constructed.

Basic Term	Synonym Suite		Dialect
Basic Term	Word	Polarity	Word	Polarity
جيد Good +0.5	حلو رائع	+0.5 +1	زين روعه	+0.5 +0.5
سيئ Bad −0.5	رديء قبيح	−1 −0.5	شين تعيس	−0.5 −1

Table 4. Sentiments in TTs.

(i) Original TT	تعبنا من الانتظار و الواسطات الكثيره خربت علينا .. أستمتعوا بجمال الجو هالفتره
Translation in English	“We were harmed by nepotism. Fraud is obvious to the selection; therefore, it’s best to go to sleep.”
(ii) Stemmed TT	تعب من انتظار و واسطه كثيرخرب علينا استمتع جمال جو فتره
(iii) Feature–sentiment association

Table 5. Samples of intensifier words.

English	Arabic
Very	جدا/وافر/كثير/واجد/وايد/عديد
Absolutely	طبعاً/من قلب/واضح/أكيد
Extremely	مره/بزياده/حيل

Table 6. Sample of TTs containing intensification.

TTs	Translation in English
حددوا ماهي مسببات رفضكم أعداد مره كثيره من المتقدمين للوظيفه	Which are the reasons for rejecting a lot of applicants?
عيال ديرتنا حيل فازعين لعمل الخير	Residents have a strong desire to do excellent.

Table 7. Emojis partial collection.

Label	Emoji	Label	Emoji
VP		N
P		VN

Table 8. TT sample comprising emoji.

TT	هالفرصه ياعيال الوطن و شبابه لا تفوتونها خلك على علم بالسعوده الرهيبه واغتنم الفرص لتسمو بوطنك و تعرف على حقك كمواطن☝
Translation in English	Gain knowledge about the great Saudi scheme and avail benefit of the opportunity to promote your country and have a better understanding of your privileges This is a fantastic offer. Don’t be hesitant.
Annotation	POS TT by all annotators
Emoji

Table 9. A POS intention supplication.

TT	يارب السعاده والتوفيق و البشارات التي تسر و سخر لنا الناس الصالحه الذين يتفانون لايجاد الحلول لمشاكلنا و نجد وظائف
Translation	Oh God, bring us joy, unify us, grant us positive news, and let honest individuals strive hard to fix our issues by hiring us.
Annotation	POS-TT by 2 annotators and a NEG-TT by 1 annotator
Source	Individual expression
Distinct phrases	يا رب السعاده/التوفيق/البشارات/سخر لنا الناس الصالحه Ya rab alsaada/altawfik/albasharat/sakhar lana alnas alsaliha

Table 10. Negative intention supplication.

TT	الله يزيل كل المعوقات .. الين متى و حنا مالنا قيمه؟ ?
Translation	Oh God, removes all obstacles until when we are devalued?
Annotation	NEG-TTs by all annotators
Source	Individual expression
Distinct phrases	الله يبهدلهم

Table 11. Sample of a usual collection of supplications.

POS Sentiment Supplication	NEG Sentiment Supplication
الله يوفقك god help you	حسبي الله و نعم الوكيل God is my suffice and the best deputy
بارك الله فيك god bless you	أعوذ بالله I seek refuge in God

Table 12. POS and negative proverbs.

POS Sentiment Proverbs	Translation in English	NEG Sentiment Proverbs	Translation in English
من صبر نال	Person who is patient will be the winner	لامن شاف و لا من درى	Uncensored

Table 13. TT containing a proverb.

TT	تعبت من الكلام ولا حياه لمن تنادي
Translation in English	We’re fed up with begging and shouting for help, but nobody seems to care.
Annotation	NEG-TTs by all annotators
Distinct phrases	ولا حياه لمن تنادي wala hiah liman tanadi

Table 14. TT including an interjection.

TT	ما اسباب عدم الاكتراث لمطالبنا و هي نفسها منذ سنوات. . إلى متى ؟ ؟
Translation in English	What reasons of ignoring us? We’ve been requesting these for years. How long do you think it will take?
Annotation	NEG-TTs by all annotators
Distinct phrases	إلى متى/؟.

Table 15. Findings of multifactor LX-based sentiment analysis of social-media posts in dialectical Arabic.

Method	Average Accuracy	Average F-SC
LX-based	84.34%	76.47%
LX-based + polarity	88.94%	82.14%
LX-based + light stemming	88.99%	81.16%
LX-based + negation	79.53%	57.70%
LX-based + intensification words	86.37%	77.53%
LX-based + emoji	82.63%	48.70%
LX-based + special phrases	85.39%	76.99%
All improvement strategies (LX-based + light stemming + polarity + negation + emoji + intensification words)	89.80%	86.32%

Table 16. Findings from Al-Twairesh and Adayel corpora using our LX-based technique.

Research	Corpus	Domain	Accuracy	F-SC	Precision	Recall
Aldayel [46]	1103 TTs	Multidomain (social issues)	78.22%	77.64%	75.49	77.02
Al-Twairesh [22]	4700 TTs	Multidomain	78.61%	62.94%	57.30	63.17
MULDASA	7000 TTs	Specific domain (unemployment)	89.80%	84.70%	86%	86.65%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alwakid, G.; Osman, T.; Haj, M.E.; Alanazi, S.; Humayun, M.; Sama, N.U. MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media. Appl. Sci. 2022, 12, 3806. https://doi.org/10.3390/app12083806

AMA Style

Alwakid G, Osman T, Haj ME, Alanazi S, Humayun M, Sama NU. MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media. Applied Sciences. 2022; 12(8):3806. https://doi.org/10.3390/app12083806

Chicago/Turabian Style

Alwakid, Ghadah, Taha Osman, Mahmoud El Haj, Saad Alanazi, Mamoona Humayun, and Najm Us Sama. 2022. "MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media" Applied Sciences 12, no. 8: 3806. https://doi.org/10.3390/app12083806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MULDASA: Multifactor Lexical Sentiment Analysis of Social-Media Content in Nonstandard Arabic Social Media

Abstract

1. Introduction

2. Related Work

2.1. ML Approaches

2.2. Lexical Approaches

3. Multifactor Lexical Dialectical Arabic Sentiment Analysis (MULDASA)

3.1. Building Sentiment Analysis Corpus

3.2. Domain Analysis and Feature Extraction

3.3. Construction of Arabic Sentiment LX

3.4. Feature–Sentiment Association

3.5. Computing Sentiment SC

3.6. Strategies to Enhance the Basic Sentiment Analysis Approach

3.6.1. Negations

3.6.2. Determining Sentiment Intensity

3.6.3. Emoji

3.6.4. Considering Special Linguistic Phrases Affecting Sentiments

4. Results and Discussion

4.1. Analysis of Experimental Findings

4.2. Evaluation against Similar Work on Dialectal ASA

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI