Linguistic Features and Bi-LSTM for Identification of Fake News

Ali, Attar Ahmed; Latif, Shahzad; Ghauri, Sajjad A.; Song, Oh-Young; Abbasi, Aaqif Afzaal; Malik, Arif Jamal

doi:10.3390/electronics12132942

Open AccessArticle

Linguistic Features and Bi-LSTM for Identification of Fake News

by

Attar Ahmed Ali

¹,

Shahzad Latif

¹,

Sajjad A. Ghauri

²,

Oh-Young Song

^3,*

,

Aaqif Afzaal Abbasi

⁴ and

Arif Jamal Malik

⁴

¹

Computer Science Department, Shaheed Zulfikar Ali Bhutto Institute of Science and Technology, Islamabad 44000, Pakistan

²

School of Engineering & Applied Sciences, ISRA University, Islamabad 44000, Pakistan

³

Software Department, Sejong University, Seoul 05006, Republic of Korea

⁴

Department of Software Engineering, Foundation University Islamabad, Islamabad 44000, Pakistan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(13), 2942; https://doi.org/10.3390/electronics12132942

Submission received: 30 April 2023 / Revised: 21 June 2023 / Accepted: 26 June 2023 / Published: 4 July 2023

(This article belongs to the Special Issue Applications of Deep Learning: Emerging Technologies and Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

With the spread of Internet technologies, the use of social media has increased exponentially. Although social media has many benefits, it has become the primary source of disinformation or fake news. The spread of fake news is creating many societal and economic issues. It has become very critical to develop an effective method to detect fake news so that it can be stopped, removed or flagged before spreading. To address the challenge of accurately detecting fake news, this paper proposes a solution called Statistical Word Embedding over Linguistic Features via Deep Learning (SWELDL Fake), which utilizes deep learning techniques to improve accuracy. The proposed model implements a statistical method called “principal component analysis” (PCA) on fake news textual representations to identify significant features that can help identify fake news. In addition, word embedding is employed to comprehend linguistic features and Bidirectional Long Short-Term Memory (Bi-LSTM) is utilized to classify news as true or fake. We used a benchmark dataset called SWELDL Fake to validate our proposed model, which has about 72,000 news articles collected from different benchmark datasets. Our model achieved a classification accuracy of 98.52% on fake news, surpassing the performance of state-of-the-art deep learning and machine learning models.

Keywords:

deep learning; Bidirectional Long Short-Term Memory (Bi-LSTM); linguistic features; PCA

1. Introduction

Social media usage has increased significantly, with more than

2.7

billion active Facebook users worldwide [1]. Individuals are more likely to read the news on social networking sites than in electronic media. News broadcasts are straightforward on social media for a larger audience. Moreover, social media has fewer censorship requirements for broadcasting news on their sites. On the other hand, these social media features also cause the spread of fake news that is propagated intentionally and has verifiable fake information among Internet users. Studies show fake news spreads significantly faster and farther to a broader audience than accurate news [2].

There have been various kinds of fake news, such as rumors and clickbait, with shocking news stories meant to create controversy and boost ad revenue. In addition, there is propaganda, with deliberately misleading or deceptive articles designed to advance the writer’s agenda. According to a survey [3], three out of five U.S. citizens think fake news hurts financial decision-making. For example, a fake news story that stated that President Obama was injured due to a bomb blast at the White House caused a massive fall of 140 points in the stock market within 6 min, causing a loss of USD

136.5

billion in the S&P 500 market cap [4]. The authors suggested a theoretically driven framework for diagnosing fake news while conducting experiments on two analyses of large datasets.

A statistical study states that 23% of American citizens admitted that they had been a victim of fake news stories and have shared it on their social media profiles [5]. Another survey conducted in 2017 shows that 67% of American citizens read news on social sites [5]. The fast spread of fake news in real time has made it increasingly challenging to identify fake news promptly. People are already taking steps to combat fake news and online fact-checking sites. These sites use experts to manually check suspected fake news by judging them based on their experience to identify the truthfulness of the news story [4].

In recent years, machine learning (ML) methods for identification, classification and application in numerous disciplines have been overgrown [6,7,8]. Recurrent neural networks identify fake news by classifying the sequence of social media communications linked to news stories [9]. The findings and detection time were enhanced by combining dimensionality reduction (DR) and deep learning (DL) approaches. The accuracy of word embedding over linguistic features is improving and the drawbacks of biased classification are diminishing.

To address this problem, the proposed model will utilize a combination of techniques. First, a statistical method, Principal Component Analysis (PCA), will be used on textual representations of fake news to identify essential features. Then, word embedding will be employed to understand linguistic features and Bi-LSTM for news classification.

1.1. Contribution of the Article

This research introduces novel techniques, such as statistical word embedding, deep learning and PCA, to enhance fake news accuracy. The method proposed is then evaluated against existing techniques to provide a comparative analysis of its performance. The following are the research contributions:

Statistical Word Embedding over Linguistic Features via Deep Learning (SWELDL Fake): The proposed model utilizes statistical word embedding techniques combined with deep learning to enhance the classification accuracy of fake news. This method proposes that linguistic characteristics be utilized to generate word embeddings, which are then integrated into a deep learning architecture to improve the classification model’s performance.
The proposed model incorporates principal component analysis on the textual representations of fake news. PCA is a dimensionality-reduction technique that identifies the data’s most significant features or components.
The proposed method is evaluated and compared with existing state-of-the-art techniques in fake news detection. By conducting such comparisons, the research aims to demonstrate the effectiveness and superiority of the proposed model in terms of classification accuracy.

1.2. Organization of the Article

The article is structured in a logical and organized manner. Section 2 provides an overview of existing work on fake news detection. This section explores different machine learning algorithms that have been utilized, along with their associated features. By discussing these algorithms and features, the article establishes a foundation for the proposed framework of fake news detection. In Section 3, the focus shifts to the comprehensive presentation of the proposed framework. The subsequent section, Section 4, showcases the simulation results and conducts a comparative analysis to evaluate the framework’s performance. The comparison likely includes quantitative measures, performance metrics and insights derived from the simulations. Section 5 summarizes the main findings and conclusions drawn from the research and outlines potential future directions for research and improvements in fake news detection.

2. Related Work

In [10], the performance of twenty-three supervised machine learning models was evaluated; their results found that the decision tree outperformed the rest with a significant limitation of the requirement of a large dataset. Gradient boosting, multinomial Naive Bayes, Decision Tree (DT), Logistic Regression, Random Forest (RF) and linear support vector machine (SVM) were evaluated in [11] for identifying fake news on the bases of text features from news articles.

Frequency–inverse document frequency and probabilistic context-free grammar features classify reliable news sources discussed in [12]. After labeling, they trained different models, namely SVM, stochastic gradient boosting (SGD), DT and RF, on additional pre-processed data. SVM with kernel RBF, K-NN, RF, Naive Bayes and XGBoost are the classifiers utilized to analyze discriminative features from diverse news sources and content in [13].

The ML method for identifying fake news was proposed using features extracted from news and social media content with low accuracy compared to baseline algorithms [14]. The authors in [15] presented a model that uses combined paralinguistic features, TF-IDF, text features, sentiment-related features and different types of five machine learning algorithms to identify fake news. To detect fake news in different languages, researchers in [16] proposed a model using trained KNN, RF, NB and SVM on five datasets comprising three other languages.

In [17], the authors proposed a method based on sentiment analysis textual data and user data containing user profile characteristics such as user gender, age, follower count and amount of replies per post. The model proposed in [18] classified fake news using the credibility of a user by different feature sets, i.e., text features (number of words, question marks and characters per word), user features (account verification status, follower count, tweets and creation date of account) and message-based features (root node, propagation path length).

An image caption-based strategy was proposed to improve the model’s capacity to extract semantic information from pictures [19]. To bridge the semantic gap between language and ideas, authors first add picture description information into the text. Furthermore, they combine global and object information from the photos into the final representation to maximize image usage and improve the semantic interaction between images and text.

Some systems successfully categorize news stories as authentic or fraudulent using document embeddings. Machine learning and natural language processing approaches are crucial for practical tools to identify fake news. In [20], various architectures for binary or multi-labeled classification-based fake news detection are discussed.

As indicated [21], social scientists may analyze false news to determine its precise components using an efficient method incorporating NLP and latent semantic analysis (LSA) utilizing singular value decomposition (SVD) techniques. The writers also investigate the distinctions between authentic and fraudulent news. The effectiveness was demonstrated using a real-world situation from the 2016 U.S. presidential election campaign.

In [22], scientometric research using 569 documents from the Scopus database (2012–2022) was used. Authorship, collaboration patterns, bibliographic coupling and productivity patterns are all essential aspects to examine in general research trends.

A Multilayer Perceptron was implemented to classify fake news over the joint declaration. To stop the spread of misinformation, the authors offer an automated system for identifying fake news in [23]. The obtained data are combined using a multimodal factorized bilinear pooling to increase their correlation and provide a more precise shared representation.

Table 1 presents the strength and limitations of related papers used for fake news detection. This comparative analysis will make it easy to analyze and understand past developments.

3. Proposed Model

SWELDL Fake utilizes the linguistic features extracted from the dataset and then uses different embedding techniques over our linguistic features to classify fake news, as shown in Figure 1. We employed count vectorization and term frequency-inverse document frequency (TF-IDF) techniques to process and analyze textual data. Using count vectorization, we converted the text into a numerical representation, capturing the frequency of occurrence of each word in the document. The technique helps in building a feature matrix that can be used for training machine learning models. Additionally, the TF-IDF method is used, which considers both the term frequency (TF) and the inverse document frequency (IDF) to assign weights to each word in the corpus. By incorporating TF-IDF, we emphasized the significance of enticing and informative terms while downplaying commonly occurring words. These techniques provide a foundation for further analysis and model training to extract meaningful insights from the text. Table 2 shows all the abbreviations used in the research article.

3.1. Pre-Processing

Pre-processing is crucial in detecting fake news, ensuring data are cleaned by removing repeated words, stop words, URLs, memorable characters and non-ASCII English characters. Techniques like tokenization divide the content into smaller units such as words or symbols, removing punctuation marks. Stop word removal improves the precision of the results by eliminating non-essential words. These pre-processing steps enhance the model’s accuracy and performance in identifying fake news.

3.2. Principal Component Analysis (PCA)

Once the data are preprocessed, a dimension-reduction technique, namely principal component analysis (PCA), is applied to reduce the number of features further. PCA effectively reduces the dimensionality of the data with only a minor trade-off in accuracy. The first step is data normalization and after that the data-scaling process, i.e., the computation of the covariance matrix, which identifies the correlated features. Eigenvalues and eigenvectors are computed from the covariance matrix to determine principal components. The eigenvector with a high eigenvalue is considered the first principal component containing the maximum helpful information. All the components are sorted in descending order based on their eigenvalue. Equation (1) is used for data standardization, where A is the variable value,

\bar{A}

is the mean and

δ

is the standard deviation.

S_{t} = \frac{A - \bar{A}}{δ}

(1)

Equation (2) is used to find the covariance matrix, where

\bar{x}

and

\bar{y}

denote the mean of X and Y, respectively. n is the number of data points.

C O V = \frac{\sum (X - \bar{x}) (Y - \bar{y})}{n - 1}

(2)

3.3. Embedding Layers

Pre-trained word embedding such as Glove and Word2Vec enhances classification accuracy and precision in fake news identification by capturing semantic relationships in large datasets. These embeddings utilize statistical approaches to analyze and represent word meanings, improving the understanding of word representations and their contextual significance. In this paper, we used Word2Vec embedding because its results were better than Glove. Our proposed model focuses solely on utilizing text features without considering additional factors such as time stamping or propagation features.

3.4. Bidirectional Long Short-Term Memory (LSTM)

The term “Long Short-Term Memory” (LSTM) refers to the specific architecture of the recurrent neural network (RNN) designed to address the problem of capturing long-term dependencies in sequential data. The name highlights the LSTM’s ability to retain important information over extended periods while also selectively discarding irrelevant information in the short term. LSTM can be utilized for fake news detection by treating it as a binary classification problem, where the goal is to classify a given news article as either fake or genuine.

LSTM regulates memory blocks, which are repeated hidden layers in a particular unit. Within the memory cells, self-connections occur. Every memory cell in the LSTM includes numerous memory gates, including input, output and forget gates. The LSTM cell is the buried layer of Long Short-Term Memory. By explaining each memory cell, LSTM may manipulate long-term dependencies. The LSTM cell’s architecture is shown below in Figure 2.

A Bidirectional Long Short-Term Memory (Bi-LSTM) model is a type of recurrent neural network (RNN) architecture that is commonly used for sequential data processing, including tasks such as natural language processing (NLP). The Bi-LSTM model extends the traditional LSTM model by considering information from both past and future time steps in the input sequence. In a Bi-LSTM, the input sequence is processed in two directions: forward and backward. This means the input sequence is fed into two separate LSTM networks: one processing the sequence in the original order (forward LSTM) and the other processing the reverse order (backward LSTM). The forward and backward LSTMs outputs are then concatenated at each time step.

The Bi-LSTM gates and state equations are represented as:

Cell State:

$c_{t} = f_{t} * c_{t - 1} + i_{t} * t a n h (W_{c} x_{t} + W_{c h} h_{t - 1} + b_{c})$

(3)
Input Gate:

$i_{t} = σ (W_{i x} x_{t} + W_{i h} h_{t - 1} + b_{i})$

(4)
Output Gate:

$O_{t} = σ (W_{o x} x_{t} + W_{o h} h_{t - 1} + b_{o})$

(5)
Hidden State:

$h_{t} = o_{t} * t a n h (c_{t})$

(6)
Forgot Gate:

$f_{t} = σ (W_{f x} x_{t} + W_{f h} h_{t - 1} + b_{f})$

(7)

The Bi-LSTM architecture has shown promising results in various NLP tasks, including sentiment analysis, named entity recognition and machine translation. The mathematical model of a Bi-LSTM involves the equations for both the forward and backward LSTMs and the concatenation of their outputs. The backward LSTM equations are defined similarly but with different weight matrices and bias terms. The outcomes of the forward and backward LSTMs at each time step are concatenated to obtain the final result of the Bi-LSTM model. The model parameters (weights and biases) are learned through the training process using techniques such as backpropagation through time (BPTT) and gradient descent. Its ability to capture information from both past and future contexts makes it suitable for tasks where the surrounding context is essential for understanding the input sequence.

Bi-LSTM cells feature numerous layers for each iteration T, including an input layer

X_{t}

, an output layer

h_{t}

and a hidden layer

h_{t - 1}

. Every cell shares some states with other cells during training or parameter updates as shown in Figure 3.

3.5. Evaluation Metrics

The commonly used performance metrics for evaluating classification models include accuracy, precision, recall (sensitivity), and F-score. The accuracy is the proportion of correctly predicted instances (both true positives and negatives) to the total number of instances in the dataset as depicted in Equation (8).

A c c u r a c y = \frac{t_{p} + t_{n}}{t_{p} + t_{n} + f_{p} + f_{n}}

(8)

where

t_{p}

and

t_{n}

are the true positive and true negative, respectively. The false positive and negative are

f_{p}

and

f_{n}

. Accuracy providing an overall measure of how well the model predicts the correct class labels. However, it may not be the most suitable metric for imbalanced datasets where the classes have significantly different frequencies.

Precision is defined as the proportion of accurately predicted positive instances to the total number of positive instances anticipated and expressed as:

P r e c i s i o n = \frac{t_{p}}{t_{p} + f_{p}}

(9)

Precision focuses on the quality of positive predictions and indicates how often the model is correct when it predicts a favorable instance. A high precision value indicates a low rate of false positives. Recall measures the proportion of correctly predicted positive instances (true positives) to the total number of actual positive instances (true positives and false negatives). It is calculated in Equation (10):

R e c a l l = \frac{t_{p}}{t_{p} + f_{n}}

(10)

Recall quantifies the model’s ability to identify positive instances correctly. A high recall value indicates a low rate of false negatives. The F-score (also called F1-score) is a weighted harmonic mean of precision and recall. It provides a balanced measure of precision and recall. The F-score is calculated as follows in Equation (11):

F - S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

F-score combines precision and recalls into a single metric, giving equal importance to both. It is useful when there is a trade-off between precision and recall and you want to consider both aspects simultaneously.

4. Simulation Results and Discussion

In this section, the performance of various machine learning methods and our proposed approach is tested for the Fake News dataset. The major goal of this research is to verify our suggested model’s validity and to describe how it works in detail. The dataset split for the training and testing/validation is 75% and 25%, respectively.

4.1. Dataset Description

The SWELDL Fake dataset is chosen to validate our proposed model because it is the most variable dataset, a composite of several benchmark datasets, including Kaggle, McIntire, Reuters and BuzzFeed Political. It has 72,134 news stories, 35,028 of which are true and 37,106 fake, as shown in Table 3. A comparison of SWELDL Fake linguistic characteristics with other sophisticated, cutting-edge 215 datasets is shown in Table 4.

The various datasets containing Real and Fake news are shown in Table 3. The SWELDL Fake dataset contains 35,028 real news and 37,106 fake news items.

The following language characteristics make up the open dataset:

The readability index measures the text’s complexity (readability difficulties) using word length, syllable count and sentence length.
Emotions, actions, persona and cognition are all described by psycho-linguistic characteristics.
Stylistic elements describe a sentence’s style.
User credibility characteristics explain the information provided by users.
Quantity characteristics describe information in phrases.

4.2. Data Analysis

The system’s specifications are Windows 10 OS with a CPU of 3.20 GHz and 16 GHz of RAM. The proposed model comprises several pieces, each trained on an NVIDIA TITANX GPU. Figure 4 shows how the SWELDL Fake dataset is transported alongside a relatively balanced dataset of the SWELDL Fake dataset. It contains both types of news and has an almost equal amount of fake and true news, making it perfect and balanced. This balance is essential for avoiding biased classification, a significant issue in detecting fake news. Figure 5 shows the size of frequently occurring words.

In our approach, we utilized textual data for analysis, and all the features from the textual data were initially embedded using deep learning methods. The term frequency-inverse document frequency (TF-IDF) and count vectorization techniques pre-process the textual data effectively via extracting informative features from the textual data and achieving improved results, as shown in Table 5. The TF-IDF and count vectorization effectively captured essential information from the text, leading to favorable outcomes when paired with the ML method.

4.3. Evaluation of Proposed Model

The performance is assessed using the SWELDL Fake dataset, and the evaluation metrics, including accuracy, precision, recall, and F1-score, all surpass the threshold of 98% as shown in Table 6.

4.4. Machine Learning Algorithms’ Accuracy

The performance of the proposed machine algorithm on the SWELDL Fake dataset is shown in Figure 6. The highest accuracy score is 95.46 % for the logistic regression-based model. At the same time, other machine learning algorithms have low but comparable accuracy, i.e., 95.36%, 94.4%, 93.52% and 87.91% for Stochastic Descent Gradient (SDG), Random Forest (RF), Decision Tree (DT) and Multinomial Naive Bayes (MNB), respectively. MNB and NB have approximately equal accuracy for the SWELDL Fake dataset. The worst accuracy for the SWELDL Fake model is 62.22% for K-nearest neighbor (KNN) due to high-dimensional feature space, complex relationships and non-linear decision boundaries.

4.5. Deep Learning Algorithms’ Accuracy

In Figure 7, a comprehensive comparison of the performance of various deep learning algorithms is presented, specifically focusing on their performance on the SWELDL Fake dataset. Among the evaluated models, the proposed Bi-LSTM model stands out, achieving an impressive accuracy of nearly 98%. In comparison, the chi-square LSTM, RNN, LDA-RNN and NMF-RNN models demonstrate comparatively lower accuracies. This analysis clearly indicates that the Bi-LSTM model outperforms these alternative algorithms in terms of classification accuracy on the SWELDL Fake dataset.

4.6. Comparative Text Classification

In Figure 8, a detailed comparison of the SWELDL Fake model’s accuracy and F1-score with two widely used algorithms, CNN and BERT, is presented. The SWELDL Fake model demonstrates superior performance, achieving a maximum accuracy of 96.73%. In comparison, the CNN and BERT models attain accuracies of 92.48% and 93.79%, respectively. Furthermore, the SWELDL Fake model incorporates various linguistic features, including text structure, syntax, sentiment, grammar and readability evidence. This comprehensive approach contributes to its high accuracy across different aspects of fake news detection. Following the SWELDL Fake model, the BERT model performs relatively well, while the CNN model lags behind in terms of accuracy. Overall, the results suggest that the SWELDL Fake model outperforms both CNN and BERT in accurately classifying fake news. Its comprehensive feature extraction and analysis contribute to its superior performance, making it a promising approach for fake news detection.

4.7. Comparative Analysis

The comparison of the proposed method is presented in Table 7. In [37], Liu et al. examined fake news identification using the Kaggle-EXT dataset of 25,200 items without linguistic features with a maximum accuracy of 92%. In [38], Sun et al. used linguistic features to identify fake news from BuzzFeed and Politifact with an accuracy for Politifact of 87.8% and BuzzFeed accuracy of 86.4%.

In [36], Li et al. employed the unbiased dataset, consisting of 3404 articles from 2004 factual news and 1400 fake news items and achieved an accuracy of up to 95%. The model proposed in [18], with 72,000 news articles, obtained an accuracy of 96.73%.

Our proposed model is evaluated in comparison to the proposed study in [18] as it has detected fake news with higher accuracy among the recently published approaches [18,37,38]. The model proposed used linguistic features and Bi-LSTM with improved accuracy of 98.52% as compared to the SWELDL Fake model accuracy of 1.78%. The proposed accuracy is 3.4% better than the BERT and 6.35% greater than convolution neural network models.

4.8. Discussion on Results

One of the primary objectives of this study was to evaluate whether deep learning (DL) models could enhance results on datasets where machine learning (ML) models had already achieved their highest performance. The findings of the study indicate that while ML models can produce excellent results, even when combined with other ML techniques to improve accuracy further, there may be a point where the accuracy plateaus. In such cases, it is advisable to explore DL methods to overcome this bottleneck and potentially achieve higher accuracy. However, it is important to note that the proposed algorithm in this study might not yield satisfactory results on multi-class datasets, as it was not specifically designed for that purpose. Its effectiveness is more geared towards binary classification datasets, such as the ISOT dataset, as demonstrated below. Therefore, when dealing with multi-class datasets, alternative DL approaches specifically tailored for multi-class, the classification should be considered to achieve optimal results. It is essential to match the characteristics of the dataset and the problem at hand with the appropriate algorithm to ensure accurate and reliable predictions. The proposed algorithm achieved a high accuracy of around 99% on a binary classification dataset. However, when tested on the LIAR dataset, which is a multi-class dataset, the accuracy dropped to less than 70%.

5. Conclusions

The rapid growth of social media has led to the widespread dissemination of fake news, resulting in significant societal and economic consequences. To address this issue, this research paper introduces SWELDL Fake, a proposed solution that leverages deep learning techniques to detect fake news with improved accuracy. The model incorporates statistical word embedding, principal component analysis and Bi-LSTM for classification. Through experimentation on the SWELDL Fake benchmark dataset, SWELDL Fake achieves an impressive classification accuracy of 98.52%, surpassing existing models in deep learning and machine learning. This research presents a promising approach to effectively identify and combat fake news in the era of social media.

In the future, some other essential features such as user context data and social context data along with our proposed text classification model for further improvement of fake news detection, could be incorporated. Moreover, adding some social and user content data to identify fake news stories accurately may be incorporated into future work.

Author Contributions

Conceptualization, A.A.A. (Attar Ahmed Ali); Methodology, A.A.A. (Attar Ahmed Ali); Software, A.A.A. (Attar Ahmed Ali) and S.A.G.; Validation, S.L.; Formal analysis, A.A.A. (Attar Ahmed Ali), S.L. and O.-Y.S.; Investigation, A.A.A. (Attar Ahmed Ali), S.L., S.A.G., A.A.A. (Aaqif Afzaal Abbasi) and A.J.M.; Resources, S.A.G., O.-Y.S. and A.J.M.; Writing—original draft, A.A.A. (Attar Ahmed Ali) and S.A.G.; Writing—review & editing, S.L., O.-Y.S., A.A.A. (Aaqif Afzaal Abbasi) and A.J.M.; Visualization, A.A.A. (Aaqif Afzaal Abbasi) and A.J.M.; Supervision, S.L.; Project administration, A.A.A. (Aaqif Afzaal Abbasi); Funding acquisition, S.A.G. and O.-Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and the Korea Institute for Advancement of Technology (KIAT) through the International Cooperative RD program, (Project No. P0016038) an Institute of Information and Communications Technology Planning Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-01188, Non-face-to-face Companion Plant Sales Support System Providing Realistic Experience) and the MSIT (Ministry of Science and ICT), Republic of Korea, under the ITRC (Information Technology Research Center) support program (IITP-2023-RS-2022-00156354) supervised by the IITP (Institute for Information Communications Technology Planning and Evaluation) and the faculty research fund of Sejong University in 2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dixon, S. Number of monthly active Facebook users worldwide as of 2nd quarter 2022. Posjećeno 2022, 9, 2022. [Google Scholar]
Siddiqui, S.; Singh, T. Social media its impact with positive and negative aspects. Int. J. Comput. Appl. Technol. Res. 2016, 5, 71–75. [Google Scholar] [CrossRef]
Schiavone, J.; Lynch, J. Fake Financial News Is a Real Threat to Majority of Americans: New AICPA Survey. 2017. Available online: https://www.aicpa.org/press/pressreleases/2017/fake-financial-news-is-a-real-threatto-majority-of-americans-newaicpa-survey (accessed on 21 December 2022).
Zhou, X.; Jain, A.; Phoha, V.V.; Zafarani, R. Fake news early detection: A theory-driven model. Digit. Threat. Res. Pract. 2020, 1, 1–25. [Google Scholar] [CrossRef]
Shearer, E.; Gottfried, J. News use across social media platforms 2017. 2017. [Google Scholar]
Fatima, M.; Ghauri, S.; Mohammad, N.; Adeel, H.; Sarfraz, M. Machine Learning for Masked Face Recognition in COVID-19 Pandemic Situation. Math. Model. Eng. Probl. 2022, 9, 283–289. [Google Scholar] [CrossRef]
Shah, S.I.H.; Alam, S.; Ghauri, S.A.; Hussain, A.; Ansari, F.A. A novel hybrid cuckoo search-extreme learning machine approach for modulation classification. IEEE Access 2019, 7, 90525–90537. [Google Scholar] [CrossRef]
Ghauri, S.A. KNN based classification of digital modulated signals. IIUM Eng. J. 2016, 17, 71–82. [Google Scholar] [CrossRef]
Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.F.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
Ozbay, F.A.; Alatas, B. Fake news detection within online social media using supervised artificial intelligence algorithms. Phys. A Stat. Mech. Its Appl. 2020, 540, 123174. [Google Scholar] [CrossRef]
Kaliyar, R.K.; Goswami, A.; Narang, P. Multiclass fake news detection using ensemble machine learning. In Proceedings of the 2019 IEEE 9th International Conference on Advanced Computing (IACC), Tiruchirappalli, India, 13–14 December 2019; pp. 103–107. [Google Scholar]
Gilda, S. Notice of Violation of IEEE Publication Principles: Evaluating machine learning algorithms for fake news detection. In Proceedings of the 2017 IEEE 15th Student Conference on Research and Development (SCOReD), Wilayah Persekutuan Putrajaya, Malaysia, 13–14 December 2017; pp. 110–115. [Google Scholar]
Della Vedova, M.L.; Tacchini, E.; Moret, S.; Ballarin, G.; DiPierro, M.; De Alfaro, L. Automatic online fake news detection combining content and social signals. In Proceedings of the 2018 22nd Conference of Open Innovations Association (FRUCT), Jyvaskyla, Finland, 15–18 May 2018; pp. 272–279. [Google Scholar]
Shabani, S.; Sokhn, M. Hybrid machine-crowd approach for fake news detection. In Proceedings of the 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC), Philadelphia, PA, USA, 18–20 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 299–306. [Google Scholar]
Faustini, P.H.A.; Covoes, T.F. Fake news detection in multiple platforms and languages. Expert Syst. Appl. 2020, 158, 113503. [Google Scholar] [CrossRef]
Jiang, T.; Li, J.P.; Haq, A.U.; Saboor, A.; Ali, A. A novel stacking approach for accurate detection of fake news. IEEE Access 2021, 9, 22626–22639. [Google Scholar] [CrossRef]
Castillo, C.; Mendoza, M.; Poblete, B. Information credibility on twitter. In Proceedings of the 20th International World Wide Web Conference, Hyderabad, India, 28 March–1 April 2011; pp. 675–684. [Google Scholar]
Verma, P.K.; Agrawal, P.; Amorim, I.; Prodan, R. WELFake: Word embedding over linguistic features for fake news detection. IEEE Trans. Comput. Soc. Syst. 2021, 8, 881–893. [Google Scholar] [CrossRef]
Liu, P.; Qian, W.; Xu, D.; Ren, B.; Cao, J. Multi-Modal Fake News Detection via Bridging the Gap between Modals. Entropy 2023, 25, 614. [Google Scholar] [CrossRef]
Truică, C.O.; Apostol, E.S. It’s All in the Embedding! Fake News Detection Using Document Embeddings. Mathematics 2023, 11, 508. [Google Scholar] [CrossRef]
Mayopu, R.G.; Wang, Y.Y.; Chen, L.S. Analyzing Online Fake News Using Latent Semantic Analysis: Case of USA Election Campaign. Big Data Cogn. Comput. 2023, 7, 81. [Google Scholar] [CrossRef]
Dhiman, P.; Kaur, A.; Iwendi, C.; Mohan, S.K. A scientometric analysis of deep learning approaches for detecting fake news. Electronics 2023, 12, 948. [Google Scholar] [CrossRef]
Nadeem, M.I.; Ahmed, K.; Li, D.; Zheng, Z.; Alkahtani, H.K.; Mostafa, S.M.; Mamyrbayev, O.; Abdel Hameed, H. EFND: A Semantic, Visual and Socially Augmented Deep Framework for Extreme Fake News Detection. Sustainability 2023, 15, 133. [Google Scholar] [CrossRef]
Umer, M.; Imtiaz, Z.; Ullah, S.; Mehmood, A.; Choi, G.S.; On, B.W. Fake news stance detection using deep learning architecture (CNN-LSTM). IEEE Access 2020, 8, 156695–156706. [Google Scholar] [CrossRef]
Ajao, O.; Bhowmik, D.; Zargari, S. Fake news identification on twitter with hybrid cnn and rnn models. In Proceedings of the 9th International Conference on Social Media and Society, Copenhagen, Denmark, 18–20 July 2018; pp. 226–230. [Google Scholar]
Roy, A.; Basak, K.; Ekbal, A.; Bhattacharyya, P. A deep ensemble framework for fake news detection and classification. arXiv 2018, arXiv:1811.04670. [Google Scholar]
Monti, F.; Frasca, F.; Eynard, D.; Mannion, D.; Bronstein, M.M. Fake news detection on social media using geometric deep learning. arXiv 2019, arXiv:1902.06673. [Google Scholar]
Reis, J.C.; Correia, A.; Murai, F.; Veloso, A.; Benevenuto, F. Supervised learning for fake news detection. IEEE Intell. Syst. 2019, 34, 76–81. [Google Scholar] [CrossRef]
Yuan, C.; Ma, Q.; Zhou, W.; Han, J.; Hu, S. Early detection of fake news by utilizing the credibility of news, publishers and users based on weakly supervised learning. arXiv 2020, arXiv:2012.04233. [Google Scholar]
Liu, Y.; Wu, Y.F.B. Fned: A deep network for fake news early detection on social media. ACM Trans. Inf. Syst. 2020, 38, 1–33. [Google Scholar] [CrossRef]
Li, M.; Clinton, G.; Miao, Y.; Gao, F. Short text classification via knowledge powered attention with similarity matrix based CNN. arXiv 2020, arXiv:2002.03350. [Google Scholar]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune bert for text classification? In Proceedings of the Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, 18–20 October 2019; Proceedings 18. Springer: Berlin/Heidelberg, Germany, 2019; pp. 194–206. [Google Scholar]
Alrubaian, M.; Al-Qurishi, M.; Hassan, M.M.; Alamri, A. A credibility analysis system for assessing information on twitter. IEEE Trans. Dependable Secur. Comput. 2016, 15, 661–674. [Google Scholar] [CrossRef]
Verma, Y. Complete Guide To Bidirectional LSTM (With Python Codes). 2021. Available online: https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/ (accessed on 9 February 2023).
Gravanis, G.; Vakali, A.; Diamantaras, K.; Karadais, P. Behind the cues: A benchmarking study for fake news detection. Expert Syst. Appl. 2019, 128, 201–213. [Google Scholar] [CrossRef]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. Fakenewsnet: A data repository with news content, social context and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef]
Ahmed, H.; Traore, I.; Saad, S. Detection of online fake news using n-gram analysis and machine learning techniques. In Proceedings of the Intelligent, Secure and Dependable Systems in Distributed and Cloud Environments: First International Conference, ISDDC 2017, Vancouver, BC, Canada, 26–28 October 2017; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 2017; pp. 127–138. [Google Scholar]
Vicario, M.D.; Quattrociocchi, W.; Scala, A.; Zollo, F. Polarization and fake news: Early warning of potential misinformation targets. ACM Trans. Web 2019, 13, 1–22. [Google Scholar] [CrossRef]
Verma, P.K.; Agrawal, P.; Prodan, R. WELFake Dataset for Fake News Detection in Text Data. 2021. Available online: https://zenodo.org/record/4561253 (accessed on 25 June 2023).
Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 759–766. [Google Scholar]

Figure 1. Proposed SWELDL Fake System Model.

Figure 2. LSTM Model [34].

Figure 3. Bi-LSTM Model [34].

Figure 4. Balanced Fake and True News in the Dataset.

Figure 5. Size of Frequently Occurring Words in the Dataset.

Figure 6. Accuracy Comparison of various Machine Learning Algorithms.

Figure 7. Accuracy of Proposed Bi-LSTM on SWELDL Fake Dataset.

Figure 8. Accuracy for Various Dataset.

Table 1. Literature Review of Deep Learning Techniques.

Ref	Methodology	Dataset	Limitations
[10]	DT, LR, RF, CNN + LSTM	BuzzFeed, ISOT, Politics News data	High time complexity
[11]	RF, MN-Naive Bayes, GB, DT, LR & SVM	Kaggle Fakenews Dataset	Low accuracy
[12]	DT, GB, SVM, RF, SGD.	-	Focused on pre-processing techniques.
[24]	CNN	Fake news Challenge Dataset	High time complexity
[25]	CNN + Bi-LSTM	1356 news articles	Comparatively low accuracy
[26]	LSTM + CNN	58,000 tweets	Low accuracy
[9]	CNN + Bi-LSTM	Liar Dataset	Low accuracy
[27]	LSTM, tah-RNN	Twitter and Weibo micro blogs.	Higher time complexity
[28]	Graph CNN	-	Only uses content data
[13]	SVM, K-NN, RF, Naive Bayes and XG-Boost	2282 BuzzFeed news articles	Only identified important features.
[14]	LR	Real-world dataset	Low accuracy & Small dataset
[15]	LR, SVM, RF, GB, Neural Networks.	Querying google	High computational cost.
[16]	Naive Bayes (NB), KNN, SVM, RF	Btvlifestyl, FakeOrRealNews, FakeNewsData1, FakeBrCorpus, TwitterBr.	Only use news content features.
[29]	CNN, LSTM, GRU, DT, RF, KNN, LR, SVM	ISOT, KDnugget	High computational complexity
[30]	CNN with SMAN attention mechanism	Weibo dataset, Twitter dataset (Tweeter15 & Tweeter 16)	More time complexity
[31]	CNN + PU learning framework	Weibo dataset, Twitter dataset (Tweeter15 & Tweeter 16)	Low accuracy
[32]	CNN	WELFAKE dataset	Low accuracy
[33]	BERT	WELFAKE dataset	Low accuracy
[17]	NB, DT, RF	489,330 Twitter accounts	Low accuracy
[18]	Model classification using a credibility	Twitter dataset	Low F1 score

Table 2. Abbreviation of LSTM Equations.

LSTM Equation Symbols	Symbol Description
$f_{t}$	FG
$σ$	LS
$o_{t}$	OG
$h_{t}$	OL
$h_{t - 1}$	HL
$i_{t}$	IG
$c_{t - 1}$	PCOS
$c_{t}$	OS
T	NOI
$w_{f}$	ESWF
$w_{i}$	ISWF
$w_{o}$	WF
$s_{t}$	IS
$W_{s}$	SWF
$b_{i}$	ISIC
$b_{s}$ , $b_{o}$	SIC
$b_{f}$	ICES

Table 3. Datasets containing Real and Fake News.

Dataset	Real News	Fake News
McIntire	3171	3164
Reuters	21,417	23,481
BuzzFeed Political	53	48
SWELDL Fake	35,028	37,106

Table 4. State-of-the-Art Comparison of Linguistic Features.

Linguistic Features	Benjamin Political News [35]	Behind the Cues [36]	Reuters [37]	Fake News Net [38]	Polarization & Fake News [39]	SWELDL Fake Dataset [40]
Readability index	✓	✗	✗	✗	✓	✓
Psycho-linguistic	✓	✗	✗	✓	✓	✓
Stylistic features	✓	✓	✗	✗	✓	✓
User credibility	✗	✓	✗	✓	✗	✗
Quantity features	✗	✗	✗	✗	✓	✓

Table 5. Features in ML & DL Techniques.

Technique	Feature
DL Techniques	Word2Vec
	Tokenize
	BOW
	PCA
ML Techniques	PCA
	TF-IDF
	Count Vectorize

Table 6. Evaluation Metrics.

SWELDL Fake Dataset
Accuracy	98.52%
Precision	98.63%
Recall	98.89%
F1-Score	98.75%

Table 7. Comparative Analysis.

Parameter	[37]	[38]	[36]	WELFake [18]	Proposed Model (SWELDL Fake)
Dataset	Kaggle	1. Politifact 2. Buzzfeed	1. Kaggle 2. Buzzfeed 3. Politifact 4. McIntire 5. WELFake	1. Kaggle 2. Buzzfeed 3. Reuters 4. McIntire 5. WELFake	1. Kaggle 2. Buzzfeed 3. Reuters 4. McIntire 5. WELFake
Number of news article	25,200	1. 240 2. 182	1. 23,340 2. 240 3. 182 4. 6310 5. 3404	1. 20,800 2. 101 3. 44,898 4. 6335 5. 72.134	1. 20,800 2. 101 3. 44,898 4. 6335 5. 72.134
Linguistic features	No	Yes	Yes	Yes	Yes
WE	TF-IDF	No	Word2Vec	CV	Word2Vec
Accuracy	92%	87.8%	95.0%	96.73%	98.52%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ali, A.A.; Latif, S.; Ghauri, S.A.; Song, O.-Y.; Abbasi, A.A.; Malik, A.J. Linguistic Features and Bi-LSTM for Identification of Fake News. Electronics 2023, 12, 2942. https://doi.org/10.3390/electronics12132942

AMA Style

Ali AA, Latif S, Ghauri SA, Song O-Y, Abbasi AA, Malik AJ. Linguistic Features and Bi-LSTM for Identification of Fake News. Electronics. 2023; 12(13):2942. https://doi.org/10.3390/electronics12132942

Chicago/Turabian Style

Ali, Attar Ahmed, Shahzad Latif, Sajjad A. Ghauri, Oh-Young Song, Aaqif Afzaal Abbasi, and Arif Jamal Malik. 2023. "Linguistic Features and Bi-LSTM for Identification of Fake News" Electronics 12, no. 13: 2942. https://doi.org/10.3390/electronics12132942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Linguistic Features and Bi-LSTM for Identification of Fake News

Abstract

1. Introduction

1.1. Contribution of the Article

1.2. Organization of the Article

2. Related Work

3. Proposed Model

3.1. Pre-Processing

3.2. Principal Component Analysis (PCA)

3.3. Embedding Layers

3.4. Bidirectional Long Short-Term Memory (LSTM)

3.5. Evaluation Metrics

4. Simulation Results and Discussion

4.1. Dataset Description

4.2. Data Analysis

4.3. Evaluation of Proposed Model

4.4. Machine Learning Algorithms’ Accuracy

4.5. Deep Learning Algorithms’ Accuracy

4.6. Comparative Text Classification

4.7. Comparative Analysis

4.8. Discussion on Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI