Editorial

2 pages, 163 KiB

Open AccessEditorial

Natural Language Processing: Recent Development and Applications

by Kuei-Hu Chang

Appl. Sci. 2023, 13(20), 11395; https://doi.org/10.3390/app132011395 - 17 Oct 2023

Cited by 1 | Viewed by 2336

Natural Language Processing (NLP) can be categorized into the subfields of artificial intelligence (AI) and linguistics [...] Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

Research

Jump to: Editorial

23 pages, 848 KiB

Open AccessArticle

From Discourse Relations to Network Edges: A Network Theory Approach to Discourse Analysis

by Alexandros Tantos and Kosmas Kosmidis

Appl. Sci. 2023, 13(12), 6902; https://doi.org/10.3390/app13126902 - 07 Jun 2023

Cited by 1 | Viewed by 1332

Abstract

In this paper, we argue that discourse representations can be mapped to networks and analyzed by tools provided in network theory so that deep properties of discourse structure are revealed. Two discourse-annotated corpora, C58 and STAC, that belong to different discourse types and [...] Read more.

In this paper, we argue that discourse representations can be mapped to networks and analyzed by tools provided in network theory so that deep properties of discourse structure are revealed. Two discourse-annotated corpora, C58 and STAC, that belong to different discourse types and languages were compared and analyzed. Various key network indices were used for the discourse representations of both corpora and show the different network profiles of the two discourse types. Moreover, both network motifs and antimotifs were discovered for the discourse networks in the two corpora that shed light on strong tendencies in building or avoiding to build discourse relations between utterances for permissible three-node discourse subgraphs. These results may lead to new types of discourse structure rules that draw on the properties of the networks that lie behind discourse representation. Another important aspect is that the second version of the STAC corpus, which includes nonlinguistic discourse units and their relations, exhibits similar trends in terms of network subgraphs compared to its first version. This suggests that the nonlinguistic context has a significant impact on discourse structure. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

22 pages, 7570 KiB

Open AccessArticle

Fine-Tuning BERT-Based Pre-Trained Models for Arabic Dependency Parsing

by Sharefah Al-Ghamdi, Hend Al-Khalifa and Abdulmalik Al-Salman

Appl. Sci. 2023, 13(7), 4225; https://doi.org/10.3390/app13074225 - 27 Mar 2023

Cited by 5 | Viewed by 2889

Abstract

With the advent of pre-trained language models, many natural language processing tasks in various languages have achieved great success. Although some research has been conducted on fine-tuning BERT-based models for syntactic parsing, and several Arabic pre-trained models have been developed, no attention has [...] Read more.

With the advent of pre-trained language models, many natural language processing tasks in various languages have achieved great success. Although some research has been conducted on fine-tuning BERT-based models for syntactic parsing, and several Arabic pre-trained models have been developed, no attention has been paid to Arabic dependency parsing. In this study, we attempt to fill this gap and compare nine Arabic models, fine-tuning strategies, and encoding methods for dependency parsing. We evaluated three treebanks to highlight the best options and methods for fine-tuning Arabic BERT-based models to capture syntactic dependencies in the data. Our exploratory results show that the AraBERTv2 model provides the best scores for all treebanks and confirm that fine-tuning to the higher layers of pre-trained models is required. However, adding additional neural network layers to those models drops the accuracy. Additionally, we found that the treebanks have differences in the encoding techniques that give the highest scores. The analysis of the errors obtained by the test examples highlights four issues that have an important effect on the results: parse tree post-processing, contextualized embeddings, erroneous tokenization, and erroneous annotation. This study reveals a direction for future research to achieve enhanced Arabic BERT-based syntactic parsing. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

15 pages, 1476 KiB

Open AccessArticle

A Chinese Few-Shot Text Classification Method Utilizing Improved Prompt Learning and Unlabeled Data

by Tingkai Hu, Zuqin Chen, Jike Ge, Zhaoxu Yang and Jichao Xu

Appl. Sci. 2023, 13(5), 3334; https://doi.org/10.3390/app13053334 - 06 Mar 2023

Cited by 2 | Viewed by 1743

Abstract

Insufficiently labeled samples and low-generalization performance have become significant natural language processing problems, drawing significant concern for few-shot text classification (FSTC). Advances in prompt learning have significantly improved the performance of FSTC. However, prompt learning methods typically require the pre-trained language model and [...] Read more.

Insufficiently labeled samples and low-generalization performance have become significant natural language processing problems, drawing significant concern for few-shot text classification (FSTC). Advances in prompt learning have significantly improved the performance of FSTC. However, prompt learning methods typically require the pre-trained language model and tokens of the vocabulary list for model training, while different language models have different token coding structures, making it impractical to build effective Chinese prompt learning methods from previous approaches related to English. In addition, a majority of current prompt learning methods do not make use of existing unlabeled data, thus often leading to unsatisfactory performance in real-world applications. To address the above limitations, we propose a novel Chinese FSTC method called CIPLUD that combines an improved prompt learning method and existing unlabeled data, which are used for the classification of a small amount of Chinese text data. We used the Chinese pre-trained language model to build two modules: the Multiple Masks Optimization-based Prompt Learning (MMOPL) module and the One-Class Support Vector Machine-based Unlabeled Data Leveraging (OCSVM-UDL) module. The former generates prompt prefixes with multiple masks and constructs suitable prompt templates for Chinese labels. It optimizes the random token combination problem during label prediction with joint probability and length constraints. The latter, by establishing an OCSVM model in the trained text vector space, selects reasonable pseudo-label data for each category from a large amount of unlabeled data. After selecting the pseudo-label data, we mixed them with the previous few-shot annotated data to obtain brand new training data and then repeated the steps of the two modules as an iterative semi-supervised optimization process. The experimental results on the four Chinese FSTC benchmark datasets demonstrate that our proposed solution outperformed other prompt learning methods with an average accuracy improvement of 2.3%. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

12 pages, 1539 KiB

Open AccessArticle

Cross-Lingual Named Entity Recognition Based on Attention and Adversarial Training

by Hao Wang, Lekai Zhou, Jianyong Duan and Li He

Appl. Sci. 2023, 13(4), 2548; https://doi.org/10.3390/app13042548 - 16 Feb 2023

Cited by 3 | Viewed by 1615

Abstract

Named entity recognition aims to extract entities with specific meaning from unstructured text. Currently, deep learning methods have been widely used for this task and have achieved remarkable results, but it is often difficult to achieve better results with less labeled data. To [...] Read more.

Named entity recognition aims to extract entities with specific meaning from unstructured text. Currently, deep learning methods have been widely used for this task and have achieved remarkable results, but it is often difficult to achieve better results with less labeled data. To address this problem, this paper proposes a method for cross-lingual entity recognition based on an attention mechanism and adversarial training, using resource-rich language annotation data to migrate to low-resource languages for named entity recognition tasks and outputting changing semantic vectors through the attention mechanism to effectively solve the long-sequence semantic dilution problem. To verify the effectiveness of the proposed method, the method in this paper is applied to the English–Chinese cross-lingual named entity recognition task based on the WeiboNER data set and the People-Daily2004 data set. The obtained F1 value of the optimal model is 53.22% (a 6.29% improvement compared to the baseline). The experimental results show that the cross-lingual adversarial named entity recognition method proposed in this paper can significantly improve the results of named entity recognition in low resource languages. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

17 pages, 3972 KiB

Open AccessArticle

A Corpus-Based Word Classification Method for Detecting Difficulty Level of English Proficiency Tests

by Liang-Ching Chen, Kuei-Hu Chang, Shu-Ching Yang and Shin-Chi Chen

Appl. Sci. 2023, 13(3), 1699; https://doi.org/10.3390/app13031699 - 29 Jan 2023

Cited by 4 | Viewed by 2398

Abstract

Many education systems globally adopt an English proficiency test (EPT) as an effective mechanism to evaluate English as a Foreign Language (EFL) speakers’ comprehension levels. Similarly, Taiwan’s military academy also developed the Military Online English Proficiency Test (MOEPT) to assess EFL cadets’ English [...] Read more.

Many education systems globally adopt an English proficiency test (EPT) as an effective mechanism to evaluate English as a Foreign Language (EFL) speakers’ comprehension levels. Similarly, Taiwan’s military academy also developed the Military Online English Proficiency Test (MOEPT) to assess EFL cadets’ English comprehension levels. However, the difficulty level of MOEPT has not been detected to help facilitate future updates of its test banks and improve EFL pedagogy and learning. Moreover, it is almost impossible to carry out any investigation effectively using previous corpus-based approaches. Hence, based on the lexical threshold theory, this research adopts a corpus-based approach to detect the difficulty level of MOEPT. The function word list and Taiwan College Entrance Examination Center (TCEEC) word list (which includes Common European Framework of Reference for Language (CEFR) A2 and B1 level word lists) are adopted as the word classification criteria to classify the lexical items. The results show that the difficulty level of MOEPT is mainly the English for General Purposes (EGP) type of CEFR A2 level (lexical coverage = 74.46%). The findings presented in this paper offer implications for the academy management or faculty to regulate the difficulty and contents of MOEPT in the future, to effectively develop suitable EFL curriculums and learning materials, and to conduct remedial teaching for cadets who cannot pass MOEPT. By doing so, it is expected the overall English comprehension level of EFL cadets is expected to improve. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

22 pages, 1141 KiB

Open AccessArticle

Amharic Speech Search Using Text Word Query Based on Automatic Sentence-like Segmentation

by Getnet Mezgebu Brhanemeskel, Solomon Teferra Abate, Tewodros Alemu Ayall and Abegaz Mohammed Seid

Appl. Sci. 2022, 12(22), 11727; https://doi.org/10.3390/app122211727 - 18 Nov 2022

Cited by 1 | Viewed by 2797

Abstract

More than 7000 languages are spoken in the world today. Amharic is one of the languages spoken in the East African country Ethiopia. A lot of speech data is being made every day in different languages as machines are getting better at processing [...] Read more.

More than 7000 languages are spoken in the world today. Amharic is one of the languages spoken in the East African country Ethiopia. A lot of speech data is being made every day in different languages as machines are getting better at processing and have improved storing capacity. However, searching for a particular word with its respective time frame inside a given audio file is a challenge. Since Amharic has its own distinguishing characteristics, such as glottal, palatal, and labialized consonants, it is not possible to directly use models that are developed for other languages. A popular approach in developing systems for searching particular information in speech involves using an automatic speech recognition (ASR) module that generates the text version of the speech where the word or phrase is searched based on text query. However, it is not possible to transcribe a long audio file without segmentation, which in turn affects the performance of the ASR module. In this paper, we are reporting our investigation on the effects of manual and automatic speech segmentation of Amharic audio files in a spiritual domain. We have used manual segmentation as a baseline for our investigation and found out that sentence-like automatic segmentation resulted in a word error rate (WER) closer to the WER achieved on the manually segmented test speech. Based on the experimental results, we propose Amharic speech search using text word query (ASSTWQ) based on automatic sentence-like segmentation. Since we have achieved lower WER using the previously developed speech corpus, which is in a broadcast news domain, together with the in-domain speech corpus, we recommend using both in- and out-domain speech corpora to develop the Amharic ASR module. The performance of the proposed ASR is a WER of 53% that needs further improvement. Combining two language models (LMs) developed using training text from the two different domains (spiritual and broadcast news) allowed a WER reduction from 53% to 46%. Therefore, we have developed two ASSTWQ systems using the two ASR modules with WERs of 53% and 46%. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

10 pages, 244 KiB

Open AccessArticle

Syntactic and Semantic Properties of on the Contrary in British University Student Essay Writing: A Corpus-Based Systemic Functional Analysis

by Yan Zhang

Appl. Sci. 2022, 12(20), 10635; https://doi.org/10.3390/app122010635 - 21 Oct 2022

Cited by 1 | Viewed by 1105

Abstract

In contrast to previous descriptions of on the contrary (OTC) as a corrective or replacive marker requiring explicit negation, this study revealed that more than 50% of the occasions the expression was used in British tertiary level student essays, represented by [...] Read more.

In contrast to previous descriptions of on the contrary (OTC) as a corrective or replacive marker requiring explicit negation, this study revealed that more than 50% of the occasions the expression was used in British tertiary level student essays, represented by British academic written English (BAWE) corpus, were not associated with a preceding negation. The frequency information provides a starting point for the qualitative analysis of the two functional types of OTC, i.e., adversative versus replacive. The notions of (topical) theme, rheme, and focus within systemic functional linguistics were proposed as descriptive frameworks to identify the distinct characteristics of each functional type. The replacive type typically employs a preceding negation and a topical theme equal to that of the clause complex preceding the conjunctive, whereas the adversative type is distinguished by the use of a different topical theme and a contrastive rheme. The analysis conducted in this study provided language teachers with a model for helping students comprehend logic semantics expressed by conjunctives by analyzing semantic features of connected clause complexes. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

21 pages, 3107 KiB

Open AccessArticle

Evaluating the Effectiveness of Text Pre-Processing in Sentiment Analysis

by Marco A. Palomino and Farida Aider

Appl. Sci. 2022, 12(17), 8765; https://doi.org/10.3390/app12178765 - 31 Aug 2022

Cited by 18 | Viewed by 3049

Abstract

Practical demands and academic challenges have both contributed to making sentiment analysis a thriving area of research. Given that a great deal of sentiment analysis work is performed on social media communications, where text frequently ignores the rules of grammar and spelling, pre-processing [...] Read more.

Practical demands and academic challenges have both contributed to making sentiment analysis a thriving area of research. Given that a great deal of sentiment analysis work is performed on social media communications, where text frequently ignores the rules of grammar and spelling, pre-processing techniques are required to clean the data. Pre-processing is also required to normalise the text before undertaking the analysis, as social media is inundated with abbreviations, emoticons, emojis, truncated sentences, and slang. While pre-processing has been widely discussed in the literature, and it is considered indispensable, recommendations for best practice have not been conclusive. Thus, we have reviewed the available research on the subject and evaluated various combinations of pre-processing components quantitatively. We have focused on the case of Twitter sentiment analysis, as Twitter has proved to be an important source of publicly accessible data. We have also assessed the effectiveness of different combinations of pre-processing components for the overall accuracy of a couple of off-the-shelf tools and one algorithm implemented by us. Our results confirm that the order of the pre-processing components matters and significantly improves the performance of naïve Bayes classifiers. We also confirm that lemmatisation is useful for enhancing the performance of an index, but it does not notably improve the quality of sentiment analysis. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

21 pages, 2510 KiB

Open AccessArticle

Hybrid Analytic Hierarchy Process–Artificial Neural Network Model for Predicting the Major Risks and Quality of Taiwanese Construction Projects

by Chien-Liang Lin, Ching-Lung Fan and Bey-Kun Chen

Appl. Sci. 2022, 12(15), 7790; https://doi.org/10.3390/app12157790 - 02 Aug 2022

Cited by 9 | Viewed by 2299

Abstract

Construction projects are associated with risks, which influence projects’ performance and quality. To ensure the on-time completion of construction projects, project managers often use risk assessment and management methods to reduce risks in the project life cycle. Identifying risk factors and the relationship [...] Read more.

Construction projects are associated with risks, which influence projects’ performance and quality. To ensure the on-time completion of construction projects, project managers often use risk assessment and management methods to reduce risks in the project life cycle. Identifying risk factors and the relationship between major risk factors and the quality of construction projects facilitates construction management. In this study, 948 project records of construction inspection from 1993 to 2020 were collected from the Public Construction Management Information System (PCMIS) of the Taiwan central government to conduct an expert survey to identify five risk dimensions and 19 major risk factors associated with Taiwanese construction projects. The hybrid analytic hierarchy process (AHP) and an artificial neural network (ANN) were employed to develop a model for predicting major risk factors and construction quality. The AHP was used to calculate the weight of major risk factors to verify their influence on construction. The ANN was adopted to extract the features of major risk factors to predict the quality of a construction project. The accuracy of the prediction model was 85%. The project managers can reference the prediction results obtained with the proposed method to perform effective risk management and devise decision-making strategies for construction management. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

15 pages, 1275 KiB

Open AccessArticle

Distantly Supervised Named Entity Recognition with Self-Adaptive Label Correction

by Binling Nie and Chenyang Li

Appl. Sci. 2022, 12(15), 7659; https://doi.org/10.3390/app12157659 - 29 Jul 2022

Cited by 1 | Viewed by 1322

Abstract

Named entity recognition has achieved remarkable success on benchmarks with high-quality manual annotations. Such annotations are labor-intensive and time-consuming, thus unavailable in real-world scenarios. An emerging interest is to generate low-cost but noisy labels via distant supervision, hence noisy label learning algorithms are [...] Read more.

Named entity recognition has achieved remarkable success on benchmarks with high-quality manual annotations. Such annotations are labor-intensive and time-consuming, thus unavailable in real-world scenarios. An emerging interest is to generate low-cost but noisy labels via distant supervision, hence noisy label learning algorithms are in demand. In this paper, a unified self-adaptive learning framework termed Self-Adaptive Label cOrrection (SALO) is proposed. SALO adaptively performs a label correction process, both in an implicit and an explicit manners, turning noisy labels into correct ones, thus benefiting model training. The experimental results on four benchmark datasets demonstrated the superiority of SALO over the state-of-the-art distantly supervised methods. Moreover, a better version of noisy labels by ensembling several semantic matching methods was built. Experiments were carried out and consistent improvements were observed, validating the generalization of the proposed SALO. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

19 pages, 360 KiB

Open AccessArticle

The Saudi Novel Corpus: Design and Compilation

by Tareq Alfraidi, Mohammad A. R. Abdeen, Ahmed Yatimi, Reyadh Alluhaibi and Abdulmohsen Al-Thubaity

Appl. Sci. 2022, 12(13), 6648; https://doi.org/10.3390/app12136648 - 30 Jun 2022

Cited by 2 | Viewed by 2369

Abstract

Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital [...] Read more.

Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital datasets that would enhance corpus linguistic and stylistic studies of these works. Thus, Arabic lags behind English and other European languages in this context. In this paper, we present the Saudi Novels Corpus, built to be a valuable resource for linguistic and stylistic research communities. We specifically present the procedures we followed and the decisions we made in creating the corpus. We describe and clarify the design criteria, data collection methods, process of annotation, and encoding. In addition, we present preliminary results that emerged from the analysis of the corpus content. We consider the work described in this paper as initial steps to bridge the existing gap between corpus linguistics and Arabic literary texts. Further work is planned to improve the quality of the corpus by adding advanced features. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

27 pages, 21431 KiB

Open AccessArticle

Robust Sentimental Class Prediction Based on Cryptocurrency-Related Tweets Using Tetrad of Feature Selection Techniques in Combination with Filtered Classifier

by Saad Awadh Alanazi

Appl. Sci. 2022, 12(12), 6070; https://doi.org/10.3390/app12126070 - 15 Jun 2022

Cited by 1 | Viewed by 1568

Abstract

Individual mental feelings and reactions are getting more significant as they help researchers, domain experts, businesses, companies, and other individuals understand the overall response of every individual in specific situations or circumstances. Every pure and compound sentiment can be classified using a dataset, [...] Read more.

Individual mental feelings and reactions are getting more significant as they help researchers, domain experts, businesses, companies, and other individuals understand the overall response of every individual in specific situations or circumstances. Every pure and compound sentiment can be classified using a dataset, which can be in the form of Twitter text by various Twitter users. Twitter is one of the vital platforms for individuals to participate and share their ideas about different topics; it is also considered to be one of the most famous and the biggest website for micro-blogging on the Internet. One of the key purposes of this study is to classify pure and compound sentiments based on text related to cryptocurrencies, an innovative way of trading and flourishing daily. The cryptocurrency market incurs many fluctuations in the coins’ value. A small positive or negative piece of news can sensate the whole scenario about the specific cryptocurrencies. In this paper, individuals’ pure and compound sentiments based on cryptocurrency-related Twitter text are classified. The dataset is collected through the Twitter API. In WEKA, the two deployment schemes are compared; firstly, straight with single feature selection technique (Tweet to lexicon feature vector), and secondly, a tetrad of feature selection techniques (Tweet to lexicon feature vector, Tweet to input lexicon feature vector, Tweet to SentiStrength feature vector, and Tweet to embedding feature vector) are used to purify the data LibLINEAR (LL) classifier, which contains fast algorithms for linear classification using L2-regularization L2-loss support vector machines (Dual SVM). The LL classifier differs in that it can potentially alleviate the sum of the absolute values of errors rather than the sum of the squared errors and is typically much speedier. Based on the overall performance parameters, the deployment scheme containing the tetrad of feature selection techniques with the LL classifier is considered the best choice for the purpose of classification. Among machine learning techniques, LL produces effective results and gives an efficient performance compared to other prevailing techniques. The findings of this research would be beneficial for Twitter users as well as cryptocurrency traders. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

14 pages, 1472 KiB

Open AccessArticle

Neural Embeddings for the Elicitation of Jurisprudence Principles: The Case of Arabic Legal Texts

by Nafla Alrumayyan and Maha Al-Yahya

Appl. Sci. 2022, 12(9), 4188; https://doi.org/10.3390/app12094188 - 21 Apr 2022

Cited by 2 | Viewed by 1458

Abstract

In the domain of law and legal systems, jurisprudence principles (JPs) are considered major sources of legislative reasoning by jurisprudence scholars. Generally accepted JPs are often used to support the reasoning for a given jurisprudence case (JC). Although eliciting the JPs associated with [...] Read more.

In the domain of law and legal systems, jurisprudence principles (JPs) are considered major sources of legislative reasoning by jurisprudence scholars. Generally accepted JPs are often used to support the reasoning for a given jurisprudence case (JC). Although eliciting the JPs associated with a specific JC is a central task of legislative reasoning, it is complex and requires expertise, knowledge of the domain, and significant and lengthy human exertion by jurisprudence scholars. This study aimed to leverage advances in language modeling to support the task of JP elicitation. We investigated neural embeddings—specifically, doc2vec architectures—as a representation model for the task of JP elicitation using Arabic legal texts. Four experiments were conducted to evaluate three different architectures for document embedding models for the JP elicitation task. In addition, we explored an approach that integrates task-oriented word embeddings (ToWE) with document embeddings (paragraph vectors). The results of the experiments showed that using neural embeddings for the JP elicitation task is a promising approach. The paragraph vector distributed bag-of-words (PV-DBOW) architecture produced the best results for this task. To evaluate how well the ToWE model performed for the JP elicitation task, a graded relevance ranking measure, discounted cumulative gain (DCG), was used. The model achieved good results with a normalized DCG of 0.9 for the majority of the JPs. The findings of this study have significant implications for the understanding of how Arabic legal texts can be modeled and how the semantics of jurisprudence principles can be elicited using neural embeddings. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

24 pages, 1076 KiB

Open AccessArticle

Attention-Based RU-BiLSTM Sentiment Analysis Model for Roman Urdu

by Bilal Ahmed Chandio, Ali Shariq Imran, Maheen Bakhtyar, Sher Muhammad Daudpota and Junaid Baber

Appl. Sci. 2022, 12(7), 3641; https://doi.org/10.3390/app12073641 - 04 Apr 2022

Cited by 17 | Viewed by 5024

Abstract

Deep neural networks have emerged as a leading approach towards handling many natural language processing (NLP) tasks. Deep networks initially conquered the problems of computer vision. However, dealing with sequential data such as text and sound was a nightmare for such networks as [...] Read more.

Deep neural networks have emerged as a leading approach towards handling many natural language processing (NLP) tasks. Deep networks initially conquered the problems of computer vision. However, dealing with sequential data such as text and sound was a nightmare for such networks as traditional deep networks are not reliable in preserving contextual information. This may not harm the results in the case of image processing where we do not care about the sequence, but when we consider the data collected from text for processing, such networks may trigger disastrous results. Moreover, establishing sentence semantics in a colloquial text such as Roman Urdu is a challenge. Additionally, the sparsity and high dimensionality of data in such informal text have encountered a significant challenge for building sentence semantics. To overcome this problem, we propose a deep recurrent architecture RU-BiLSTM based on bidirectional LSTM (BiLSTM) coupled with word embedding and an attention mechanism for sentiment analysis of Roman Urdu. Our proposed model uses the bidirectional LSTM to preserve the context in both directions and the attention mechanism to concentrate on more important features. Eventually, the last dense softmax output layer is used to acquire the binary and ternary classification results. We empirically evaluated our model on two available datasets of Roman Urdu, i.e., RUECD and RUSA-19. Our proposed model outperformed the baseline models on many grounds, and a significant improvement of 6% to 8% is achieved over baseline models. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

20 pages, 784 KiB

Open AccessArticle

Efficient Fake News Detection Mechanism Using Enhanced Deep Learning Model

by Tahir Ahmad, Muhammad Shahzad Faisal, Atif Rizwan, Reem Alkanhel, Prince Waqas Khan and Ammar Muthanna

Appl. Sci. 2022, 12(3), 1743; https://doi.org/10.3390/app12031743 - 08 Feb 2022

Cited by 18 | Viewed by 4777

Abstract

The spreading of accidental or malicious misinformation on social media, specifically in critical situations, such as real-world emergencies, can have negative consequences for society. This facilitates the spread of rumors on social media. On social media, users share and exchange the latest information [...] Read more.

The spreading of accidental or malicious misinformation on social media, specifically in critical situations, such as real-world emergencies, can have negative consequences for society. This facilitates the spread of rumors on social media. On social media, users share and exchange the latest information with many readers, including a large volume of new information every second. However, updated news sharing on social media is not always true.In this study, we focus on the challenges of numerous breaking-news rumors propagating on social media networks rather than long-lasting rumors. We propose new social-based and content-based features to detect rumors on social media networks. Furthermore, our findings show that our proposed features are more helpful in classifying rumors compared with state-of-the-art baseline features. Moreover, we apply bidirectional LSTM-RNN on text for rumor prediction. This model is simple but effective for rumor detection. The majority of early rumor detection research focuses on long-running rumors and assumes that rumors are always false. In contrast, our experiments on rumor detection are conducted on real-world scenario data set. The results of the experiments demonstrate that our proposed features and different machine learning models perform best when compared to the state-of-the-art baseline features and classifier in terms of precision, recall, and F1 measures. Full article

(This article belongs to the Special Issue Natural Language Processing: Recent Development and Applications)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing: Recent Development and Applications

Share This Special Issue

Special Issue Editor

Special Issue Information

Published Papers (16 papers)

Editorial

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI