Research

19 pages, 8555 KiB

Open AccessArticle

Enhanced Seagull Optimization with Natural Language Processing Based Hate Speech Detection and Classification

by Yousef Asiri, Hanan T. Halawani, Hanan M. Alghamdi, Saadia Hassan Abdalaha Hamza, Sayed Abdel-Khalek and Romany F. Mansour

Appl. Sci. 2022, 12(16), 8000; https://doi.org/10.3390/app12168000 - 10 Aug 2022

Cited by 6 | Viewed by 2110

Abstract

Hate speech has become a hot research topic in the area of natural language processing (NLP) due to the tremendous increase in the usage of social media platforms like Instagram, Twitter, Facebook, etc. The facelessness and flexibility provided through the Internet have made [...] Read more.

Hate speech has become a hot research topic in the area of natural language processing (NLP) due to the tremendous increase in the usage of social media platforms like Instagram, Twitter, Facebook, etc. The facelessness and flexibility provided through the Internet have made it easier for people to interact aggressively. Furthermore, the massive quantity of increasing hate speech on social media with heterogeneous sources makes it a challenging task. With this motivation, this study presents an Enhanced Seagull Optimization with Natural Language Processing Based Hate Speech Detection and Classification (ESGONLP-HSC) model. The major intention of the presented ESGONLP-HSC model is to identify and classify the occurrence of hate speech on social media websites. To accomplish this, the presented ESGONLP-HSC model involves data pre-processing at several stages, such as tokenization, vectorization, etc. Additionally, the Glove technique is applied for the feature extraction process. In addition, an attention-based bidirectional long short-term memory (ABLSTM) model is utilized for the classification of social media text into three classes such as neutral, offensive, and hate language. Moreover, the ESGO algorithm is utilized as a hyperparameter optimizer to adjust the hyperparameters related to the ABLSTM model, which shows the novelty of the work. The experimental validation of the ESGONLP-HSC model is carried out, and the results are examined under diverse aspects. The experimentation outcomes reported the promising performance of the ESGONLP-HSC model over recent state of art approaches. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

13 pages, 3856 KiB

Open AccessArticle

Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer

by Turki Turki and Sanjiban Sekhar Roy

Appl. Sci. 2022, 12(13), 6611; https://doi.org/10.3390/app12136611 - 29 Jun 2022

Cited by 14 | Viewed by 3690

Abstract

A plethora of negative behavioural activities have recently been found in social media. Incidents such as trolling and hate speech on social media, especially on Twitter, have grown considerably. Therefore, detection of hate speech on Twitter has become an area of interest among [...] Read more.

A plethora of negative behavioural activities have recently been found in social media. Incidents such as trolling and hate speech on social media, especially on Twitter, have grown considerably. Therefore, detection of hate speech on Twitter has become an area of interest among many researchers. In this paper, we present a computational framework to (1) examine out the computational challenges behind hate speech detection and (2) generate high performance results. First, we extract features from Twitter data by utilizing a count vectorizer technique. Then, we provide the labeled dataset of constructed features to adopted ensemble methods, including Bagging, AdaBoost, and Random Forest. After training, we classify new tweet examples into one of the two categories, hate speech or non-hate speech. Experimental results show (1) that Random Forest has surpassed other methods by generating 95% using accuracy performance results and (2) word cloud displays the most prominent tweets that are responsible for hateful sentiments. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

16 pages, 441 KiB

Open AccessArticle

EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring

by Yo-Han Park, Yong-Seok Choi, Cheon-Young Park and Kong-Joo Lee

Appl. Sci. 2022, 12(12), 5803; https://doi.org/10.3390/app12125803 - 07 Jun 2022

Cited by 8 | Viewed by 2385

Abstract

In large-scale testing and e-learning environments, automated essay scoring (AES) can relieve the burden upon human raters by replacing human grading with machine grading. However, building AES systems based on deep learning requires a training dataset consisting of essays that are manually rated [...] Read more.

In large-scale testing and e-learning environments, automated essay scoring (AES) can relieve the burden upon human raters by replacing human grading with machine grading. However, building AES systems based on deep learning requires a training dataset consisting of essays that are manually rated with scores. In this study, we introduce EssayGAN, an automatic essay generator based on generative adversarial networks (GANs). To generate essays rated with scores, EssayGAN has multiple generators for each score range and one discriminator. Each generator is dedicated to a specific score and can generate an essay rated with that score. Therefore, the generators can focus only on generating a realistic-looking essay that can fool the discriminator without considering the target score. Although ordinary-text GANs generate text on a word basis, EssayGAN generates essays on a sentence basis. Therefore, EssayGAN can compose not only a long essay by predicting a sentence instead of a word at each step, but can also compose a score-rated essay by adopting multiple generators dedicated to the target score. Since EssayGAN can generate score-rated essays, the generated essays can be used in the supervised learning process for AES systems. Experimental results show that data augmentation using augmented essays helps to improve the performance of AES systems. We conclude that EssayGAN can generate essays not only consisting of multiple sentences but also maintaining coherence between sentences in essays. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

17 pages, 969 KiB

Open AccessArticle

Parallel Bidirectionally Pretrained Taggers as Feature Generators

by Ranka Stanković, Mihailo Škorić and Branislava Šandrih Todorović

Appl. Sci. 2022, 12(10), 5028; https://doi.org/10.3390/app12105028 - 16 May 2022

Cited by 3 | Viewed by 1476

Abstract

In a setting where multiple automatic annotation approaches coexist and advance separately but none completely solve a specific problem, the key might be in their combination and integration. This paper outlines a scalable architecture for Part-of-Speech tagging using multiple standalone annotation systems as [...] Read more.

In a setting where multiple automatic annotation approaches coexist and advance separately but none completely solve a specific problem, the key might be in their combination and integration. This paper outlines a scalable architecture for Part-of-Speech tagging using multiple standalone annotation systems as feature generators for a stacked classifier. It also explores automatic resource expansion via dataset augmentation and bidirectional training in order to increase the number of taggers and to maximize the impact of the composite system, which is especially viable for low-resource languages. We demonstrate the approach on a preannotated dataset for Serbian using nested cross-validation to test and compare standalone and composite taggers. Based on the results, we conclude that given a limited training dataset, there is a payoff from cutting a percentage of the initial training set and using it to fine-tune a machine-learning-based stacked classifier, especially if it is trained bidirectionally. Moreover, we found a measurable impact on the usage of multiple tagsets to scale-up the architecture further through transfer learning methods. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

16 pages, 322 KiB

Open AccessArticle

A White-Box Sociolinguistic Model for Gender Detection

by Damián Morales Sánchez, Antonio Moreno and María Dolores Jiménez López

Appl. Sci. 2022, 12(5), 2676; https://doi.org/10.3390/app12052676 - 04 Mar 2022

Cited by 1 | Viewed by 1663

Abstract

Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author’s writing style, sociodemographic information, such as the author’s gender, age, or native language can be predicted. The exponential growth of user-generated [...] Read more.

Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author’s writing style, sociodemographic information, such as the author’s gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

18 pages, 2095 KiB

Open AccessArticle

Determination of the Features of the Author’s Style of A.S. Pushkin’s Poems by Machine Learning Methods

by Vladimir Barakhnin, Olga Kozhemyakina and Irina Grigorieva

Appl. Sci. 2022, 12(3), 1674; https://doi.org/10.3390/app12031674 - 06 Feb 2022

Cited by 1 | Viewed by 1498

Abstract

This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin [...] Read more.

This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin using machine learning methods. This paper describes the construction of several classifications based on different groups of features, as well as the classification based on a combined set of features from different groups. The quality of all constructed classifications is also analyzed; special attention is paid to the interpretation of the neural network solution and the identification of features of the author’s style. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

19 pages, 495 KiB

Open AccessArticle

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

by Md. Saef Ullah Miah, Junaida Sulaiman, Talha Bin Sarwar, Ateeqa Naseer, Fasiha Ashraf, Kamal Zuhairi Zamli and Rajan Jose

Appl. Sci. 2022, 12(3), 1352; https://doi.org/10.3390/app12031352 - 27 Jan 2022

Cited by 9 | Viewed by 2914

Abstract

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing [...] Read more.

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

16 pages, 363 KiB

Open AccessArticle

Transformer-Based Graph Convolutional Network for Sentiment Analysis

by Barakat AlBadani, Ronghua Shi, Jian Dong, Raeed Al-Sabri and Oloulade Babatounde Moctard

Appl. Sci. 2022, 12(3), 1316; https://doi.org/10.3390/app12031316 - 26 Jan 2022

Cited by 15 | Viewed by 5806

Abstract

Sentiment Analysis is an essential research topic in the field of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising [...] Read more.

Sentiment Analysis is an essential research topic in the field of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising results. Although these models can analyze sequences of arbitrary length, utilizing them in the feature extraction layer of a DNN increases the dimensionality of the feature space. More recently, graph neural networks (GNNs) have achieved a promising performance in different NLP tasks. However, previous models cannot be transferred to a large corpus and neglect the heterogeneity of textual graphs. To overcome these difficulties, we propose a new Transformer-based graph convolutional network for heterogeneous graphs called Sentiment Transformer Graph Convolutional Network (ST-GCN). To the best of our knowledge, this is the first study to model the sentiment corpus as a heterogeneous graph and learn document and word embeddings using the proposed sentiment graph transformer neural network. In addition, our model offers an easy mechanism to fuse node positional information for graph datasets using Laplacian eigenvectors. Extensive experiments on four standard datasets show that our model outperforms the existing state-of-the-art models. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

18 pages, 12453 KiB

Open AccessArticle

Latent-Cause Extraction Model in Maritime Collision Accidents Using Text Analytics on Korean Maritime Accident Verdicts

by Taemin Hwang and Ik-Hyun Youn

Appl. Sci. 2022, 12(2), 914; https://doi.org/10.3390/app12020914 - 17 Jan 2022

Cited by 1 | Viewed by 1506

Abstract

Maritime collision accidents occur frequently and result in huge damages. Complex collision accidents are especially associated with worse damages. Complex maritime collision accidents involve other types of accidents barring the main accident, such as fire, explosions, capsizes, sinking, and even casualties. When a [...] Read more.

Maritime collision accidents occur frequently and result in huge damages. Complex collision accidents are especially associated with worse damages. Complex maritime collision accidents involve other types of accidents barring the main accident, such as fire, explosions, capsizes, sinking, and even casualties. When a maritime accident occurs, the maritime accident verdict covers the surveyed facts from the origin of the accident to the consequences. The survey usually reveals the primary cause of the accident; however, complex causes may remain latent. Therefore, this research aims to apply text analytics to maritime verdicts of collision accident cases to identify the latent causes in complex collision accidents. The proposed methods separated the collected corpus into the training dataset and the test dataset. The word propensity database was extracted from the training dataset and applied to sample verdicts of complex maritime collision accidents in the test dataset. The expected results of this research were words that appeared in only complex maritime accidents with a high propensity for additional categories and the relevant context that explains the latent causes that underlie the complexity of the maritime accident. The conclusion suggested that the latent causes derived should be provided to ships to help prevent future complex collision accidents. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

9 pages, 1069 KiB

Open AccessArticle

Examining the Effect of the Ratio of Biomedical Domain to General Domain Data in Corpus in Biomedical Literature Mining

by Ziheng Zhang, Feng Han, Hongjian Zhang, Tomohiro Aoki and Katsuhiko Ogasawara

Appl. Sci. 2022, 12(1), 154; https://doi.org/10.3390/app12010154 - 24 Dec 2021

Cited by 2 | Viewed by 1987

Abstract

Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine [...] Read more.

Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine how changes in the ratio of the biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. We downloaded abstracts of 214,892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased when the ratio of the biomedical domain to general domain data was 3:7 to 5:5. This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

14 pages, 5281 KiB

Open AccessArticle

Development of Intellectual Web System for Morph Analyzing of Uzbek Words

by Davlatyor Mengliev, Vladimir Barakhnin and Nilufar Abdurakhmonova

Appl. Sci. 2021, 11(19), 9117; https://doi.org/10.3390/app11199117 - 30 Sep 2021

Cited by 10 | Viewed by 1908

Abstract

Currently, there is an active development of the Uzbek sector of the Internet. In it, as in other national sectors, the most common form of presentation of textual information is semi-structured documents, work that presupposes the availability of reliable algorithms for text analysis, [...] Read more.

Currently, there is an active development of the Uzbek sector of the Internet. In it, as in other national sectors, the most common form of presentation of textual information is semi-structured documents, work that presupposes the availability of reliable algorithms for text analysis, including its lexical characteristics. The article offers an intelligent web application developed for morphological analysis of words in the Uzbek language. The web application is based on the concept of generation and stem analysis of the Uzbek language word forms. A well-known Porter algorithm was chosen as the basis for stemming. The morphoanalyzer generates word forms of the Uzbek language based on the division of words into certain classes, taking into account the specifics and structure of this language. For example, nouns can be classified by meaning (related, nominal), by quantity (singular and plural), by case, and also, by the endings of belonging (possessive). Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

12 pages, 1406 KiB

Open AccessArticle

Comparative Analysis of Reasoning in Russian Classic Poetry

by Mariya Timofeeva

Appl. Sci. 2021, 11(18), 8665; https://doi.org/10.3390/app11188665 - 17 Sep 2021

Cited by 2 | Viewed by 1298

Abstract

The paper considers the pragmatic textual level and focuses on peculiarities of reasoning realized in the lyric verses written by the representatives of Russian classic poetry. The investigated material of poetic texts includes the verses written by A.K. Tolstoy, K.K. Sluchevsky, and I.F. [...] Read more.

The paper considers the pragmatic textual level and focuses on peculiarities of reasoning realized in the lyric verses written by the representatives of Russian classic poetry. The investigated material of poetic texts includes the verses written by A.K. Tolstoy, K.K. Sluchevsky, and I.F. Annensky. The purposes of the study involve adapting the rhetorical structure theory (RST) to poetic texts, annotating these texts, searching for regularities of poetic reasoning specific to the considered authors. Applying RST to poetic texts was a novel task; the lack of experience made it necessary to adapt the method, that is, to elaborate the adequate set of rhetorical relations and to specify two sets of criteria: segmenting a text and identifying the relations. The resulting set of relations consists of 34 items. After annotating the texts in terms of the adopted RST, several lines of comparison were the objects of investigation. They include collating the frequency spectrums of relations and the semantical groups of relations for the three authors, as well as comparing two periods of creativity for A.K. Tolstoy and K.K. Sluchevsky. The results of the comparative investigation revealed certain regularities both in the distribution of isolated relations and the distribution of semantically grouped relations. Full article

(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Natural Language Processing: Approaches and Applications

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (12 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI