Natural Language Processing: Approaches and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 May 2022) | Viewed by 30755

Special Issue Editors


E-Mail Website
Guest Editor

E-Mail Website
Guest Editor
1. Federal Research Center for Information and Computational Technologies, Ac. Lavrentiev ave., 6, 630090 Novosibirsk, Russia;
2. Department of Information Technologies, Novosibirsk State University, Pirogova str., 1, 630090 Novosibirsk, Russia
Interests: natural language processing; poetic texts analysis; mathematical linguistics; semiotics; data mining; machine learning; artificial intelligence; self-organizing systems; wave processes in a liquid

Special Issue Information

Dear Colleagues,

The natural language document processing is one of the most important tasks of modern information technologies. The results are used in a wide variety of fields: from extracting new information and knowledge from scientific publications in order to create the information and analytical systems to support the scientific activities, to the analysis of blogs for the purpose of study the consumer ratings of a particular product.

To solve this problem, the various approaches are used: the study of the lower structural levels of the text, in particular, the syntax and the vocabulary, allows the usage of classic finite state algorithms that permit to obtain the reliably interpreted results, while the study of the higher structural levels of the text, the semantics and the pragmatics, requires the usage of the machine learning methods. In addition, the algorithms for natural language text processing differ markedly depending on which class this particular language belongs to: analytical, inflectional, or agglutinative.

All of the above defines a wide range of the tasks and the approaches to their solution, to which this special issue is devoted.

Prof. Dr. Evgeny Nikulchev
Prof. Dr. Vladimir Borisovich Barakhnin
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language analysis
  • multilevel text structure
  • finite state algorithms
  • machine learning methods
  • analytical languages
  • inflectional languages
  • agglutinative languages

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

19 pages, 8555 KiB  
Article
Enhanced Seagull Optimization with Natural Language Processing Based Hate Speech Detection and Classification
by Yousef Asiri, Hanan T. Halawani, Hanan M. Alghamdi, Saadia Hassan Abdalaha Hamza, Sayed Abdel-Khalek and Romany F. Mansour
Appl. Sci. 2022, 12(16), 8000; https://doi.org/10.3390/app12168000 - 10 Aug 2022
Cited by 6 | Viewed by 2110
Abstract
Hate speech has become a hot research topic in the area of natural language processing (NLP) due to the tremendous increase in the usage of social media platforms like Instagram, Twitter, Facebook, etc. The facelessness and flexibility provided through the Internet have made [...] Read more.
Hate speech has become a hot research topic in the area of natural language processing (NLP) due to the tremendous increase in the usage of social media platforms like Instagram, Twitter, Facebook, etc. The facelessness and flexibility provided through the Internet have made it easier for people to interact aggressively. Furthermore, the massive quantity of increasing hate speech on social media with heterogeneous sources makes it a challenging task. With this motivation, this study presents an Enhanced Seagull Optimization with Natural Language Processing Based Hate Speech Detection and Classification (ESGONLP-HSC) model. The major intention of the presented ESGONLP-HSC model is to identify and classify the occurrence of hate speech on social media websites. To accomplish this, the presented ESGONLP-HSC model involves data pre-processing at several stages, such as tokenization, vectorization, etc. Additionally, the Glove technique is applied for the feature extraction process. In addition, an attention-based bidirectional long short-term memory (ABLSTM) model is utilized for the classification of social media text into three classes such as neutral, offensive, and hate language. Moreover, the ESGO algorithm is utilized as a hyperparameter optimizer to adjust the hyperparameters related to the ABLSTM model, which shows the novelty of the work. The experimental validation of the ESGONLP-HSC model is carried out, and the results are examined under diverse aspects. The experimentation outcomes reported the promising performance of the ESGONLP-HSC model over recent state of art approaches. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

13 pages, 3856 KiB  
Article
Novel Hate Speech Detection Using Word Cloud Visualization and Ensemble Learning Coupled with Count Vectorizer
by Turki Turki and Sanjiban Sekhar Roy
Appl. Sci. 2022, 12(13), 6611; https://doi.org/10.3390/app12136611 - 29 Jun 2022
Cited by 14 | Viewed by 3690
Abstract
A plethora of negative behavioural activities have recently been found in social media. Incidents such as trolling and hate speech on social media, especially on Twitter, have grown considerably. Therefore, detection of hate speech on Twitter has become an area of interest among [...] Read more.
A plethora of negative behavioural activities have recently been found in social media. Incidents such as trolling and hate speech on social media, especially on Twitter, have grown considerably. Therefore, detection of hate speech on Twitter has become an area of interest among many researchers. In this paper, we present a computational framework to (1) examine out the computational challenges behind hate speech detection and (2) generate high performance results. First, we extract features from Twitter data by utilizing a count vectorizer technique. Then, we provide the labeled dataset of constructed features to adopted ensemble methods, including Bagging, AdaBoost, and Random Forest. After training, we classify new tweet examples into one of the two categories, hate speech or non-hate speech. Experimental results show (1) that Random Forest has surpassed other methods by generating 95% using accuracy performance results and (2) word cloud displays the most prominent tweets that are responsible for hateful sentiments. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

16 pages, 441 KiB  
Article
EssayGAN: Essay Data Augmentation Based on Generative Adversarial Networks for Automated Essay Scoring
by Yo-Han Park, Yong-Seok Choi, Cheon-Young Park and Kong-Joo Lee
Appl. Sci. 2022, 12(12), 5803; https://doi.org/10.3390/app12125803 - 07 Jun 2022
Cited by 8 | Viewed by 2385
Abstract
In large-scale testing and e-learning environments, automated essay scoring (AES) can relieve the burden upon human raters by replacing human grading with machine grading. However, building AES systems based on deep learning requires a training dataset consisting of essays that are manually rated [...] Read more.
In large-scale testing and e-learning environments, automated essay scoring (AES) can relieve the burden upon human raters by replacing human grading with machine grading. However, building AES systems based on deep learning requires a training dataset consisting of essays that are manually rated with scores. In this study, we introduce EssayGAN, an automatic essay generator based on generative adversarial networks (GANs). To generate essays rated with scores, EssayGAN has multiple generators for each score range and one discriminator. Each generator is dedicated to a specific score and can generate an essay rated with that score. Therefore, the generators can focus only on generating a realistic-looking essay that can fool the discriminator without considering the target score. Although ordinary-text GANs generate text on a word basis, EssayGAN generates essays on a sentence basis. Therefore, EssayGAN can compose not only a long essay by predicting a sentence instead of a word at each step, but can also compose a score-rated essay by adopting multiple generators dedicated to the target score. Since EssayGAN can generate score-rated essays, the generated essays can be used in the supervised learning process for AES systems. Experimental results show that data augmentation using augmented essays helps to improve the performance of AES systems. We conclude that EssayGAN can generate essays not only consisting of multiple sentences but also maintaining coherence between sentences in essays. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

17 pages, 969 KiB  
Article
Parallel Bidirectionally Pretrained Taggers as Feature Generators
by Ranka Stanković, Mihailo Škorić and Branislava Šandrih Todorović
Appl. Sci. 2022, 12(10), 5028; https://doi.org/10.3390/app12105028 - 16 May 2022
Cited by 3 | Viewed by 1476
Abstract
In a setting where multiple automatic annotation approaches coexist and advance separately but none completely solve a specific problem, the key might be in their combination and integration. This paper outlines a scalable architecture for Part-of-Speech tagging using multiple standalone annotation systems as [...] Read more.
In a setting where multiple automatic annotation approaches coexist and advance separately but none completely solve a specific problem, the key might be in their combination and integration. This paper outlines a scalable architecture for Part-of-Speech tagging using multiple standalone annotation systems as feature generators for a stacked classifier. It also explores automatic resource expansion via dataset augmentation and bidirectional training in order to increase the number of taggers and to maximize the impact of the composite system, which is especially viable for low-resource languages. We demonstrate the approach on a preannotated dataset for Serbian using nested cross-validation to test and compare standalone and composite taggers. Based on the results, we conclude that given a limited training dataset, there is a payoff from cutting a percentage of the initial training set and using it to fine-tune a machine-learning-based stacked classifier, especially if it is trained bidirectionally. Moreover, we found a measurable impact on the usage of multiple tagsets to scale-up the architecture further through transfer learning methods. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

16 pages, 322 KiB  
Article
A White-Box Sociolinguistic Model for Gender Detection
by Damián Morales Sánchez, Antonio Moreno and María Dolores Jiménez López
Appl. Sci. 2022, 12(5), 2676; https://doi.org/10.3390/app12052676 - 04 Mar 2022
Cited by 1 | Viewed by 1663
Abstract
Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author’s writing style, sociodemographic information, such as the author’s gender, age, or native language can be predicted. The exponential growth of user-generated [...] Read more.
Within the area of Natural Language Processing, we approached the Author Profiling task as a text classification problem. Based on the author’s writing style, sociodemographic information, such as the author’s gender, age, or native language can be predicted. The exponential growth of user-generated data and the development of Machine-Learning techniques have led to significant advances in automatic gender detection. Unfortunately, gender detection models often become black-boxes in terms of interpretability. In this paper, we propose a tree-based computational model for gender detection made up of 198 features. Unlike the previous works on gender detection, we organized the features from a linguistic perspective into six categories: orthographic, morphological, lexical, syntactic, digital, and pragmatics-discursive. We implemented a Decision-Tree classifier to evaluate the performance of all feature combinations, and the experiments revealed that, on average, the classification accuracy increased up to 3.25% with the addition of feature sets. The maximum classification accuracy was reached by a three-level model that combined lexical, syntactic, and digital features. We present the most relevant features for gender detection according to the trees generated by the classifier and contextualize the significance of the computational results with the linguistic patterns defined by previous research in relation to gender. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

18 pages, 2095 KiB  
Article
Determination of the Features of the Author’s Style of A.S. Pushkin’s Poems by Machine Learning Methods
by Vladimir Barakhnin, Olga Kozhemyakina and Irina Grigorieva
Appl. Sci. 2022, 12(3), 1674; https://doi.org/10.3390/app12031674 - 06 Feb 2022
Cited by 1 | Viewed by 1498
Abstract
This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin [...] Read more.
This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin using machine learning methods. This paper describes the construction of several classifications based on different groups of features, as well as the classification based on a combined set of features from different groups. The quality of all constructed classifications is also analyzed; special attention is paid to the interpretation of the neural network solution and the identification of features of the author’s style. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

19 pages, 495 KiB  
Article
Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques
by Md. Saef Ullah Miah, Junaida Sulaiman, Talha Bin Sarwar, Ateeqa Naseer, Fasiha Ashraf, Kamal Zuhairi Zamli and Rajan Jose
Appl. Sci. 2022, 12(3), 1352; https://doi.org/10.3390/app12031352 - 27 Jan 2022
Cited by 9 | Viewed by 2914
Abstract
Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing [...] Read more.
Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

16 pages, 363 KiB  
Article
Transformer-Based Graph Convolutional Network for Sentiment Analysis
by Barakat AlBadani, Ronghua Shi, Jian Dong, Raeed Al-Sabri and Oloulade Babatounde Moctard
Appl. Sci. 2022, 12(3), 1316; https://doi.org/10.3390/app12031316 - 26 Jan 2022
Cited by 15 | Viewed by 5806
Abstract
Sentiment Analysis is an essential research topic in the field of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising [...] Read more.
Sentiment Analysis is an essential research topic in the field of natural language processing (NLP) and has attracted the attention of many researchers in the last few years. Recently, deep neural network (DNN) models have been used for sentiment analysis tasks, achieving promising results. Although these models can analyze sequences of arbitrary length, utilizing them in the feature extraction layer of a DNN increases the dimensionality of the feature space. More recently, graph neural networks (GNNs) have achieved a promising performance in different NLP tasks. However, previous models cannot be transferred to a large corpus and neglect the heterogeneity of textual graphs. To overcome these difficulties, we propose a new Transformer-based graph convolutional network for heterogeneous graphs called Sentiment Transformer Graph Convolutional Network (ST-GCN). To the best of our knowledge, this is the first study to model the sentiment corpus as a heterogeneous graph and learn document and word embeddings using the proposed sentiment graph transformer neural network. In addition, our model offers an easy mechanism to fuse node positional information for graph datasets using Laplacian eigenvectors. Extensive experiments on four standard datasets show that our model outperforms the existing state-of-the-art models. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

18 pages, 12453 KiB  
Article
Latent-Cause Extraction Model in Maritime Collision Accidents Using Text Analytics on Korean Maritime Accident Verdicts
by Taemin Hwang and Ik-Hyun Youn
Appl. Sci. 2022, 12(2), 914; https://doi.org/10.3390/app12020914 - 17 Jan 2022
Cited by 1 | Viewed by 1506
Abstract
Maritime collision accidents occur frequently and result in huge damages. Complex collision accidents are especially associated with worse damages. Complex maritime collision accidents involve other types of accidents barring the main accident, such as fire, explosions, capsizes, sinking, and even casualties. When a [...] Read more.
Maritime collision accidents occur frequently and result in huge damages. Complex collision accidents are especially associated with worse damages. Complex maritime collision accidents involve other types of accidents barring the main accident, such as fire, explosions, capsizes, sinking, and even casualties. When a maritime accident occurs, the maritime accident verdict covers the surveyed facts from the origin of the accident to the consequences. The survey usually reveals the primary cause of the accident; however, complex causes may remain latent. Therefore, this research aims to apply text analytics to maritime verdicts of collision accident cases to identify the latent causes in complex collision accidents. The proposed methods separated the collected corpus into the training dataset and the test dataset. The word propensity database was extracted from the training dataset and applied to sample verdicts of complex maritime collision accidents in the test dataset. The expected results of this research were words that appeared in only complex maritime accidents with a high propensity for additional categories and the relevant context that explains the latent causes that underlie the complexity of the maritime accident. The conclusion suggested that the latent causes derived should be provided to ships to help prevent future complex collision accidents. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

9 pages, 1069 KiB  
Article
Examining the Effect of the Ratio of Biomedical Domain to General Domain Data in Corpus in Biomedical Literature Mining
by Ziheng Zhang, Feng Han, Hongjian Zhang, Tomohiro Aoki and Katsuhiko Ogasawara
Appl. Sci. 2022, 12(1), 154; https://doi.org/10.3390/app12010154 - 24 Dec 2021
Cited by 2 | Viewed by 1987
Abstract
Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine [...] Read more.
Biomedical terms extracted using Word2vec, the most popular word embedding model in recent years, serve as the foundation for various natural language processing (NLP) applications, such as biomedical information retrieval, relation extraction, and recommendation systems. The objective of this study is to examine how changes in the ratio of the biomedical domain to general domain data in the corpus affect the extraction of similar biomedical terms using Word2vec. We downloaded abstracts of 214,892 articles from PubMed Central (PMC) and the 3.9 GB Billion Word (BW) benchmark corpus from the computer science community. The datasets were preprocessed and grouped into 11 corpora based on the ratio of BW to PMC, ranging from 0:10 to 10:0, and then Word2vec models were trained on these corpora. The cosine similarities between the biomedical terms obtained from the Word2vec models were then compared in each model. The results indicated that the models trained with both BW and PMC data outperformed the model trained only with medical data. The similarity between the biomedical terms extracted by the Word2vec model increased when the ratio of the biomedical domain to general domain data was 3:7 to 5:5. This study allows NLP researchers to apply Word2vec based on more information and increase the similarity of extracted biomedical terms to improve their effectiveness in NLP applications, such as biomedical information extraction. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

14 pages, 5281 KiB  
Article
Development of Intellectual Web System for Morph Analyzing of Uzbek Words
by Davlatyor Mengliev, Vladimir Barakhnin and Nilufar Abdurakhmonova
Appl. Sci. 2021, 11(19), 9117; https://doi.org/10.3390/app11199117 - 30 Sep 2021
Cited by 10 | Viewed by 1908
Abstract
Currently, there is an active development of the Uzbek sector of the Internet. In it, as in other national sectors, the most common form of presentation of textual information is semi-structured documents, work that presupposes the availability of reliable algorithms for text analysis, [...] Read more.
Currently, there is an active development of the Uzbek sector of the Internet. In it, as in other national sectors, the most common form of presentation of textual information is semi-structured documents, work that presupposes the availability of reliable algorithms for text analysis, including its lexical characteristics. The article offers an intelligent web application developed for morphological analysis of words in the Uzbek language. The web application is based on the concept of generation and stem analysis of the Uzbek language word forms. A well-known Porter algorithm was chosen as the basis for stemming. The morphoanalyzer generates word forms of the Uzbek language based on the division of words into certain classes, taking into account the specifics and structure of this language. For example, nouns can be classified by meaning (related, nominal), by quantity (singular and plural), by case, and also, by the endings of belonging (possessive). Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

12 pages, 1406 KiB  
Article
Comparative Analysis of Reasoning in Russian Classic Poetry
by Mariya Timofeeva
Appl. Sci. 2021, 11(18), 8665; https://doi.org/10.3390/app11188665 - 17 Sep 2021
Cited by 2 | Viewed by 1298
Abstract
The paper considers the pragmatic textual level and focuses on peculiarities of reasoning realized in the lyric verses written by the representatives of Russian classic poetry. The investigated material of poetic texts includes the verses written by A.K. Tolstoy, K.K. Sluchevsky, and I.F. [...] Read more.
The paper considers the pragmatic textual level and focuses on peculiarities of reasoning realized in the lyric verses written by the representatives of Russian classic poetry. The investigated material of poetic texts includes the verses written by A.K. Tolstoy, K.K. Sluchevsky, and I.F. Annensky. The purposes of the study involve adapting the rhetorical structure theory (RST) to poetic texts, annotating these texts, searching for regularities of poetic reasoning specific to the considered authors. Applying RST to poetic texts was a novel task; the lack of experience made it necessary to adapt the method, that is, to elaborate the adequate set of rhetorical relations and to specify two sets of criteria: segmenting a text and identifying the relations. The resulting set of relations consists of 34 items. After annotating the texts in terms of the adopted RST, several lines of comparison were the objects of investigation. They include collating the frequency spectrums of relations and the semantical groups of relations for the three authors, as well as comparing two periods of creativity for A.K. Tolstoy and K.K. Sluchevsky. The results of the comparative investigation revealed certain regularities both in the distribution of isolated relations and the distribution of semantically grouped relations. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

Back to TopTop