Research

13 pages, 1353 KiB

Open AccessArticle

Experimental Study of Morphological Analyzers for Topic Categorization in News Articles

by Sangtae Ahn

Appl. Sci. 2023, 13(19), 10572; https://doi.org/10.3390/app131910572 - 22 Sep 2023

Cited by 2 | Viewed by 933

Natural language processing refers to the ability of computers to understand text and spoken words similar to humans. Recently, various machine learning techniques have been used to encode a large amount of text and decode feature vectors of text successfully. However, understanding low-resource [...] Read more.

Natural language processing refers to the ability of computers to understand text and spoken words similar to humans. Recently, various machine learning techniques have been used to encode a large amount of text and decode feature vectors of text successfully. However, understanding low-resource languages is in the early stages of research. In particular, Korean, which is an agglutinative language, needs sophisticated preprocessing steps, such as morphological analysis. Since morphological analysis in preprocessing significantly influences classification results, ideal and optimized morphological analyzers must be used. This study explored five state-of-the-art morphological analyzers for Korean news articles and categorized their topics into seven classes using term frequency–inverse document frequency and light gradient boosting machine frameworks. It was found that a morphological analyzer based on unsupervised learning achieved a computation time of 6 s in 500,899 tokens, which is 72 times faster than the slowest analyzer (432 s). In addition, a morphological analyzer using dynamic programming achieved a topic categorization accuracy of 82.5%, which is 9.4% higher than achieve when using the hidden Markov model (73.1%) and 13.4% higher compared to the baseline (69.1%) without any morphological analyzer in news articles. This study can provide insight into how each morphological analyzer extracts morphemes in sentences and affects categorizing topics in news articles. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

25 pages, 2524 KiB

Open AccessArticle

Classification of Events in Selected Industrial Processes Using Weighted Key Words and K-Nearest Neighbors Algorithm

by Mateusz Walczak, Aneta Poniszewska-Marańda and Krzysztof Stepień

Appl. Sci. 2023, 13(18), 10334; https://doi.org/10.3390/app131810334 - 15 Sep 2023

Cited by 1 | Viewed by 651

Abstract

The problem of classifying events in the industry is related to a large amount of accumulated text data including, among others, communication between the company and the client, whose expectations regarding the quality of its service are constantly growing. The currently used solutions [...] Read more.

The problem of classifying events in the industry is related to a large amount of accumulated text data including, among others, communication between the company and the client, whose expectations regarding the quality of its service are constantly growing. The currently used solutions for handling incoming requests have numerous disadvantages; they imply additional costs for the company and often a high level of customer dissatisfaction. A partial solution to this problem may be the automation of event classification; for example, by means of an expert IT system. The presented work proposes the solution to the problem of classifying text events. For this purpose, textual descriptions of events were used, which were collected for many years by companies from many different industries. A large part of text events are various types of problems reported by company customers. As part of this work, a complex text-classification process was constructed by using the K-Nearest Neighbors algorithm. The demonstrated classification process uses two novel proposed mechanisms: the dynamic extension of stop list and weighted keywords. Both of the mechanisms aim to improve the classification performance by solving typical problems that occur when using a fixed stop list and a classical keyword extraction approach by using TF or TF-IDF methods. Finally, the Text Events Categorizer system that implements the proposed classification process was described. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

15 pages, 1342 KiB

Open AccessArticle

University Student Dropout Prediction Using Pretrained Language Models

by Hyun-Sik Won, Min-Ji Kim, Dohyun Kim, Hee-Soo Kim and Kang-Min Kim

Appl. Sci. 2023, 13(12), 7073; https://doi.org/10.3390/app13127073 - 13 Jun 2023

Viewed by 1977

Abstract

Predicting student dropout from universities is an imperative but challenging task. Numerous data-driven approaches that utilize both student demographic information (e.g., gender, nationality, and high school graduation year) and academic information (e.g., GPA, participation in activities, and course evaluations) have shown meaningful results. [...] Read more.

Predicting student dropout from universities is an imperative but challenging task. Numerous data-driven approaches that utilize both student demographic information (e.g., gender, nationality, and high school graduation year) and academic information (e.g., GPA, participation in activities, and course evaluations) have shown meaningful results. Recently, pretrained language models have achieved very successful results in understanding the tasks associated with structured data as well as textual data. In this paper, we propose a novel student dropout prediction framework based on demographic and academic information, using a pretrained language model to capture the relationship between different forms of information. To this end, we first formulate both types of information in natural language form. We then recast the student dropout prediction task as a natural language inference (NLI) task. Finally, we fine-tune the pretrained language models to predict student dropout. In particular, we further enhance the model using a continuous hypothesis. The experimental results demonstrate that the proposed model is effective for the freshmen dropout prediction task. The proposed method exhibits significant improvements of as much as 9.00% in terms of F1-score compared with state-of-the-art techniques. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

24 pages, 2232 KiB

Open AccessArticle

Synthetized Multilanguage OCR Using CRNN and SVTR Models for Realtime Collaborative Tools

by Attila Biró, Antonio Ignacio Cuesta-Vargas, Jaime Martín-Martín, László Szilágyi and Sándor Miklós Szilágyi

Appl. Sci. 2023, 13(7), 4419; https://doi.org/10.3390/app13074419 - 30 Mar 2023

Cited by 4 | Viewed by 2593

Abstract

Background: Remote diagnosis using collaborative tools have led to multilingual joint working sessions in various domains, including comprehensive health care, and resulting in more inclusive health care services. One of the main challenges is providing a real-time solution for shared documents and [...] Read more.

Background: Remote diagnosis using collaborative tools have led to multilingual joint working sessions in various domains, including comprehensive health care, and resulting in more inclusive health care services. One of the main challenges is providing a real-time solution for shared documents and presentations on display to improve the efficacy of noninvasive, safe, and far-reaching collaborative models. Classic optical character recognition (OCR) solutions fail when there is a mixture of languages or dialects or in case of the participation of different technical levels and skills. Due to the risk of misunderstandings caused by mistranslations or lack of domain knowledge of the interpreters involved, the technological pipeline also needs artificial intelligence (AI)-supported improvements on the OCR side. This study examines the feasibility of machine learning-supported OCR in a multilingual environment. The novelty of our method is that it provides a solution not only for different speaking languages but also for a mixture of technological languages, using artificially created vocabulary and a custom training data generation approach. Methods: A novel hybrid language vocabulary creation method is utilized in the OCR training process in combination with convolutional recurrent neural networks (CRNNs) and a single visual model for scene text recognition within the patch-wise image tokenization framework (SVTR). Data: In the research, we used a dedicated Python-based data generator built on dedicated collaborative tool-based templates to cover and simulated the real-life variances of remote diagnosis and co-working collaborative sessions with high accuracy. The generated training datasets ranged from 66 k to 8.5 M in size. Twenty-one research results were analyzed. Instruments: Training was conducted by using tuned PaddleOCR with CRNN and SVTR modeling and a domain-specific, customized vocabulary. The Weight & Biases (WANDB) machine learning (ML) platform is used for experiment tracking, dataset versioning, and model evaluation. Based on the evaluations, the training dataset was adjusted by using a different language corpus or/and modifications applied to templates. Results: The machine learning models recognized the multilanguage/hybrid texts with high accuracy. The highest precision scores achieved are 90.25%, 91.35%, and 93.89%. Conclusions: machine learning models for special multilanguages, including languages with artificially made vocabulary, perform consistently with high accuracy. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

16 pages, 433 KiB

Open AccessArticle

RoBERTa-GRU: A Hybrid Deep Learning Model for Enhanced Sentiment Analysis

by Kian Long Tan, Chin Poo Lee and Kian Ming Lim

Appl. Sci. 2023, 13(6), 3915; https://doi.org/10.3390/app13063915 - 19 Mar 2023

Cited by 6 | Viewed by 4497

Abstract

This paper proposes a novel hybrid model for sentiment analysis. The model leverages the strengths of both the Transformer model, represented by the Robustly Optimized BERT Pretraining Approach (RoBERTa), and the Recurrent Neural Network, represented by Gated Recurrent Units (GRU). The RoBERTa model [...] Read more.

This paper proposes a novel hybrid model for sentiment analysis. The model leverages the strengths of both the Transformer model, represented by the Robustly Optimized BERT Pretraining Approach (RoBERTa), and the Recurrent Neural Network, represented by Gated Recurrent Units (GRU). The RoBERTa model provides the capability to project the texts into a discriminative embedding space through its attention mechanism, while the GRU model captures the long-range dependencies of the embedding and addresses the vanishing gradients problem. To overcome the challenge of imbalanced datasets in sentiment analysis, this paper also proposes the use of data augmentation with word embeddings by over-sampling the minority classes. This enhances the representation capacity of the model, making it more robust and accurate in handling the sentiment classification task. The proposed RoBERTa-GRU model was evaluated on three widely used sentiment analysis datasets: IMDb, Sentiment140, and Twitter US Airline Sentiment. The results show that the model achieved an accuracy of 94.63% on IMDb, 89.59% on Sentiment140, and 91.52% on Twitter US Airline Sentiment. These results demonstrate the effectiveness of the proposed RoBERTa-GRU hybrid model in sentiment analysis. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

20 pages, 6973 KiB

Open AccessArticle

Sarcasm Detection Base on Adaptive Incongruity Extraction Network and Incongruity Cross-Attention

by Yuanlin He, Mingju Chen, Yingying He, Zhining Qu, Fanglin He, Feihong Yu, Jun Liao and Zhenchuan Wang

Appl. Sci. 2023, 13(4), 2102; https://doi.org/10.3390/app13042102 - 06 Feb 2023

Cited by 1 | Viewed by 1978

Abstract

Sarcasm is a linguistic phenomenon indicating a difference between literal meanings and implied intentions. It is commonly used on blogs, e-commerce platforms, and social media. Numerous NLP tasks, such as opinion mining and sentiment analysis systems, are hampered by its linguistic nature in [...] Read more.

Sarcasm is a linguistic phenomenon indicating a difference between literal meanings and implied intentions. It is commonly used on blogs, e-commerce platforms, and social media. Numerous NLP tasks, such as opinion mining and sentiment analysis systems, are hampered by its linguistic nature in detection. Traditional techniques concentrated mostly on textual incongruity. Recent research demonstrated that the addition of commonsense knowledge into sarcasm detection is an effective new method. However, existing techniques cannot effectively capture sentence “incongruity” information or take good advantage of external knowledge, resulting in imperfect detection performance. In this work, new modules are proposed for maximizing the utilization of the text, the commonsense knowledge, and their interplay. At first, we propose an adaptive incongruity extraction module to compute the distance between each word in the text and commonsense knowledge. Two adaptive incongruity extraction modules are applied to text and commonsense knowledge, respectively, which can obtain two adaptive incongruity attention matrixes. Therefore, each of the words in the sequence receives a new representation with enhanced incongruity semantics. Secondly, we propose the incongruity cross-attention module to extract the incongruity between the text and the corresponding commonsense knowledge, thereby allowing us to pick useful commonsense knowledge in sarcasm detection. In addition, we propose an improved gate module as a feature fusion module of text and commonsense knowledge, which determines how much information should be considered. Experimental results on publicly available datasets demonstrate the superiority of our method in achieving state-of-the-art performance on three datasets as well as enjoying improved interpretability. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

18 pages, 1277 KiB

Open AccessArticle

SelfCCL: Curriculum Contrastive Learning by Transferring Self-Taught Knowledge for Fine-Tuning BERT

by Somaiyeh Dehghan and Mehmet Fatih Amasyali

Appl. Sci. 2023, 13(3), 1913; https://doi.org/10.3390/app13031913 - 01 Feb 2023

Cited by 1 | Viewed by 1731

Abstract

BERT, the most popular deep learning language model, has yielded breakthrough results in various NLP tasks. However, the semantic representation space learned by BERT has the property of anisotropy. Therefore, BERT needs to be fine-tuned for certain downstream tasks such as Semantic Textual [...] Read more.

BERT, the most popular deep learning language model, has yielded breakthrough results in various NLP tasks. However, the semantic representation space learned by BERT has the property of anisotropy. Therefore, BERT needs to be fine-tuned for certain downstream tasks such as Semantic Textual Similarity (STS). To overcome this problem and improve the sentence representation space, some contrastive learning methods have been proposed for fine-tuning BERT. However, existing contrastive learning models do not consider the importance of input triplets in terms of easy and hard negatives during training. In this paper, we propose the SelfCCL: Curriculum Contrastive Learning model by Transferring Self-taught Knowledge for Fine-Tuning BERT, which mimics the two ways that humans learn about the world around them, namely contrastive learning and curriculum learning. The former learns by contrasting similar and dissimilar samples. The latter is inspired by the way humans learn from the simplest concepts to the most complex concepts. Our model also performs this training by transferring self-taught knowledge. That is, the model figures out which triplets are easy or difficult based on previously learned knowledge, and then learns based on those triplets in the order of curriculum using a contrastive objective. We apply our proposed model to the BERT and Sentence BERT(SBERT) frameworks. The evaluation results of SelfCCL on the standard STS and SentEval transfer learning tasks show that using curriculum learning together with contrastive learning increases average performance to some extent. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

19 pages, 1821 KiB

Open AccessArticle

FA-RCNet: A Fused Feature Attention Network for Relationship Classification

by Jiakai Tian, Gang Li, Mingle Zhou, Min Li and Delong Han

Appl. Sci. 2022, 12(23), 12460; https://doi.org/10.3390/app122312460 - 06 Dec 2022

Cited by 1 | Viewed by 1233

Abstract

Relation extraction is an important task in natural language processing. It plays an integral role in intelligent question-and-answer systems, semantic search, and knowledge graph work. For this task, previous studies have demonstrated the effectiveness of convolutional neural networks (CNNs), recurrent neural networks (RNNs), [...] Read more.

Relation extraction is an important task in natural language processing. It plays an integral role in intelligent question-and-answer systems, semantic search, and knowledge graph work. For this task, previous studies have demonstrated the effectiveness of convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) in relational classification tasks. Recently, due to the superior performance of the pre-trained model BERT, BERT has become a feature extraction module for many relational classification models, and good results have been achieved in work related to BERT. However, most of such work uses the deepest levels of features. The important role of shallow-level information in the relational classification task is ignored. Based on the above problems, a relationship classification network FA-RCNet (fusion-attention relationship classification network) with feature fusion and attention mechanism is proposed in this paper. FA-RCNet fuses shallow-level features with deep-level features, and augments entity features and global features by the attention module so that the feature vector can perform the relational classification task more perfectly. In addition, the model in this paper achieves advanced results on both the SemEval-2010 Task 8 dataset and the KBP37 dataset compared to previously published models. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

18 pages, 4709 KiB

Open AccessArticle

Framework for Handling Rare Word Problems in Neural Machine Translation System Using Multi-Word Expressions

by Kamal Deep Garg, Shashi Shekhar, Ajit Kumar, Vishal Goyal, Bhisham Sharma, Rajeswari Chengoden and Gautam Srivastava

Appl. Sci. 2022, 12(21), 11038; https://doi.org/10.3390/app122111038 - 31 Oct 2022

Cited by 7 | Viewed by 2049

Abstract

Machine Translation (MT) systems are now being improved with the use of an ongoing methodology known as Neural Machine Translation (NMT). Natural language processing (NLP) researchers have shown that NMT systems are unable to deal with out-of-vocabulary (OOV) words and multi-word expressions (MWEs) [...] Read more.

Machine Translation (MT) systems are now being improved with the use of an ongoing methodology known as Neural Machine Translation (NMT). Natural language processing (NLP) researchers have shown that NMT systems are unable to deal with out-of-vocabulary (OOV) words and multi-word expressions (MWEs) in the text. OOV terms are those that are not currently included in the vocabulary that is used by the NMT system. MWEs are phrases that consist of a minimum of two terms but are treated as a single unit. MWEs have great importance in NLP, linguistic theory, and MT systems. In this article, OOV words and MWEs are handled for the Punjabi to English NMT system. A parallel corpus for Punjabi to English containing MWEs was developed and used to train the different models of NMT. Punjabi is a low-resource language as it lacks the availability of a large parallel corpus for building various NLP tools, and this is an attempt to improve the accuracy of Punjabi in the English NMT system by using named entities and MWEs in the corpus. The developed NMT models were assessed using human evaluation through adequacy, fluency and overall rating as well as automated assessment tools such as the bilingual evaluation study (BLEU) and translation error rate (TER) score. Results show that using word embedding (WE) and MWEs corpus increased the accuracy of translation for the Punjabi to English language pair. The best BLEU score obtained was 15.45 for the small test set, 43.32 for the medium test set, and 34.5 for the large test set, respectively. The best TER rate score obtained was 57.34% for the small test set, 37.29% for the medium test set, and 53.79% for the large test set, repectively. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

26 pages, 1325 KiB

Open AccessArticle

TDO-Spider Taylor ChOA: An Optimized Deep-Learning-Based Sentiment Classification and Review Rating Prediction

by Santosh Kumar Banbhrani, Bo Xu, Pir Dino Soomro, Deepak Kumar Jain and Hongfei Lin

Appl. Sci. 2022, 12(20), 10292; https://doi.org/10.3390/app122010292 - 13 Oct 2022

Cited by 2 | Viewed by 1410

Abstract

Modern review websites, namely Yelp and Amazon, permit the users to post online reviews for numerous businesses, services and products. Currently, online reviewing is an imperative task in the manipulation of shopping decisions produced by customers. These reviews afford consumers experience and information [...] Read more.

Modern review websites, namely Yelp and Amazon, permit the users to post online reviews for numerous businesses, services and products. Currently, online reviewing is an imperative task in the manipulation of shopping decisions produced by customers. These reviews afford consumers experience and information regarding the superiority of the product. The prevalent method of strengthening online review evolution is the performance of Sentiment Classification, which is an attractive domain in industrial and academic research. The review helps various domains, and it is problematic to collect interpreted training data. In this paper, an effectual Review Rating Prediction and Sentiment Classification was developed. Here, a Gated Recurrent Unit (GRU) was employed for the Sentiment Classification process, whereas a Hierarchical Attention Network (HAN) was applied for Review Rating Prediction. The significant features, such as statistical, SentiWordNet and classification features, were extracted for the Sentiment Classification and Review Rating Prediction process. Moreover, the GRU was trained by the designed TD-Spider Taylor ChOA approach, and the HAN was trained by the designed Jaya-TDO approach. The experimental results show that the proposed Jaya-TDO technique attained a better performance of 0.9425, 0.9654 and 0.9538, and that TD-Spider Taylor ChOA achieved 0.9524, 0.9698 and 0.9588 in terms of the precision, recall and F-measure. Full article

(This article belongs to the Special Issue New Technologies and Applications of Natural Language Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

New Technologies and Applications of Natural Language Processing

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Related Special Issue

Published Papers (10 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI