Text Mining, Machine Learning, and Natural Language Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 May 2024 | Viewed by 20511

Special Issue Editors


E-Mail Website
Guest Editor
Computer Science and Engineering Department, American University in Cairo, AUC Avenue, New Cairo 11835, Egypt
Interests: data, text, and web mining; natural language processing and machine translation; knowledge engineering

E-Mail Website
Guest Editor
Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233 Gdansk, Poland
Interests: natural language processing; intelligent signal analysis; artificial intelligence; data/text mining; machine learning; classification; pattern recognition; clustering
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue will address text mining techniques to perform different tasks on textual data. Text mining uses techniques from machine learning and natural language processing to perform a set of tasks, such as knowledge extraction, information extraction, summarization, name entity extraction, relations extraction, text embeddings, sentiment classification, topic modelling, fake news identification, and others.

Topics of interest include but are not limited to the following:

  • Text classification and clustering;
  • Text representation using word, sentence, and document embeddings;
  • Text preprocessing using NLP techniques;
  • Text summarization;
  • Web and social content mining;
  • Information and knowledge extraction from textual corpora;
  • Text mining applications in different domains, such as legal, news, and biomedical;
  • Sentiment classification;
  • Opinion mining;
  • Topic modeling.

Prof. Dr. Ahmed Rafea
Prof. Dr. Julian Szymanski
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • classification
  • clustering
  • document embedding
  • information extraction
  • knowledge extraction
  • summarization
  • sentiment analysis
  • opinion mining
  • topic modeling

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 3568 KiB  
Article
Innovative Use of Self-Attention-Based Ensemble Deep Learning for Suicide Risk Detection in Social Media Posts
by Hoan-Suk Choi and Jinhong Yang
Appl. Sci. 2024, 14(2), 893; https://doi.org/10.3390/app14020893 - 20 Jan 2024
Viewed by 755
Abstract
Suicidal ideation constitutes a critical concern in mental health, adversely affecting individuals and society at large. The early detection of such ideation is vital for providing timely support to individuals and mitigating its societal impact. With social media serving as a platform for [...] Read more.
Suicidal ideation constitutes a critical concern in mental health, adversely affecting individuals and society at large. The early detection of such ideation is vital for providing timely support to individuals and mitigating its societal impact. With social media serving as a platform for self-expression, it offers a rich source of data that can reveal early symptoms of mental health issues. This paper introduces an innovative ensemble learning method named LSTM-Attention-BiTCN, which fuses LSTM and BiTCN models with a self-attention mechanism to detect signs of suicidality in social media posts. Our LSTM-Attention-BiTCN model demonstrated superior performance in comparison to baseline models in the realm of classification and suicidal ideation detection, boasting an accuracy of 0.9405, a precision of 0.9385, a recall of 0.9424, and an F1-score of 0.9405. Our proposed model can aid healthcare professionals in recognizing suicidal tendencies among social media users accurately, thereby contributing to efforts to reduce suicide rates. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

14 pages, 434 KiB  
Article
Abstractive Summarizers Become Emotional on News Summarization
by Vicent Ahuir, José-Ángel González, Lluís-F. Hurtado and Encarna Segarra
Appl. Sci. 2024, 14(2), 713; https://doi.org/10.3390/app14020713 - 15 Jan 2024
Viewed by 707
Abstract
Emotions are central to understanding contemporary journalism; however, they are overlooked in automatic news summarization. Actually, summaries are an entry point to the source article that could favor some emotions to captivate the reader. Nevertheless, the emotional content of summarization corpora and the [...] Read more.
Emotions are central to understanding contemporary journalism; however, they are overlooked in automatic news summarization. Actually, summaries are an entry point to the source article that could favor some emotions to captivate the reader. Nevertheless, the emotional content of summarization corpora and the emotional behavior of summarization models are still unexplored. In this work, we explore the usage of established methodologies to study the emotional content of summarization corpora and the emotional behavior of summarization models. Using these methodologies, we study the emotional content of two widely used summarization corpora: Cnn/Dailymail and Xsum, and the capabilities of three state-of-the-art transformer-based abstractive systems for eliciting emotions in the generated summaries: Bart, Pegasus, and T5. The main significant findings are as follows: (i) emotions are persistent in the two summarization corpora, (ii) summarizers approach moderately well the emotions of the reference summaries, and (iii) more than 75% of the emotions introduced by novel words in generated summaries are present in the reference ones. The combined use of these methodologies has allowed us to conduct a satisfactory study of the emotional content in news summarization. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

13 pages, 1120 KiB  
Article
Leveraging Prompt and Top-K Predictions with ChatGPT Data Augmentation for Improved Relation Extraction
by Ping Feng, Hang Wu, Ziqian Yang, Yunyi Wang and Dantong Ouyang
Appl. Sci. 2023, 13(23), 12746; https://doi.org/10.3390/app132312746 - 28 Nov 2023
Viewed by 768
Abstract
Relation extraction tasks aim to predict the type of relationship between two entities from a given text. However, many existing methods fail to fully utilize the semantic information and the probability distribution of the output of pre-trained language models, and existing data augmentation [...] Read more.
Relation extraction tasks aim to predict the type of relationship between two entities from a given text. However, many existing methods fail to fully utilize the semantic information and the probability distribution of the output of pre-trained language models, and existing data augmentation approaches for natural language processing (NLP) may introduce errors. To address this issue, we propose a method that introduces prompt information and Top-K prediction sets and utilizes ChatGPT for data augmentation to improve relational classification model performance. First, we add prompt information before each sample and encode the modified samples by pre-training the language model RoBERTa and using these feature vectors to obtain the Top-K prediction set. We add a multi-attention mechanism to link the Top-K prediction set with the prompt information. We then reduce the possibility of introducing noise by bootstrapping ChatGPT so that it can better perform the data augmentation task and reduce subsequent unnecessary operations. Finally, we investigate the predefined relationship categories in the SemEval 2010 Task 8 dataset and the prediction results of the model and propose an entity location prediction task designed to assist the model in accurately determining the relative locations between entities. Experimental results indicate that our model achieves high results on the SemEval 2010 Task 8 dataset. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

14 pages, 793 KiB  
Article
Unlocking Everyday Wisdom: Enhancing Machine Comprehension with Script Knowledge Integration
by Zhihao Zhou, Tianwei Yue, Chen Liang, Xiaoyu Bai, Dachi Chen, Congrui Hetang and Wenping Wang
Appl. Sci. 2023, 13(16), 9461; https://doi.org/10.3390/app13169461 - 21 Aug 2023
Cited by 2 | Viewed by 847
Abstract
Harnessing commonsense knowledge poses a significant challenge for machine comprehension systems. This paper primarily focuses on incorporating a specific subset of commonsense knowledge, namely, script knowledge. Script knowledge is about sequences of actions that are typically performed by individuals in everyday life. Our [...] Read more.
Harnessing commonsense knowledge poses a significant challenge for machine comprehension systems. This paper primarily focuses on incorporating a specific subset of commonsense knowledge, namely, script knowledge. Script knowledge is about sequences of actions that are typically performed by individuals in everyday life. Our experiments were centered around the MCScript dataset, which was the basis of the SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge. As a baseline, we utilized our Three-Way Attentive Networks (TriANs) framework to model the interactions among passages, questions, and answers. Building upon the TriAN, we proposed to: (1) integrate a pre-trained language model to capture script knowledge; (2) introduce multi-layer attention to facilitate multi-hop reasoning; and (3) incorporate positional embeddings to enhance the model’s capacity for event-ordering reasoning. In this paper, we present our proposed methods and prove their efficacy in improving script knowledge integration and reasoning. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

24 pages, 1852 KiB  
Article
High-Quality Data from Crowdsourcing towards the Creation of a Mexican Anti-Immigrant Speech Corpus
by Alejandro Molina-Villegas, Thomas Cattin, Karina Gazca-Hernandez and Edwin Aldana-Bobadilla
Appl. Sci. 2023, 13(14), 8417; https://doi.org/10.3390/app13148417 - 21 Jul 2023
Viewed by 830
Abstract
Currently, a significant portion of published research on online hate speech relies on existing textual corpora. However, when examining a specific context, there is a lack of preexisting datasets that include the particularities associated with various conditions (e.g., geographic and cultural). This issue [...] Read more.
Currently, a significant portion of published research on online hate speech relies on existing textual corpora. However, when examining a specific context, there is a lack of preexisting datasets that include the particularities associated with various conditions (e.g., geographic and cultural). This issue is evident in the case of online anti-immigrant speech in Mexico, where available data to study this emergent and often overlooked phenomenon are scarce. In light of this situation, we propose a novel methodology wherein three domain experts annotate a certain number of texts related to the subject. We establish a precise control mechanism based on these annotations to evaluate non-expert annotators. The evaluation of the contributors is implemented in a custom annotation platform, enabling us to conduct a controlled crowdsourcing campaign and assess the reliability of the obtained data. Our results demonstrate that a combination of crowdsourced and expert data leads to iterative improvements, not only in the accuracy achieved by various machine learning classification models (reaching 0.8828) but also in the model’s adaptation to the specific characteristics of hate speech in the Mexican Twittersphere context. In addition to these methodological innovations, the most significant contribution of our work is the creation of the first online Mexican anti-immigrant training corpus for machine-learning-based detection tasks. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

18 pages, 2072 KiB  
Article
GenCo: A Generative Learning Model for Heterogeneous Text Classification Based on Collaborative Partial Classifications
by Zie Eya Ekolle and Ryuji Kohno
Appl. Sci. 2023, 13(14), 8211; https://doi.org/10.3390/app13148211 - 14 Jul 2023
Viewed by 1075
Abstract
The use of generative learning models in natural language processing (NLP) has significantly contributed to the advancement of natural language applications, such as sentimental analysis, topic modeling, text classification, chatbots, and spam filtering. With a large amount of text generated each day from [...] Read more.
The use of generative learning models in natural language processing (NLP) has significantly contributed to the advancement of natural language applications, such as sentimental analysis, topic modeling, text classification, chatbots, and spam filtering. With a large amount of text generated each day from different sources, such as web-pages, blogs, emails, social media, and articles, one of the most common tasks in NLP is the classification of a text corpus. This is important in many institutions for planning, decision-making, and creating archives of their projects. Many algorithms exist to automate text classification tasks but the most intriguing of them is that which also learns these tasks automatically. In this study, we present a new model to infer and learn from data using probabilistic logic and apply it to text classification. This model, called GenCo, is a multi-input single-output (MISO) learning model that uses a collaboration of partial classifications to generate the desired output. It provides a heterogeneity measure to explain its classification results and enables a reduction in the curse of dimensionality in text classification. Experiments with the model were carried out on the Twitter US Airline dataset, the Conference Paper dataset, and the SMS Spam dataset, outperforming baseline models with 98.40%, 89.90%, and 99.26% accuracy, respectively. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

24 pages, 3188 KiB  
Article
One-Class Learning for AI-Generated Essay Detection
by Roberto Corizzo and Sebastian Leal-Arenas
Appl. Sci. 2023, 13(13), 7901; https://doi.org/10.3390/app13137901 - 05 Jul 2023
Cited by 1 | Viewed by 2548
Abstract
Detection of AI-generated content is a crucially important task considering the increasing attention towards AI tools, such as ChatGPT, and the raised concerns with regard to academic integrity. Existing text classification approaches, including neural-network-based and feature-based methods, are mostly tailored for English data, [...] Read more.
Detection of AI-generated content is a crucially important task considering the increasing attention towards AI tools, such as ChatGPT, and the raised concerns with regard to academic integrity. Existing text classification approaches, including neural-network-based and feature-based methods, are mostly tailored for English data, and they are typically limited to a supervised learning setting. Although one-class learning methods are more suitable for classification tasks, their effectiveness in essay detection is still unknown. In this paper, this gap is explored by adopting linguistic features and one-class learning models for AI-generated essay detection. Detection performance of different models is assessed in different settings, where positively labeled data, i.e., AI-generated essays, are unavailable for model training. Results with two datasets containing essays in L2 English and L2 Spanish show that it is feasible to accurately detect AI-generated essays. The analysis reveals which models and which sets of linguistic features are more powerful than others in the detection task. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

14 pages, 2080 KiB  
Article
Enhancing Abstractive Summarization with Extracted Knowledge Graphs and Multi-Source Transformers
by Tong Chen, Xuewei Wang, Tianwei Yue, Xiaoyu Bai, Cindy X. Le and Wenping Wang
Appl. Sci. 2023, 13(13), 7753; https://doi.org/10.3390/app13137753 - 30 Jun 2023
Cited by 6 | Viewed by 3155
Abstract
As the popularity of large language models (LLMs) has risen over the course of the last year, led by GPT-3/4 and especially its productization as ChatGPT, we have witnessed the extensive application of LLMs to text summarization. However, LLMs do not intrinsically have [...] Read more.
As the popularity of large language models (LLMs) has risen over the course of the last year, led by GPT-3/4 and especially its productization as ChatGPT, we have witnessed the extensive application of LLMs to text summarization. However, LLMs do not intrinsically have the power to verify the correctness of the information they supply and generate. This research introduces a novel approach to abstractive summarization, aiming to address the limitations of LLMs in that they struggle to understand the truth. The proposed method leverages extracted knowledge graph information and structured semantics as a guide for summarization. Building upon BART, one of the state-of-the-art sequence-to-sequence pre-trained LLMs, multi-source transformer modules are developed as an encoder, which are capable of processing textual and graphical inputs. Decoding is performed based on this enriched encoding to enhance the summary quality. The Wiki-Sum dataset, derived from Wikipedia text dumps, is introduced for evaluation purposes. Comparative experiments with baseline models demonstrate the strengths of the proposed approach in generating informative and relevant summaries. We conclude by presenting our insights into utilizing LLMs with graph external information, which will become a powerful aid towards the goal of factually correct and verified LLMs. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

20 pages, 742 KiB  
Article
EvoText: Enhancing Natural Language Generation Models via Self-Escalation Learning for Up-to-Date Knowledge and Improved Performance
by Zhengqing Yuan, Huiwen Xue, Chao Zhang and Yongming Liu
Appl. Sci. 2023, 13(8), 4758; https://doi.org/10.3390/app13084758 - 10 Apr 2023
Viewed by 1452
Abstract
In recent years, pretrained models have been widely used in various fields, including natural language understanding, computer vision, and natural language generation. However, the performance of these language generation models is highly dependent on the model size and the dataset size. While larger [...] Read more.
In recent years, pretrained models have been widely used in various fields, including natural language understanding, computer vision, and natural language generation. However, the performance of these language generation models is highly dependent on the model size and the dataset size. While larger models excel in some aspects, they cannot learn up-to-date knowledge and are relatively difficult to relearn. In this paper, we introduce EvoText, a novel training method that enhances the performance of any natural language generation model without requiring additional datasets during the entire training process (although a prior dataset is necessary for pretraining). EvoText employs two models: G, a text generation model, and D, a model that can determine whether the data generated by G is legitimate. Initially, the fine-tuned D model serves as the knowledge base. The text generated by G is then input to D to determine whether it is legitimate. Finally, G is fine-tuned based on D’s output. EvoText enables the model to learn up-to-date knowledge through a self-escalation process that builds on a priori knowledge. When EvoText needs to learn something new, it simply fine-tunes the D model. Our approach applies to autoregressive language modeling for all Transformer classes. With EvoText, eight models achieved stable improvements in seven natural language processing tasks without any changes to the model structure. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

19 pages, 605 KiB  
Article
An Abstractive Summarization Model Based on Joint-Attention Mechanism and a Priori Knowledge
by Yuanyuan Li, Yuan Huang, Weijian Huang, Junhao Yu and Zheng Huang
Appl. Sci. 2023, 13(7), 4610; https://doi.org/10.3390/app13074610 - 05 Apr 2023
Cited by 3 | Viewed by 1510
Abstract
An abstractive summarization model based on the joint-attention mechanism and a priori knowledge is proposed to address the problems of the inadequate semantic understanding of text and summaries that do not conform to human language habits in abstractive summary models. Word vectors that [...] Read more.
An abstractive summarization model based on the joint-attention mechanism and a priori knowledge is proposed to address the problems of the inadequate semantic understanding of text and summaries that do not conform to human language habits in abstractive summary models. Word vectors that are most relevant to the original text should be selected first. Second, the original text is represented in two dimensions—word-level and sentence-level, as word vectors and sentence vectors, respectively. After this processing, there will be not only a relationship between word-level vectors but also a relationship between sentence-level vectors, and the decoder discriminates between word-level and sentence-level vectors based on their relationship with the hidden state of the decoder. Then, the pointer generation network is improved using a priori knowledge. Finally, reinforcement learning is used to improve the quality of the generated summaries. Experiments on two classical datasets, CNN/DailyMail and DUC 2004, show that the model has good performance and effectively improves the quality of generated summaries. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

14 pages, 514 KiB  
Article
Readability Metrics for Machine Translation in Dutch: Google vs. Azure & IBM
by Chaïm van Toledo, Marijn Schraagen, Friso van Dijk, Matthieu Brinkhuis and Marco Spruit
Appl. Sci. 2023, 13(7), 4444; https://doi.org/10.3390/app13074444 - 31 Mar 2023
Cited by 1 | Viewed by 1242
Abstract
This paper introduces a novel method to predict when a Google translation is better than other machine translations (MT) in Dutch. Instead of considering fidelity, this approach considers fluency and readability indicators for when Google ranked best. This research explores an alternative approach [...] Read more.
This paper introduces a novel method to predict when a Google translation is better than other machine translations (MT) in Dutch. Instead of considering fidelity, this approach considers fluency and readability indicators for when Google ranked best. This research explores an alternative approach in the field of quality estimation. The paper contributes by publishing a dataset with sentences from English to Dutch, with human-made classifications on a best-worst scale. Logistic regression shows a correlation between T-Scan output, such as readability measurements like lemma frequencies, and when Google translation was better than Azure and IBM. The last part of the results section shows the prediction possibilities. First by logistic regression and second by a generated automated machine learning model. Respectively, they have an accuracy of 0.59 and 0.61. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

21 pages, 801 KiB  
Article
Does Context Matter? Effective Deep Learning Approaches to Curb Fake News Dissemination on Social Media
by Jawaher Alghamdi, Yuqing Lin and Suhuai Luo
Appl. Sci. 2023, 13(5), 3345; https://doi.org/10.3390/app13053345 - 06 Mar 2023
Cited by 6 | Viewed by 1805
Abstract
The prevalence of fake news on social media has led to major sociopolitical issues. Thus, the need for automated fake news detection is more important than ever. In this work, we investigated the interplay between news content and users’ posting behavior clues in [...] Read more.
The prevalence of fake news on social media has led to major sociopolitical issues. Thus, the need for automated fake news detection is more important than ever. In this work, we investigated the interplay between news content and users’ posting behavior clues in detecting fake news by using state-of-the-art deep learning approaches, such as the convolutional neural network (CNN), which involves a series of filters of different sizes and shapes (combining the original sentence matrix to create further low-dimensional matrices), and the bidirectional gated recurrent unit (BiGRU), which is a type of bidirectional recurrent neural network with only the input and forget gates, coupled with a self-attention mechanism. The proposed architectures introduced a novel approach to learning rich, semantical, and contextual representations of a given news text using natural language understanding of transfer learning coupled with context-based features. Experiments were conducted on the FakeNewsNet dataset. The experimental results show that incorporating information about users’ posting behaviors (when available) improves the performance compared to models that rely solely on textual news data. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

14 pages, 1434 KiB  
Article
Predicting Location of Tweets Using Machine Learning Approaches
by Mohammed Alsaqer, Salem Alelyani, Mohamed Mohana, Khalid Alreemy and Ali Alqahtani
Appl. Sci. 2023, 13(5), 3025; https://doi.org/10.3390/app13053025 - 26 Feb 2023
Cited by 3 | Viewed by 2186
Abstract
Twitter, one of the most popular microblogging platforms, has tens of millions of active users worldwide, generating hundreds of millions of posts every day. Twitter posts, referred to as “tweets”, the short and the noisy text, bring many challenges with them, such as [...] Read more.
Twitter, one of the most popular microblogging platforms, has tens of millions of active users worldwide, generating hundreds of millions of posts every day. Twitter posts, referred to as “tweets”, the short and the noisy text, bring many challenges with them, such as in the case of some emergency or disaster. Predicting the location of these tweets is important for social, security, human rights, and business reasons and has raised noteworthy consideration lately. However, most Twitter users disable the geo-tagging feature, and their home locations are neither standardized nor accurate. In this study, we applied four machine learning techniques named Logistic Regression, Random Forest, Multinomial Naïve Bayes, and Support Vector Machine with and without the utilization of the geo-distance matrix for location prediction of a tweet using its textual content. Our extensive experiments on our vast collection of Arabic tweets From Saudi Arabia with different feature sets yielded promising results with 67% accuracy. Full article
(This article belongs to the Special Issue Text Mining, Machine Learning, and Natural Language Processing)
Show Figures

Figure 1

Back to TopTop