Advances in Natural Language Processing and Text Mining

A special issue of Big Data and Cognitive Computing (ISSN 2504-2289).

Deadline for manuscript submissions: 31 July 2024 | Viewed by 14255

Special Issue Editors

School of Computer Science, Wuhan University, Wuhan 430072, China
Interests: parsing; information extraction; machine translation; large language models; multi-modal processing; natural language understanding; text mining

E-Mail Website
Guest Editor
School of Computer Science, Wuhan University, Wuhan 430072, China
Interests: text mining; entity linking; knowledge graph; natural language processing

Special Issue Information

Dear Colleagues,

Natural language processing (NLP) and text mining are two rapidly evolving fields with an increasing importance in both academic and industrial research areas. NLP focuses on the interaction between human language and computers, while text mining aims to extract useful insights and knowledge from unstructured textual data. Both fields are essential for handling the vast amounts of text data generated in today's world, which is crucial for various applications, such as information retrieval, sentiment analysis, machine translation, and many others.

With the growing volume and complexity of textual data, new challenges and opportunities arise in NLP and text mining. Recent advancements in machine learning, deep learning, and artificial intelligence have led to significant improvements in these fields. However, there is still much room for innovation and research to tackle the existing challenges.

The aim of this Special Issue is to present the latest research and developments in NLP and text mining, including new methodologies, techniques, and applications. This Special Issue intends to bring together researchers, practitioners, and academics to showcase their work and share their knowledge and expertise in these fields. The scope of this Special Issue aligns with the broader scope of big data and cognitive computing, which focuses on exploring the intersection of big data, cognitive computing, and artificial intelligence. The subject matter of NLP and text mining directly relates to the journal’s scope as these fields contribute significantly to the advancement of artificial intelligence and cognitive computing.

In this Special Issue, original research articles and reviews are welcome. Research areas may include (but are not limited to) the following:

  • Natural language understanding: techniques and algorithms for understanding and analyzing natural language, including sentiment analysis, topic modeling, named entity recognition, and entity linking.
  • Text mining and information retrieval: approaches for mining knowledge and insights from unstructured text data, including information retrieval, text classification, and clustering.
  • Deep learning for NLP and text mining: deep learning-based techniques for natural language processing and text mining, including neural language models, sequence-to-sequence models, and attention-based models.
  • Large language model pre-training: techniques for pre-training large language models, including BERT, GPT, and RoBERTa, and their applications in NLP and text mining tasks.
  • Multimodal NLP: techniques for analyzing and understanding multimodal data, including text, images, and videos.
  • Text generation: techniques for generating natural language text, including text summarization, question-answering systems, and text-to-speech systems.
  • Applications of NLP and text mining: practical applications of NLP and text mining in various domains, including healthcare, finance, social media, and e-commerce.
  • Explainable NLP and text mining: approaches for making NLP models more transparent and interpretable, including model visualization, attention mechanisms, and explainable AI.
  • Low-resource NLP and text mining: techniques for NLP tasks in low-resource languages or domains, where training data are scarce, including transfer learning, domain adaptation, and few-shot learning.
  • Multilingual NLP and text mining: techniques for processing and analyzing text data in multiple languages, including multilingual embeddings, cross-lingual transfer learning, and multilingual topic modeling.
  • NLP and text mining for social good: applications of NLP and text mining for social good, including hate speech detection, cyberbullying prevention, and disaster response.

We look forward to receiving your contributions.

Dr. Zuchao Li
Prof. Dr. Min Peng
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Big Data and Cognitive Computing is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language processing
  • text mining
  • deep learning
  • large language models
  • information retrieval
  • entity linking
  • relation extraction
  • multimodal NLP
  • low-resource NLP
  • NLP and text mining for social good

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 470 KiB  
Article
Comparing Hierarchical Approaches to Enhance Supervised Emotive Text Classification
by Lowri Williams, Eirini Anthi and Pete Burnap
Big Data Cogn. Comput. 2024, 8(4), 38; https://doi.org/10.3390/bdcc8040038 - 29 Mar 2024
Viewed by 633
Abstract
The performance of emotive text classification using affective hierarchical schemes (e.g., WordNet-Affect) is often evaluated using the same traditional measures used to evaluate the performance of when a finite set of isolated classes are used. However, applying such measures means the full characteristics [...] Read more.
The performance of emotive text classification using affective hierarchical schemes (e.g., WordNet-Affect) is often evaluated using the same traditional measures used to evaluate the performance of when a finite set of isolated classes are used. However, applying such measures means the full characteristics and structure of the emotive hierarchical scheme are not considered. Thus, the overall performance of emotive text classification using emotion hierarchical schemes is often inaccurately reported and may lead to ineffective information retrieval and decision making. This paper provides a comparative investigation into how methods used in hierarchical classification problems in other domains, which extend traditional evaluation metrics to consider the characteristics of the hierarchical classification scheme, can be applied and subsequently improve the classification of emotive texts. This study investigates the classification performance of three widely used classifiers, Naive Bayes, J48 Decision Tree, and SVM, following the application of the aforementioned methods. The results demonstrated that all the methods improved the emotion classification. However, the most notable improvement was recorded when a depth-based method was applied to both the testing and validation data, where the precision, recall, and F1-score were significantly improved by around 70 percentage points for each classifier. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

26 pages, 6098 KiB  
Article
Unveiling Sentiments: A Comprehensive Analysis of Arabic Hajj-Related Tweets from 2017–2022 Utilizing Advanced AI Models
by Hanan M. Alghamdi
Big Data Cogn. Comput. 2024, 8(1), 5; https://doi.org/10.3390/bdcc8010005 - 02 Jan 2024
Viewed by 2017
Abstract
Sentiment analysis plays a crucial role in understanding public opinion and social media trends. It involves analyzing the emotional tone and polarity of a given text. When applied to Arabic text, this task becomes particularly challenging due to the language’s complex morphology, right-to-left [...] Read more.
Sentiment analysis plays a crucial role in understanding public opinion and social media trends. It involves analyzing the emotional tone and polarity of a given text. When applied to Arabic text, this task becomes particularly challenging due to the language’s complex morphology, right-to-left script, and intricate nuances in expressing emotions. Social media has emerged as a powerful platform for individuals to express their sentiments, especially regarding religious and cultural events. Consequently, studying sentiment analysis in the context of Hajj has become a captivating subject. This research paper presents a comprehensive sentiment analysis of tweets discussing the annual Hajj pilgrimage over a six-year period. By employing a combination of machine learning and deep learning models, this study successfully conducted sentiment analysis on a sizable dataset consisting of Arabic tweets. The process involves pre-processing, feature extraction, and sentiment classification. The objective was to uncover the prevailing sentiments associated with Hajj over different years, before, during, and after each Hajj event. Importantly, the results presented in this study highlight that BERT, an advanced transformer-based model, outperformed other models in accurately classifying sentiment. This underscores its effectiveness in capturing the complexities inherent in Arabic text. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

12 pages, 2495 KiB  
Article
Semantic Similarity of Common Verbal Expressions in Older Adults through a Pre-Trained Model
by Marcos Orellana, Patricio Santiago García, Guillermo Daniel Ramon, Jorge Luis Zambrano-Martinez, Andrés Patiño-León, María Verónica Serrano and Priscila Cedillo
Big Data Cogn. Comput. 2024, 8(1), 3; https://doi.org/10.3390/bdcc8010003 - 29 Dec 2023
Viewed by 1597
Abstract
Health problems in older adults lead to situations where communication with peers, family and caregivers becomes challenging for seniors; therefore, it is necessary to use alternative methods to facilitate communication. In this context, Augmentative and Alternative Communication (AAC) methods are widely used to [...] Read more.
Health problems in older adults lead to situations where communication with peers, family and caregivers becomes challenging for seniors; therefore, it is necessary to use alternative methods to facilitate communication. In this context, Augmentative and Alternative Communication (AAC) methods are widely used to support this population segment. Moreover, with Artificial Intelligence (AI), and specifically, machine learning algorithms, AAC can be improved. Although there have been several studies in this field, it is interesting to analyze common phrases used by seniors, depending on their context (i.e., slang and everyday expressions typical of their age). This paper proposes a semantic analysis of the common phrases of older adults and their corresponding meanings through Natural Language Processing (NLP) techniques and a pre-trained language model using semantic textual similarity to represent the older adults’ phrases with their corresponding graphic images (pictograms). The results show good scores achieved in the semantic similarity between the phrases of the older adults and the definitions, so the relationship between the phrase and the pictogram has a high degree of probability. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

15 pages, 1210 KiB  
Article
Text Classification Based on the Heterogeneous Graph Considering the Relationships between Documents
by Hiromu Nakajima and Minoru Sasaki
Big Data Cogn. Comput. 2023, 7(4), 181; https://doi.org/10.3390/bdcc7040181 - 13 Dec 2023
Cited by 1 | Viewed by 1553
Abstract
Text classification is the task of estimating the genre of a document based on information such as word co-occurrence and frequency of occurrence. Text classification has been studied by various approaches. In this study, we focused on text classification using graph structure data. [...] Read more.
Text classification is the task of estimating the genre of a document based on information such as word co-occurrence and frequency of occurrence. Text classification has been studied by various approaches. In this study, we focused on text classification using graph structure data. Conventional graph-based methods express relationships between words and relationships between words and documents as weights between nodes. Then, a graph neural network is used for learning. However, there is a problem that conventional methods are not able to represent the relationship between documents on the graph. In this paper, we propose a graph structure that considers the relationships between documents. In the proposed method, the cosine similarity of document vectors is set as weights between document nodes. This completes a graph that considers the relationship between documents. The graph is then input into a graph convolutional neural network for training. Therefore, the aim of this study is to improve the text classification performance of conventional methods by using this graph that considers the relationships between document nodes. In this study, we conducted evaluation experiments using five different corpora of English documents. The results showed that the proposed method outperformed the performance of the conventional method by up to 1.19%, indicating that the use of relationships between documents is effective. In addition, the proposed method was shown to be particularly effective in classifying long documents. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

20 pages, 7293 KiB  
Article
Empowering Propaganda Detection in Resource-Restraint Languages: A Transformer-Based Framework for Classifying Hindi News Articles
by Deptii Chaudhari and Ambika Vishal Pawar
Big Data Cogn. Comput. 2023, 7(4), 175; https://doi.org/10.3390/bdcc7040175 - 15 Nov 2023
Cited by 3 | Viewed by 1774
Abstract
Misinformation, fake news, and various propaganda techniques are increasingly used in digital media. It becomes challenging to uncover propaganda as it works with the systematic goal of influencing other individuals for the determined ends. While significant research has been reported on propaganda identification [...] Read more.
Misinformation, fake news, and various propaganda techniques are increasingly used in digital media. It becomes challenging to uncover propaganda as it works with the systematic goal of influencing other individuals for the determined ends. While significant research has been reported on propaganda identification and classification in resource-rich languages such as English, much less effort has been made in resource-deprived languages like Hindi. The spread of propaganda in the Hindi news media has induced our attempt to devise an approach for the propaganda categorization of Hindi news articles. The unavailability of the necessary language tools makes propaganda classification in Hindi more challenging. This study proposes the effective use of deep learning and transformer-based approaches for Hindi computational propaganda classification. To address the lack of pretrained word embeddings in Hindi, Hindi Word2vec embeddings were created using the H-Prop-News corpus for feature extraction. Subsequently, three deep learning models, i.e., CNN (convolutional neural network), LSTM (long short-term memory), Bi-LSTM (bidirectional long short-term memory); and four transformer-based models, i.e., multi-lingual BERT, Distil-BERT, Hindi-BERT, and Hindi-TPU-Electra, were experimented with. The experimental outcomes indicate that the multi-lingual BERT and Hindi-BERT models provide the best performance, with the highest F1 score of 84% on the test data. These results strongly support the efficacy of the proposed solution and indicate its appropriateness for propaganda classification. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

18 pages, 1858 KiB  
Article
Arabic Toxic Tweet Classification: Leveraging the AraBERT Model
by Amr Mohamed El Koshiry, Entesar Hamed I. Eliwa, Tarek Abd El-Hafeez and Ahmed Omar
Big Data Cogn. Comput. 2023, 7(4), 170; https://doi.org/10.3390/bdcc7040170 - 26 Oct 2023
Cited by 3 | Viewed by 2005
Abstract
Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify [...] Read more.
Social media platforms have become the primary means of communication and information sharing, facilitating interactive exchanges among users. Unfortunately, these platforms also witness the dissemination of inappropriate and toxic content, including hate speech and insults. While significant efforts have been made to classify toxic content in the English language, the same level of attention has not been given to Arabic texts. This study addresses this gap by constructing a standardized Arabic dataset specifically designed for toxic tweet classification. The dataset is annotated automatically using Google’s Perspective API and the expertise of three native Arabic speakers and linguists. To evaluate the performance of different models, we conduct a series of experiments using seven models: long short-term memory (LSTM), bidirectional LSTM, a convolutional neural network, a gated recurrent unit (GRU), bidirectional GRU, multilingual bidirectional encoder representations from transformers, and AraBERT. Additionally, we employ word embedding techniques. Our experimental findings demonstrate that the fine-tuned AraBERT model surpasses the performance of other models, achieving an impressive accuracy of 0.9960. Notably, this accuracy value outperforms similar approaches reported in recent literature. This study represents a significant advancement in Arabic toxic tweet classification, shedding light on the importance of addressing toxicity in social media platforms while considering diverse languages and cultures. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

16 pages, 3082 KiB  
Article
The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters
by Nurgali Kadyrbek, Madina Mansurova, Adai Shomanov and Gaukhar Makharova
Big Data Cogn. Comput. 2023, 7(3), 132; https://doi.org/10.3390/bdcc7030132 - 20 Jul 2023
Viewed by 1746
Abstract
This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of [...] Read more.
This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed audio corpus, and the use of deep neural networks for speech modeling. A high-quality decoded audio corpus was collected, containing 554 h of data, giving an idea of the frequencies of letters and syllables, as well as demographic parameters such as the gender, age, and region of residence of native speakers. The corpus contains a universal vocabulary and serves as a valuable resource for the development of modules related to speech. Machine learning experiments were conducted using the DeepSpeech2 model, which includes a sequence-to-sequence architecture with an encoder, decoder, and attention mechanism. To increase the reliability of the model, filters initialized with symbol-level embeddings were introduced to reduce the dependence on accurate positioning on object maps. The training process included simultaneous preparation of convolutional filters for spectrograms and symbolic objects. The proposed approach, using a combination of supervised and unsupervised learning methods, resulted in a 66.7% reduction in the weight of the model while maintaining relative accuracy. The evaluation on the test sample showed a 7.6% lower character error rate (CER) compared to existing models, demonstrating its most modern characteristics. The proposed architecture provides deployment on platforms with limited resources. Overall, this study presents a high-quality audio corpus, an improved speech recognition model, and promising results applicable to speech-related applications and languages beyond Kazakh. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

20 pages, 1788 KiB  
Article
DSpamOnto: An Ontology Modelling for Domain-Specific Social Spammers in Microblogging
by Malak Al-Hassan, Bilal Abu-Salih and Ahmad Al Hwaitat
Big Data Cogn. Comput. 2023, 7(2), 109; https://doi.org/10.3390/bdcc7020109 - 02 Jun 2023
Cited by 2 | Viewed by 1502
Abstract
The lack of regulations and oversight on Online Social Networks (OSNs) has resulted in the rise of social spam, which is the dissemination of unsolicited and low-quality content that aims to deceive and manipulate users. Social spam can cause a range of negative [...] Read more.
The lack of regulations and oversight on Online Social Networks (OSNs) has resulted in the rise of social spam, which is the dissemination of unsolicited and low-quality content that aims to deceive and manipulate users. Social spam can cause a range of negative consequences for individuals and businesses, such as the spread of malware, phishing scams, and reputational damage. While machine learning techniques can be used to detect social spammers by analysing patterns in data, they have limitations such as the potential for false positives and false negatives. In contrast, ontologies allow for the explicit modelling and representation of domain knowledge, which can be used to create a set of rules for identifying social spammers. However, the literature exposes a deficiency of ontologies that conceptualize domain-based social spam. This paper aims to address this gap by designing a domain-specific ontology called DSpamOnto to detect social spammers in microblogging that targes a specific domain. DSpamOnto can identify social spammers based on their domain-specific behaviour, such as posting repetitive or irrelevant content and using misleading information. The proposed model is compared and benchmarked against well-proven ML models using various evaluation metrics to verify and validate its utility in capturing social spammers. Full article
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining)
Show Figures

Figure 1

Back to TopTop