Natural Language Processing (NLP) and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 October 2023) | Viewed by 134915

Special Issue Editors


E-Mail Website
Guest Editor
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
Interests: natural language processing; knowledge graph; multimodal learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Lab of Big Data Analysis and Application, University of Science and Technology of China, Hefei 230027, China
Interests: natural language processing; social media analysis; multimodal intelligence
Special Issues, Collections and Topics in MDPI journals
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
Interests: natural language processing; knowledge graph; multimodal learning
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Natural Language Processing (NLP) is a key technology of artificial intelligence. In recent years, many highly recognized efforts in NLP have mushroomed, such as BERT and GPT-3. They already power a wide range of applications that we experience on a daily basis, such as question answering, machine translation, and smart assistants. They are also crucial to a wide range of other research topics, for biomedical information processing, knowledge graph, and multimodal intelligence. However, numerous relevant unsolved theoretical and technological problems await further research. This special issue aims at addressing the aforementioned questions by inviting scholarly contributions covering recent advances in NLP and applications. We welcome original research articles reporting the development of NLP novel models and algorithms, as well as NLP applications papers with novel ideas.

Prof. Dr. Gui-Lin Qi
Dr. Tong Xu
Dr. Meng Wang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language understanding
  • natural language generation
  • question answering
  • machine translation
  • knowledge graph
  • NLP for knowledge extraction
  • NLP for multimodal intelligence
  • NLP applications in specific domains, like life sciences, health, and medicine
  • eGovernment and public administration
  • news and social media

Published Papers (79 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

15 pages, 867 KiB  
Article
An Efficient Document Retrieval for Korean Open-Domain Question Answering Based on ColBERT
by Byungha Kang, Yeonghwa Kim and Youhyun Shin
Appl. Sci. 2023, 13(24), 13177; https://doi.org/10.3390/app132413177 - 12 Dec 2023
Cited by 1 | Viewed by 1057
Abstract
Open-domain question answering requires the task of retrieving documents with high relevance to the query from a large-scale corpus. Deep learning-based dense retrieval methods have become the primary approach for finding related documents. Although deep learning-based methods have improved search accuracy compared to [...] Read more.
Open-domain question answering requires the task of retrieving documents with high relevance to the query from a large-scale corpus. Deep learning-based dense retrieval methods have become the primary approach for finding related documents. Although deep learning-based methods have improved search accuracy compared to traditional techniques, they simultaneously impose a considerable increase in computational burden. Consequently, research on efficient models and methods that optimize the trade-off between search accuracy and time to alleviate computational demands is required. In this paper, we propose a Korean document retrieval method utilizing ColBERT’s late interaction paradigm to efficiently calculate the relevance between questions and documents. For open-domain Korean question answering document retrieval, we construct a Korean dataset using various corpora from AI-Hub. We conduct experiments comparing the search accuracy and inference time among the traditional IR (information retrieval) model BM25, the dense retrieval approach utilizing BERT-based models for Korean, and our proposed method. The experimental results demonstrate that our approach achieves a higher accuracy than BM25 and requires less search time than the dense retrieval method employing KoBERT. Moreover, the most outstanding performance is observed when using KoSBERT, a pre-trained Korean language model that learned to position semantically similar sentences closely in vector space. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 6671 KiB  
Article
Neural Machine Translation Research on Syntactic Information Fusion Based on the Field of Electrical Engineering
by Yanna Sang, Yuan Chen and Juwei Zhang
Appl. Sci. 2023, 13(23), 12905; https://doi.org/10.3390/app132312905 - 01 Dec 2023
Viewed by 732
Abstract
Neural machine translation has achieved good translation results, but needs further improvement in low-resource and domain-specific translation. To this end, the paper proposed to incorporate source language syntactic information into neural machine translation models. Two novel approaches, namely Contrastive Language–Image Pre-training(CLIP) and Cross-attention [...] Read more.
Neural machine translation has achieved good translation results, but needs further improvement in low-resource and domain-specific translation. To this end, the paper proposed to incorporate source language syntactic information into neural machine translation models. Two novel approaches, namely Contrastive Language–Image Pre-training(CLIP) and Cross-attention Fusion (CAF), were compared to a base transformer model on EN–ZH and ZH–EN pair machine translation focusing on the electrical engineering domain. In addition, an ablation study on the effect of both proposed methods was presented. Among them, the CLIP pre-training method improved significantly compared with the baseline system, and the BLEU values in the EN–ZH and ZH–EN tasks increased by 3.37 and 3.18 percentage points, respectively. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 402 KiB  
Article
Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages
by Ephrem Afele Retta, Richard Sutcliffe, Jabar Mahmood, Michael Abebe Berwo, Eiad Almekhlafi, Sajjad Ahmad Khan, Shehzad Ashraf Chaudhry, Mustafa Mhamed and Jun Feng
Appl. Sci. 2023, 13(23), 12587; https://doi.org/10.3390/app132312587 - 22 Nov 2023
Cited by 1 | Viewed by 690
Abstract
In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We [...] Read more.
In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German, and Urdu. For Amharic, we use our own publicly available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu, we use the existing RAVDESS, EMO-DB, and URDU datasets. We followed previous research in mapping labels for all of the datasets to just two classes: positive and negative. Thus, we can compare performance on different languages directly and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. The results, averaged for the three models, were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each of the following pairs: Amharic↔German, Amharic↔English, and Amharic↔Urdu. The results with Amharic as the target suggested that using English or German as the source gives the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percentage points greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training an SER classifier when resources for a language are scarce. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 517 KiB  
Article
Prompt Language Learner with Trigger Generation for Dialogue Relation Extraction
by Jinsung Kim, Gyeongmin Kim, Junyoung Son and Heuiseok Lim
Appl. Sci. 2023, 13(22), 12414; https://doi.org/10.3390/app132212414 - 16 Nov 2023
Cited by 1 | Viewed by 685
Abstract
Dialogue relation extraction identifies semantic relations between entity pairs in dialogues. This research explores a methodology harnessing the potential of prompt-based fine-tuning paired with a trigger-generation approach. Capitalizing on the intrinsic knowledge of pre-trained language models, this strategy employs triggers that underline the [...] Read more.
Dialogue relation extraction identifies semantic relations between entity pairs in dialogues. This research explores a methodology harnessing the potential of prompt-based fine-tuning paired with a trigger-generation approach. Capitalizing on the intrinsic knowledge of pre-trained language models, this strategy employs triggers that underline the relation between entities decisively. In particular, diverging from the conventional extractive methods seen in earlier research, our study leans towards a generative manner for trigger generation. The dialogue-based relation extraction (DialogeRE) benchmark dataset features multi-utterance environments of colloquial speech by multiple speakers, making it critical to capture meaningful clues for inferring relational facts. In the benchmark, empirical results reveal significant performance boosts in few-shot scenarios, where the availability of examples is notably limited. Nevertheless, the scarcity of ground-truth triggers for training hints at potential further refinements in the trigger-generation module, especially when ample examples are present. When evaluating the challenges of dialogue relation extraction, combining prompt-based learning with trigger generation offers pronounced improvements in both full-shot and few-shot scenarios. Specifically, integrating a meticulously crafted manual initialization method with the prompt-based model—considering prior distributional insights and relation class semantics—substantially surpasses the baseline. However, further advancements in trigger generation are warranted, especially in data-abundant contexts, to maximize performance enhancements. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

13 pages, 446 KiB  
Article
An Improved Nested Named-Entity Recognition Model for Subject Recognition Task under Knowledge Base Question Answering
by Ziming Wang, Xirong Xu, Xinzi Li, Haochen Li, Xiaopeng Wei and Degen Huang
Appl. Sci. 2023, 13(20), 11249; https://doi.org/10.3390/app132011249 - 13 Oct 2023
Cited by 2 | Viewed by 796
Abstract
In the subject recognition (SR) task under Knowledge Base Question Answering (KBQA), a common method is by training and employing a general flat Named-Entity Recognition (NER) model. However, it is not effective and robust enough in the case that the recognized entity could [...] Read more.
In the subject recognition (SR) task under Knowledge Base Question Answering (KBQA), a common method is by training and employing a general flat Named-Entity Recognition (NER) model. However, it is not effective and robust enough in the case that the recognized entity could not be strictly matched to any subjects in the Knowledge Base (KB). Compared to flat NER models, nested NER models show more flexibility and robustness in general NER tasks, whereas it is difficult to employ a nested NER model directly in an SR task. In this paper, we take advantage of features of a nested NER model and propose an Improved Nested NER Model (INNM) for the SR task under KBQA. In our model, each question token is labeled as either an entity token, a start token, or an end token by a modified nested NER model based on semantics. Then, entity candidates would be generated based on such labels, and an approximate matching strategy is employed to score all subjects in the KB based on string similarity to find the best-matched subject. Experimental results show that our model is effective and robust to both single-relation questions and complex questions, which outperforms the baseline flat NER model by a margin of 3.3% accuracy on the SimpleQuestions dataset and a margin of 11.0% accuracy on the WebQuestionsSP dataset. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 967 KiB  
Article
A Scientific Document Retrieval and Reordering Method by Incorporating HFS and LSD
by Ziyang Feng and Xuedong Tian
Appl. Sci. 2023, 13(20), 11207; https://doi.org/10.3390/app132011207 - 12 Oct 2023
Viewed by 637
Abstract
Achieving scientific document retrieval by considering the wealth of mathematical expressions and the semantic text they contain has become an inescapable trend. Current scientific document matching models focus solely on the textual features of expressions and frequently encounter hurdles like proliferative parameters and [...] Read more.
Achieving scientific document retrieval by considering the wealth of mathematical expressions and the semantic text they contain has become an inescapable trend. Current scientific document matching models focus solely on the textual features of expressions and frequently encounter hurdles like proliferative parameters and sluggish reasoning speeds in the pursuit of improved performance. To solve this problem, this paper proposes a scientific document retrieval method founded upon hesitant fuzzy sets (HFS) and local semantic distillation (LSD). Concretely, in order to extract both spatial and semantic features for each symbol within a mathematical expression, this paper introduces an expression analysis module that leverages HFS to establish feature indices. Secondly, to enhance contextual semantic alignment, the method of knowledge distillation is employed to refine the pretrained language model and establish a twin network for semantic matching. Lastly, by amalgamating mathematical expressions with contextual semantic features, the retrieval results can be made more efficient and rational. Experiments were implemented on the NTCIR dataset and the expanded Chinese dataset. The average MAP for mathematical expression retrieval results was 83.0%, and the average nDCG for sorting scientific documents was 85.8%. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 412 KiB  
Article
Assessment of Parent–Child Interaction Quality from Dyadic Dialogue
by Chaohao Lin, Ou Bai, Jennifer Piscitello, Emily L. Robertson, Brittany Merrill, Kellina Lupas and William E. Pelham, Jr.
Appl. Sci. 2023, 13(20), 11129; https://doi.org/10.3390/app132011129 - 10 Oct 2023
Viewed by 955
Abstract
The quality of parent–child interaction is critical for child cognitive development. The Dyadic Parent–Child Interaction Coding System (DPICS) is commonly used to assess parent and child behaviors. However, manual annotation of DPICS codes by parent–child interaction therapists is a time-consuming task. To assist [...] Read more.
The quality of parent–child interaction is critical for child cognitive development. The Dyadic Parent–Child Interaction Coding System (DPICS) is commonly used to assess parent and child behaviors. However, manual annotation of DPICS codes by parent–child interaction therapists is a time-consuming task. To assist therapists in the coding task, researchers have begun to explore the use of artificial intelligence in natural language processing to classify DPICS codes automatically. In this study, we utilized datasets from the DPICS book manual, five families, and an open-source PCIT dataset. To train DPICS code classifiers, we employed the pre-trained fine-tuned model RoBERTa as our learning algorithm. Our study shows that fine-tuning the pre-trained RoBERTa model achieves the highest results compared to other methods in sentence-based DPICS code classification assignments. For the DPICS manual dataset, the overall accuracy was 72.3% (72.2% macro-precision, 70.5% macro-recall, and 69.6% macro-F-score). Meanwhile, for the PCIT dataset, the overall accuracy was 79.8% (80.4% macro-precision, 79.7% macro-recall, and 79.8% macro-F-score), surpassing the previous highest results of 78.3% accuracy (79% precision, 77% recall) averaged over the eight DPICS classes. These results show that fine-tuning the pre-trained RoBERTa model could provide valuable assistance to experts in the labeling process. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 278 KiB  
Article
Authorship Attribution on Short Texts in the Slovenian Language
by Gregor Gabrovšek, Peter Peer, Žiga Emeršič and Borut Batagelj
Appl. Sci. 2023, 13(19), 10965; https://doi.org/10.3390/app131910965 - 04 Oct 2023
Viewed by 913
Abstract
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for [...] Read more.
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 909 KiB  
Article
A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization
by Canglan Liu, Wushouer Silamu and Yanbing Li
Appl. Sci. 2023, 13(19), 10589; https://doi.org/10.3390/app131910589 - 22 Sep 2023
Viewed by 810
Abstract
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through methods such as flipping, cropping, rotating, [...] Read more.
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through methods such as flipping, cropping, rotating, and adding noise. Traditionally, pseudo-parallel corpora are generated by randomly replacing words in low-resource language machine translation. However, this method can introduce ambiguity, as the same word may have different meanings in different contexts. This study proposes a new approach for low-resource language machine translation, which involves generating pseudo-parallel corpora by replacing phrases. The performance of this approach is compared with other data augmentation methods, and it is observed that combining it with other data augmentation methods further improves performance. To enhance the robustness of the model, R-Drop regularization is also used. R-Drop is an effective method for improving the quality of machine translation. The proposed method was tested on Chinese–Kazakh (Arabic script) translation tasks, resulting in performance improvements of 4.99 and 7.7 for Chinese-to-Kazakh and Kazakh-to-Chinese translations, respectively. By combining the generation of pseudo-parallel corpora through phrase replacement with the application of R-Drop regularization, there is a significant advancement in machine translation performance for low-resource languages. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

13 pages, 2003 KiB  
Article
Neural Machine Translation of Electrical Engineering with Fusion of Memory Information
by Yuan Chen, Zikang Liu and Juwei Zhang
Appl. Sci. 2023, 13(18), 10279; https://doi.org/10.3390/app131810279 - 13 Sep 2023
Viewed by 809
Abstract
This paper proposes a new neural machine translation model of electrical engineering that combines a transformer with gated recurrent unit (GRU) networks. By fusing global information and memory information, the model effectively improves the performance of low-resource neural machine translation. Unlike traditional transformers, [...] Read more.
This paper proposes a new neural machine translation model of electrical engineering that combines a transformer with gated recurrent unit (GRU) networks. By fusing global information and memory information, the model effectively improves the performance of low-resource neural machine translation. Unlike traditional transformers, our proposed model includes two different encoders: one is the global information encoder, which focuses on contextual information, and the other is the memory encoder, which is responsible for capturing recurrent memory information. The model with these two types of attention can encode both global and memory information and learn richer semantic knowledge. Because transformers require global attention calculation for each word position, the time and space complexity are both squared with the length of the source language sequence. When the length of the source language sequence becomes too long, the performance of the transformer will sharply decline. Therefore, we propose a memory information encoder based on the GRU to improve this drawback. The model proposed in this paper has a maximum improvement of 2.04 BLEU points over the baseline model in the field of electrical engineering with low resources. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 3431 KiB  
Article
Named Entity Recognition in Power Marketing Domain Based on Whole Word Masking and Dual Feature Extraction
by Yan Chen, Zengfu Liang, Zhixiang Tan and Dezhao Lin
Appl. Sci. 2023, 13(16), 9338; https://doi.org/10.3390/app13169338 - 17 Aug 2023
Viewed by 762
Abstract
With the aim of solving the current problems of low utilization of entity features, multiple meanings of a word, and poor recognition of specialized terms in the Chinese power marketing domain named entity recognition (PMDNER), this study proposes a Chinese power marketing named [...] Read more.
With the aim of solving the current problems of low utilization of entity features, multiple meanings of a word, and poor recognition of specialized terms in the Chinese power marketing domain named entity recognition (PMDNER), this study proposes a Chinese power marketing named entity recognition method based on whole word masking and joint extraction of dual features. Firstly, word vectorization of the electricity text data is performed using the RoBERTa pre-training model; then, it is fed into the constructed dual feature extraction neural network (DFENN) to acquire the local and global features of text in a parallel manner and fuse them. The output of the RoBERTa layer is used as the auxiliary classification layer, the output of the DFENN layer is used as the master classification layer, and the output of the two layers is dynamically combined through the attention mechanism to weight the outputs of the two layers so as to fuse new features, which are input into the conditional random field (CRF) layer to obtain the most reasonable label sequence. A focal loss function is used in the training process to alleviate the problem of uneven sample distribution. The experimental results show that the method achieved an F1 value of 88.58% on the constructed named entity recognition dataset in the power marketing domain, which is a significant improvement in performance compared with the existing methods. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

11 pages, 979 KiB  
Article
Medical Named Entity Recognition Fusing Part-of-Speech and Stroke Features
by Fen Yi, Hong Liu, You Wang, Sheng Wu, Cheng Sun, Peng Feng and Jin Zhang
Appl. Sci. 2023, 13(15), 8913; https://doi.org/10.3390/app13158913 - 02 Aug 2023
Viewed by 1100
Abstract
It is highly significant from a research standpoint and a valuable practice to identify diseases, symptoms, drugs, examinations, and other medical entities in medical text data to support knowledge maps, question and answer systems, and other downstream tasks that can provide the public [...] Read more.
It is highly significant from a research standpoint and a valuable practice to identify diseases, symptoms, drugs, examinations, and other medical entities in medical text data to support knowledge maps, question and answer systems, and other downstream tasks that can provide the public with knowledgeable answers. However, when contrasted with other languages like English, Chinese words lack a distinct dividing line, and medical entities have problems such as long length and multiple entity types nesting. Therefore, to address these issues, this study suggests a medical named entity recognition (NER) approach that combines part-of-speech and stroke features. First, the text is fed into the BERT pre-training model to get the semantic representation of the text, while the part-of-speech feature vector is obtained using the part-of-speech dictionary, and the stroke feature of the text is extracted through a convolution neural network (CNN). The word vector is then joined with the part-of-speech and stroke feature vectors, respectively, and input into the BiLSTM and CRF layer for training. Additionally, to balance the disparity in data volume across several types of entities, the class-weighted loss function is included in the loss function. According to the experimental findings, our model’s F1 score on the CCKS2019 dataset reaches 78.65%, and the recognition performance exceeds many existing algorithms. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

23 pages, 2773 KiB  
Article
Domain Knowledge Graph Question Answering Based on Semantic Analysis and Data Augmentation
by Shulin Hu, Huajun Zhang and Wanying Zhang
Appl. Sci. 2023, 13(15), 8838; https://doi.org/10.3390/app13158838 - 31 Jul 2023
Cited by 1 | Viewed by 1386
Abstract
Information retrieval-based question answering (IRQA) and knowledge-based question answering (KBQA) are the main forms of question answering (QA) systems. The answer generated by the IRQA system is extracted from the relevant text but has a certain degree of randomness, while the KBQA system [...] Read more.
Information retrieval-based question answering (IRQA) and knowledge-based question answering (KBQA) are the main forms of question answering (QA) systems. The answer generated by the IRQA system is extracted from the relevant text but has a certain degree of randomness, while the KBQA system retrieves the answer from structured data, and its accuracy is relatively high. In the field of policy and regulations such as household registration, the QA system requires precise and rigorous answers. Therefore, we design a QA system based on the household registration knowledge graph, aiming to provide rigorous and accurate answers for relevant household registration inquiries. The QA system uses a semantic analysis-based approach to simplify one question into a simple problem consisting of a single event entity and a single intention relationship, and quickly generates accurate answers by searching in the household registration knowledge graph. Due to the scarcity and imbalance of QA corpus data in the field of household registration, we use GPT3.5 to augment the collected questions dataset and explore the impact of data augmentation on the QA system. The experiment results show that the accuracy rate of the QA system using the augmented dataset reaches 93%, which is 6% higher than before. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 2928 KiB  
Article
Knowledge Interpolated Conditional Variational Auto-Encoder for Knowledge Grounded Dialogues
by Xingwei Liang, Jiachen Du, Taiyu Niu, Lanjun Zhou and Ruifeng Xu
Appl. Sci. 2023, 13(15), 8707; https://doi.org/10.3390/app13158707 - 28 Jul 2023
Viewed by 849
Abstract
In the Knowledge Grounded Dialogue (KGD) generation, the explicit modeling of instance-variety of knowledge specificity and its seamless fusion with the dialogue context remains challenging. This paper presents an innovative approach, the Knowledge Interpolated conditional Variational auto-encoder (KIV), to address these issues. In [...] Read more.
In the Knowledge Grounded Dialogue (KGD) generation, the explicit modeling of instance-variety of knowledge specificity and its seamless fusion with the dialogue context remains challenging. This paper presents an innovative approach, the Knowledge Interpolated conditional Variational auto-encoder (KIV), to address these issues. In particular, KIV introduces a novel interpolation mechanism to fuse two latent variables: independently encoding dialogue context and grounded knowledge. This distinct fusion of context and knowledge in the semantic space enables the interpolated latent variable to guide the decoder toward generating more contextually rich and engaging responses. We further explore deterministic and probabilistic methodologies to ascertain the interpolation weight, capturing the level of knowledge specificity. Comprehensive empirical analysis conducted on the Wizard-of-Wikipedia and Holl-E datasets verifies that the responses generated by our model performs better than strong baselines, with notable performance improvements observed in both automatic metrics and manual evaluation. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 1911 KiB  
Article
Semantic Similarity Analysis for Examination Questions Classification Using WordNet
by Thing Thing Goh, Nor Azliana Akmal Jamaludin, Hassan Mohamed, Mohd Nazri Ismail and Huangshen Chua
Appl. Sci. 2023, 13(14), 8323; https://doi.org/10.3390/app13148323 - 19 Jul 2023
Viewed by 956
Abstract
Question classification based on Bloom’s Taxonomy (BT) has been widely accepted and used as a guideline in designing examination questions in many institutions of higher learning. The misclassification of questions may happen when the classification task is conducted manually due to a discrepancy [...] Read more.
Question classification based on Bloom’s Taxonomy (BT) has been widely accepted and used as a guideline in designing examination questions in many institutions of higher learning. The misclassification of questions may happen when the classification task is conducted manually due to a discrepancy in the understanding of BT by academics. Hence, several automated examination question classification systems have been proposed by researchers to perform question classification accurately. Most of this research has focused on specific subject areas only or single-sentence type questions. There has been a lack of research on question classification for multi-sentence type and multi-subject questions. This paper proposes a question classification system (QCS) to perform the examination of question classification using a semantic and synthetic approach. The questions were taken from various subjects of an engineering diploma course, and the questions were either single- or multiple-sentence types. The QCS was developed using a natural language toolkit (NLTK), Stanford POS tagger (SPOS), Stanford parser’s universal dependencies (UD), and WordNet similarity approaches. The QCS used the NLTK to process the questions into sentences and then word tokens, such as SPOS, to tag the word tokens and UD to identify the important word tokens, which were the verbs of the examination questions. The identified verbs were then compared with the BT’s verbs list in terms of word sense using the WordNet similarity approach before finally classifying the questions according to BT. The developed QCS achieved an overall 83% accuracy in the classification of a set of 200 examination questions, according to BT. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 1937 KiB  
Article
FREDA: Few-Shot Relation Extraction Based on Data Augmentation
by Junbao Liu, Xizhong Qin, Xiaoqin Ma and Wensheng Ran
Appl. Sci. 2023, 13(14), 8312; https://doi.org/10.3390/app13148312 - 18 Jul 2023
Cited by 1 | Viewed by 956
Abstract
The primary task of few-shot relation extraction is to quickly learn the features of relation classes from a few labelled instances and predict the semantic relations between entity pairs in new instances. Most existing few-shot relation extraction methods do not fully utilize the [...] Read more.
The primary task of few-shot relation extraction is to quickly learn the features of relation classes from a few labelled instances and predict the semantic relations between entity pairs in new instances. Most existing few-shot relation extraction methods do not fully utilize the relation information features in sentences, resulting in difficulties in improving the performance of relation classification. Some researchers have attempted to incorporate external information, but the results have been unsatisfactory when applied to different domains. In this paper, we propose a method that utilizes triple information for data augmentation, which can alleviate the issue of insufficient instances and possesses strong domain adaptation capabilities. Firstly, we extract relation and entity pairs from the instances in the support set, forming relation triple information. Next, the sentence information and relation triple information are encoded using the same sentence encoder. Then, we construct an interactive attention module to enable the query set instances to interact separately with the support set instances and relation triple instances. The module pays greater attention to highly interactive parts between instances and assigns them higher weights. Finally, we merge the interacted support set representation and relation triple representation. To our knowledge, we are the first to propose a method that utilizes triple information for data augmentation in relation extraction. In our experiments on the standard datasets FewRel1.0 and FewRel2.0 (domain adaptation), we observed substantial improvements without including external information. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

13 pages, 578 KiB  
Article
An Open-Domain Event Extraction Method Incorporating Semantic and Dependent Syntactic Information
by Li He, Qian Zhang, Jianyong Duan and Hao Wang
Appl. Sci. 2023, 13(13), 7942; https://doi.org/10.3390/app13137942 - 06 Jul 2023
Viewed by 908
Abstract
Open-domain event extraction is a fundamental task that aims to extract non-predefined types of events from news clusters. Some researchers have noticed that its performance can be enhanced by improving dependency relationships. Recently, graphical convolutional networks (GCNs) have been widely used to integrate [...] Read more.
Open-domain event extraction is a fundamental task that aims to extract non-predefined types of events from news clusters. Some researchers have noticed that its performance can be enhanced by improving dependency relationships. Recently, graphical convolutional networks (GCNs) have been widely used to integrate dependency syntactic information into neural networks. However, they usually introduce noise and deteriorate the generalization. To tackle this issue, we propose using Bi-LSTM to obtain semantic representations of BERT intermediate layer features and infuse the dependent syntactic information. Compared to current methods, Bi-LSTM is more robust and has less dependency on word vectors and artificial features. Experiments on public datasets show that our approach is effective for open-domain event extraction tasks. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

24 pages, 1321 KiB  
Article
Impact of Negation and AnA-Words on Overall Sentiment Value of the Text Written in the Bosnian Language
by Sead Jahić and Jernej Vičič
Appl. Sci. 2023, 13(13), 7760; https://doi.org/10.3390/app13137760 - 30 Jun 2023
Cited by 2 | Viewed by 977
Abstract
In this manuscript, we present our efforts to develop an accurate sentiment analysis model for Bosnian-language tweets which incorporated three elements: negation cues, AnA-words (referring to maximizers, boosters, approximators, relative intensifiers, diminishers, and minimizers), and sentiment-labeled words from a lexicon. We used several [...] Read more.
In this manuscript, we present our efforts to develop an accurate sentiment analysis model for Bosnian-language tweets which incorporated three elements: negation cues, AnA-words (referring to maximizers, boosters, approximators, relative intensifiers, diminishers, and minimizers), and sentiment-labeled words from a lexicon. We used several machine-learning techniques, including SVM, Naive Bayes, RF, and CNN, with different input parameters, such as batch size, number of convolution layers, and type of convolution layers. In addition to these techniques, BOSentiment is used to provide an initial sentiment value for each tweet, which is then used as input for CNN. Our best-performing model, which combined BOSentiment and CNN with 256 filters and a size of 4×4, with a batch size of 10, achieved an accuracy of over 92%. Our results demonstrate the effectiveness of our approach in accurately classifying the sentiment of Bosnian tweets using machine-learning techniques, lexicons, and pre-trained models. This study makes a significant contribution to the field of sentiment analysis for under-researched languages such as Bosnian, and our approach could be extended to other languages and social media platforms to gain insight into public opinion. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

26 pages, 2848 KiB  
Article
Detecting Fine-Grained Emotions in Literature
by Luis Rei and Dunja Mladenić
Appl. Sci. 2023, 13(13), 7502; https://doi.org/10.3390/app13137502 - 25 Jun 2023
Cited by 1 | Viewed by 1758
Abstract
Emotion detection in text is a fundamental aspect of affective computing and is closely linked to natural language processing. Its applications span various domains, from interactive chatbots to marketing and customer service. This research specifically focuses on its significance in literature analysis and [...] Read more.
Emotion detection in text is a fundamental aspect of affective computing and is closely linked to natural language processing. Its applications span various domains, from interactive chatbots to marketing and customer service. This research specifically focuses on its significance in literature analysis and understanding. To facilitate this, we present a novel approach that involves creating a multi-label fine-grained emotion detection dataset, derived from literary sources. Our methodology employs a simple yet effective semi-supervised technique. We leverage textual entailment classification to perform emotion-specific weak-labeling, selecting examples with the highest and lowest scores from a large corpus. Utilizing these emotion-specific datasets, we train binary pseudo-labeling classifiers for each individual emotion. By applying this process to the selected examples, we construct a multi-label dataset. Using this dataset, we train models and evaluate their performance within a traditional supervised setting. Our model achieves an F1 score of 0.59 on our labeled gold set, showcasing its ability to effectively detect fine-grained emotions. Furthermore, we conduct evaluations of the model’s performance in zero- and few-shot transfer scenarios using benchmark datasets. Notably, our results indicate that the knowledge learned from our dataset exhibits transferability across diverse data domains, demonstrating its potential for broader applications beyond emotion detection in literature. Our contribution thus includes a multi-label fine-grained emotion detection dataset built from literature, the semi-supervised approach used to create it, as well as the models trained on it. This work provides a solid foundation for advancing emotion detection techniques and their utilization in various scenarios, especially within the cultural heritage analysis. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 3057 KiB  
Article
Automatic Essay Scoring Method Based on Multi-Scale Features
by Feng Li, Xuefeng Xi, Zhiming Cui, Dongyang Li and Wanting Zeng
Appl. Sci. 2023, 13(11), 6775; https://doi.org/10.3390/app13116775 - 02 Jun 2023
Cited by 3 | Viewed by 2216
Abstract
Essays are a pivotal component of conventional exams; accurately, efficiently, and effectively grading them is a significant challenge for educators. Automated essay scoring (AES) is a complex task that utilizes computer technology to assist teachers in scoring. Traditional AES techniques only focus on [...] Read more.
Essays are a pivotal component of conventional exams; accurately, efficiently, and effectively grading them is a significant challenge for educators. Automated essay scoring (AES) is a complex task that utilizes computer technology to assist teachers in scoring. Traditional AES techniques only focus on shallow linguistic features based on the grading criteria, ignoring the influence of deep semantic features. The AES model based on deep neural networks (DNN) can eliminate the need for feature engineering and achieve better accuracy. In addition, the DNN-AES model combining different scales of essays has recently achieved excellent results. However, it has the following problems: (1) It mainly extracts sentence-scale features manually and cannot be fine-tuned for specific tasks. (2) It does not consider the shallow linguistic features that the DNN-AES cannot extract. (3) It does not contain the relevance between the essay and the corresponding prompt. To solve these problems, we propose an AES method based on multi-scale features. Specifically, we utilize Sentence-BERT (SBERT) to vectorize sentences and connect them to the DNN-AES model. Furthermore, the typical shallow linguistic features and prompt-related features are integrated into the distributed features of the essay. The experimental results show that the Quadratic Weighted Kappa of our proposed method on the Kaggle ASAP competition dataset reaches 79.3%, verifying the efficacy of the extended method in the AES task. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 1123 KiB  
Article
An Event Extraction Approach Based on a Multi-Round Q&A Framework
by Li He, Xiya Zhao, Liang Zhao and Qing Zhang
Appl. Sci. 2023, 13(10), 6308; https://doi.org/10.3390/app13106308 - 22 May 2023
Viewed by 1033
Abstract
Event extraction aims to present unstructured text containing event information in a structured form to help people quickly mine the target information. Most of the traditional event extraction methods focus on the design of complex neural network models, which rely on a large [...] Read more.
Event extraction aims to present unstructured text containing event information in a structured form to help people quickly mine the target information. Most of the traditional event extraction methods focus on the design of complex neural network models, which rely on a large amount of annotated data to train the models. In recent years, some researchers have proposed the use of machine reading comprehension models for event extraction; however, the existing methods are limited to the single-round question-and-answer model, ignoring the dependency relation between the elements of event arguments. In addition, the existing methods do not fully utilize knowledge such as a priori information. To address these shortcomings, a multi-round Q&A framework is proposed for event extraction, which extends the existing methods in two aspects: first, by constructing a multi-round extraction problem framework, the model can effectively exploit the hierarchical dependencies among the argument elements; second, the question-and-answer framework is populated with historical answer information encoding slots, which are integrated into the multi-round Q&A process to assist in inference. Finally, experimental results on a publicly available dataset show that the proposed model achieves superior results compared to existing methods. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 1156 KiB  
Article
Multi-Stage Prompt Tuning for Political Perspective Detection in Low-Resource Settings
by Kang-Min Kim, Mingyu Lee, Hyun-Sik Won, Min-Ji Kim, Yeachan Kim and SangKeun Lee
Appl. Sci. 2023, 13(10), 6252; https://doi.org/10.3390/app13106252 - 19 May 2023
Viewed by 1618
Abstract
Political perspective detection in news media—identifying political bias in news articles—is an essential but challenging low-resource task. Prompt-based learning (i.e., discrete prompting and prompt tuning) achieves promising results in low-resource scenarios by adapting a pre-trained model to handle new tasks. However, these approaches [...] Read more.
Political perspective detection in news media—identifying political bias in news articles—is an essential but challenging low-resource task. Prompt-based learning (i.e., discrete prompting and prompt tuning) achieves promising results in low-resource scenarios by adapting a pre-trained model to handle new tasks. However, these approaches suffer performance degradation when the target task involves a textual domain (e.g., a political domain) different from the pre-training task (e.g., masked language modeling on a general corpus). In this paper, we develop a novel multi-stage prompt tuning framework for political perspective detection. Our method involves two sequential stages: a domain- and task-specific prompt tuning stage. In the first stage, we tune the domain-specific prompts based on a masked political phrase prediction (MP3) task to adjust the language model to the political domain. In the second task-specific prompt tuning stage, we only tune task-specific prompts with a frozen language model and domain-specific prompts for downstream tasks. The experimental results demonstrate that our method significantly outperforms fine-tuning (i.e., model tuning) methods and state-of-the-art prompt tuning methods on the SemEval-2019 Task 4: Hyperpartisan News Detection and AllSides datasets. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 1352 KiB  
Article
Multi-Feature Fusion Method for Chinese Shipping Companies Credit Named Entity Recognition
by Lin He, Shengnan Wang and Xinran Cao
Appl. Sci. 2023, 13(9), 5787; https://doi.org/10.3390/app13095787 - 08 May 2023
Cited by 2 | Viewed by 1282
Abstract
Shipping Enterprise Credit Named Entity Recognition (NER) aims to recognize shipping enterprise credit entities from unstructured shipping enterprise credit texts. Aiming at the problem of low entity recognition rate caused by complex and diverse entities and nesting phenomenon in the field of shipping [...] Read more.
Shipping Enterprise Credit Named Entity Recognition (NER) aims to recognize shipping enterprise credit entities from unstructured shipping enterprise credit texts. Aiming at the problem of low entity recognition rate caused by complex and diverse entities and nesting phenomenon in the field of shipping enterprise credit, a deep learning method based on multi-feature fusion is proposed to improve the recognition effect of shipping enterprise credit entities. In this study, the shipping enterprise credit dataset is manually labeled using the BIO labeling model, combining the pre-trained model Bidirectional Encoder Representations from Transformers (BERT) and bidirectional gated recurrent unit (BiGRU) with conditional random field (CRF) to form the BERT-BiGRU-CRF model, and changing the input of the model from a single feature vector to a multi-feature vector (MF) after stitching character vector features, word vector features, word length features, and part-of-speech (pos) features; BiGRU is introduced to extract the contextual features of shipping enterprise credit texts. Finally, CRF completes the sequence annotation task. According to the experimental results, using the BERT-MF-BiGRU-CRF model for NER of shipping enterprise credit text data, the F1 Score (F1) reaches 91.7%, which is 8.37% higher than the traditional BERT-BiGRU-CRF model. The experimental results show that the BERT-MF-BiGRU-CRF model can effectively perform NER for shipping enterprise credit text data, which is helpful to construct a credit knowledge graph for shipping enterprises, while the research results can provide references for complex entities and nested entities recognition in other fields. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

22 pages, 692 KiB  
Article
Quality Control for Distantly-Supervised Data-to-Text Generation via Meta Learning
by Heng Gong, Xiaocheng Feng and Bing Qin
Appl. Sci. 2023, 13(9), 5573; https://doi.org/10.3390/app13095573 - 30 Apr 2023
Viewed by 951
Abstract
Data-to-text generation plays an important role in natural language processing by processing structured data and helping people understand those data by generating user-friendly descriptive text. It can be applied to news generation, financial report generation, customer service, etc. However, in practice, it needs [...] Read more.
Data-to-text generation plays an important role in natural language processing by processing structured data and helping people understand those data by generating user-friendly descriptive text. It can be applied to news generation, financial report generation, customer service, etc. However, in practice, it needs to adapt to different domains that may lack an annotated training corpus. To alleviate this dataset scarcity problem, distantly-supervised data-to-text generation has emerged, which constructs a training corpus automatically and is more practical to apply to new domains when well-aligned data is expensive to obtain. However, this distant supervision method of training induces an over-generation problem since the automatically aligned text includes hallucination. These expressions cannot be inferred from the data, misguiding the model to produce unfaithful text. To exploit the noisy dataset while maintaining faithfulness, we empower the neural data-to-text model by dynamically increasing the weights of those well-aligned training instances and reducing the weights of the low-quality ones via meta learning. To our best knowledge, we are the first to alleviate the noise in distantly-supervised data-to-text generation via meta learning. In addition, we rewrite those low-quality texts to provide better training instances. Finally, we construct a new distantly-supervised dataset, DIST-ToTTo (abbreviation for Distantly-supervised Table-To-Text), and conduct experiments on both the benchmark WITA (abbreviation for the data source Wikipedia and Wikidata) and DIST-ToTTo datasets. The evaluation results show that our model can improve the state-of-the-art DSG (abbreviation for Distant Supervision Generation) model across all automatic evaluation metrics, with an improvement of 3.72% on the WITA dataset and 3.82% on the DIST-ToTTo dataset in terms of the widely used metric BLEU (abbreviation for BiLingual Evaluation Understudy). Furthermore, based on human evaluation, our model can generate more grammatically correct and more faithful text compared to the state-of-the-art DSG model. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

11 pages, 628 KiB  
Article
Improving Named Entity Recognition for Social Media with Data Augmentation
by Wenzhong Liu and Xiaohui Cui
Appl. Sci. 2023, 13(9), 5360; https://doi.org/10.3390/app13095360 - 25 Apr 2023
Cited by 3 | Viewed by 1714
Abstract
Social media is important for providing text information; however, due to its informal and unstructured nature, traditional named entity recognition (NER) methods face the challenge of achieving high accuracy when dealing with social media data. This paper proposes a new method for social [...] Read more.
Social media is important for providing text information; however, due to its informal and unstructured nature, traditional named entity recognition (NER) methods face the challenge of achieving high accuracy when dealing with social media data. This paper proposes a new method for social media named entity recognition with data augmentation. First, we pre-train the language model by using a bi-directional encoder representation of the transformer (BERT) to obtain a semantic vector of the word based on the contextual information of the word. Then, we obtain similar entities via data augmentation methods and perform substitution or semantic transformation on these entities. After that, the input into the Bi-LSTM model is trained and then fused and fine-tuned to obtain the best label. In addition, our use of the self-attentive layer captures the essential information of the features and reduces the reliance on external information. Experimental results on the WNUT16, WNUT17, and OntoNotes 5.0 datasets confirm the effectiveness of our proposed model. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

13 pages, 415 KiB  
Article
Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach
by Saima Shaukat, Muhammad Asad and Asmara Akram
Appl. Sci. 2023, 13(8), 5103; https://doi.org/10.3390/app13085103 - 19 Apr 2023
Viewed by 1520
Abstract
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. [...] Read more.
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
18 pages, 1133 KiB  
Article
Conditional Knowledge Extraction Using Contextual Information Enhancement
by Zhangbiao Xu, Botao Zhang, Jinguang Gu and Feng Gao
Appl. Sci. 2023, 13(8), 4954; https://doi.org/10.3390/app13084954 - 14 Apr 2023
Viewed by 1367
Abstract
Conditional phrases provide fine-grained domain knowledge in various industries, including medicine, manufacturing, and others. Most existing knowledge extraction research focuses on mining triplets with entities and relations and treats that triplet knowledge as plain facts without considering the conditional modality of such facts. [...] Read more.
Conditional phrases provide fine-grained domain knowledge in various industries, including medicine, manufacturing, and others. Most existing knowledge extraction research focuses on mining triplets with entities and relations and treats that triplet knowledge as plain facts without considering the conditional modality of such facts. We argue that such approaches are insufficient in building knowledge-based decision support systems in vertical domains, where specific and professional instructions on what facts apply under given circumstances are indispensable. To address this issue, this paper proposes a condition-aware knowledge extraction method using contextual information. In particular, this paper first fine-tunes the pre-training model to leverage a local context enhancement to capture the positional context of conditional phrases; then, a sentence-level context enhancement is used to integrate sentence semantics; finally, the correspondences between conditional phrases and relation triplets are extracted using syntactic attention. Experimental results on public and proprietary datasets show that our model can successfully retrieve conditional phrases with relevant triplets while improving the accuracy of the matching task by 2.68%, compared to the baseline. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 1955 KiB  
Article
A Small-Sample Text Classification Model Based on Pseudo-Label Fusion Clustering Algorithm
by Linda Yang, Baohua Huang, Shiqian Guo, Yunjie Lin and Tong Zhao
Appl. Sci. 2023, 13(8), 4716; https://doi.org/10.3390/app13084716 - 08 Apr 2023
Cited by 2 | Viewed by 1766
Abstract
The problem of text classification has been a mainstream research branch in natural language processing, and how to improve the effect of classification under the scarcity of labeled samples is one of the hot issues in this direction. The current models supporting small-sample [...] Read more.
The problem of text classification has been a mainstream research branch in natural language processing, and how to improve the effect of classification under the scarcity of labeled samples is one of the hot issues in this direction. The current models supporting small-sample classification can learn knowledge and train models with a small number of labels, but the classification results are not satisfactory enough. In order to improve the classification accuracy, we propose a Small-sample Text Classification model based on the Pseudo-label fusion Clustering algorithm (STCPC). The algorithm includes two cores: (1) Mining the potential features of unlabeled data by using the training strategy of clustering assuming pseudo-labeling and then reducing the noise of the pseudo-labeled dataset by consistent training with its enhanced samples to improve the quality of the pseudo-labeled dataset. (2) The labeled data is augmented, and then the Easy Plug-in Data Augmentation (EPiDA) framework is used to balance the diversity and quality of the augmented samples to improve the richness of the labeled data reasonably. The results of comparison tests with other classical algorithms show that the STCPC model can effectively improve classification accuracy. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 1259 KiB  
Article
Candidate Term Boundary Conflict Reduction Method for Chinese Geological Text Segmentation
by Yu Tang, Jiqiu Deng and Zhiyong Guo
Appl. Sci. 2023, 13(7), 4516; https://doi.org/10.3390/app13074516 - 02 Apr 2023
Cited by 2 | Viewed by 1110
Abstract
Though Chinese word segmentation (CWS) relies heavily on arithmetic power to train huge models and human work to label corpora, models and algorithms are still less accurate, especially for segmentation in a specific domain. In this study, a high-degree-of-freedom-priority candidate term boundary conflict [...] Read more.
Though Chinese word segmentation (CWS) relies heavily on arithmetic power to train huge models and human work to label corpora, models and algorithms are still less accurate, especially for segmentation in a specific domain. In this study, a high-degree-of-freedom-priority candidate term boundary conflict reduction method (HFCR) is proposed to solve the problem of manually setting thresholds on segmentation based on information entropy. We quantify the uncertainty of left and right character connections of candidate terms and then arrange them in descending order for local comparisons to determine term boundaries. Dynamic numerical comparisons are adopted instead of setting a threshold manually and randomly. Experiments show that the average F1-value of CWS for Chinese geological text is higher than 95% and the F1-value for Chinese general datasets is higher than 87%. Compared with representative tokenizers and the SOTA model, our method performs better, which solves the term boundary conflict problem well and has excellent performance on single geological text segmentation without any samples or labels. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 578 KiB  
Article
Affective-Knowledge-Enhanced Graph Convolutional Networks for Aspect-Based Sentiment Analysis with Multi-Head Attention
by Xiaodong Cui, Wenbiao Tao and Xiaohui Cui
Appl. Sci. 2023, 13(7), 4458; https://doi.org/10.3390/app13074458 - 31 Mar 2023
Cited by 4 | Viewed by 1809
Abstract
Aspect-based sentiment analysis (ABSA) is a task in natural language processing (NLP) that involves predicting the sentiment polarity towards a specific aspect in text. Graph neural networks (GNNs) have been shown to be effective tools for sentiment analysis tasks, but current research often [...] Read more.
Aspect-based sentiment analysis (ABSA) is a task in natural language processing (NLP) that involves predicting the sentiment polarity towards a specific aspect in text. Graph neural networks (GNNs) have been shown to be effective tools for sentiment analysis tasks, but current research often overlooks affective information in the text, leading to irrelevant information being learned for specific aspects. To address this issue, we propose a novel GNN model, MHAKE-GCN, which is based on the graph convolutional neural network (GCN) and multi-head attention (MHA). Our model incorporates external sentiment knowledge into the GCN and fully extracts semantic and syntactic information from a sentence using MHA. By adding weights to sentiment words associated with aspect words, our model can better learn sentiment expressions related to specific aspects. Our model was evaluated on four publicly benchmark datasets and compared against twelve other methods. The results of the experiments demonstrate the effectiveness of the proposed model for the task of aspect-based sentiment analysis. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 4291 KiB  
Article
CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
by Shiqian Guo, Yansun Huang, Baohua Huang, Linda Yang and Cong Zhou
Appl. Sci. 2023, 13(6), 4056; https://doi.org/10.3390/app13064056 - 22 Mar 2023
Cited by 4 | Viewed by 1283
Abstract
This paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, we proposed the CWSXLNet (Chinese [...] Read more.
This paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, we proposed the CWSXLNet (Chinese Word Segmentation XLNet) model based on Chinese word segmentation information enhancement. The model first pre-processed Chinese pretrained text by Chinese word segmentation tool, and proposed a Chinese word segmentation attention mask mechanism by combining PLM (Permuted Language Model) and two-stream self-attention mechanism of XLNet. While performing natural language processing at word granularity, it can reduce the degree of masking between masked and non-masked words for two words belonging to the same word. For the Chinese sentiment analysis task, proposed the CWSXLNet-BiGRU-Attention model, which introduces bi-directional GRU as well as self-attention mechanism in the downstream task. Experiments show that CWSXLNet has achieved 89.91% precision, 91.53% recall rate and 90.71% F1-score, and CWSXLNet-BiGRU-Attention has achieved 92.61% precision, 93.19% recall rate and 92.90% F1-score on ChnSentiCorp dataset, which indicates that CWSXLNet has better performance than other models in Chinese sentiment analysis. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

13 pages, 372 KiB  
Article
Regret and Hope on Transformers: An Analysis of Transformers on Regret and Hope Speech Detection Datasets
by Grigori Sidorov, Fazlourrahman Balouchzahi, Sabur Butt and Alexander Gelbukh
Appl. Sci. 2023, 13(6), 3983; https://doi.org/10.3390/app13063983 - 21 Mar 2023
Cited by 1 | Viewed by 1294
Abstract
In this paper, we analyzed the performance of different transformer models for regret and hope speech detection on two novel datasets. For the regret detection task, we compared the averaged macro-scores of the transformer models to the previous state-of-the-art results. We found that [...] Read more.
In this paper, we analyzed the performance of different transformer models for regret and hope speech detection on two novel datasets. For the regret detection task, we compared the averaged macro-scores of the transformer models to the previous state-of-the-art results. We found that the transformer models outperformed the previous approaches. Specifically, the roberta-based model achieved the highest averaged macro F1-score of 0.83, beating the previous state-of-the-art score of 0.76. For the hope speech detection task, the bert-based, uncased model achieved the highest averaged-macro F1-score of 0.72 among the transformer models. However, the specific performance of each model varied slightly depending on the task and dataset. Our findings highlight the effectiveness of transformer models for hope speech and regret detection tasks, and the importance of considering the effects of context, specific transformer architectures, and pre-training on their performance. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 966 KiB  
Article
Improving Many-to-Many Neural Machine Translation via Selective and Aligned Online Data Augmentation
by Weitai Zhang, Lirong Dai, Junhua Liu and Shijin Wang
Appl. Sci. 2023, 13(6), 3946; https://doi.org/10.3390/app13063946 - 20 Mar 2023
Cited by 1 | Viewed by 1383
Abstract
Multilingual neural machine translation (MNMT) models are theoretically attractive for low- and zero-resource language pairs with the impact of cross-lingual knowledge transfer. Existing approaches mainly focus on English-centric directions and always underperform compared to their pivot-based counterparts for non-English directions. In this work, [...] Read more.
Multilingual neural machine translation (MNMT) models are theoretically attractive for low- and zero-resource language pairs with the impact of cross-lingual knowledge transfer. Existing approaches mainly focus on English-centric directions and always underperform compared to their pivot-based counterparts for non-English directions. In this work, we aim to build a many-to-many MNMT system with an emphasis on the quality of non-English directions by exploring selective and aligned online data augmentation algorithms. Based on our findings showing that the augmented synthetic samples are not “the more, the better” we propose selective online back-translation (SOBT) and thoroughly study different selection criteria to pick suitable samples for training. Furthermore, we boost SOBT with cross-lingual online substitution (CLOS) to align token representations and encourage transfer learning. Our intuition is based on the hypothesis that a universal cross-lingual representation leads to a better multilingual translation performance, especially for non-English directions. Comparing to previous state-of-the-art many-to-many MNMT models and conventional pivot-based methods, experiments on IWSLT2014 and OPUS-100 translation benchmarks show that our approach achieves a competitive or even better performance on English-centric directions and achieves up to ∼12 BLEU for non-English directions. All of our models and codes are publicly available. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

10 pages, 1049 KiB  
Article
Event Detection Using a Self-Constructed Dependency and Graph Convolution Network
by Li He, Qingxin Meng, Qing Zhang, Jianyong Duan and Hao Wang
Appl. Sci. 2023, 13(6), 3919; https://doi.org/10.3390/app13063919 - 19 Mar 2023
Cited by 3 | Viewed by 1099
Abstract
The extant event detection models, which rely on dependency parsing, have exhibited commendable efficacy. However, for some long sentences with more words, the results of dependency parsing are more complex, because each word corresponds to a directed edge with a dependency parsing label. [...] Read more.
The extant event detection models, which rely on dependency parsing, have exhibited commendable efficacy. However, for some long sentences with more words, the results of dependency parsing are more complex, because each word corresponds to a directed edge with a dependency parsing label. These edges do not all provide guidance for the event detection model, and the accuracy of dependency parsing tools decreases with the increase in sentence length, resulting in error propagation. To solve these problems, we developed an event detection model that uses a self-constructed dependency and graph convolution network. First, we statistically analyzed the ACE2005 corpus to prune the dependency parsing tree, and combined the named entity features in the sentence to generate an undirected graph. Second, we implemented an enhanced graph convolution network using the multi-head attention mechanism to understand the representation of nodes in the graph. Finally, a gating mechanism combined the semantic and structural dependency information of the sentence, enabling us to accomplish the event detection task. A series of experiments conducted on the ACE2005 corpus demonstrates that the proposed method enhances the performance of the event detection model. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 1319 KiB  
Article
PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies
by Atreya Shankar, Andreas Waldis, Christof Bless, Maria Andueza Rodriguez and Luca Mazzola
Appl. Sci. 2023, 13(6), 3701; https://doi.org/10.3390/app13063701 - 14 Mar 2023
Cited by 1 | Viewed by 2166
Abstract
Benchmarks for general language understanding have been rapidly developing in recent years of NLP research, particularly because of their utility in choosing strong-performing models for practical downstream applications. While benchmarks have been proposed in the legal language domain, virtually no such benchmarks exist [...] Read more.
Benchmarks for general language understanding have been rapidly developing in recent years of NLP research, particularly because of their utility in choosing strong-performing models for practical downstream applications. While benchmarks have been proposed in the legal language domain, virtually no such benchmarks exist for privacy policies despite their increasing importance in modern digital life. This could be explained by privacy policies falling under the legal language domain, but we find evidence to the contrary that motivates a separate benchmark for privacy policies. Consequently, we propose PrivacyGLUE as the first comprehensive benchmark of relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. Furthermore, we release performances from multiple transformer language models and perform model–pair agreement analysis to detect tasks where models benefited from domain specialization. Our findings show the importance of in-domain pretraining for privacy policies. We believe PrivacyGLUE can accelerate NLP research and improve general language understanding for humans and AI algorithms in the privacy language domain, thus supporting the adoption and acceptance rates of solutions based on it. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 368 KiB  
Article
Joint Syntax-Enhanced and Topic-Driven Graph Networks for Emotion Recognition in Multi-Speaker Conversations
by Hui Yu, Tinghuai Ma, Li Jia, Najla Al-Nabhan and M. M. Abdel Wahab
Appl. Sci. 2023, 13(6), 3548; https://doi.org/10.3390/app13063548 - 10 Mar 2023
Viewed by 1218
Abstract
Daily conversations contain rich emotional information, and identifying this emotional information has become a hot task in the field of natural language processing. The traditional dialogue sentiment analysis method studies one-to-one dialogues and cannot be effectively applied to multi-speaker dialogues. This paper focuses [...] Read more.
Daily conversations contain rich emotional information, and identifying this emotional information has become a hot task in the field of natural language processing. The traditional dialogue sentiment analysis method studies one-to-one dialogues and cannot be effectively applied to multi-speaker dialogues. This paper focuses on the relationship between participants in a multi-speaker conversation and analyzes the influence of each speaker on the emotion of the whole conversation. We summarize the challenges of emotion recognition work in multi-speaker dialogue, focusing on the context-topic switching problem caused by multi-speaker dialogue due to its free flow of topics. For this challenge, this paper proposes a graph network that combines syntactic structure and topic information. A syntax module is designed to convert sentences into graphs, using edges to represent dependencies between words, solving the colloquial problem of daily conversations. We use graph convolutional networks to extract the implicit meaning of discourse. In addition, we focus on the impact of topic information on sentiment, so we design a topic module to optimize the topic extraction and classification of sentences by VAE. Then, we use the combination of attention mechanism and syntactic structure to strengthen the model’s ability to analyze sentences. In addition, the topic segmentation technology is adopted to solve the long-term dependencies problem, and a heterogeneous graph is used to model the dialogue. The nodes of the graph combine speaker information and utterance information. Aiming at the interaction relationship between the subject and the object of the dialogue, different edge types are used to represent different interaction relationships, and different weights are assigned to them. The experimental results of our work on multiple public datasets show that the new model outperforms several other alternative methods in sentiment label classification results. In the multi-person dialogue dataset, the classification accuracy is increased by more than 4%, which verifies the effectiveness of constructing heterogeneous dialogue graphs. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 794 KiB  
Article
Temporal Extraction of Complex Medicine by Combining Probabilistic Soft Logic and Textual Feature Feedback
by Jinguang Gu, Daiwen Wang, Danyang Hu, Feng Gao and Fangfang Xu
Appl. Sci. 2023, 13(5), 3348; https://doi.org/10.3390/app13053348 - 06 Mar 2023
Cited by 1 | Viewed by 1135
Abstract
In medical texts, temporal information describes events and changes in status, such as medical visits and discharges. According to the semantic features, it is classified into simple time and complex time. The current research on time recognition usually focuses on coarse-grained simple time [...] Read more.
In medical texts, temporal information describes events and changes in status, such as medical visits and discharges. According to the semantic features, it is classified into simple time and complex time. The current research on time recognition usually focuses on coarse-grained simple time recognition while ignoring fine-grained complex time. To address this problem, based on the semantic concept of complex time in Clinical Time Ontology, we define seven basic features and eleven extraction rules and propose a complex medical time-extraction method. It combines probabilistic soft logic and textual feature feedback. The framework consists of two parts: (a) text feature recognition based on probabilistic soft logic, which is based on probabilistic soft logic for negative feedback adjustment; (b) complex medical time entity recognition based on text feature feedback, which is based on the text feature recognition model in (a) for positive feedback adjustment. Finally, the effectiveness of our approach is verified in text feature recognition and complex temporal entity recognition experimentally. In the text feature recognition task, our method shows the best F1 improvement of 18.09% on the Irregular Instant Collection type corresponding to utterance l17. In the complex medical temporal entity recognition task, the F1 metric improves the most significantly, by 10.42%, on the Irregular Instant Collection type. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 717 KiB  
Article
A Personalized Multi-Turn Generation-Based Chatbot with Various-Persona-Distribution Data
by Shihao Zhu, Tinghuai Ma, Huan Rong and Najla Al-Nabhan
Appl. Sci. 2023, 13(5), 3122; https://doi.org/10.3390/app13053122 - 28 Feb 2023
Viewed by 2397
Abstract
Existing persona-based dialogue generation models focus on the semantic consistency between personas and responses. However, various influential factors can cause persona inconsistency, such as the speaking style in the context. Existing models perform inflexibly in speaking styles on various-persona-distribution datasets, resulting in persona [...] Read more.
Existing persona-based dialogue generation models focus on the semantic consistency between personas and responses. However, various influential factors can cause persona inconsistency, such as the speaking style in the context. Existing models perform inflexibly in speaking styles on various-persona-distribution datasets, resulting in persona style inconsistency. In this work, we propose a dialogue generation model with persona selection classifier to solve the complex inconsistency problem. The model generates responses in two steps: original response generation and rewriting responses. For training, we employ two auxiliary tasks: (1) a persona selection task to fuse the adapted persona into the original responses; (2) consistency inference to remove inconsistent persona information in the final responses. In our model, the adapted personas are predicted by an NLI-based classifier. We evaluate our model on the persona dialogue dataset with different persona distributions, i.e., the persona-dense PersonaChat dataset and the persona-spare PersonalDialog dataset. The experimental results show that our model outperforms strong models in response quality, persona consistency, and persona distribution consistency. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 3063 KiB  
Article
Refined Answer Selection Method with Attentive Bidirectional Long Short-Term Memory Network and Self-Attention Mechanism for Intelligent Medical Service Robot
by Deguang Wang, Ye Liang, Hengrui Ma and Fengqiang Xu
Appl. Sci. 2023, 13(5), 3016; https://doi.org/10.3390/app13053016 - 26 Feb 2023
Cited by 5 | Viewed by 1198
Abstract
Answer selection, as a crucial method for intelligent medical service robots, has become more and more important in natural language processing (NLP). However, there are still some critical issues in the answer selection model. On the one hand, the model lacks semantic understanding [...] Read more.
Answer selection, as a crucial method for intelligent medical service robots, has become more and more important in natural language processing (NLP). However, there are still some critical issues in the answer selection model. On the one hand, the model lacks semantic understanding of long questions because of noise information in a question–answer (QA) pair. On the other hand, some researchers combine two or more neural network models to improve the quality of answer selection. However, these models focus on the similarity between questions and answers without considering background information. To this end, this paper proposes a novel refined answer selection method, which uses an attentive bidirectional long short-term memory (Bi-LSTM) network and a self-attention mechanism to solve these issues. First of all, this paper constructs the required knowledge-based text as background information and converts the questions and answers from words to vectors, respectively. Furthermore, the self-attention mechanism is adopted to extract the global features from the vectors. Finally, an attentive Bi-LSTM network is designed to address long-distance dependent learning problems and calculate the similarity between the question and answer with consideration of the background knowledge information. To verify the effectiveness of the proposed method, this paper constructs a knowledge-based QA dataset including multiple medical QA pairs and conducts a series of experiments on it. The experimental results reveal that the proposed approach could achieve impressive performance on the answer selection task and reach an accuracy of 71.4%, MAP of 68.8%, and decrease the BLUE indicator to 3.10. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 566 KiB  
Article
A Multi-Granularity Word Fusion Method for Chinese NER
by Tong Liu, Jian Gao, Weijian Ni and Qingtian Zeng
Appl. Sci. 2023, 13(5), 2789; https://doi.org/10.3390/app13052789 - 21 Feb 2023
Cited by 1 | Viewed by 1484
Abstract
Named entity recognition (NER) plays a crucial role in many downstream natural language processing (NLP) tasks. It is challenging for Chinese NER because of certain features of Chinese. Recently, large-scaled pre-training language models have been used in Chinese NER. However, since some of [...] Read more.
Named entity recognition (NER) plays a crucial role in many downstream natural language processing (NLP) tasks. It is challenging for Chinese NER because of certain features of Chinese. Recently, large-scaled pre-training language models have been used in Chinese NER. However, since some of the pre-training language models do not use word information or just employ word information of single granularity, the semantic information in sentences could not be fully captured, which affects these models’ performance. To fully take advantage of word information and obtain richer semantic information, we propose a multi-granularity word fusion method for Chinese NER. We introduce multi-granularity word information into our model. To make full use of the information, we classify the information into three kinds: strong information, moderate information, and weak information. These kinds of information are encoded by encoders and then integrated with each other through the strong-weak feedback attention mechanism. Specifically, we apply two separate attention networks to word embeddings and N-grams embeddings. Then, the outputs are fused into another attention. In these three attentions, character embeddings are used to be the query of attentions. We call the results the multi-granularity word information. To combine character information and multi-granularity word information, we introduce two fusion strategies for better performance. The process makes our model obtain rich semantic information and reduces word segmentation errors and noise in an explicit way. We design experiments to get our model’s best performance by comparing some components. Ablation study is used to verify the effectiveness of each module. The final experiments are conducted on four Chinese NER benchmark datasets and the F1 scores are 81.51% for Ontonotes4.0, 95.47% for MSRA, 95.87% for Resume, and 69.41% for Weibo. The best improvement achieved by the proposed method is 1.37%. Experimental results show that our method outperforms most baselines and achieves the state-of-the-art method in performance. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

11 pages, 1134 KiB  
Article
Document-Level Event Role Filler Extraction Using Key-Value Memory Network
by Hao Wang, Miao Li, Jianyong Duan, Li He and Qing Zhang
Appl. Sci. 2023, 13(4), 2724; https://doi.org/10.3390/app13042724 - 20 Feb 2023
Viewed by 1170
Abstract
Previous work has demonstrated that end-to-end neural sequence models work well for document-level event role filler extraction. However, the end-to-end neural network model suffers from the problem of not being able to utilize global information, resulting in incomplete extraction of document-level event arguments. [...] Read more.
Previous work has demonstrated that end-to-end neural sequence models work well for document-level event role filler extraction. However, the end-to-end neural network model suffers from the problem of not being able to utilize global information, resulting in incomplete extraction of document-level event arguments. This is because the inputs to BiLSTM are all single-word vectors with no input of contextual information. This phenomenon is particularly pronounced at the document level. To address this problem, we propose key-value memory networks to enhance document-level contextual information, and the overall model is represented at two levels: the sentence-level and document-level. At the sentence-level, we use BiLSTM to obtain key sentence information. At the document-level, we use a key-value memory network to enhance document-level representations by recording information about those words in articles that are sensitive to contextual similarity. We fuse two levels of contextual information by means of a fusion formula. We perform various experimental validations on the MUC-4 dataset, and the results show that the model using key-value memory networks works better than the other models. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 1317 KiB  
Article
Type Hierarchy Enhanced Event Detection without Triggers
by Youcheng Yan, Zhao Liu, Feng Gao and Jinguang Gu
Appl. Sci. 2023, 13(4), 2296; https://doi.org/10.3390/app13042296 - 10 Feb 2023
Viewed by 1104
Abstract
Event detection (ED) aims to detect events from a given text and categorize them into event types. Most of the current approaches to ED rely heavily on the human annotations of triggers, which are often costly and affect the application of ED in [...] Read more.
Event detection (ED) aims to detect events from a given text and categorize them into event types. Most of the current approaches to ED rely heavily on the human annotations of triggers, which are often costly and affect the application of ED in other fields. However, triggers are not necessary for the event detection task. We propose a novel framework called Type Hierarchy Enhanced Event Detection Without Triggers (THEED) to avoid this problem. More specifically, We construct a type hierarchy concept module using the external knowledge graph Probase to enhance the semantic representation of event types. In addition, we divide input instances into word-level and context-level representations, which can make the model use different level features. The experimental result indicates that our proposed approach achieves better improvement. Additionally, it is significantly competitive with mainstream trigger-based models. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

23 pages, 2695 KiB  
Article
A Multi-Attention Approach Using BERT and Stacked Bidirectional LSTM for Improved Dialogue State Tracking
by Muhammad Asif Khan, Yi Huang, Junlan Feng, Bhuyan Kaibalya Prasad, Zafar Ali, Irfan Ullah and Pavlos Kefalas
Appl. Sci. 2023, 13(3), 1775; https://doi.org/10.3390/app13031775 - 30 Jan 2023
Cited by 1 | Viewed by 2149
Abstract
The modern digital world and associated innovative and state-of-the-art applications that characterize its presence, render the current digital age a captivating era for many worldwide. These innovations include dialogue systems, such as Apple’s Siri, Google Now, and Microsoft’s Cortana, that stay on the [...] Read more.
The modern digital world and associated innovative and state-of-the-art applications that characterize its presence, render the current digital age a captivating era for many worldwide. These innovations include dialogue systems, such as Apple’s Siri, Google Now, and Microsoft’s Cortana, that stay on the personal devices of users and assist them in their daily activities. These systems track the intentions of users by analyzing their speech, context by looking at their previous turns, and several other external details, and respond or act in the form of speech output. For these systems to work efficiently, a dialogue state tracking (DST) module is required to infer the current state of the dialogue in a conversation by processing previous states up to the current state. However, developing a DST module that tracks and exploit dialogue states effectively and accurately is challenging. The notable challenges that warrant immediate attention include scalability, handling the unseen slot-value pairs during training, and retraining the model with changes in the domain ontology. In this article, we present a new end-to-end framework by combining BERT, Stacked Bidirectional LSTM (BiLSTM), and a multiple attention mechanism to formalize DST as a classification problem and address the aforementioned issues. The BERT-based module encodes the user’s and system’s utterances. The Stacked BiLSTM extracts the contextual features and multiple attention mechanisms to calculate the attention between its hidden states and the utterance embeddings. We experimentally evaluated our method against the current approaches over a variety of datasets. The results indicate a significant overall improvement. The proposed model is scalable in terms of sharing the parameters and it considers the unseen instances during training. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 3655 KiB  
Article
Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
by Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh and Grigori Sidorov
Appl. Sci. 2023, 13(2), 1201; https://doi.org/10.3390/app13021201 - 16 Jan 2023
Cited by 11 | Viewed by 3630
Abstract
Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a [...] Read more.
Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 1267 KiB  
Article
Span-Based Fine-Grained Entity-Relation Extraction via Sub-Prompts Combination
by Ning Yu, Jianyi Liu and Yu Shi
Appl. Sci. 2023, 13(2), 1159; https://doi.org/10.3390/app13021159 - 15 Jan 2023
Viewed by 1390
Abstract
With the development of information extraction technology, a variety of entity-relation extraction paradigms have been formed. However, approaches guided by these existing paradigms suffer from insufficient information fusion and too coarse extraction granularity, leading to difficulties extracting all triples in a sentence. Moreover, [...] Read more.
With the development of information extraction technology, a variety of entity-relation extraction paradigms have been formed. However, approaches guided by these existing paradigms suffer from insufficient information fusion and too coarse extraction granularity, leading to difficulties extracting all triples in a sentence. Moreover, the joint entity-relation extraction model cannot easily adapt to the relation extraction task. Therefore, we need to design more fine-grained and flexible extraction methods. In this paper, we propose a new extraction paradigm based on existing paradigms. Then, based on it, we propose SSPC, a method for Span-based Fine-Grained Entity-Relation Extraction via Sub-Prompts Combination. SSPC first decomposes the task into three sub-tasks, namely S,R Extraction, R,O Extraction and S,R,O Classification and then uses prompt tuning to fully integrate entity and relation information in each part. This fine-grained extraction framework makes the model easier to adapt to other similar tasks. We conduct experiments on joint entity-relation extraction and relation extraction, respectively. The experimental results show that our model outperforms previous methods and achieves state-of-the-art results on ADE, TACRED, and TACREV. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

12 pages, 543 KiB  
Article
Ensemble-NQG-T5: Ensemble Neural Question Generation Model Based on Text-to-Text Transfer Transformer
by Myeong-Ha Hwang, Jikang Shin, Hojin Seo, Jeong-Seon Im, Hee Cho and Chun-Kwon Lee
Appl. Sci. 2023, 13(2), 903; https://doi.org/10.3390/app13020903 - 09 Jan 2023
Cited by 6 | Viewed by 2847
Abstract
Deep learning chatbot research and development is exploding recently to offer customers in numerous industries personalized services. However, human resources are used to create a learning dataset for a deep learning chatbot. In order to augment this, the idea of neural question generation [...] Read more.
Deep learning chatbot research and development is exploding recently to offer customers in numerous industries personalized services. However, human resources are used to create a learning dataset for a deep learning chatbot. In order to augment this, the idea of neural question generation (NQG) has evolved, although it has restrictions on how questions can be expressed in different ways and has a finite capacity for question generation. In this paper, we propose an ensemble-type NQG model based on the text-to-text transfer transformer (T5). Through the proposed model, the number of generated questions for each single NQG model can be greatly increased by considering the mutual similarity and the quality of the questions using the soft-voting method. For the training of the soft-voting algorithm, the evaluation score and mutual similarity score weights based on the context and the question–answer (QA) dataset are used as the threshold weight. Performance comparison results with existing T5-based NQG models using the SQuAD 2.0 dataset demonstrate the effectiveness of the proposed method for QG. The implementation of the proposed ensemble model is anticipated to span diverse industrial fields, including interactive chatbots, robotic process automation (RPA), and Internet of Things (IoT) services in the future. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

28 pages, 15345 KiB  
Article
GeoBERT: Pre-Training Geospatial Representation Learning on Point-of-Interest
by Yunfan Gao, Yun Xiong, Siqi Wang and Haofen Wang
Appl. Sci. 2022, 12(24), 12942; https://doi.org/10.3390/app122412942 - 16 Dec 2022
Cited by 3 | Viewed by 3031
Abstract
Thanks to the development of geographic information technology, geospatial representation learning based on POIs (Point-of-Interest) has gained widespread attention in the past few years. POI is an important indicator to reflect urban socioeconomic activities, widely used to extract geospatial information. However, previous studies [...] Read more.
Thanks to the development of geographic information technology, geospatial representation learning based on POIs (Point-of-Interest) has gained widespread attention in the past few years. POI is an important indicator to reflect urban socioeconomic activities, widely used to extract geospatial information. However, previous studies often focus on a specific area, such as a city or a district, and are designed only for particular tasks, such as land-use classification. On the other hand, large-scale pre-trained models (PTMs) have recently achieved impressive success and become a milestone in artificial intelligence (AI). Against this background, this study proposes the first large-scale pre-training geospatial representation learning model called GeoBERT. First, we collect about 17 million POIs in 30 cities across China to construct pre-training corpora, with 313 POI types as the tokens and the level-7 Geohash grids as the basic units. Second, we pre-train GeoEBRT to learn grid embedding in self-supervised learning by masking the POI type and then predicting. Third, under the paradigm of “pre-training + fine-tuning”, we design five practical downstream tasks. Experiments show that, with just one additional output layer fine-tuning, GeoBERT outperforms previous NLP methods (Word2vec, GloVe) used in geospatial representation learning by 9.21% on average in F1-score for classification tasks, such as store site recommendation and working/living area prediction. For regression tasks, such as POI number prediction, house price prediction, and passenger flow prediction, GeoBERT demonstrates greater performance improvements. The experiment results prove that pre-training on large-scale POI data can significantly improve the ability to extract geospatial information. In the discussion section, we provide a detailed analysis of what GeoBERT has learned from the perspective of attention mechanisms. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

20 pages, 1788 KiB  
Article
Towards Domain-Specific Knowledge Graph Construction for Flight Control Aided Maintenance
by Chuanyou Li, Xinhang Yang, Shance Luo, Mingzhe Song and Wei Li
Appl. Sci. 2022, 12(24), 12736; https://doi.org/10.3390/app122412736 - 12 Dec 2022
Cited by 1 | Viewed by 1689
Abstract
Flight control is a key system of modern aircraft. During each flight, pilots use flight control to control the forces of flight and also the aircraft’s direction and attitude. Whether flight control can work properly is closely related to safety such that daily [...] Read more.
Flight control is a key system of modern aircraft. During each flight, pilots use flight control to control the forces of flight and also the aircraft’s direction and attitude. Whether flight control can work properly is closely related to safety such that daily maintenance is an essential task of airlines. Flight control maintenance heavily relies on expert knowledge. To facilitate knowledge achievement, aircraft manufacturers and airlines normally provide structural manuals for consulting. On the other hand, computer-aided maintenance systems are adopted for improving daily maintenance efficiency. However, we find that grass-roots engineers of airlines still inevitably consult unstructured technical manuals from time to time, for example, when meeting an unusual problem or an unfamiliar type of aircraft. Achieving effective knowledge from unstructured data is inefficient and inconvenient. Aiming at the problem, we propose a knowledge-graph-based maintenance prototype system as a complementary solution. The knowledge graph we built is dedicated for unstructured manuals referring to flight control. We first build ontology to represent key concepts and relation types and then perform entity-relation extraction adopting a pipeline paradigm with natural language processing techniques. To fully utilize domain-specific features, we present a hybrid method consisting of dedicated rules and a machine learning model for entity recognition. As for relation extraction, we leverage a two-stage Bi-LSTM (bi-directional long short-term memory networks) based method to improve the extraction precision by solving a sample imbalanced problem. We conduct comprehensive experiments to study the technical feasibility on real manuals from airlines. The average precision of entity recognition reaches 85%, and the average precision of relation extraction comes to 61%. Finally, we design a flight control maintenance prototype system based on the knowledge graph constructed and a graph database Neo4j. The prototype system takes alarm messages represented in natural language as the input and returns maintenance suggestions to serve grass-roots engineers. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

22 pages, 1864 KiB  
Article
An Embedding-Based Approach to Repairing OWL Ontologies
by Qiu Ji, Guilin Qi, Yinkai Yang, Weizhuo Li, Siying Huang and Yang Sheng
Appl. Sci. 2022, 12(24), 12655; https://doi.org/10.3390/app122412655 - 09 Dec 2022
Viewed by 1089
Abstract
High-quality ontologies are critical to ontology-based applications, such as natural language understanding and information extraction, but logical conflicts naturally occur in the lifecycle of ontology development. To deal with such conflicts, conflict detection and ontology repair become two critical tasks, and we focus [...] Read more.
High-quality ontologies are critical to ontology-based applications, such as natural language understanding and information extraction, but logical conflicts naturally occur in the lifecycle of ontology development. To deal with such conflicts, conflict detection and ontology repair become two critical tasks, and we focus on repairing ontologies. Most existing approaches for ontology repair rely on the syntax of axioms or logical consequences but ignore the semantics of axioms. In this paper, we propose an embedding-based approach by considering sentence embeddings of axioms, which translates axioms into semantic vectors and provides facilities to compute semantic similarities among axioms. A threshold-based algorithm and a signature-based algorithm are designed to repair ontologies with the help of detected conflicts and axiom embeddings. In the experiments, our proposed algorithms are compared with existing ones over 20 real-life incoherent ontologies. The threshold-based algorithm with different distance metrics is further evaluated with 10 distinct thresholds and 3 pre-trained models. The experimental results show that the embedding-based algorithms could achieve promising performances. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

22 pages, 2248 KiB  
Article
Improving User Intent Detection in Urdu Web Queries with Capsule Net Architectures
by Sana Shams and Muhammad Aslam
Appl. Sci. 2022, 12(22), 11861; https://doi.org/10.3390/app122211861 - 21 Nov 2022
Cited by 2 | Viewed by 2964
Abstract
Detecting the communicative intent behind user queries is critically required by search engines to understand a user’s search goal and retrieve the desired results. Due to increased web searching in local languages, there is an emerging need to support the language understanding for [...] Read more.
Detecting the communicative intent behind user queries is critically required by search engines to understand a user’s search goal and retrieve the desired results. Due to increased web searching in local languages, there is an emerging need to support the language understanding for languages other than English. This article presents a distinctive, capsule neural network architecture for intent detection from search queries in Urdu, a widely spoken South Asian language. The proposed two-tiered capsule network utilizes LSTM cells and an iterative routing mechanism between the capsules to effectively discriminate diversely expressed search intents. Since no Urdu queries dataset is available, a benchmark intent-annotated dataset of 11,751 queries was developed, incorporating 11 query domains and annotated with Broder’s intent taxonomy (i.e., navigational, transactional and informational intents). Through rigorous experimentation, the proposed model attained the state of the art accuracy of 91.12%, significantly improving upon several alternate classification techniques and strong baselines. An error analysis revealed systematic error patterns owing to a class imbalance and large lexical variability in Urdu web queries. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

21 pages, 4512 KiB  
Article
Modularization Method to Reuse Medical Knowledge Graphs
by Maricela Bravo, Darinel González-Villarreal, José A. Reyes-Ortiz and Leonardo D. Sánchez-Martínez
Appl. Sci. 2022, 12(22), 11816; https://doi.org/10.3390/app122211816 - 21 Nov 2022
Cited by 2 | Viewed by 1303
Abstract
During the creation and integration of a health care system based on medical knowledge graphs, it is necessary to review and select the vocabularies and definitions that best fit the information requirements of the system being developed. This implies the reuse of medical [...] Read more.
During the creation and integration of a health care system based on medical knowledge graphs, it is necessary to review and select the vocabularies and definitions that best fit the information requirements of the system being developed. This implies the reuse of medical knowledge graphs; however, full importation of knowledge graphs is not a tractable solution in terms of memory requirements. In this paper, we present a modularization-based method for knowledge graph reuse. A case study of graph reuse is presented by transforming the original model into a lighter one. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 530 KiB  
Article
Fully Attentional Network for Low-Resource Academic Machine Translation and Post Editing
by İlhami Sel and Davut Hanbay
Appl. Sci. 2022, 12(22), 11456; https://doi.org/10.3390/app122211456 - 11 Nov 2022
Cited by 2 | Viewed by 1443
Abstract
English is accepted as an academic language in the world. This necessitates the use of English in their academic studies for speakers of other languages. Even when these researchers are competent in the use of the English language, some mistakes may occur while [...] Read more.
English is accepted as an academic language in the world. This necessitates the use of English in their academic studies for speakers of other languages. Even when these researchers are competent in the use of the English language, some mistakes may occur while writing an academic article. To solve this problem, academicians tend to use automatic translation programs or get assistance from people with an advanced level of English. This study offers an expert system to enable assistance to the researchers throughout their academic article writing process. In this study, Turkish which is considered among low-resource languages is used as the source language. The proposed model combines the transformer encoder-decoder architecture model with the pre-trained Sci-BERT language model via the shallow fusion method. The model uses a Fully Attentional Network Layer instead of a Feed-Forward Network Layer in the known shallow fusion method. In this way, a higher success rate could be achieved by increasing the attention at the word level. Different metrics were used to evaluate the created model. The model created as a result of the experiments reached 45.1 BLEU and 73.2 METEOR scores. In addition, the proposed model achieved 20.12 and 20.56 scores, respectively, with the zero-shot translation method in the World Machine Translation (2017–2018) test datasets. The proposed method could inspire other low-resource languages to include the language model in the translation system. In this study, a corpus composed entirely of academic sentences is also introduced to be used in the translation system. The corpus consists of 1.2 million parallel sentences. The proposed model and corpus are made available to researchers on our GitHub page. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

21 pages, 4699 KiB  
Article
Identification and Visualization of Key Topics in Scientific Publications with Transformer-Based Language Models and Document Clustering Methods
by Min-Hsien Weng, Shaoqun Wu and Mark Dyer
Appl. Sci. 2022, 12(21), 11220; https://doi.org/10.3390/app122111220 - 05 Nov 2022
Cited by 3 | Viewed by 4340
Abstract
With the rapidly growing number of scientific publications, researchers face an increasing challenge of discovering the current research topics and methodologies in a scientific domain. This paper describes an unsupervised topic detection approach that utilizes the new development of transformer-based GPT-3 (Generative Pretrained [...] Read more.
With the rapidly growing number of scientific publications, researchers face an increasing challenge of discovering the current research topics and methodologies in a scientific domain. This paper describes an unsupervised topic detection approach that utilizes the new development of transformer-based GPT-3 (Generative Pretrained Transformer 3) similarity embedding models and modern document clustering techniques. In total, 593 publication abstracts across urban study and machine learning domains were used as a case study to demonstrate the three phases of our approach. The iterative clustering phase uses the GPT-3 embeddings to represent the semantic meaning of abstracts and deploys the HDBSCAN (Hierarchical Density-based Spatial Clustering of Applications with Noise) clustering algorithm along with silhouette scores to group similar abstracts. The keyword extraction phase identifies candidate words from each abstract and selects keywords using the Maximal Marginal Relevance ranking algorithm. The keyword grouping phase produces the keyword groups to represent topics in each abstract cluster, again using GPT-3 embeddings, the HDBSCAN algorithm, and silhouette scores. The results are visualized in a web-based interactive tool that allows users to explore abstract clusters and examine the topics in each cluster through keyword grouping. Our unsupervised topic detection approach does not require labeled datasets for training and has the potential to be used in bibliometric analysis in a large collection of publications. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

21 pages, 554 KiB  
Article
Retweet Prediction Based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features
by Ana Meštrović, Milan Petrović and Slobodan Beliga
Appl. Sci. 2022, 12(21), 11216; https://doi.org/10.3390/app122111216 - 05 Nov 2022
Cited by 1 | Viewed by 1638
Abstract
Retweet prediction is an important task in the context of various problems, such as information spreading analysis, automatic fake news detection, social media monitoring, etc. In this study, we explore retweet prediction based on heterogeneous data sources. In order to classify a tweet [...] Read more.
Retweet prediction is an important task in the context of various problems, such as information spreading analysis, automatic fake news detection, social media monitoring, etc. In this study, we explore retweet prediction based on heterogeneous data sources. In order to classify a tweet according to the number of retweets, we combine features extracted from the multilayer network and text. More specifically, we introduce a multilayer framework for the multilayer network representation of Twitter. This formalism captures different users’ actions and complex relationships, as well as other key properties of communication on Twitter. Next, we select a set of local network measures from each layer and construct a set of multilayer network features. We also adopt a BERT-based language model, namely Cro-CoV-cseBERT, to capture the high-level semantics and structure of tweets as a set of text features. We then trained six machine learning (ML) algorithms: random forest, multilayer perceptron, light gradient boosting machine, category-embedding model, neural oblivious decision ensembles, and an attentive interpretable tabular learning model for the retweet-prediction task. We compared the performance of all six algorithms in three different setups: with text features only, with multilayer network features only, and with both feature sets. We evaluated all the setups in terms of standard evaluation measures. For this task, we first prepared an empirical dataset of 199,431 tweets in Croatian posted between 1 January 2020 and 31 May 2021. Our results indicate that the prediction model performs better by integrating multilayer network features with text features than by using only one set of features. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 4291 KiB  
Article
Mixup Based Cross-Consistency Training for Named Entity Recognition
by Geonsik Youn, Bohan Yoon, Seungbin Ji, Dahee Ko and Jongtae Rhee
Appl. Sci. 2022, 12(21), 11084; https://doi.org/10.3390/app122111084 - 01 Nov 2022
Cited by 1 | Viewed by 1303
Abstract
Named Entity Recognition (NER) is at the core of natural language understanding. The quality and amount of datasets determine the performance of deep-learning-based NER models. As datasets for NER require token-level or word-level labels to be assigned, annotating the datasets is expensive and [...] Read more.
Named Entity Recognition (NER) is at the core of natural language understanding. The quality and amount of datasets determine the performance of deep-learning-based NER models. As datasets for NER require token-level or word-level labels to be assigned, annotating the datasets is expensive and time consuming. To alleviate efforts of manual anotation, many prior studies utilized weak supervision for NER tasks. However, using weak supervision directly would be an obstacle for training deep networks because the labels automatically annotated contain a a lot of noise. In this study, we propose a framework to better train the deep model for NER tasks using weakly labeled data. The proposed framework stems from the idea that mixup, which was recently considered as a data augmentation strategy, would be an obstacle to deep model training for NER tasks. Inspired by this idea, we used mixup as a perturbation function for consistency regularization, one of the semi-supervised learning strategies. To support our idea, we conducted several experiments for NER benchmarks. Experimental results proved that directly using mixup on NER tasks hinders deep model training while demonstrating that the proposed framework achieves improved performances compared to employing only a few human-annotated data. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 679 KiB  
Article
CRSAtt: By Capturing Relational Span and Using Attention for Relation Classification
by Cong Shao, Min Li, Gang Li, Mingle Zhou and Delong Han
Appl. Sci. 2022, 12(21), 11068; https://doi.org/10.3390/app122111068 - 01 Nov 2022
Cited by 3 | Viewed by 1331
Abstract
Relation classification is an important fundamental task in information extraction, and convolutional neural networks have been commonly applied to relation classification with good results. In recent years, due to the proposed pre-training model BERT, the use of which as a feature extraction architecture [...] Read more.
Relation classification is an important fundamental task in information extraction, and convolutional neural networks have been commonly applied to relation classification with good results. In recent years, due to the proposed pre-training model BERT, the use of which as a feature extraction architecture has become more and more popular, convolutional neural networks have gradually withdrawn from the stage of NLP, and the relation classification/extraction model based on pre-training BERT has achieved state-of-the-art results. However, none of these methods consider how to accurately capture the semantic features of the relationships between entities to reduce the number of noisy words in a sentence that are not helpful for relation classification. Moreover, these methods do not have a systematic prediction structure to fully utilize the extracted features for the relational classification task. To address these problems, a SpanBert-based relation classification model is proposed in this paper. Compared with existing Bert-based architectures, the model is able to understand the semantic information of the relationships between entities more accurately, and it can fully utilize the extracted features to represent the degree of dependency of a pair of entities with each type of relationship. In this paper, we design a feature fusion method called “SRS” (Strengthen Relational Semantics) and an attention-based prediction structure. Compared with existing methods, the feature fusion method proposed in this paper can reduce the noise interference of irrelevant words when extracting relational semantics, and the prediction structure proposed in this paper can make full use of semantic features for relational classification. We achieved advanced results on the SemEval-2010 Task 8 and the KBP37 relational dataset. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

21 pages, 7903 KiB  
Article
Improved Graph-Based Arabic Hotel Review Summarization Using Polarity Classification
by Ghada Amoudi, Amal Almansour and Hanan Saleh Alghamdi
Appl. Sci. 2022, 12(21), 10980; https://doi.org/10.3390/app122110980 - 29 Oct 2022
Viewed by 1394
Abstract
The increasing number of online product and service reviews has created a substantial information resource for individuals and businesses. Automatic review summarization helps overcome information overload. Research in automatic text summarization shows remarkable advancement. However, research on Arabic text summarization has not been [...] Read more.
The increasing number of online product and service reviews has created a substantial information resource for individuals and businesses. Automatic review summarization helps overcome information overload. Research in automatic text summarization shows remarkable advancement. However, research on Arabic text summarization has not been sufficiently conducted. This study proposes an extractive Arabic review summarization approach that incorporates the reviews’ polarity and sentiment aspects and employs a graph-based ranking algorithm, TextRank. We demonstrate the advantages of the proposed methods through a set of experiments using hotel reviews from Booking.com. Reviews were grouped based on their polarity, and then TextRank was applied to produce the summary. Results were evaluated using two primary measures, BLEU and ROUGE. Further, two Arabic native speakers’ summaries were used for evaluation purposes. The results showed that this approach improved the summarization scores in most experiments, reaching an F1 score of 0.6294. Contributions of this work include applying a graph-based approach to a new domain, Arabic hotel reviews, adding sentiment dimension to summarization, analyzing the algorithms of the two primary summarization metrics showing the working of these measures and how they could be used to give accurate results, and finally, providing four human summaries for two hotels which could be utilized for another research. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 608 KiB  
Article
Robustness Analysis on Graph Neural Networks Model for Event Detection
by Hui Wei, Hanqing Zhu, Jibing Wu, Kaiming Xiao and Hongbin Huang
Appl. Sci. 2022, 12(21), 10825; https://doi.org/10.3390/app122110825 - 25 Oct 2022
Cited by 1 | Viewed by 1206
Abstract
Event Detection (ED), which aims to identify trigger words from the given text and classify them into corresponding event types, is an important task in Natural Language Processing (NLP); it contributes to several downstream tasks and is beneficial for many real-world applications. Most [...] Read more.
Event Detection (ED), which aims to identify trigger words from the given text and classify them into corresponding event types, is an important task in Natural Language Processing (NLP); it contributes to several downstream tasks and is beneficial for many real-world applications. Most of the current SOTA (state-of-the-art) models for ED are based on Graph Neural Networks (GNN). However, a few studies focus on the issue of GNN-based ED models’ robustness towards text adversarial attacks, which is a challenge in practical applications of EDs that needs to be solved urgently. In this paper, we first propose a robustness analysis framework for an ED model. Using this framework, we can evaluate the robustness of the ED model with various adversarial data. To improve the robustness of the GNN-based ED model, we propose a new multi-order distance representation method and an edge representation update method based on attention weights, then design an innovative model named A-MDL-EEGCN. Extensive experiments illustrate that the proposed model can achieve better performance than other models both on original data and various adversarial data. The comprehensive robustness analysis according to experimental results in this paper brings new insights into the evaluation and design of a robust ED model. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

11 pages, 3640 KiB  
Article
Prediction of Venous Thrombosis Chinese Electronic Medical Records Based on Deep Learning and Rule Reasoning
by Jiawei Chen, Jianhua Yang and Jianfeng He
Appl. Sci. 2022, 12(21), 10824; https://doi.org/10.3390/app122110824 - 25 Oct 2022
Cited by 2 | Viewed by 1242
Abstract
Aiming at the problems of heavy workload of medical staff in the process of venous thrombosis prevention and treatment, error evaluation, missed evaluation, and inconsistent evaluation, we propose a joint extraction model of Chinese electronic medical records based on deep learning. The approach [...] Read more.
Aiming at the problems of heavy workload of medical staff in the process of venous thrombosis prevention and treatment, error evaluation, missed evaluation, and inconsistent evaluation, we propose a joint extraction model of Chinese electronic medical records based on deep learning. The approach was to first construct the handshake annotation, then use bidirectional encoder representations from transformers (BERT) as the word vector embedding, then use the bidirectional long short-term memory network (BiLSTM) to extract the contextual features, and then integrate the contextual information into the process of normalizing the word vector. Experiments show that our proposed method achieves 93.3% and 94.3% of entity and relation F1 on the constructed electronic medical record dataset, which effectively improves the effect of medical information extraction. At the same time, the venous thromboembolism (VTE) risk factors extracted from the electronic medical record were used to judge the risk factors of venous thrombosis by means of rule reasoning. Compared with the assessment of clinicians on the Wells and Geneva scales, the accuracy rates of 84.7% and 86.1% were obtained. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 5387 KiB  
Article
Cosine-Based Embedding for Completing Lightweight Schematic Knowledge in DL-Litecore
by Weizhuo Li, Xianda Zheng, Huan Gao, Qiu Ji and Guilin Qi
Appl. Sci. 2022, 12(20), 10690; https://doi.org/10.3390/app122010690 - 21 Oct 2022
Viewed by 1509
Abstract
Schematic knowledge, an important component of knowledge graphs (KGs), defines a rich set of logical axioms based on concepts and relations to support knowledge integration, reasoning, and heterogeneity elimination over KGs. Although several KGs consist of lots of factual knowledge, their schematic knowledge [...] Read more.
Schematic knowledge, an important component of knowledge graphs (KGs), defines a rich set of logical axioms based on concepts and relations to support knowledge integration, reasoning, and heterogeneity elimination over KGs. Although several KGs consist of lots of factual knowledge, their schematic knowledge (e.g., subclassOf axioms, disjointWith axioms) is far from complete. Currently, existing KG embedding methods for completing schematic knowledge still suffer from two limitations. Firstly, existing embedding methods designed to encode factual knowledge pay little attention to the completion of schematic knowledge (e.g., axioms). Secondly, several methods try to preserve logical properties of relations for completing schematic knowledge, but they cannot simultaneously preserve the transitivity (e.g., subclassOf) and symmetry (e.g., disjointWith) of axioms well. To solve these issues, we propose a cosine-based embedding method named CosE tailored for completing lightweight schematic knowledge in DL-Litecore. Precisely, the concepts in axioms will be encoded into two semantic spaces defined in CosE. One is called angle-based semantic space, which is employed to preserve the transitivity or symmetry of relations in axioms. The other one is defined as translation-based semantic space that is used to measure the confidence of each axiom. We design two types of score functions for these two semantic spaces, so as to sufficiently learn the vector representations of concepts. Moreover, we propose a novel negative sampling strategy based on the mutual exclusion between subclassOf and disjointWith. In this way, concepts can obtain better vector representations for schematic knowledge completion. We implement our method and verify it on four standard datasets generated by real ontologies. Experiments show that CosE can obtain better results than existing models and keep the logical properties of relations for transitivity and symmetry simultaneously. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

9 pages, 676 KiB  
Article
The Multi-Hot Representation-Based Language Model to Maintain Morpheme Units
by Ju-Sang Lee, Joon-Choul Shin and Choel-Young Ock
Appl. Sci. 2022, 12(20), 10612; https://doi.org/10.3390/app122010612 - 20 Oct 2022
Viewed by 1168
Abstract
Natural language models brought rapid developments to Natural Language Processing (NLP) performance following the emergence of large-scale deep learning models. Language models have previously used token units to represent natural language while reducing the proportion of unknown tokens. However, tokenization in language models [...] Read more.
Natural language models brought rapid developments to Natural Language Processing (NLP) performance following the emergence of large-scale deep learning models. Language models have previously used token units to represent natural language while reducing the proportion of unknown tokens. However, tokenization in language models raises language-specific issues. One of the key issues is that separating words by morphemes may cause distortion to the original meaning; also, it can prove challenging to apply the information surrounding a word, such as its semantic network. We propose a multi-hot representation language model to maintain Korean morpheme units. This method represents a single morpheme as a group of syllable-based tokens for cases where no matching tokens exist. This model has demonstrated similar performance to existing models in various natural language processing applications. The proposed model retains the minimum unit of meaning by maintaining the morpheme units and can easily accommodate the extension of semantic information. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 1203 KiB  
Article
AT-CRF: A Chinese Reading Comprehension Algorithm Based on Attention Mechanism and Conditional Random Fields
by Nawei Shi, Huazhang Wang and Yongqiang Cheng
Appl. Sci. 2022, 12(20), 10459; https://doi.org/10.3390/app122010459 - 17 Oct 2022
Viewed by 1440
Abstract
Machine reading comprehension (MRC) is an important research topic in the field of Natural Language Processing (NLP). However, traditional MRC models often face challenges of information loss, lack of capability to retain long-distance dependence, and inability to deal with unanswerable questions where answers [...] Read more.
Machine reading comprehension (MRC) is an important research topic in the field of Natural Language Processing (NLP). However, traditional MRC models often face challenges of information loss, lack of capability to retain long-distance dependence, and inability to deal with unanswerable questions where answers are not available in given texts. In this paper, a Chinese reading comprehension algorithm, called the Attention and Conditional Random Filed (AT-CRF) Reader, is proposed to address the above challenges. Firstly, RoBERTa, a pre-trained language model, is introduced to obtain the embedding representations of input. Then, a depthwise separable convolution neural network and attention mechanisms are used to replace the recurrent neural network for encoding. Next, the attention flow and self-attention mechanisms are used to obtain the context–query internal relation. Finally, the conditional random field is used to handle unanswerable questions and predict the correct answer. Experiments were conducted on the two Chinese machine reading comprehension datasets, CMRC2018 and DuReader-checklist, and the results showed that, compared with the baseline model, the F1 scores achieved by our AT-CRF Reader model has improved by 2.65% and 2.68%, and the EM values increased by 4.45% and 3.88%. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 4519 KiB  
Article
Enhancing Food Ingredient Named-Entity Recognition with Recurrent Network-Based Ensemble (RNE) Model
by Kokoy Siti Komariah and Bong-Kee Sin
Appl. Sci. 2022, 12(20), 10310; https://doi.org/10.3390/app122010310 - 13 Oct 2022
Viewed by 2215
Abstract
Food recipe sharing sites are becoming increasingly popular among people who want to learn how to cook or plan their menu. Through online food recipes, individuals can select ingredients that suit their lifestyle and health condition. Information from online food recipes is useful [...] Read more.
Food recipe sharing sites are becoming increasingly popular among people who want to learn how to cook or plan their menu. Through online food recipes, individuals can select ingredients that suit their lifestyle and health condition. Information from online food recipes is useful in developing food-related systems such as recommendations and health care systems. However, the information from online recipes is often unstructured. One way of extracting such information into a well-structured format is the technique called named-entity recognition (NER), which is the process of identifying keywords and phrases in the text and classifying them into a set of predetermined categories, such as location, persons, time, and others. We present a food ingredient named-entity recognition model called RNE (recurrent network-based ensemble methods) to extract the entities from the online recipe. RNE is an ensemble-learning framework using recurrent network models such as RNN, GRU, and LSTM. These models are trained independently on the same dataset and combined to produce better predictions in extracting food entities such as ingredient names, products, units, quantities, and states for each ingredient in a recipe. The experimental findings demonstrate that the proposed model achieves predictions with an F1 score of 96.09% and outperforms all individual models by 0.2% to 0.5% in percentage points. This result indicates that RNE can extract information from food recipes better than a single model. In addition, this information extracted by RNE can be used to support various information systems related to food. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

23 pages, 3212 KiB  
Article
Survey of Text Mining Techniques Applied to Judicial Decisions Prediction
by Olga Alejandra Alcántara Francia, Miguel Nunez-del-Prado and Hugo Alatrista-Salas
Appl. Sci. 2022, 12(20), 10200; https://doi.org/10.3390/app122010200 - 11 Oct 2022
Cited by 6 | Viewed by 2978
Abstract
This paper reviews the most recent literature on experiments with different Machine Learning, Deep Learning and Natural Language Processing techniques applied to predict judicial and administrative decisions. Among the most outstanding findings, we have that the most used data mining techniques are Support [...] Read more.
This paper reviews the most recent literature on experiments with different Machine Learning, Deep Learning and Natural Language Processing techniques applied to predict judicial and administrative decisions. Among the most outstanding findings, we have that the most used data mining techniques are Support Vector Machine (SVM), K Nearest Neighbours (K-NN) and Random Forest (RF), and in terms of the most used deep learning techniques, we found Long-Term Memory (LSTM) and transformers such as BERT. An important finding in the papers reviewed was that the use of machine learning techniques has prevailed over those of deep learning. Regarding the place of origin of the research carried out, we found that 64% of the works belong to studies carried out in English-speaking countries, 8% in Portuguese and 28% in other languages (such as German, Chinese, Turkish, Spanish, etc.). Very few works of this type have been carried out in Spanish-speaking countries. The classification criteria of the works have been based, on the one hand, on the identification of the classifiers used to predict situations (or events with legal interference) or judicial decisions and, on the other hand, on the application of classifiers to the phenomena regulated by the different branches of law: criminal, constitutional, human rights, administrative, intellectual property, family law, tax law and others. The corpus size analyzed in the reviewed works reached 100,000 documents in 2020. Finally, another important finding lies in the accuracy of these predictive techniques, reaching predictions of over 60% in different branches of law. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 1321 KiB  
Article
Leveraging Multi-Modal Information for Cross-Lingual Entity Matching across Knowledge Graphs
by Tianxing Wu, Chaoyu Gao, Lin Li and Yuxiang Wang
Appl. Sci. 2022, 12(19), 10107; https://doi.org/10.3390/app121910107 - 08 Oct 2022
Cited by 6 | Viewed by 1615
Abstract
In recent years, the scale of knowledge graphs and the number of entities have grown rapidly. Entity matching across different knowledge graphs has become an urgent problem to be solved for knowledge fusion. With the importance of entity matching being increasingly evident, the [...] Read more.
In recent years, the scale of knowledge graphs and the number of entities have grown rapidly. Entity matching across different knowledge graphs has become an urgent problem to be solved for knowledge fusion. With the importance of entity matching being increasingly evident, the use of representation learning technologies to find matched entities has attracted extensive attention due to the computability of vector representations. However, existing studies on representation learning technologies cannot make full use of knowledge graph relevant multi-modal information. In this paper, we propose a new cross-lingual entity matching method (called CLEM) with knowledge graph representation learning on rich multi-modal information. The core is the multi-view intact space learning method to integrate embeddings of multi-modal information for matching entities. Experimental results on cross-lingual datasets show the superiority and competitiveness of our proposed method. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 973 KiB  
Article
Automatic Classification of Eyewitness Messages for Disaster Events Using Linguistic Rules and ML/AI Approaches
by Sajjad Haider, Azhar Mahmood, Shaheen Khatoon, Majed Alshamari and Muhammad Tanvir Afzal
Appl. Sci. 2022, 12(19), 9953; https://doi.org/10.3390/app12199953 - 03 Oct 2022
Cited by 1 | Viewed by 1396
Abstract
Emergency response systems require precise and accurate information about an incident to respond accordingly. An eyewitness report is one of the sources of such information. The research community has proposed diverse techniques to identify eyewitness messages from social media platforms. In our previous [...] Read more.
Emergency response systems require precise and accurate information about an incident to respond accordingly. An eyewitness report is one of the sources of such information. The research community has proposed diverse techniques to identify eyewitness messages from social media platforms. In our previous work, we created grammar rules by exploiting the language structure, linguistics, and word relations to automatically extract feature words to classify eyewitness messages for different disaster types. Our previous work adopted a manual classification technique and secured the maximum F-Score of 0.81, far less than the static dictionary-based approach with an F-Score of 0.92. In this work, we enhanced our work by adding more features and fine-tuning the Linguistic Rules to identify feature words related to Twitter Eyewitness messages for Disaster events, named as LR-TED approach. We used linguistic characteristics and labeled datasets to train several machine learning and deep learning classifiers for classifying eyewitness messages and secured a maximum F-score of 0.93. The proposed LR-TED can process millions of tweets in real-time and is scalable to diverse events and unseen content. In contrast, the static dictionary-based approaches require domain experts to create dictionaries of related words for all the identified features and disaster types. Additionally, LR-TED can be evaluated on different social media platforms to identify eyewitness reports for various disaster types in the future. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

17 pages, 786 KiB  
Article
Long Text Truncation Algorithm Based on Label Embedding in Text Classification
by Jingang Chen and Shu Lv
Appl. Sci. 2022, 12(19), 9874; https://doi.org/10.3390/app12199874 - 30 Sep 2022
Cited by 1 | Viewed by 1997
Abstract
The long text classification task has become a hot research topic in the field of text classification due to its long length and redundant information. At present, the common processing methods for long text data, such as the truncation method and pooling method, [...] Read more.
The long text classification task has become a hot research topic in the field of text classification due to its long length and redundant information. At present, the common processing methods for long text data, such as the truncation method and pooling method, are prone to the problem of too many sentences or loss of contextual semantic information. To deal with these issues, we present LTTA-LE (Long Text Truncation Algorithm Based on Label Embedding in Text Classification), which consists of three key steps. Firstly, we build a pretraining prefix template and a label word mapping prefix template to obtain the label word embedding, and we realize the joint training of long text and label words. Secondly, we calculate the cosine similarity between the label word embedding and the long text embedding, and we filter the redundant information of the long text to reduce the text length. Finally, a three-stage model training architecture is introduced to effectively improve the classification performance and generalization ability of the model. We conduct comparative experiments on three public long text datasets, and the results show that LTTA-LE has an average F1 improvement of 1.0518% over other algorithms, which proves that our method can achieve satisfactory performance. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

18 pages, 457 KiB  
Article
Research on Chinese Medical Entity Relation Extraction Based on Syntactic Dependency Structure Information
by Qinghui Zhang, Meng Wu, Pengtao Lv, Mengya Zhang and Lei Lv
Appl. Sci. 2022, 12(19), 9781; https://doi.org/10.3390/app12199781 - 28 Sep 2022
Cited by 2 | Viewed by 1477
Abstract
Extracting entity relations from unstructured medical texts is a fundamental task in the field of medical information extraction. In relation extraction, dependency trees contain rich structural information that helps capture the long-range relations between entities. However, many models cannot effectively use dependency information [...] Read more.
Extracting entity relations from unstructured medical texts is a fundamental task in the field of medical information extraction. In relation extraction, dependency trees contain rich structural information that helps capture the long-range relations between entities. However, many models cannot effectively use dependency information or learn sentence information adequately. In this paper, we propose a relation extraction model based on syntactic dependency structure information. First, the model learns sentence sequence information by Bi-LSTM. Then, the model learns syntactic dependency structure information through graph convolutional networks. Meanwhile, in order to remove irrelevant information from the dependencies, the model adopts a new pruning strategy. Finally, the model adds a multi-head attention mechanism to focus on the entity information in the sentence from multiple aspects. We evaluate the proposed model on a Chinese medical entity relation extraction dataset. Experimental results show that our model can learn dependency relation information better and has higher performance than other baseline models. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

20 pages, 471 KiB  
Article
SupMPN: Supervised Multiple Positives and Negatives Contrastive Learning Model for Semantic Textual Similarity
by Somaiyeh Dehghan and Mehmet Fatih Amasyali
Appl. Sci. 2022, 12(19), 9659; https://doi.org/10.3390/app12199659 - 26 Sep 2022
Cited by 4 | Viewed by 3049
Abstract
Semantic Textual Similarity (STS) is an important task in the area of Natural Language Processing (NLP) that measures the similarity of the underlying semantics of two texts. Although pre-trained contextual embedding models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art [...] Read more.
Semantic Textual Similarity (STS) is an important task in the area of Natural Language Processing (NLP) that measures the similarity of the underlying semantics of two texts. Although pre-trained contextual embedding models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance on several NLP tasks, BERT-derived sentence embeddings have been proven to collapse in some way, i.e., sentence embeddings generated by BERT depend on the frequency of words. Therefore, almost all BERT-derived sentence embeddings are mapped into a small area and have a high cosine similarity. Hence, sentence embeddings generated by BERT are not so robust in the STS task as they cannot capture the full semantic meaning of the sentences. In this paper, we propose SupMPN: A Supervised Multiple Positives and Negatives Contrastive Learning Model, which accepts multiple hard-positive sentences and multiple hard-negative sentences simultaneously and then tries to bring hard-positive sentences closer, while pushing hard-negative sentences away from them. In other words, SupMPN brings similar sentences closer together in the representation space by discrimination among multiple similar and dissimilar sentences. In this way, SupMPN can learn the semantic meanings of sentences by contrasting among multiple similar and dissimilar sentences and can generate sentence embeddings based on the semantic meaning instead of the frequency of the words. We evaluate our model on standard STS and transfer-learning tasks. The results reveal that SupMPN outperforms state-of-the-art SimCSE and all other previous supervised and unsupervised models. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 766 KiB  
Article
Multigranularity Syntax Guidance with Graph Structure for Machine Reading Comprehension
by Chuanyun Xu, Zixu Liu, Gang Li, Changpeng Zhu and Yang Zhang
Appl. Sci. 2022, 12(19), 9525; https://doi.org/10.3390/app12199525 - 22 Sep 2022
Viewed by 1819
Abstract
In recent years, pre-trained language models, represented by the bidirectional encoder representations from transformers (BERT) model, have achieved remarkable success in machine reading comprehension (MRC). However, limited by the structure of BERT-based MRC models (for example, restrictions on word count), such models cannot [...] Read more.
In recent years, pre-trained language models, represented by the bidirectional encoder representations from transformers (BERT) model, have achieved remarkable success in machine reading comprehension (MRC). However, limited by the structure of BERT-based MRC models (for example, restrictions on word count), such models cannot effectively integrate significant features, such as syntax relations, semantic connections, and long-distance semantics between sentences, leading to the inability of the available models to better understand the intrinsic connections between text and questions to be answered based on it. In this paper, a multi-granularity syntax guidance (MgSG) module that consists of a “graph with dependence” module and a “graph with entity” module is proposed. MgSG selects both sentence and word granularities to guide the text model to decipher the text. In particular, syntactic constraints are used to guide the text model while exploiting the global nature of graph neural networks to enhance the model’s ability to construct long-range semantics. Simultaneously, named entities play an important role in text and answers and focusing on entities can improve the model’s understanding of the text’s major idea. Ultimately, fusing multiple embedding representations to form a representation yields the semantics of the context and the questions. Experiments demonstrate that the performance of the proposed method on the Stanford Question Answering Dataset is better when compared with the traditional BERT baseline model. The experimental results illustrate that our proposed “MgSG” module effectively utilizes the graph structure to learn the internal features of sentences, solve the problem of long-distance semantics, while effectively improving the performance of PrLM in machine reading comprehension. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 3476 KiB  
Article
Knowledge Graph Alignment Network with Node-Level Strong Fusion
by Shuang Liu, Man Xu, Yufeng Qin and Niko Lukač
Appl. Sci. 2022, 12(19), 9434; https://doi.org/10.3390/app12199434 - 20 Sep 2022
Cited by 4 | Viewed by 1534
Abstract
Entity alignment refers to the process of discovering entities representing the same object in different knowledge graphs (KG). Recently, some studies have learned other information about entities, but they are aspect-level simple information associations, and thus only rough entity representations can be obtained, [...] Read more.
Entity alignment refers to the process of discovering entities representing the same object in different knowledge graphs (KG). Recently, some studies have learned other information about entities, but they are aspect-level simple information associations, and thus only rough entity representations can be obtained, and the advantage of multi-faceted information is lost. In this paper, a novel node-level information strong fusion framework (SFEA) is proposed, based on four aspects: structure, attribute, relation and names. The attribute information and name information are learned first, then structure information is learned based on these two aspects of information through graph convolutional network (GCN), the alignment signals from attribute and name are already carried at the beginning of the learning structure. In the process of continuous propagation of multi-hop neighborhoods, the effect of strong fusion of structure, attribute and name information is achieved and the more meticulous entity representations are obtained. Additionally, through the continuous interaction between sub-alignment tasks, the effect of entity alignment is enhanced. An iterative framework is designed to improve performance while reducing the impact on pre-aligned seed pairs. Furthermore, extensive experiments demonstrate that the model improves the accuracy of entity alignment and significantly outperforms 13 previous state-of-the-art methods. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

14 pages, 1040 KiB  
Article
An End-to-End Mutually Interactive Emotion–Cause Pair Extractor via Soft Sharing
by Beilun Wang, Tianyi Ma, Zhengxuan Lu and Haoqing Xu
Appl. Sci. 2022, 12(18), 8998; https://doi.org/10.3390/app12188998 - 07 Sep 2022
Cited by 1 | Viewed by 1292
Abstract
Emotion–cause pair extraction (ECPE), i.e., extracting pairs of emotions and corresponding causes from text, has recently attracted a lot of research interest. However, current ECPE models face two problems: (1) The common two-stage pipeline causes the error to be accumulated. (2) Ignoring the [...] Read more.
Emotion–cause pair extraction (ECPE), i.e., extracting pairs of emotions and corresponding causes from text, has recently attracted a lot of research interest. However, current ECPE models face two problems: (1) The common two-stage pipeline causes the error to be accumulated. (2) Ignoring the mutual connection between the extraction and pairing of emotion and cause limits the performance. In this paper, we propose a novel end-to-end mutually interactive emotion–cause pair extractor (Emiece) that is able to effectively extract emotion–cause pairs from all potential clause pairs. Specifically, we design two soft-shared clause-level encoders in an end-to-end deep model to measure the weighted probability of being a potential emotion–cause pair. Experiments on standard ECPE datasets show that Emiece achieves drastic improvements over the original two-step ECPE model and other end-to-end models in the extraction of major emotional cause pairs. The effectiveness of soft sharing and the applicability of the Emiece framework are further demonstrated by ablation experiments. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 3118 KiB  
Article
Zero-Shot Emotion Detection for Semi-Supervised Sentiment Analysis Using Sentence Transformers and Ensemble Learning
by Senait Gebremichael Tesfagergish, Jurgita Kapočiūtė-Dzikienė and Robertas Damaševičius
Appl. Sci. 2022, 12(17), 8662; https://doi.org/10.3390/app12178662 - 29 Aug 2022
Cited by 26 | Viewed by 4339
Abstract
We live in a digitized era where our daily life depends on using online resources. Businesses consider the opinions of their customers, while people rely on the reviews/comments of other users before buying specific products or services. These reviews/comments are usually provided in [...] Read more.
We live in a digitized era where our daily life depends on using online resources. Businesses consider the opinions of their customers, while people rely on the reviews/comments of other users before buying specific products or services. These reviews/comments are usually provided in the non-normative natural language within different contexts and domains (in social media, forums, news, blogs, etc.). Sentiment classification plays an important role in analyzing such texts collected from users by assigning positive, negative, and sometimes neutral sentiment values to each of them. Moreover, these texts typically contain many expressed or hidden emotions (such as happiness, sadness, etc.) that could contribute significantly to identifying sentiments. We address the emotion detection problem as part of the sentiment analysis task and propose a two-stage emotion detection methodology. The first stage is the unsupervised zero-shot learning model based on a sentence transformer returning the probabilities for subsets of 34 emotions (anger, sadness, disgust, fear, joy, happiness, admiration, affection, anguish, caution, confusion, desire, disappointment, attraction, envy, excitement, grief, hope, horror, joy, love, loneliness, pleasure, fear, generosity, rage, relief, satisfaction, sorrow, wonder, sympathy, shame, terror, and panic). The output of the zero-shot model is used as an input for the second stage, which trains the machine learning classifier on the sentiment labels in a supervised manner using ensemble learning. The proposed hybrid semi-supervised method achieves the highest accuracy of 87.3% on the English SemEval 2017 dataset. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 270 KiB  
Article
Identifying Irregular Financial Operations Using Accountant Comments and Natural Language Processing Techniques
by Vytautas Rudžionis, Audrius Lopata, Saulius Gudas, Rimantas Butleris, Ilona Veitaitė, Darius Dilijonas, Evaldas Grišius, Maarten Zwitserloot and Kristina Rudzioniene
Appl. Sci. 2022, 12(17), 8558; https://doi.org/10.3390/app12178558 - 26 Aug 2022
Cited by 2 | Viewed by 1856
Abstract
Finding not typical financial operations is a complicated task. The difficulties arise not only due to the sophisticated actions of fraudsters but also because of the large number of financial operations performed by business companies. This is especially true for large companies. It [...] Read more.
Finding not typical financial operations is a complicated task. The difficulties arise not only due to the sophisticated actions of fraudsters but also because of the large number of financial operations performed by business companies. This is especially true for large companies. It is highly desirable to have a tool to reduce the number of potentially irregular operations significantly. This paper presents an implementation of NLP-based algorithms to identify irregular financial operations using comments left by accountants. The comments are freely written and usually very short remarks used by accountants for personal information. Implementation of content analysis using cosine similarity showed that identification of the type of operation using the comments of accountants is very likely. Further comment content analysis and financial data analysis showed that it could be expected to reduce the number of potentially suspicious operations significantly: analysis of more than half a million financial records of Dutch companies enabled the identification of 0.3% operations that may be potentially suspicious. This could make human financial auditing easier and more robust task. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
14 pages, 582 KiB  
Article
Research on Chinese Medical Entity Recognition Based on Multi-Neural Network Fusion and Improved Tri-Training Algorithm
by Renlong Qi, Pengtao Lv, Qinghui Zhang and Meng Wu
Appl. Sci. 2022, 12(17), 8539; https://doi.org/10.3390/app12178539 - 26 Aug 2022
Cited by 2 | Viewed by 1452
Abstract
Chinese medical texts contain a large number of medically named entities. Automatic recognition of these medical entities from medical texts is the key to developing medical informatics. In the field of Chinese medical information extraction, annotated Chinese medical text data are very few. [...] Read more.
Chinese medical texts contain a large number of medically named entities. Automatic recognition of these medical entities from medical texts is the key to developing medical informatics. In the field of Chinese medical information extraction, annotated Chinese medical text data are very few. In the named entity recognition task, there is insufficient labeled data, which leads to low model recognition performance. Therefore, this paper proposes a Chinese medical entity recognition model based on multi-neural network fusion and the improved Tri-Training algorithm. The model performs semi-supervised learning by improving the Tri-Training algorithm. According to the characteristics of the medical entity recognition task and medical data, the method in this paper is improved in terms of the division of the initial sub-training set, the construction of the base classifier, and the integration of the learning voting method. In addition, this paper also proposes a multi-neural network fusion entity recognition model for base classifier construction. The model learns feature information jointly by combining Iterated Dilated Convolutional Neural Network (IDCNN) and BiLSTM. Through experimental verification, the model proposed in this paper outperforms other models and improves the performance of the Chinese medical entity recognition model by incorporating and improving the semi-supervised learning algorithm. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

16 pages, 913 KiB  
Article
KGNER: Improving Chinese Named Entity Recognition by BERT Infused with the Knowledge Graph
by Weiwei Hu, Liang He, Hanhan Ma, Kai Wang and Jingfeng Xiao
Appl. Sci. 2022, 12(15), 7702; https://doi.org/10.3390/app12157702 - 30 Jul 2022
Cited by 4 | Viewed by 2329
Abstract
Recently, the lexicon method has been proven to be effective for named entity recognition (NER). However, most existing lexicon-based methods cannot fully utilize common-sense knowledge in the knowledge graph. For example, the word embeddings pretrained by Word2vector or Glove lack better contextual semantic [...] Read more.
Recently, the lexicon method has been proven to be effective for named entity recognition (NER). However, most existing lexicon-based methods cannot fully utilize common-sense knowledge in the knowledge graph. For example, the word embeddings pretrained by Word2vector or Glove lack better contextual semantic information usage. Hence, how to make the best of knowledge for the NER task has become a challenging and hot research topic. We propose a knowledge graph-inspired named-entity recognition (KGNER) featuring a masking and encoding method to incorporate common sense into bidirectional encoder representations from transformers (BERT). The proposed method not only preserves the original sentence semantic information but also takes advantage of the knowledge information in a more reasonable way. Subsequently, we model the temporal dependencies by taking the conditional random field (CRF) as the backend, and improve the overall performance. Experiments on four dominant datasets demonstrate that the KGNER outperforms other lexicon-based models in terms of performance. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

15 pages, 891 KiB  
Article
Boosting the Transformer with the BERT Supervision in Low-Resource Machine Translation
by Rong Yan, Jiang Li, Xiangdong Su, Xiaoming Wang and Guanglai Gao
Appl. Sci. 2022, 12(14), 7195; https://doi.org/10.3390/app12147195 - 17 Jul 2022
Cited by 7 | Viewed by 2258
Abstract
Previous works trained the Transformer and its variants end-to-end and achieved remarkable translation performance when there are huge parallel sentences available. However, these models suffer from the data scarcity problem in low-resource machine translation tasks. To deal with the mismatch problem between the [...] Read more.
Previous works trained the Transformer and its variants end-to-end and achieved remarkable translation performance when there are huge parallel sentences available. However, these models suffer from the data scarcity problem in low-resource machine translation tasks. To deal with the mismatch problem between the big model capacity of the Transformer and the small parallel training data set, this paper adds the BERT supervision on the latent representation between the encoder and the decoder of the Transformer and designs a multi-step training algorithm to boost the Transformer on such a basis. The algorithm includes three stages: (1) encoder training, (2) decoder training, and (3) joint optimization. We introduce the BERT of the target language in the encoder and the decoder training and alleviate the data starvation problem of the Transformer. After the training stage, the BERT will not further attend the inference section explicitly. Another merit of our training algorithm is that it can further enhance the Transformer in the task where there are limited parallel sentence pairs but large amounts of monolingual corpus of the target language. The evaluation results on six low-resource translation tasks suggest that the Transformer trained by our algorithm significantly outperforms the baselines which were trained end-to-end in previous works. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

32 pages, 2220 KiB  
Article
Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC
by Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo, Hyeonseok Moon and Heuiseok Lim
Appl. Sci. 2022, 12(11), 5545; https://doi.org/10.3390/app12115545 - 30 May 2022
Cited by 4 | Viewed by 2508
Abstract
The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora [...] Read more.
The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

Other

Jump to: Research

22 pages, 4864 KiB  
Systematic Review
Natural Language Processing Adoption in Governments and Future Research Directions: A Systematic Review
by Yunqing Jiang, Patrick Cheong-Iao Pang, Dennis Wong and Ho Yin Kan
Appl. Sci. 2023, 13(22), 12346; https://doi.org/10.3390/app132212346 - 15 Nov 2023
Viewed by 1588
Abstract
Natural language processing (NLP), which is known as an emerging technology creating considerable value in multiple areas, has recently shown its great potential in government operations and public administration applications. However, while the number of publications on NLP is increasing steadily, there is [...] Read more.
Natural language processing (NLP), which is known as an emerging technology creating considerable value in multiple areas, has recently shown its great potential in government operations and public administration applications. However, while the number of publications on NLP is increasing steadily, there is no comprehensive review for a holistic understanding of how NLP is being adopted by governments. In this regard, we present a systematic literature review on NLP applications in governments by following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol. The review shows that the current literature comprises three levels of contribution: automation, extension, and transformation. The most-used NLP techniques reported in government-related research are sentiment analysis, machine learning, deep learning, classification, data extraction, data mining, topic modelling, opinion mining, chatbots, and question answering. Data classification, management, and decision-making are the most frequently reported reasons for using NLP. The salient research topics being discussed in the literature can be grouped into four categories: (1) governance and policy, (2) citizens and public opinion, (3) medical and healthcare, and (4) economy and environment. Future research directions should focus on (1) the potential of chatbots, (2) NLP applications in the post-pandemic era, and (3) empirical research for government work. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

Back to TopTop