Natural Language Processing and Information Retrieval

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (6 November 2023) | Viewed by 30339

Special Issue Editors


E-Mail Website
Guest Editor
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 511400, China
Interests: information retrieval; data mining; machine learning; natural language processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Computing Science, University of Glasgow, Scotland G128QQ, UK
Interests: NLP; knowledge graphs; graph neural networks
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In recent decades, techniques that integrate natural language processing (NLP) and information retrieval (IR) have achieved significant improvements in a wide spectrum of real-life applications, such as question-answering summarization, neural text retrieval and understanding, and representation learning for information extraction. One of the keys to the success of these real-life applications is how NLP and IR integrate in seamless and appropriate ways.

This Special Issue is intended to provide an overview of state-of-the-art research in the fields of NLP and IR, and in particular, how they integrate and improve each other in terms of either theories or applications. Specifically, this Special Issue aims to gathering works from researchers with broad expertise in NLP and IR that discuss their cutting-edge theories, models, methods, algorithms, applications, or perspectives on future directions.

The topics of interest for this Special Issue include but are not limited to:

  • Question answering
  • Information retrieval and text mining
  • NLP and IR theories and applications
  • Summarization
  • Graph neural networks for NLP and IR
  • Machine/deep learning for NLP and IR
  • Machine translation and multilingualism
  • Syntax: tagging, chunking, and parsing
  • Semantics: lexical, sentence-level semantics, textual inference, and other areas
  • Generation
  • Dialogue and interactive systems
  • Search and ranking
  • NLP for search, recommendation, and representation

Dr. Shangsong Liang
Dr. Zaiqiao Meng
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • natural language processing
  • information retrieval
  • machine learning
  • deep learning

Related Special Issue

Published Papers (20 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 1498 KiB  
Article
RoBERTa-Based Keyword Extraction from Small Number of Korean Documents
by So-Eon Kim, Jun-Beom Lee, Gyu-Min Park, Seok-Man Sohn and Seong-Bae Park
Electronics 2023, 12(22), 4560; https://doi.org/10.3390/electronics12224560 - 07 Nov 2023
Viewed by 942
Abstract
Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel [...] Read more.
Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

14 pages, 2966 KiB  
Article
A Multi-Modal Retrieval Model for Mathematical Expressions Based on ConvNeXt and Hesitant Fuzzy Set
by Ruxuan Li, Jingyi Wang and Xuedong Tian
Electronics 2023, 12(20), 4363; https://doi.org/10.3390/electronics12204363 - 20 Oct 2023
Viewed by 1001
Abstract
Mathematical expression retrieval is an essential component of mathematical information retrieval. Current mathematical expression retrieval research primarily targets single modalities, particularly text, which can lead to the loss of structural information. On the other hand, multimodal research has demonstrated promising outcomes across different [...] Read more.
Mathematical expression retrieval is an essential component of mathematical information retrieval. Current mathematical expression retrieval research primarily targets single modalities, particularly text, which can lead to the loss of structural information. On the other hand, multimodal research has demonstrated promising outcomes across different domains, and mathematical expressions in image format are adept at preserving their structural characteristics. So we propose a multi-modal retrieval model for mathematical expressions based on ConvNeXt and HFS to address the limitations of single-modal retrieval. For the image modal, mathematical expression retrieval is based on the similarity of image features and symbol-level features of the expression, where image features of the expression image are extracted by ConvNeXt, while symbol-level features are obtained by the Symbol Level Features Extraction (SLFE) module. For the text modal, the Formula Description Structure (FDS) is employed to analyze expressions and extract their attributes. Additionally, the application of the Hesitant Fuzzy Set (HFS) theory facilitates the computation of hesitant fuzzy similarity between mathematical queries and candidate expressions. Finally, Reciprocal Rank Fusion (RRF) is employed to integrate rankings from image modal and text modal retrieval, yielding the ultimate retrieval list. The experiment was conducted on the publicly accessible ArXiv dataset (containing 592,345 mathematical expressions) and the NTCIR-mair-wikipedia-corpus (NTCIR) dataset.The MAP@10 values for the multimodal RRF fusion approach are recorded as 0.774. These substantiate the efficacy of the multi-modal mathematical expression retrieval approach based on ConvNeXt and HFS. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

20 pages, 1026 KiB  
Article
Addressing Long-Distance Dependencies in AMR Parsing with Hierarchical Clause Annotation
by Yunlong Fan, Bin Li, Yikemaiti Sataer, Miao Gao, Chuanqi Shi and Zhiqiang Gao
Electronics 2023, 12(18), 3908; https://doi.org/10.3390/electronics12183908 - 16 Sep 2023
Cited by 1 | Viewed by 704
Abstract
Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance [...] Read more.
Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance dependencies (LDDs) introduced by long sequences are still open problems. We argue that LDDs are not actually to blame for the sequence length but are essentially related to the internal clause hierarchy. Typically, non-verb words in a clause cannot depend on words outside of it, and verbs from different but related clauses have much longer dependencies than those in the same clause. With this intuition, we introduce a type of clausal feature, hierarchical clause annotation (HCA), into AMR parsing and propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate HCA trees of complex sentences for addressing LDDs. We conduct extensive experiments on two in-distribution (ID) AMR datasets (AMR 2.0 and AMR 3.0) and three out-of-distribution (OOD) ones (TLP, New3, and Bio). Experimental results show that our HCA-based approaches achieve significant and explainable improvements (0.7 Smatch score in both ID datasets; 2.3, 0.7, and 2.6 in three OOD datasets, respectively) against the baseline model and outperform the state-of-the-art (SOTA) model (0.7 Smatch score in the OOD dataset, Bio) when encountering sentences with complex clausal structures that introduce most LDD cases. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Graphical abstract

15 pages, 2746 KiB  
Article
Neural Machine Translation of Electrical Engineering Based on Integrated Convolutional Neural Networks
by Zikang Liu, Yuan Chen and Juwei Zhang
Electronics 2023, 12(17), 3604; https://doi.org/10.3390/electronics12173604 - 25 Aug 2023
Viewed by 1067
Abstract
Research has shown that neural machine translation performs poorly on low-resource and specific domain parallel corpora. In this paper, we focus on the problem of neural machine translation in the field of electrical engineering. To address the mistranslation caused by the Transformer model’s [...] Read more.
Research has shown that neural machine translation performs poorly on low-resource and specific domain parallel corpora. In this paper, we focus on the problem of neural machine translation in the field of electrical engineering. To address the mistranslation caused by the Transformer model’s limited ability to extract feature information from certain sentences, we propose two new models that integrate a convolutional neural network as a feature extraction layer into the Transformer model. The feature information extracted by the CNN is fused separately in the source-side and target-side models, which enhances the Transformer model’s ability to extract feature information, optimizes model performance, and improves translation quality. On the dataset of the field of electrical engineering, the proposed source-side and target-side models improved BLEU scores by 1.63 and 1.12 percentage points, respectively, compared to the baseline model. In addition, the two models proposed in this paper can learn rich semantic knowledge without relying on auxiliary knowledge such as part-of-speech tagging and named entity recognition, which saves a certain amount of human resources and time costs. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

15 pages, 4080 KiB  
Article
Named Entity Recognition for Few-Shot Power Dispatch Based on Multi-Task
by Zhixiang Tan, Yan Chen, Zengfu Liang, Qi Meng and Dezhao Lin
Electronics 2023, 12(16), 3476; https://doi.org/10.3390/electronics12163476 - 17 Aug 2023
Cited by 1 | Viewed by 803
Abstract
In view of the fact that entity nested and professional terms are difficult to identify in the field of power dispatch, a multi-task-based few-shot named entity recognition model (FSPD-NER) for power dispatch is proposed. The model consists of four modules: feature enhancement, seed, [...] Read more.
In view of the fact that entity nested and professional terms are difficult to identify in the field of power dispatch, a multi-task-based few-shot named entity recognition model (FSPD-NER) for power dispatch is proposed. The model consists of four modules: feature enhancement, seed, expansion, and implication. Firstly, the masking strategy of the encoder is improved by adopting whole-word masking, using a RoBERTa (Robustly Optimized BERT Pretraining Approach) encoder as the embedding layer to obtain the text feature representation, and an IDCNN (Iterated Dilated CNN) module to enhance the feature. Then the text is cut into one Chinese character and two Chinese characters as a seed set, the score for each seed is calculated, and if the score is greater than the threshold value ω, they are passed to the expansion module as candidate seeds; next, the candidate seeds need to be expanded left and right according to offset γ to obtain the candidate entities; finally, to construct text implication pairs, the input text is used as a premise sentence, the candidate entity is connected with predefined label templates as hypothesis sentences, and the implication pairs are passed to the RoBERTa encoder for the classification task. The focus loss function is used to alleviate label imbalance during training. The experimental results of the model on the power dispatch dataset show that the precision, recall, and F1 scores of the recognition results in 20-shot samples are 63.39%, 61.97%, and 62.67%, respectively, which is a significant performance improvement compared to existing methods. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

17 pages, 892 KiB  
Article
Part-of-Speech Tags Guide Low-Resource Machine Translation
by Zaokere Kadeer, Nian Yi and Aishan Wumaier
Electronics 2023, 12(16), 3401; https://doi.org/10.3390/electronics12163401 - 10 Aug 2023
Viewed by 968
Abstract
Neural machine translation models are guided by loss function to select source sentence features and generate results close to human annotation. When the data resources are abundant, neural machine translation models can focus on the features used to produce high-quality translations. These features [...] Read more.
Neural machine translation models are guided by loss function to select source sentence features and generate results close to human annotation. When the data resources are abundant, neural machine translation models can focus on the features used to produce high-quality translations. These features include POS or other grammatical features. However, models cannot focus precisely on these features when data resources are limited. The reason is that the lack of samples makes the model overfit before considering these features. Previous works have enriched the features by integrating source POS or multitask methods. However, these methods only utilize the source POS or produce translations by introducing the generated target POS. We propose introducing POS information based on multitask methods and reconstructors. We obtain the POS tags by the additional encoder and decoder and compute the corresponding loss function. These loss functions are used with the loss function of machine translation to optimize the parameters of the entire model, which makes the model pay attention to POS features. The POS features focused on by models will guide the translation process and alleviate the problem that models cannot focus on the POS features in the case of low resources. Experiments on multiple translation tasks show that the method improves 0.4∼1 BLEU compared with the baseline model on different translation tasks. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

14 pages, 747 KiB  
Article
A Curriculum Learning Approach for Multi-Domain Text Classification Using Keyword Weight Ranking
by Zilin Yuan, Yinghui Li, Yangning Li, Hai-Tao Zheng, Yaobin He, Wenqiang Liu, Dongxiao Huang and Bei Wu
Electronics 2023, 12(14), 3040; https://doi.org/10.3390/electronics12143040 - 11 Jul 2023
Cited by 1 | Viewed by 929
Abstract
Text classification is a well-established task in NLP, but it has two major limitations. Firstly, text classification is heavily reliant on domain-specific knowledge, meaning that a classifier that is trained on a given corpus may not perform well when presented with text from [...] Read more.
Text classification is a well-established task in NLP, but it has two major limitations. Firstly, text classification is heavily reliant on domain-specific knowledge, meaning that a classifier that is trained on a given corpus may not perform well when presented with text from another domain. Secondly, text classification models require substantial amounts of annotated data for training, and in certain domains, there may be an insufficient quantity of labeled data available. Consequently, it is essential to explore methods for efficiently utilizing text data from various domains to improve the performance of models across a range of domains. One approach for achieving this is through the use of multi-domain text classification models that leverage adversarial training to extract domain-shared features among all domains as well as the specific features of each domain. After observing the varying distinctness of domain-specific features, our paper introduces a curriculum learning approach using a ranking system based on keyword weight to enhance the effectiveness of multi-domain text classification models. The experimental data from Amazon reviews and FDU-MTL datasets show that our method significantly improves the efficacy of multi-domain text classification models adopting adversarial learning and reaching state-of-the-art outcomes on these two datasets. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

18 pages, 3222 KiB  
Article
Interaction Information Guided Prototype Representation Rectification for Few-Shot Relation Extraction
by Xiaoqin Ma, Xizhong Qin, Junbao Liu and Wensheng Ran
Electronics 2023, 12(13), 2912; https://doi.org/10.3390/electronics12132912 - 03 Jul 2023
Viewed by 1059
Abstract
Few-shot relation extraction aims to identify and extract semantic relations between entity pairs using only a small number of annotated instances. Many recently proposed prototype-based methods have shown excellent performance. However, existing prototype-based methods ignore the hidden inter-instance interaction information between the support [...] Read more.
Few-shot relation extraction aims to identify and extract semantic relations between entity pairs using only a small number of annotated instances. Many recently proposed prototype-based methods have shown excellent performance. However, existing prototype-based methods ignore the hidden inter-instance interaction information between the support and query sets, leading to unreliable prototypes. In addition, the current optimization of the prototypical network only relies on cross-entropy loss. It is only concerned with the accuracy of the predicted probability for the correct label, ignoring the differences of other non-correct labels, which cannot account for relation discretization in semantic space. This paper proposes an attentional network of interaction information to obtain a more reliable relation prototype. Firstly, an inter-instance interaction information attention module is designed to mitigate prototype unreliability through interaction information between the support set and query set instances, utilizing category information hidden in the query set. Secondly, the similarity scalar, which is defined by the mixed features of the prototype and the relation and is added to the focal loss to improve the attention of hard examples. We conducted extensive experiments on two standard datasets and demonstrated the validity of our proposed model. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

17 pages, 685 KiB  
Article
RDVI: A Retrieval–Detection Framework for Verbal Irony Detection
by Zhiyuan Wen, Rui Wang, Shiwei Chen, Qianlong Wang, Keyang Ding, Bin Liang and Ruifeng Xu
Electronics 2023, 12(12), 2673; https://doi.org/10.3390/electronics12122673 - 14 Jun 2023
Viewed by 1187
Abstract
Verbal irony is a common form of expression used in daily communication, where the intended meaning is often opposite to the literal meaning. Accurately recognizing verbal irony is essential for any NLP application for which the understanding of the true user intentions is [...] Read more.
Verbal irony is a common form of expression used in daily communication, where the intended meaning is often opposite to the literal meaning. Accurately recognizing verbal irony is essential for any NLP application for which the understanding of the true user intentions is key to performing the underlying tasks. While existing research has made progress in this area, verbal irony often involves connotative knowledge that cannot be directly inferred from the text or its context, which limits the detection model’s ability to recognize and comprehend verbal irony. To address this issue, we propose a Retrieval–Detection method for Verbal Irony (RDVI). This approach improves the detection model’s ability to recognize and comprehend verbal irony by retrieving the connotative knowledge from the open domain and incorporating it into the model using prompt learning. The experimental results demonstrate that our proposed method outperforms state-of-the-art models. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

15 pages, 2108 KiB  
Article
A Scenario-Generic Neural Machine Translation Data Augmentation Method
by Xiner Liu, Jianshu He, Mingzhe Liu, Zhengtong Yin, Lirong Yin and Wenfeng Zheng
Electronics 2023, 12(10), 2320; https://doi.org/10.3390/electronics12102320 - 21 May 2023
Cited by 45 | Viewed by 1899
Abstract
Amid the rapid advancement of neural machine translation, the challenge of data sparsity has been a major obstacle. To address this issue, this study proposes a general data augmentation technique for various scenarios. It examines the predicament of parallel corpora diversity and high [...] Read more.
Amid the rapid advancement of neural machine translation, the challenge of data sparsity has been a major obstacle. To address this issue, this study proposes a general data augmentation technique for various scenarios. It examines the predicament of parallel corpora diversity and high quality in both rich- and low-resource settings, and integrates the low-frequency word substitution method and reverse translation approach for complementary benefits. Additionally, this method improves the pseudo-parallel corpus generated by the reverse translation method by substituting low-frequency words and includes a grammar error correction module to reduce grammatical errors in low-resource scenarios. The experimental data are partitioned into rich- and low-resource scenarios at a 10:1 ratio. It verifies the necessity of grammatical error correction for pseudo-corpus in low-resource scenarios. Models and methods are chosen from the backbone network and related literature for comparative experiments. The experimental findings demonstrate that the data augmentation approach proposed in this study is suitable for both rich- and low-resource scenarios and is effective in enhancing the training corpus to improve the performance of translation tasks. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

12 pages, 2521 KiB  
Article
A Multi-Granularity Heterogeneous Graph for Extractive Text Summarization
by Henghui Zhao, Wensheng Zhang, Mengxing Huang, Siling Feng and Yuanyuan Wu
Electronics 2023, 12(10), 2184; https://doi.org/10.3390/electronics12102184 - 10 May 2023
Cited by 3 | Viewed by 1211
Abstract
Extractive text summarization selects the most important sentences from a document, preserves their original meaning, and produces an objective and fact-based summary. It is faster and less computationally intensive than abstract summarization techniques. Learning cross-sentence relationships is crucial for extractive text summarization. However, [...] Read more.
Extractive text summarization selects the most important sentences from a document, preserves their original meaning, and produces an objective and fact-based summary. It is faster and less computationally intensive than abstract summarization techniques. Learning cross-sentence relationships is crucial for extractive text summarization. However, most of the language models currently in use process text data sequentially, which makes it difficult to capture such inter-sentence relations, especially in long documents. This paper proposes an extractive summarization model based on the graph neural network (GNN) to address this problem. The model effectively represents cross-sentence relationships using a graph-structured document representation. In addition to sentence nodes, we introduce two nodes with different granularity in the graph structure, words and topics, which bring different levels of semantic information. The node representations are updated by the graph attention network (GAT). The final summary is obtained using the binary classification of the sentence nodes. Our text summarization method was demonstrated to be highly effective, as supported by the results of our experiments on the CNN/DM and NYT datasets. To be specific, our approach outperformed baseline models of the same type in terms of ROUGE scores on both datasets, indicating the potential of our proposed model for enhancing text summarization tasks. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

24 pages, 717 KiB  
Article
DiffuD2T: Empowering Data-to-Text Generation with Diffusion
by Heng Gong, Xiaocheng Feng and Bing Qin
Electronics 2023, 12(9), 2136; https://doi.org/10.3390/electronics12092136 - 07 May 2023
Viewed by 2199
Abstract
Surrounded by structured data, such as medical data, financial data, knowledge bases, etc., data-to-text generation has become an important natural language processing task that can help people better understand the meaning of those data by providing them with user-friendly text. Existing methods for [...] Read more.
Surrounded by structured data, such as medical data, financial data, knowledge bases, etc., data-to-text generation has become an important natural language processing task that can help people better understand the meaning of those data by providing them with user-friendly text. Existing methods for data-to-text generation show promising results in tackling two major challenges: content planning and surface realization, which transform structured data into fluent text. However, they lack an iterative refinement process for generating text, which can enable the model to perfect the text step-by-step while accepting control over the process. In this paper, we explore enhancing data-to-text generation with an iterative refinement process via diffusion. We have four main contributions: (1) we use the diffusion model to improve the prefix tuning for data-to-text generation; (2) we propose a look-ahead guiding loss to supervise the iterative refinement process for better text generation; (3) we extract content plans from reference text and propose a planning-then-writing pipeline to give the model content planning ability; and (4) we conducted experiments on three data-to-text generation datasets and both automatic evaluation criteria (BLEU, NIST, METEOR, ROUGEL, CIDEr, TER, MoverScore, BLEURT, and BERTScore) and human evaluation criteria (Quality and Naturalness) show the effectiveness of our model. Our model can improve the competitive prefix tuning method by 2.19% in terms of a widely-used automatic evaluation criterion BLEU (BiLingual Evaluation Understudy) on WebNLG dataset with GPT-2 Large as the pretrained language model backbone. Human evaluation criteria also show that our model can improve the quality and naturalness of the generated text across all three datasets. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

18 pages, 1715 KiB  
Article
Integrating Relational Structure to Heterogeneous Graph for Chinese NL2SQL Parsers
by Changzhe Ma, Wensheng Zhang, Mengxing Huang, Siling Feng and Yuanyuan Wu
Electronics 2023, 12(9), 2093; https://doi.org/10.3390/electronics12092093 - 04 May 2023
Viewed by 1145
Abstract
The existing models for NL2SQL tasks are mainly oriented toward English text and cannot solve the problems of column name reuse in Chinese text data, description in natural language query, and inconsistent representation of data stored in the database. To address this problem, [...] Read more.
The existing models for NL2SQL tasks are mainly oriented toward English text and cannot solve the problems of column name reuse in Chinese text data, description in natural language query, and inconsistent representation of data stored in the database. To address this problem, this paper proposes a Chinese cross-domain NL2SQL model based on a heterogeneous graph and relative position attention mechanism. This model introduces relational structure information defined by the expert to construct initial heterogeneous graphs for database schemas and natural language questions. The heterogeneous graph is pruned based on natural language questions, and the multi-head relative position attention mechanism is used to encode the database schema and natural language questions. The target SQL statement is generated using a tree-structured decoder with predefined SQL syntax. Experimental results on the CSpider dataset demonstrate that our model better aligns database schema with natural language questions and understands the semantic information in natural language queries, effectively improving the matching accuracy of Chinese multi-table SQL statement generation. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

15 pages, 685 KiB  
Article
Knowledge-Guided Prompt Learning for Few-Shot Text Classification
by Liangguo Wang, Ruoyu Chen and Li Li
Electronics 2023, 12(6), 1486; https://doi.org/10.3390/electronics12061486 - 21 Mar 2023
Cited by 1 | Viewed by 2936
Abstract
Recently, prompt-based learning has shown impressive performance on various natural language processing tasks in few-shot scenarios. The previous study of knowledge probing showed that the success of prompt learning contributes to the implicit knowledge stored in pre-trained language models. However, how this implicit [...] Read more.
Recently, prompt-based learning has shown impressive performance on various natural language processing tasks in few-shot scenarios. The previous study of knowledge probing showed that the success of prompt learning contributes to the implicit knowledge stored in pre-trained language models. However, how this implicit knowledge helps solve downstream tasks remains unclear. In this work, we propose a knowledge-guided prompt learning method that can reveal relevant knowledge for text classification. Specifically, a knowledge prompting template and two multi-task frameworks were designed, respectively. The experiments demonstrated the superiority of combining knowledge and prompt learning in few-shot text classification. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

17 pages, 794 KiB  
Article
Keyword-Aware Transformers Network for Chinese Open-Domain Conversation Generation
by Yang Zhou, Chenjiao Zhi, Feng Xu, Weiwei Cui, Huaqiong Wang, Aihong Qin, Xiaodiao Chen, Yaqi Wang and Xingru Huang
Electronics 2023, 12(5), 1228; https://doi.org/10.3390/electronics12051228 - 04 Mar 2023
Viewed by 1359
Abstract
The open-domain conversation generation task aims to generate contextually relevant and informative responses based on a given conversation history. A critical challenge in open-domain dialogs is the tendency of models to generate safe responses. Existing work has often incorporated keyword information in the [...] Read more.
The open-domain conversation generation task aims to generate contextually relevant and informative responses based on a given conversation history. A critical challenge in open-domain dialogs is the tendency of models to generate safe responses. Existing work has often incorporated keyword information in the conversation history for response generation to relieve this problem. However, these approaches interact weakly between responses and keywords or ignore the association between keyword extraction and conversation generation. In this paper, we propose a method based on a Keyword-Aware Transformers Network (KAT) that can fuse contextual keywords. Specifically, the model enables keywords and contexts to fully interact with responses for keyword semantic enhancement. We jointly model the keyword extraction task and the dialog generation task in a multi-task learning fashion. Experimental results of two Chinese open-domain dialogue datasets showed that our proposed model outperformed the methods in both semantic and non-semantic evaluation metrics, improving Coherence, Fluency, and Informativeness in manual evaluation. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

15 pages, 557 KiB  
Article
WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation
by Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao and Tadahiro Matsumoto
Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140 - 26 Feb 2023
Cited by 4 | Viewed by 1797
Abstract
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable [...] Read more.
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

13 pages, 1298 KiB  
Article
Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning
by Yao Guo, Meng Li, Yanling Li, Fengpei Ge, Yaohui Qi and Min Lin
Electronics 2023, 12(4), 884; https://doi.org/10.3390/electronics12040884 - 09 Feb 2023
Viewed by 1537
Abstract
In this paper, we propose an adversarial transfer learning method to solve the lack of data resources for named entity recognition (NER) tasks in spoken language understanding. In the framework, we use bi-directional long short-term memory with self-attention and conditional random field (BiLSTM-Attention-CRF) [...] Read more.
In this paper, we propose an adversarial transfer learning method to solve the lack of data resources for named entity recognition (NER) tasks in spoken language understanding. In the framework, we use bi-directional long short-term memory with self-attention and conditional random field (BiLSTM-Attention-CRF) model which combines character and word information as the baseline model to train source domain and target domain corpus jointly. Shared features between domains are extracted by a shared feature extractor. This paper uses two different sharing patterns simultaneously: full sharing mode and private sharing mode. On this basis, an adversarial discriminator is added to the shared feature extractor to simulate generative adversarial networks (GAN) and eliminate domain-dependent features. This paper compares ordinary adversarial discriminator (OAD) and generalized resource-adversarial discriminator (GRAD) through experiments. The experimental results show that the transfer effect of GRAD is better than other methods. The F1 score reaches 92.99% at the highest, with a relative increase of 12.89%. It can effectively improve the performance of NER tasks in resource shortage fields and solve the problem of negative transfer. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

15 pages, 970 KiB  
Article
ENEX-FP: A BERT-Based Address Recognition Model
by Min Li, Zeyu Liu, Gang Li, Mingle Zhou and Delong Han
Electronics 2023, 12(1), 209; https://doi.org/10.3390/electronics12010209 - 01 Jan 2023
Viewed by 2150
Abstract
In e-commerce logistics, government registration, financial transportation and other fields, communication addresses are required. Analyzing the communication address is crucial. There are various challenges in address recognition due to the address text’s features of free writing, numerous aliases and significant text similarity. This [...] Read more.
In e-commerce logistics, government registration, financial transportation and other fields, communication addresses are required. Analyzing the communication address is crucial. There are various challenges in address recognition due to the address text’s features of free writing, numerous aliases and significant text similarity. This study shows an ENEX-FP address recognition model, which consists of an entity extractor (ENEX) and a feature processor (FP) for address recognition, as a solution to the issues mentioned. This study uses adversarial training to enhance the model’s robustness and a hierarchical learning rate setup and learning rate attenuation technique to enhance recognition accuracy. Compared with traditional named entity recognition models, our model achieves an F1-score of 93.47% and 94.59% in the dataset, demonstrating the ENEX-FP model’s effectiveness in recognizing addresses. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

11 pages, 719 KiB  
Article
Entity Factor: A Balanced Method for Table Filling in Joint Entity and Relation Extraction
by Zhifeng Liu, Mingcheng Tao and Conghua Zhou
Electronics 2023, 12(1), 121; https://doi.org/10.3390/electronics12010121 - 27 Dec 2022
Viewed by 1446
Abstract
The knowledge graph is an effective tool for improving natural language processing, but manually annotating enormous amounts of knowledge is expensive. Academics have conducted research on entity and relation extraction techniques, among which, the end-to-end table-filling approach is a popular direction for achieving [...] Read more.
The knowledge graph is an effective tool for improving natural language processing, but manually annotating enormous amounts of knowledge is expensive. Academics have conducted research on entity and relation extraction techniques, among which, the end-to-end table-filling approach is a popular direction for achieving joint entity and relation extraction. However, once the table has been populated in a uniform label space, a large number of null labels are generated within the array, causing label-imbalance problems, which could result in a tendency of the model’s encoder to predict null labels; that is, model generalization performance decreases. In this paper, we propose a method to mitigate non-essential null labels in matrices. This method utilizes a score matrix to calculate the count of non-entities and the percentage of non-essential null labels in the matrix, which is then projected by the power of natural constant to generate an entity-factor matrix. This is then incorporated into the scoring matrix. In the back-propagation process, the gradient of non-essential null-labeled cells in the entity factor layer is affected and shrinks, the amplitude of which is related to the size of the entity factor, thereby reducing the feature learning of the model for a large number of non-essential null labels. Experiments with two publicly available benchmark datasets show that the incorporation of entity factors significantly improved model performance, especially in the relation extraction task, by 1.5% in both cases. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

25 pages, 950 KiB  
Article
Learning to Co-Embed Queries and Documents
by Yuehong Wu, Bowen Lu, Lin Tian and Shangsong Liang
Electronics 2022, 11(22), 3694; https://doi.org/10.3390/electronics11223694 - 11 Nov 2022
Cited by 1 | Viewed by 1033
Abstract
Learning to Rank (L2R) methods that utilize machine learning techniques to solve the ranking problems have been widely studied in the field of information retrieval. Existing methods usually concatenate query and document features as training input, without explicit understanding of relevance between queries [...] Read more.
Learning to Rank (L2R) methods that utilize machine learning techniques to solve the ranking problems have been widely studied in the field of information retrieval. Existing methods usually concatenate query and document features as training input, without explicit understanding of relevance between queries and documents, especially in pairwise based ranking approach. Thus, it is an interesting question whether we can devise an algorithm that effectively describes the relation between queries and documents to learn a better ranking model without incurring huge parameter costs. In this paper, we present a Gaussian Embedding model for Ranking (GERank), an architecture for co-embedding queries and documents, such that each query or document is represented by a Gaussian distribution with mean and variance. Our GERank optimizes an energy-based loss based on the pairwise ranking framework. Additionally, the KL-divergence is utilized to measure the relevance between queries and documents. Experimental results on two LETOR datasets and one TREC dataset demonstrate that our model obtains a remarkable improvement in the ranking performance compared with the state-of-the-art retrieval models. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

Back to TopTop