Information Extraction and Language Discourse Processing

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: 30 April 2024 | Viewed by 9434

Special Issue Editors


E-Mail Website
Guest Editor
TIB–Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany
Interests: information extraction; text mining; natural language processing; knowledge graphs

E-Mail Website
Guest Editor
Professor, School of Economics and Management, Nanjing University of Science and Technology (NJUST), No. 200, Xiaolingwei, 210094 Nanjing, China
Interests: scientific text mining; knowledge entity extraction and evaluation; social media mining

Special Issue Information

Dear Colleagues,

Information extraction (IE) plays an increasingly important and pervasive role in today’s era of digitalized communication media based on the Semantic Web. E.g., search engine results, as snippets, are slowly replaced by “rich snippets”; there is an interest in converting scholarly publications to structured records available in such downstream IT applications as Leaderboards, etc. IE is thus the task of automatically extracting structured information from unstructured and/or semi-structured electronically represented documents. In most cases, this activity concerns processing of human language texts by means of natural language processing (NLP). The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data.

Apart from extrinsic models of IE, research in linguistics and computational linguistics have long pointed out that text is not just simple sequence of clauses and sentences but rather follows a highly elaborated structure formalized within discourse. The framework used for discourse analysis has long since been rhetorical structure theory (RST). Within a well-written text, no unit of the text is completely isolated; interpretation requires understanding the unit’s relation with the context. Research in discourse analysis aims to unmask such relations in the text, which is helpful for many downstream applications such as summarization, information retrieval, and question answering.

This Special Issue seeks novel research reports on the spectrum that blends information extraction and language discourse processing research in diverse communities. The editors welcome submissions along various dimensions derived from the nature of the extraction task, the advanced neural techniques used for extraction, the variety of input resources exploited, and the type of output produced. Quantitative, qualitative, and mixed methods studies are welcome, as are case studies and experience reports if they describe an impactful application at a scale that delivers useful lessons to the journal readership.

Topics of interest include (but are not limited to):

  • Knowledge base population with discourse-centric information extraction (IE)
  • Coreference resolution and its impact on discourse-centric IE
  • Relationship extraction leveraging linguistic discourse
  • Template filling
  • Impact of pragmatics or rhetorics on information extraction
  • Discourse-centric IE at scale
  • Intelligent and novel assessment models of discourse-centric IE
  • Survey of discourse-centric IE in natural language processing (NLP)
  • Challenges implementing discourse-centric IE in real-world scenarios
  • Modeling domains using discourse-centric IE
  • Human–AI hybrid systems for learning discourse and IE
  • Application of discourse-centric IE

Dr. Jennifer D'Souza
Prof. Dr. Chengzhi Zhang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • coherence
  • topic focus
  • information structure
  • conversation structure
  • discourse processing
  • scholarly discourse processing
  • anaphora resolution

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

22 pages, 461 KiB  
Article
The Power of Context: A Novel Hybrid Context-Aware Fake News Detection Approach
by Jawaher Alghamdi, Yuqing Lin and Suhuai Luo
Information 2024, 15(3), 122; https://doi.org/10.3390/info15030122 - 21 Feb 2024
Viewed by 1199
Abstract
The detection of fake news has emerged as a crucial area of research due to its potential impact on society. In this study, we propose a robust methodology for identifying fake news by leveraging diverse aspects of language representation and incorporating auxiliary information. [...] Read more.
The detection of fake news has emerged as a crucial area of research due to its potential impact on society. In this study, we propose a robust methodology for identifying fake news by leveraging diverse aspects of language representation and incorporating auxiliary information. Our approach is based on the utilisation of Bidirectional Encoder Representations from Transformers (BERT) to capture contextualised semantic knowledge. Additionally, we employ a multichannel Convolutional Neural Network (mCNN) integrated with stacked Bidirectional Gated Recurrent Units (sBiGRU) to jointly learn multi-aspect language representations. This enables our model to effectively identify valuable clues from news content while simultaneously incorporating content- and context-based cues, such as user posting behaviour, to enhance the detection of fake news. Through extensive experimentation on four widely used real-world datasets, our proposed framework demonstrates superior performance (↑3.59% (PolitiFact), ↑6.8% (GossipCop), ↑2.96% (FA-KES), and ↑12.51% (LIAR), considering both content-based features and additional auxiliary information) compared to existing state-of-the-art approaches, establishing its effectiveness in the challenging task of fake news detection. Full article
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)
Show Figures

Figure 1

26 pages, 1592 KiB  
Article
FinChain-BERT: A High-Accuracy Automatic Fraud Detection Model Based on NLP Methods for Financial Scenarios
by Xinze Yang, Chunkai Zhang, Yizhi Sun, Kairui Pang, Luru Jing, Shiyun Wa and Chunli Lv
Information 2023, 14(9), 499; https://doi.org/10.3390/info14090499 - 12 Sep 2023
Cited by 1 | Viewed by 2767
Abstract
This research primarily explores the application of Natural Language Processing (NLP) technology in precision financial fraud detection, with a particular focus on the implementation and optimization of the FinChain-BERT model. Firstly, the FinChain-BERT model has been successfully employed for financial fraud detection tasks, [...] Read more.
This research primarily explores the application of Natural Language Processing (NLP) technology in precision financial fraud detection, with a particular focus on the implementation and optimization of the FinChain-BERT model. Firstly, the FinChain-BERT model has been successfully employed for financial fraud detection tasks, improving the capability of handling complex financial text information through deep learning techniques. Secondly, novel attempts have been made in the selection of loss functions, with a comparison conducted between negative log-likelihood function and Keywords Loss Function. The results indicated that the Keywords Loss Function outperforms the negative log-likelihood function when applied to the FinChain-BERT model. Experimental results validated the efficacy of the FinChain-BERT model and its optimization measures. Whether in the selection of loss functions or the application of lightweight technology, the FinChain-BERT model demonstrated superior performance. The utilization of Keywords Loss Function resulted in a model achieving 0.97 in terms of accuracy, recall, and precision. Simultaneously, the model size was successfully reduced to 43 MB through the application of integer distillation technology, which holds significant importance for environments with limited computational resources. In conclusion, this research makes a crucial contribution to the application of NLP in financial fraud detection and provides a useful reference for future studies. Full article
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)
Show Figures

Figure 1

18 pages, 1431 KiB  
Article
Exploring a Multi-Layered Cross-Genre Corpus of Document-Level Semantic Relations
by Gregor Williamson, Angela Cao, Yingying Chen, Yuxin Ji, Liyan Xu and Jinho D. Choi
Information 2023, 14(8), 431; https://doi.org/10.3390/info14080431 - 01 Aug 2023
Viewed by 897
Abstract
This paper introduces a multi-layered cross-genre corpus, annotated for coreference resolution, causal relations, and temporal relations, comprising a variety of genres, from news articles and children’s stories to Reddit posts. Our results reveal distinctive genre-specific characteristics at each layer of annotation, highlighting unique [...] Read more.
This paper introduces a multi-layered cross-genre corpus, annotated for coreference resolution, causal relations, and temporal relations, comprising a variety of genres, from news articles and children’s stories to Reddit posts. Our results reveal distinctive genre-specific characteristics at each layer of annotation, highlighting unique challenges for both annotators and machine learning models. Children’s stories feature linear temporal structures and clear causal relations. In contrast, news articles employ non-linear temporal sequences with minimal use of explicit causal or conditional language and few first-person pronouns. Lastly, Reddit posts are author-centered explanations of ongoing situations, with occasional meta-textual reference. Our annotation schemes are adapted from existing work to better suit a broader range of text types. We argue that our multi-layered cross-genre corpus not only reveals genre-specific semantic characteristics but also indicates a rich contextual interplay between the various layers of semantic information. Our MLCG corpus is shared under the open-source Apache 2.0 license. Full article
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)
Show Figures

Figure 1

21 pages, 339 KiB  
Article
Extracting Narrative Patterns in Different Textual Genres: A Multilevel Feature Discourse Analysis
by María Miró Maestre, Marta Vicente, Elena Lloret and Armando Suárez Cueto
Information 2023, 14(1), 28; https://doi.org/10.3390/info14010028 - 31 Dec 2022
Viewed by 2373
Abstract
We present a data-driven approach to discover and extract patterns in textual genres with the aim of identifying whether there is an interesting variation of linguistic features among different narrative genres depending on their respective communicative purposes. We want to achieve this goal [...] Read more.
We present a data-driven approach to discover and extract patterns in textual genres with the aim of identifying whether there is an interesting variation of linguistic features among different narrative genres depending on their respective communicative purposes. We want to achieve this goal by performing a multilevel discourse analysis according to (1) the type of feature studied (shallow, syntactic, semantic, and discourse-related); (2) the texts at a document level; and (3) the textual genres of news, reviews, and children’s tales. To accomplish this, several corpora from the three textual genres were gathered from different sources to ensure a heterogeneous representation, paying attention to the presence and frequency of a series of features extracted with computational tools. This deep analysis aims at obtaining more detailed knowledge of the different linguistic phenomena that directly shape each of the genres included in the study, therefore showing the particularities that make them be considered as individual genres but also comprise them inside the narrative typology. The findings suggest that this type of multilevel linguistic analysis could be of great help for areas of research within natural language processing such as computational narratology, as they allow a better understanding of the fundamental features that define each genre and its communicative purpose. Likewise, this approach could also boost the creation of more consistent automatic story generation tools in areas of language generation. Full article
(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)
Show Figures

Figure 1

Planned Papers

The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.

Title: Comparative Analysis of ORKG Properties and LLM-Generated Research Dimensions
Authors: Jennifer D'Souza; Vlad Nechakhin; Steffen Eger
Affiliation: TIB and University of Mannheim
Abstract: Structuring papers (or research ideas) in terms of various dimensions/properties is the basis to effectively searching scientific articles beyond mere keyword based search. Existing endeavours, e.g., from the Open Research Knowledge Graph (ORKG) use manually codified attributes to describe papers in terms of such properties. For example, "time period of study", "location of study population" for the research problem "reproductive number estimates of a population." Manual specification is extremely time-consuming however, and suffers from inconsistencies among human coders involved. In this study, we conduct a thorough comparative analysis between manually extracted papers’ properties from the ORKG and research dimensions generated by Large Language Models (LLMs) such as GPT, Mistral, and Llama. Our objective is to assess the similarity and divergence across various criteria, including semantic alignment and deviation, mapping accuracy between properties, and cosine similarity across generated embeddings from state-of-the-art models like SciNCL. By quantifying the relatedness of LLMs to manually created ORKG properties, we explore their performance across diverse research fields. Our findings provide insights into the correspondence between ORKG properties and LLM dimensions, with significant implications for the advancement of automated research metadata generation and effective related work search beyond using keywords.

Back to TopTop