Digital Analysis in Digital Humanities

A special issue of Future Internet (ISSN 1999-5903). This special issue belongs to the section "Big Data and Augmented Intelligence".

Deadline for manuscript submissions: 30 September 2024 | Viewed by 8436

Special Issue Editors


E-Mail Website
Guest Editor
1. Department of Innovation and Information Engineering, Università degli Studi "Guglielmo Marconi", 00193 Roma, Italy
2. Leibniz Institute for Educational Media | Georg Eckert Institute, 38118 Braunschweig, Germany
Interests: digital analysis and digital humanities; machine learning; natural language processing; knowledge discovery and data linking of structured and unstructured open linked data; semantic technology and knowledge management of big data
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Leibniz Institute for Educational Media | Georg Eckert Institute, 38118 Braunschweig, Germany
Interests: data and web mining; human-machine interaction; information processing; user and data modeling; semantic web and information retrieval

Special Issue Information

Dear Colleagues,

The aim of this Special Issue on “Digital Analysis in Digital Humanities” is to gather results on the interdisciplinary area of innovative data analytics methods and customized technologies for digital humanities research and analysis.

This Special Issue is calling for submissions of novel and innovative research results on innovative methods and technologies with a clear reference to data collection, data processing, data analysis and data visualizations that can be exploited in the field of digital humanities and cultural heritage. 

This Special Issue also invites submissions that concentrate on the well-founded idea that digital analysis enables the digital humanities domain to digitally transform its research analysis and results, becoming more innovative and forward-looking in its decision making.

In addition, this Special Issue aims to emphasize the role of humanists in data analysis through demonstrating the successful and challenging application of human interaction to interpretate the quantitative analysis and define new advanced and customized solutions.

The Special Issue thus intends to focus on any aspect of “Digital Analysis in Digital Humanities” referring to computer science and humanist research interests in addition to recent research trends.

Dr. Francesca Fallucchi
Prof. Dr. Ernesto William De Luca
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Future Internet is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • digital humanities
  • digital analysis
  • digital libraries
  • information technology
  • data mining
  • machine learning
  • natural language processing

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

31 pages, 5283 KiB  
Article
Beyond Lexical Boundaries: LLM-Generated Text Detection for Romanian Digital Libraries
by Melania Nitu and Mihai Dascalu
Future Internet 2024, 16(2), 41; https://doi.org/10.3390/fi16020041 - 25 Jan 2024
Viewed by 1577
Abstract
Machine-generated content reshapes the landscape of digital information; hence, ensuring the authenticity of texts within digital libraries has become a paramount concern. This work introduces a corpus of approximately 60 k Romanian documents, including human-written samples as well as generated texts using six [...] Read more.
Machine-generated content reshapes the landscape of digital information; hence, ensuring the authenticity of texts within digital libraries has become a paramount concern. This work introduces a corpus of approximately 60 k Romanian documents, including human-written samples as well as generated texts using six distinct Large Language Models (LLMs) and three different generation methods. Our robust experimental dataset covers five domains, namely books, news, legal, medical, and scientific publications. The exploratory text analysis revealed differences between human-authored and artificially generated texts, exposing the intricacies of lexical diversity and textual complexity. Since Romanian is a less-resourced language requiring dedicated detectors on which out-of-the-box solutions do not work, this paper introduces two techniques for discerning machine-generated texts. The first method leverages a Transformer-based model to categorize texts as human or machine-generated, while the second method extracts and examines linguistic features, such as identifying the top textual complexity indices via Kruskal–Wallis mean rank and computes burstiness, which are further fed into a machine-learning model leveraging an extreme gradient-boosting decision tree. The methods show competitive performance, with the first technique’s results outperforming the second one in two out of five domains, reaching an F1 score of 0.96. Our study also includes a text similarity analysis between human-authored and artificially generated texts, coupled with a SHAP analysis to understand which linguistic features contribute more to the classifier’s decision. Full article
(This article belongs to the Special Issue Digital Analysis in Digital Humanities)
Show Figures

Figure 1

12 pages, 1610 KiB  
Article
Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification
by Panagiotis Skondras, Panagiotis Zervas and Giannis Tzimas
Future Internet 2023, 15(11), 363; https://doi.org/10.3390/fi15110363 - 09 Nov 2023
Cited by 1 | Viewed by 2495
Abstract
In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language [...] Read more.
In this article, we investigate the potential of synthetic resumes as a means for the rapid generation of training data and their effectiveness in data augmentation, especially in categories marked by sparse samples. The widespread implementation of machine learning algorithms in natural language processing (NLP) has notably streamlined the resume classification process, delivering time and cost efficiencies for hiring organizations. However, the performance of these algorithms depends on the abundance of training data. While selecting the right model architecture is essential, it is also crucial to ensure the availability of a robust, well-curated dataset. For many categories in the job market, data sparsity remains a challenge. To deal with this challenge, we employed the OpenAI API to generate both structured and unstructured resumes tailored to specific criteria. These synthetically generated resumes were cleaned, preprocessed and then utilized to train two distinct models: a transformer model (BERT) and a feedforward neural network (FFNN) that incorporated Universal Sentence Encoder 4 (USE4) embeddings. While both models were evaluated on the multiclass classification task of resumes, when trained on an augmented dataset containing 60 percent real data (from Indeed website) and 40 percent synthetic data from ChatGPT, the transformer model presented exceptional accuracy. The FFNN, albeit predictably, achieved lower accuracy. These findings highlight the value of augmented real-world data with ChatGPT-generated synthetic resumes, especially in the context of limited training data. The suitability of the BERT model for such classification tasks further reinforces this narrative. Full article
(This article belongs to the Special Issue Digital Analysis in Digital Humanities)
Show Figures

Graphical abstract

16 pages, 1368 KiB  
Article
New RFI Model for Behavioral Audience Segmentation in Wi-Fi Advertising System
by Shueh-Ting Lim, Lee-Yeng Ong and Meng-Chew Leow
Future Internet 2023, 15(11), 351; https://doi.org/10.3390/fi15110351 - 26 Oct 2023
Viewed by 1361
Abstract
In this technological era, businesses tend to place advertisements via the medium of Wi-Fi advertising to expose their brands and products to the public. Wi-Fi advertising offers a platform for businesses to leverage their marketing strategies to achieve desired goals, provided they have [...] Read more.
In this technological era, businesses tend to place advertisements via the medium of Wi-Fi advertising to expose their brands and products to the public. Wi-Fi advertising offers a platform for businesses to leverage their marketing strategies to achieve desired goals, provided they have a thorough understanding of their audience’s behaviors. This paper aims to formulate a new RFI (recency, frequency, and interest) model that is able to analyze the behavior of the audience towards the advertisement. The audience’s interest is measured based on the relationship between their total view duration on an advertisement and its corresponding overall click received. With the help of a clustering algorithm to perform the dynamic segmentation, the patterns of the audience behaviors are then being interpreted by segmenting the audience based on their engagement behaviors. In the experiments, two different Wi-Fi advertising attributes are tested to prove the new RFI model is applicable to effectively interpret the audience engagement behaviors with the proposed dynamic characteristics range table. The weak and strongly engaged behavioral characteristics of the segmented behavioral patterns of the audience, such as in a one-time audience, are interpreted successfully with the dynamic-characteristics range table. Full article
(This article belongs to the Special Issue Digital Analysis in Digital Humanities)
Show Figures

Graphical abstract

14 pages, 2745 KiB  
Article
Digital Qualitative and Quantitative Analysis of Arabic Textbooks
by Francesca Fallucchi, Bouchra Ghattas, Riem Spielhaus and Ernesto William De Luca
Future Internet 2022, 14(8), 237; https://doi.org/10.3390/fi14080237 - 29 Jul 2022
Cited by 2 | Viewed by 2032
Abstract
Digital Humanities (DH) provide a broad spectrum of functionalities and tools that enable the enrichment of both quantitative and qualitative research methods in the humanities. It has been widely recognized that DH can help in curating and analysing large amounts of data. However, [...] Read more.
Digital Humanities (DH) provide a broad spectrum of functionalities and tools that enable the enrichment of both quantitative and qualitative research methods in the humanities. It has been widely recognized that DH can help in curating and analysing large amounts of data. However, digital tools can also support research processes in the humanities that are interested in detailed analyses of how empirical sources are patterned. Following a methodological differentiation between close and distant reading with regard to textual analysis, this article describes the Edumeres Toolbox, a digital tool for textbook analysis. The Edumeres Toolbox is an outcome of the continuous interdisciplinary exchange between computer scientists and humanist researchers, whose expertise is crucial to convert information into knowledge by means of (critical) interpretation and contextualization. This paper presents a use case in order to describe the various functionalities of the Edumeres Toolbox and their use for the analysis of a collection of Arabic textbooks. Hereby, it shows how the interaction between humanist researchers and computer scientists in this digital process produces innovative research solutions and how the tool enables users to discover structural and linguistic patterns and develop innovative research questions. Finally, the paper describes challenges recognized by humanist researchers in using digital tools in their work, which still require in-depth research and practical efforts from both parties to improve the tool performance. Full article
(This article belongs to the Special Issue Digital Analysis in Digital Humanities)
Show Figures

Figure 1

Planned Papers

The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.

Title: Generating Synthetic Resume Data with Large Language Models for Enhanced Job Description Classification
Authors: Panagiotis Skondras; Panagiotis Zervas; Giannis Tzimas
Affiliation: University of Peloponnese
Abstract: The rapid increment of online resume platforms and the escalating influence of digital recruitment (e-recruitment) have resulted in a surge of data, intensifying the challenge of extracting valuable insights and metadata from resumes. This article addresses the extraction of metadata from digital resumes marked by big data attributes, which demand frequent updates and continuous monitoring. We propose a multi-faceted approach. We first deploy web crawlers designed to proficiently aggregate resumes from online sources (i.e., indeed.com). Subsequently, we leverage natural language processing (NLP) techniques for data sanitization and preprocessing to ensure the consistency and quality of the gathered resumes. A major obstacle is the absence of annotated data, critical for the efficiency of machine learning algorithms. To overcome this, we turn to ChatGPT and other advanced text-generative models. Through prompt engineering, these Large Language Models (LLMs) utilized to generate annotated data for varied resume-based tasks, such as candidate categorization, skills identification, and experience assessment. This approach amplifies the advantages of LLMs, ensuring the generation of reliable and accurate annotated datasets for efficient metadata extraction. Additionally, enhancing the extraction process we exploit the potential of embeddings and deep learning. Training deep learning models on LLM-generated annotated datasets, we aim to capture the intricate contextual subtleties within representations. This strategy enables precise and swift metadata extraction, benefiting from the inherent power of deep learning algorithms and the rich semantics of embeddings. In conclusion, this article presents a novel approach for digital resume metadata extraction, navigating the challenges posed by big data and optimizing deep learning techniques. Experiments reveal that merging synthetic data with real-data yields superior outcomes, validating our approach's effectiveness in producing accurate metadata, setting the stage for improved candidate matching and recruitment decisions. Keywords: Metadata extraction, Resumes, CV, Big data, Web crawling, Data preprocessing, ChatGPT, Large Language Models, Deep learning, Embeddings, Text Classification, Labor market analysis

Back to TopTop