Research

13 pages, 1498 KiB

Open AccessArticle

RoBERTa-Based Keyword Extraction from Small Number of Korean Documents

by So-Eon Kim, Jun-Beom Lee, Gyu-Min Park, Seok-Man Sohn and Seong-Bae Park

Electronics 2023, 12(22), 4560; https://doi.org/10.3390/electronics12224560 - 07 Nov 2023

Viewed by 942

Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel [...] Read more.

Keyword extraction is the task of identifying essential words in a lengthy document. This process is primarily executed through supervised keyword extraction. In instances where the dataset is limited in size, a classification-based approach is typically employed. Therefore, this paper introduces a novel keyword extractor based on a classification approach. The proposed keyword extractor comprises three key components: RoBERTa, a keyword estimator, and a decision rule. RoBERTa encodes an input document, the keyword estimator calculates the probability of each token in the document becoming a keyword, and the decision rule ultimately determines whether each token is a keyword based on these probabilities. However, training the proposed model with a small dataset presents two challenges. One problem is the case that all tokens in the documents are not a keyword, and the other problem is that a single word can be composed of keyword tokens and non-keyword tokens. Two novel heuristics are thus proposed to tackle these problems. To address these issues, two novel heuristics are proposed. These heuristics have been extensively tested through experiments, demonstrating that the proposed keyword extractor surpasses both the generation-based approach and the vanilla RoBERTa in environments with limited data. The efficacy of the heuristics is further validated through an ablation study. In summary, the proposed heuristics have proven to be effective in developing a supervised keyword extractor with a small dataset. Full article

(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)

Journal Menu

Journal Browser

Natural Language Processing and Information Retrieval

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Related Special Issue

Published Papers (20 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI