Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion

Xu, Xinlan; Li, Bo; Shen, Yuhao; Luo, Bing; Zhang, Chao; Hao, Fei

doi:10.3390/electronics12122560

Open AccessArticle

Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion

by

Xinlan Xu

¹,

Bo Li

^1,*

,

Yuhao Shen

¹,

Bing Luo

¹,

Chao Zhang

^2,* and

Fei Hao

³

¹

School of Computer and Software Engineering, Xihua University, Chengdu 610039, China

²

Intelligent Policing Key Laboratory of Sichuan Province, Sichuan Police College, Luzhou 646000, China

³

School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(12), 2560; https://doi.org/10.3390/electronics12122560

Submission received: 26 April 2023 / Revised: 21 May 2023 / Accepted: 3 June 2023 / Published: 6 June 2023

(This article belongs to the Special Issue Industrial Artificial Intelligence: Innovations and Challenges)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The proliferation of short texts resulting from the rapid advancements of social networks, online communication, and e-commerce has created a pressing need for short text classification in various applications. This paper presents a novel approach for short text classification, which combines a hierarchical heterogeneous graph with latent Dirichlet allocation (LDA) fusion. Our method first models the short text dataset as a hierarchical heterogeneous graph, which incorporates more syntactic and semantic information through a word graph, parts-of-speech (POS) tag graph, and entity graph. We then connected the representation of these three feature maps to derive a comprehensive feature vector for the text. Finally, we used the LDA topic model to adjust the feature weight, enhancing the effectiveness of short text extension. Our experiments demonstrated that our proposed approach has a promising performance in English short text classification, while in Chinese short text classification, although slightly inferior to the LDA + TF-IDF method, it still achieved promising results.

Keywords:

short text classification; LDA; hierarchical heterogeneous graph

1. Introduction

With the increase of Internet users and the rapid development of social networks, massive amounts of short text data have been generated in life, such as product descriptions, comment information, and microblogs. More and more short texts need to be classified. Maron [1] published the first paper on automatic text classification. However, traditional text classification methods are unsuitable for short text classification [2]. Therefore, how to extract hidden short text information and apply it to practical tasks such as topic tracking [3], sentiment analysis [4], industrial equipment inspection [5], and personalized user recommendation [6] is a hot topic in the field of natural language processing.

It is more challenging to classify short texts than long texts [7]. One reason is that short texts contain fewer terms, which results in the need for more contextual information on the text content. The other one is that many data lack labeling information due to the high cost of manual labeling. Therefore, understanding short texts requires auxiliary knowledge, for example concepts that can be found in common sense knowledge graphs [8,9], potential topics extracted from short text datasets [10], and entities in knowledge graphs [10]. Despite this, more than merely enriching auxiliary knowledge is needed to deal with the lack of labeled data [7,11]. In addition, training regular depth models requires large-scale label data [12].

So far, one of the most-effective methods for short text classification is graph neural networks. Among them, the best classification effect is the SHINE [13] model. It introduces more semantic and syntactic information by modeling short text datasets as hierarchical heterogeneous graphs of word-level component graphs. Besides, a dynamic short text graph easily transfers labels between similar short texts. However, it could better solve problems such as sparse short text features.

In order to address the above problems, we propose the SHINE + LDA algorithm. It can improve the accuracy of short text classification by dealing with semantic and syntactic information missing in short texts. First, we constructed three feature graphs: word graph, POS tag graph, and entity graph. Then, we connected the representations of the feature graphs to obtain the feature vector of the text. Last but not least, we used the LDA topic model for feature weight adjustment after connecting the representations of the feature map to obtain the feature vector of the text. More specifically, the semantic calculation of word vectors gives higher weight to significant features in short texts. Adjusting the feature weight of short texts solves the problem of sparse text features. Extensive experimental results showed that our proposed approach has a promising performance in English short text classification, while in Chinese short text classification, although slightly inferior to the LDA + TF-IDF method, it still achieved promising results. The goal of the research work was to improve the accuracy of short text classification by dealing with semantic and syntactic information missing in short texts.

The potential applications of short text classification are as follows:

1.: The short text classification technology is used to analyze the data of industrial equipment, which can better understand the running state of equipment and diagnose the fault.
2.: The short text classification technology is used to classify and cluster large-scale news reports and social media topics and analyze and predict the development and trend of events.
3.: According to the historical behaviors and interests of users, the short text library is classified and matched to recommend personalized goods or services for users.

The organizational structure of the remainder of the paper is as follows. The related work is given in Section 2. Our approach is presented in Section 3. Section 4 presents the experimental results of our proposed method, including a comparison with existing methods and its application to the classification of short texts in English and Chinese. Finally, Section 5 summarizes the content of the paper and points out the deficiencies of the current research and the direction for future research.

2. Related Work

2.1. Short Text Classification

Short text classification is complicated [14]. Because of their restricted length, short texts lack context information and a strict syntactic structure. These are essential for text comprehension [8]. Therefore, it is crucial to extend short texts. Up to now, there are two kinds of short text expansion methods. They are external resource expansion and internal resource expansion. Expansion based on external resources refers to expanding short texts using an external corpus. For example, Chen [9] uses external resources to obtain rich information about short text expansion. Expansion based on internal resources refers to the expansion of short text by using the context of the text itself to construct the word set based on text content. For example, Paulo [15] proposed using word co-occurrences and word vectors to create a large pseudo-document in the original text for feature expansion. Nevertheless, enriching semantic information alone cannot compensate for the deficiency in labeled data, a common problem faced by short texts [11]. Consequently, the GNN-based method was born, which performs the node classification of semi-supervised short text classification. HGAT [10] utilizes the GNN with dual-level attention to jointly forward messages about topics, entities, and documents of corpus-level graph modeling, in which entities are words connected to knowledge graphs. FABG [16] first uses the bidirectional GRU (Bi-GRU) to learn the semantic information of the text. Then, it uses the complete attention mechanism to learn the weight of the current and previous outputs of the Bi-GRU in each step so that each step can obtain necessary information and ignore irrelevant information, which improves the effect of classification. BERT [17] is a two-way deep language model based on Transformer, which can capture two-way context semantics. Zhang [18] proposed an ERNIE pre-training model based on BERT and knowledge map fusion, which uses multi-information entities in the knowledge map as external knowledge to improve language representation. Sitaula and Shahi [19] proposed a method of a multi-channel convolutional neural network using mixed features. Firstly, each tweet is represented by combining grammatical and semantic information, and then, a new multi-channel convolutional neural network is used to classify it, which integrates multiple CNNs and can capture multi-scale information. Sitaula et al. [20] used three different feature extraction methods to represent tweets. The three methods were fastText-based, domain-specific, and domain-agnostic. Three different convolutional neural networks were then used to implement the proposed features. Finally, three CNN models were integrated in an end-to-end way to achieve the final result. SHINE [13] proposes a GNN-based hierarchical heterogeneous graph consisting of subgraphs based on the word level. At present, the most-advanced method for short text classification is SHINE, as shown in Section 2.2.

2.2. SHINE Model

SHINE [13] is a new hierarchical heterogeneous graph representation learning method of short text classification, which can fully capture the sparse semantic relationship between short texts. It proposes a method to construct multiple heterogeneous graphs for text. Specifically, it proposes two different composition methods. They are word-level component graphs and short document graphs. The former describes the interaction between words, POS labels, and entities. The component graphs are easy to extract and transport extra semantic and syntactic information to compensate for the absence of context information. The latter is dynamically learned and optimized to encode the similarity between short texts, thereby enabling more efficient label propagation between similar short texts.

However, SHINE does not perform particularly well in solving problems such as sparse short text features. Therefore, we propose the SHINE + LDA model, which introduces the LDA thematic model on the basis of the SHINE model. It can make better use of semantic and syntactic information to extend short text and adjust the feature weight by the LDA topic model to improve the classification effect. The basic principle of LDA is described in Section 2.3.

2.3. LDA Topic Model

Blei [21] proposed the topic model algorithm LDA. The basic idea of this model is that each piece of text data is composed of multiple topics, and the probability of multiple words can represent each topic. The LDA model can effectively map the text to a low-dimensional topic vector. Then, the text feature is represented using the text topic’s distribution vector. The LDA model is shown in Figure 1.

The symbols in Figure 1 are defined as follows: M represents the total number of texts in the training corpus; N represents the total number of words in a text; K is the number of topics;

θ

is the distribution matrix of text topics;

φ

is the distribution matrix of topics and words;

α

and

β

are the parameters of the Dirichlet distribution.

θ_{i} \sim D i r i c h l e t (α)

, and

φ_{k} \sim D i r i c h l e t ((β)

, and W denotes the observable words in the text.

Suppose that D is the text set of the text dataset and

D o c_{m}

is the mth text in the available text D. The

D o c_{m}

is composed of word combinations

(W_{m 1}, W_{m 2}, W_{m 3} \dots \dots W_{m n})

, where

W_{m n}

represents the nth word of the

m^{t h}

document. The modeling process of the LDA model is as follows:

1.: For text $D o c_{m} \in D$ , the topic distribution $θ_{m}$ is generated by $D i r i c h l e t (α)$ .
2.: For the nth word of the text $W_{m n}$ , topic $Z_{m n}$ of word $W_{m n}$ is generated by the polynomial distribution $(Z_{m n} \sim M u l t (θ_{m}))$ .
3.: Determining the distribution probability matrix $φ_{m}$ of topics and words is a distribution of $(φ_{m} \sim D i r i c h l e t ((β))$ with parameter $β$ . At the same time, according to topic $Z_{m n}$ , $W_{m n}$ determines a term distribution $φ_{z m n}$ .
4.: According to the distribution of words $φ_{z m n}$ , generate the selected topic term $W_{m n} (W_{m n}$ $\sim M u l t (φ_{z m n}))$ .
5.: Repeat Steps 1 to 4 to generate $D o c_{m}$ for all words in the text. Repeat Steps 1 to 5 above for all text in the text set to generate the entire text set D.

The text process simulated by the LDA model shows the joint probability distribution of variables as shown in Formula (1).

p (W, Z, θ, φ ∣ α, β) = \prod_{m = 1}^{∣ D ∣} p (W_{m n} ∣ φ_{z m n}) \cdot p (Z_{m n} ∣ θ) \cdot p (θ ∣ α) \cdot p (φ ∣ β)

(1)

where W represents the observable words in the text, Z is the subject,

θ

is the distribution matrix of the text topic,

φ

is the distribution matrix of the topic and word,

α

and

β

are the parameters of the Dirichlet distribution, which is obtained by the grid search method, D represents the document aggregate of the text dataset,

W_{m n}

represents the nth word of the mth document,

φ_{z m n}

represents the lexical distribution of the nth term of the mth document,

Z_{m n}

represents the topic of the nth term of the mth document, and

p (x ∣ y)

is under the condition of the y occurrences and the event probability of x occurring.

The LDA model’s probability distribution of hidden variables is complicated, so estimation methods are often used to calculate it. The estimation methods include variational Bayesian inference and the Gibbs sampling algorithm. Among them, the advantage of the Gibbs sampling algorithm [22] is its fast computation speed. Additionally, it is widely used because it is easy to understand and implement. This paper also used this method to process the text data.

3. Proposed Method

Currently, SHINE [13] performs better in short text classification. However, it still needs improvement in short text extension. It cannot avoid the problem of invalid expansion and reduces the effectiveness of short text expansion. Therefore, we propose a method to better compensate for the lack of context information in short text classification. To begin with, we modeled the short text dataset as a hierarchical heterogeneous graph consisting of a word-level graph. In addition, the GCN was used to obtain the node embedding of three feature graphs, and the feature vectors of the short texts were obtained by hierarchical pooling. Moreover, the LDA topic model was used to obtain the topic distribution vector of the short text. Furthermore, the feature vector of the short text and the topic distribution vector were weighted and fused to obtain a new feature vector. Finally, the GCN was used to learn the label probability distribution of the short text nodes. By combining words, topics, entities, and POS, we enriched the semantics of short texts, which significantly facilitates classification tasks.

3.1. Word-Level Graph [13]

A word-level graph can use both semantic and syntactic information to better understand the short text. For short texts, they often do not contain enough contextual information to facilitate proper classification. Therefore, a word-level graph can be used to combine various information, such as words, entities, and POS, so as to enrich the semantics of short texts and further improve the effect of the classification task. The word-level graph is composed of the word graph, POS tag graph, and entity graph. The following sections describe how to build a word-level graph.

3.1.1. Node Embedding

Z_{T} = {N_{T}, M_{T}}

represents a word-level graph of type T, where

N_{T}

represents a set of nodes,

M_{T} \in R^{∣ N_{T} ∣ \times ∣ N_{T} ∣}

represents the adjacency matrix, and the node feature is

X^{I} \in R^{∣ N_{T} ∣ \times a_{T}}

. The node embeddings

E_{T}

are obtained using a two-layer graph convolutional network (GCN) [23].

E_{T} = {\tilde{M}}_{T} \cdot R e L u ({\tilde{M}}_{T} X_{T} V_{T}^{1}) V_{T}^{2}

(2)

where

{[R e L u (x)]}_{i} = m a x ({[x]}_{i}, 0)

,

{\tilde{M}}_{T} = B_{T}^{- \frac{1}{2}} (I + M_{T}) B_{T}^{- \frac{1}{2}}

with

{[B_{T}]}_{i j} = \sum_{j} {[M_{T}]}_{i j}

and

V_{T}^{1}

,

V_{T}^{2}

are trainable parameters.

3.1.2. Graph Construction

According to the definition of a graph

G_{T} = {N_{T}, M_{T}}

,

G_{T}

represents a graph of type T,

N_{T}

represents a set of nodes, and

M_{T} \in R^{∣ N_{T} ∣ \times ∣ N_{T} ∣}

represents the adjacency matrix. The construction of the graph needs to solve two problems:

1.

How to define node N:

This is defined according to the characteristics to be integrated. If it is merging POS information, N is the POS tag. If it is merging entity information, then N is the entity label.

2.

How to build adjacency matrix:

Pointwise mutual information (PMI) [24]:
PMI is used to measure the correlation between the two variables. The calculation formula of PMI is as follows.

$P M I (x, y) = log \frac{p (x, y)}{p (x) p (y)} = log \frac{p (x ∣ y)}{p (x)} = log \frac{p (y ∣ x)}{p (y)}$

(3)

where x and y are two variables. The larger the PMI, the higher the correlation between the two variables.
Cosine similarity:
The cosine similarity is the cosine value of the angle between two vectors in the vector space, which is used to measure the similarity between two vectors. The calculation formula of the cosine similarity is as follows.

$S i m i l a r i t y (\vec{x}, \vec{y}) = cos θ = \frac{\vec{x} \cdot \vec{y}}{∣ ∣ \vec{x} ∣ ∣ ∣ ∣ \vec{y} ∣ ∣}$

(4)

where $\vec{x}$ and $\vec{y}$ are two vectors. In the trigonometric function, the function value of the cosine function is $[- 1, 1]$ . Therefore, the cosine similarity range between the two vectors is $[- 1, 1]$ . When the angle between the two vectors is $0^{\circ}$ , the similarity is 1. When the angle is $180^{\circ}$ , the similarity is $- 1$ . The larger the cosine value, the more similar the two samples are.

Next, we describe how it constructs the word graph, POS tag graph, and entity graph:

1.: Word graph:
Build a word graph [13] $G_{W} = {N_{W}, M_{W}}$ , where $G_{W}$ represents a graph of type W, $N_{W}$ represents a set of nodes, and $M_{W} \in R^{∣ N_{W} ∣ \times ∣ N_{W} ∣}$ represents the adjacency matrix. Firstly, the short text is segmented into words using NLTK. Then, the word set is represented as N, and the PMI between any two words in the entire dataset is calculated to construct $M_{W}$ .

${[M_{W}]}_{i j} = m a x (P M I (n_{w}^{i}, n_{w}^{j}), 0)$

(5)

where $n_{w}^{i}, n_{w}^{j} \in N_{W}$ . Then, use Formula (2) to learn the node representation function $E_{W}$ (W is the type of graph) based on this graph. This calculates the representation of each node, which is the embedding of each word. The initial representation uses one-hot encoding.
2.: POS tag graph:
Build a POS tag graph [13] $G_{P} = {N_{P}, M_{P}}$ , where $G_{P}$ represents a graph of type P, $N_{P}$ represents a set of nodes, and $M_{P} \in R^{∣ N_{P} ∣ \times ∣ N_{P} ∣}$ represents the adjacency matrix. The construction and node representation of the graph are similar to the word graph. The only difference is that the words in the original dataset are replaced with POS tags, and the POS tag set provided by NLTK is used to obtain the POS tags for each word in the short text.
3.: Entity graph:
Build an entity graph [13] $G_{e} = {N_{e}, M_{e}}$ , where $G_{e}$ represents a graph of type e, $N_{e}$ represents a set of nodes, and $M_{e} \in R^{∣ N_{e} ∣ \times ∣ N_{e} ∣}$ represents the adjacency matrix. The node set from the entity type is defined in the NELL knowledge base [25]. The adjacency matrix $M_{e}$ is calculated as follows.

${[M_{e}]}_{i j} = m a x (S i m i l a r i t y (x_{e}^{i}, x_{e}^{j}), 0)$

(6)

$x_{e}$ is the embedding of the entity type. It is learned by the TransE [26] method, a classic knowledge graph embedding method. In general, the NELL knowledge base [25] is first used to identify the entities contained in each text. Then, use the TransE [26] method to learn the embedding of each entity. Finally, the cosine distance calculates the similarity between two entities to fill $M_{e}$ .

Figure 2 shows that the word-level graph consists of three graphs. They are the word graph, entity graph, and POS tag graph.

3.2. Short Text Graph

We dynamically learned the short text graph

G_{S} = {N_{S}, M_{S}}

, where

N_{S}

represents the text set and

M_{S}

represents the relation matrix between texts. To propagate labels more efficiently, we used hierarchical pooling to learn the similarity

M_{S}

between short texts from the word-level graph.

We used the above three feature maps to represent each node. Moreover, the similarity is calculated by the inner product of the two nodes.

{\hat{x}}_{T}^{i} = u (E_{T}^{t} S_{T}^{i})

(7)

Here, the subscript T represents the type of feature graph,

E_{T}

is the node embeddings of the corresponding graph, u is the

L 1

-norm, and

u (x) = \frac{x}{∣ ∣ x ∣ ∣_{2}}

.

For the word graph and POS tag graph,

S_{T}

represents the TF-IDF [27] value of each word or POS tag in text i. For the entity graph, when

S_{T}

equals 1, the entity in the graph exists in the current text i. When

S_{T}

is equal to 0, the entity in the graph does not exist in the current text.

S_{T}

characterizes the relationship between all nodes in the subgraph and the current node and uses this weight to weight the embedding of the nodes calculated by the subgraph. The weighted node’s embedding matrix is used to represent text i. Finally, a text i connects the representations of the three feature graphs.

{\hat{x}}_{s}^{i} = {\hat{x}}_{w}^{j} ∣ ∣ {\hat{x}}_{p}^{j} ∣ ∣ {\hat{x}}_{e}^{j}

(8)

According to the generation principle of the LDA topic model described in Section 2.3, the LDA topic model is used to conduct the topic modeling for the short text, and the short text is represented as the topic distribution vector. Then, the feature vector of the short text and the topic distribution vector are weighted and fused to obtain a new feature vector

x_{s}

.

Use the inner product to represent the relation matrix

M_{S}

.

{[M_{S}]}_{i j} = \{\begin{matrix} {(x_{s}^{i})}^{T} x_{s}^{j} & if {(x_{s}^{i})}^{T} x_{s}^{j} \geq δ_{s} \\ 0 & otherwise \end{matrix}

(9)

Among them,

δ_{s}

is the threshold for measuring text similarity and plays a sparse role.

x_{s}

is a short text feature learned by Formula (8).

Then, use the GCN to learn the label probability distribution of the document nodes.

{\hat{H}}_{S} = s o f t m a x (M_{S} \cdot R e L u (M_{S} X_{S} W_{S}^{1}) \cdot W_{S}^{2})

(10)

where

W_{S}^{1}

and

W_{S}^{2}

are trainable parameters and can be trained using backpropagation algorithms,

X_{S}

is the short text embeddings,

M_{S}

is the adjacency matrix,

{[s o f t m a x (x)]}_{i} = \frac{e x p ({[x]}_{i})}{\sum_{j} e x p ({[x]}_{j})}

, and

R e L u (x) = m a x (0, x)

. Its classification losses are as follows.

Γ = - \sum_{i \in Γ_{l}} {(f_{s}^{i})}^{T} log ({\hat{f}}_{s}^{i})

(11)

where

Γ_{l}

records the index of the labeled short text and

f_{s}^{i}

is a one-hot vector of all 0s.

Figure 3 shows the natural language processing techniques from short text datasets to build a heterogeneous corpus-level graph. Figure 4 shows our proposed SHINE + LDA architecture, which aggregates hierarchically on the word-level component graph to obtain a short text graph. First, it constructs three feature maps: word graph, POS tag graph, and entity graph. Then, it connect the representations of these three feature maps to obtain the feature vectors of the text. In the end, the LDA topic model is used to adjust the feature weight and update the weight of the short text.

The algorithm for SHINE + LDA is shown in Algorithm 1.

Algorithm 1 Pseudocode for our algorithm.

Input: short text dataset S, word graph $G_{W} = {N_{W}, M_{W}}$ , POS tag graph $G_{P} = {N_{P}, M_{P}}$ ,
entity graph $G_{e} = {N_{e}, M_{e}}$ (see Section 3.1), sample-specific aggregation vector $S_{T}^{i}$ ,
where $T \in {W, P, e}$ ;
for $i = 1, 2, \dots, I$ do
for $T \in {W, P, e}$ do
obtain the node embedding $E_{T}$ of $G_{T}$ through Formula (2);
end for
obtain the short text feature $X_{S}$ by layer pooling on $G_{T}$ through Formula (8);
update the feature weight of the short text through the LDA topic model (see Section 2.3);
obtain a short document embedding from $G_{S}$ , and perform class prediction through Formula (10);
optimize the model parameters relative to Formula (11) by back propagation;
end for

4. Experiments

Each result was averaged over five runs as the final result.

4.1. Experimental Environment

The server was the EMC PowerEdge R740; the CPU was 5220R 2.2G, 24C/48T (24 core 48 processes) * 2; the GPU was NVIDIA GRX 3090; the memory was 384G; the operating system was Ubuntu 18.04.6, CUDA 11.4.0, Python 3.7, and Pytorch 1.2.

4.2. Datasets

We selected five short text datasets for the experiments. They were Twitter, Snippets, TagMyNews, MR, and Headlines Today.

Twitter: The dataset was provided by NLTK4, a vast corpus of tweets and replies [10].

Snippets: The dataset was published by Phan et al. [11] and is a web search fragment returned by a Google search.

TagMyNews: The dataset is an English news headline from a Simple Syndication (RSS) feed published by Hu et al. [10].

MR: a dataset of movie reviews [28].

Headlines Today: a dataset of daily news (The data come from today’s headline client. To download the dataset, please visit the website: https://github.com/BenDerPan/toutiao-text-classfication-dataset, accessed on 14 May 2018). We randomly selected 15,425 pieces of data.

Table 1 shows the feature information of the datasets. It includes the number of datasets, the average length in words, the number of categories, the number of training sets, and the proportion of training sets in parentheses.

The following is the preprocessing we performed on all datasets. We marked each sentence and deleted low-frequency words, non-English characters, and stop words less than five times in the corpus [29]. We randomly selected 40 tagged short texts for each dataset in each class. Moreover, half was used as a training set, the other half as a validation set [10]. After Kipf and Welling [23], all the remaining short texts were used as test sets and unlabeled texts in training.

4.3. Hyperparameter Setting

We set our parameters according to the parameters in the comparison experiment. We set the entity embedding dimension

d_{e}

to 100. For all datasets, we set the sliding window size of the PMI of

G_{W}

and

G_{P}

to 5, set the embedding size of all GCN layers used to 200, and set the threshold

δ_{S}

of

G_{S}

to 2.7. We set the number of topics

k = 15

for the MR, TagMyNews, Headlines Today, and Twitter datasets in LDA. We set

k = 20

for Snippets. For all datasets, each text was assigned to the first

P = 2

topics with the highest probability. We implemented this method in PyTorch and used Adam [30] to train the model at a maximum of 1000 epochs at a learning rate of

5 \times 10^{- 3}

. We stopped training in advance if the verification loss was not reduced for ten consecutive periods. The dropout rate was set to 0.5.

4.4. Evaluation Metrics

We evaluated the classification results with the more commonly used evaluation index accuracy (ACC) and F1 value [31]:

1.: Accuracy: Accuracy (ACC) refers to the number of samples in which the results are correctly predicted as a percentage of the total results [32].

$A C C = \frac{T P + T N}{T P + F P + F N + T N}$

(12)

Among them, $T P$ indicates that the prediction is a positive class and the actual number of texts in the positive class. $F P$ indicates that the prediction is a positive class, which is the number of texts in the negative class. $T N$ indicates that the prediction is negative, and it is the number of negative texts. $F N$ means that the prediction is a negative class, which is the number of texts in the positive class.
2.: F1 value: The F1 value is the harmonic average of the precision and recall rate. Because the precision and recall rate have difficulty in fully reflecting the classification results, the F1 value comprehensive evaluation index was introduced [33].

$F 1 = \frac{2 * P * R}{P + R}$

(13)

where $P = \frac{T P}{T P + F P}$ and $R = \frac{T P}{T P + F N}$ .

4.5. Compared Methods

LDA + SVM is a short text classification method based on the topic model and support vector machine. First, LDA is used to model the topic of the text, and then, the topic features are input into SVM for classification. It can capture the underlying topic information in the text, but it requires much preprocessing and hyperparameter adjustment and may not handle lexical diversity and context information well. HGAT is a short text classification method based on a graph attention network. This method also represents text as nodes and forms a graph and then uses the graph attention network for classification. It can better capture the relationships between texts and assign different weights to each node, but it is also computationally complex for large datasets. SHINE is a novel hierarchical heterogeneous graph representation learning method, which can effectively learn from hierarchical graphs modeling different short text data. SHINE can better utilize the interaction between nodes of the same type and capture the similarities between short texts. However, it still needs improvement in short text extension. SHINE + LDA adds the LDA topic model to solve the problem of sparse features in short text extension.

The proposed SHINE + LDA method was compared with the above methods.

According to the results in Table 2, although the difference between the SHINE model and the SHINE + LDA model on the Twitter and Snippets datasets is small, the SHINE + LDA model had better ACC and F1 indicators on the four datasets than the other five models. This indicated that the SHINE + LDA model can make better use of semantic and syntactic information to extend short texts and adjust the feature weights through the LDA topic model to improve the classification effect.

4.6. Model Sensitivity on Twitter

Figure 5 depicts the effect of changes in the threshold

δ_{S}

on short text classification performance. First, the performance improved with the

δ_{S}

increase. When reaching a specific value, the performance decreased with the

δ_{S}

increase. When reducing to a certain extent, the performance increased with the

δ_{S}

increase. When increasing to a specific value, the performance decreased sharply with the

δ_{S}

increase. When

δ_{S}

was too small,

G_{S}

may become sparse, resulting in lower performance. However, when

δ_{S}

was too large,

G_{S}

may lose its proper functionality, reducing the performance of short text classification. Figure 6 depicts the impact of the GCN embedding size on short text classification performance. With the increase of GCN embedding size, the performance also improved. When increasing to a certain value, the performance decreased with the increase of the GCN embedding size. When reducing to a certain value, the performance increased with the increase of the GCN embedding size. When reaching a certain value, the performance decreased again with the increase of the GCN embedding size. If the GCN embedding size is too small, it may not capture enough neighbor node information, resulting in performance degradation. Conversely, if the GCN embedding size is too large, it may result in overfitting or increased computational complexity, which also affects performance.

4.7. Application to Chinese

The difference between English and Chinese is that there is no space between words in Chinese to indicate the boundary of words. Therefore, Chinese word segmentation is the basic module and the primary link in various natural language processing systems. So far, there are numerous Chinese word segmentation methods. This article uses the jieba segmentation.

4.7.1. Jieba Segmentation

Jieba word segmentation is based primarily on a statistical dictionary to construct a prefix dictionary. Furthermore, it uses the prefix dictionary to segment the input sentence to obtain all possible segmentations. In addition, it constructs a directed acyclic graph according to the segmentation position. Finally, the maximum probability path is calculated using a dynamic programming algorithm, and the word segmentation is performed according to the path. For unregistered words, jieba uses the HMM model based on Chinese character composition and the Viterbi algorithm for derivation.

4.7.2. Experimental Result

We applied this method to the Headlines Today dataset to test its performance on short text classification for Chinese. The proposed SHINE + LDA method was compared with the following methods.

According to the results in Table 3, it can be seen that the SHINE + LDA model was better than the LDA + SVM and LDA + KNN models on the ACC and F1 indexes, but slightly worse than the LDA + TF-IDF model. This showed that the SHINE + LDA model can make better use of semantic and syntactic information to extend short texts and adjust the feature weights through the LDA topic model to improve classification. However, there were still some defects in this model, so it was necessary to strengthen the collection ability of text content information. All in all, the results showed that the SHINE + LDA model still had room for improvement on the performance of Chinese short text classification, but it had a certain feasibility and practicability.

5. Conclusions

In this paper, we proposed a new hierarchical heterogeneous graph representation learning method for short text classification, which is particularly useful to make up for the lack of context information. In particular, it can effectively learn from hierarchical graphs modeled from different perspectives on short text datasets. The word-level component graph is used to understand short text from the perspective of semantics and syntax, and the dynamically learned short text graph allows efficient and effective label propagation in similar short texts. It can also assign higher weights to essential features in short texts using LDA for word vector semantic computation. Extensive experimental results showed that this method was always superior to the most-progressive method on four benchmark datasets. Moreover, its application to short text classification for Chinese also achieved good results.

However, this method also had some limitations. On the one hand, because this method required constructing multiple heterogeneous graphs and processing them with algorithms such as the GCN and LDA topic model, its computational complexity was high. This increased the time and space overhead of training and testing the model, which may limit its use on large-scale datasets. On the other hand, this method used a hierarchical heterogeneous graph and LDA topic model to fuse various information. However, this fusion method may ignore some important information of the original text and may have problems such as model overfitting because the fusion method is not flexible enough. Therefore, this fusion approach may have limitations in more complex scenarios.

The future work of this paper can be summarized as follows:

1.: Improved algorithm efficiency and accuracy:
Future studies could continue to optimize the algorithm and improve its classification efficiency and accuracy.
2.: Explore other text feature fusion methods:
The algorithm fuses the hierarchical heterogeneous graph and LDA topic model to process text features, and future studies can explore the fusion methods of other text features.

Author Contributions

Methodology, B.L. (Bo Li); Formal analysis, B.L. (Bing Luo); Resources, C.Z.; Data curation, Y.S.; Writing—original draft, X.X.; Writing—review & editing, F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Science and Technology Program of Sichuan Province, China (Grant No. 2023YFG0264) and the Opening Project of Intelligent Policing Key Laboratory of Sichuan Province (Grant Nos. ZNJW2022KFMS004, ZNJW2022KFQN002, ZNJW2023KFQN007).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this paper:

LDA	Latent Dirichlet allocation
POS	Parts-of-speech
PMI	Pointwise mutual information
ACC	Accuracy

References

Maron, M.E. Automatic Indexing: An Experimental Inquiry. J. ACM 1961, 8, 404–417. [Google Scholar] [CrossRef]
Vo, D.T.; Ock, C.Y. Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 2015, 42, 1684–1698. [Google Scholar] [CrossRef]
Du, Y.; Yi, Y.; Li, X.; Chen, X.; Fan, Y.; Su, F. Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation. Eng. Appl. Artif. Intell. 2020, 87, 103279. [Google Scholar] [CrossRef]
Kilimci, Z.H.; Omurca, S.I. Extended Feature Spaces Based Classifier Ensembles for Sentiment Analysis of Short Texts. Inf. Technol. Control 2018, 47, 457–470. [Google Scholar] [CrossRef] [Green Version]
Zhu, L.; Tian, N.; Li, W.; Yang, J. A Text Classification Algorithm for Power Equipment Defects Based on Random Forest. Int. J. Reliab. Qual. Saf. Eng. 2022, 29, 2240001. [Google Scholar] [CrossRef]
Chen, H.Y. Personalized recommendation system of e-commerce based on big data analysis. J. Interdiscip. Math. 2018, 21, 1243–1247. [Google Scholar] [CrossRef]
Peng, H.; Li, J.; He, Y.; Liu, Y.; Bao, M.; Wang, L.; Song, Y.; Yang, Q. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018. [Google Scholar]
Wang, J.; Wang, Z.; Zhang, D.; Yan, J. Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Chen, J.; Hu, Y.; Liu, J.; Xiao, Y.; Jiang, H. Deep Short Text Classification with Knowledge Powered Attention. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’19/IAAI’19/EAAI’19), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Washington, DC, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
Yang, T.; Hu, L.; Shi, C.; Ji, H.; Li, X.; Nie, L. HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification. ACM Trans. Inf. Syst. 2021, 39, 1–29. [Google Scholar] [CrossRef]
Phan, X.H.; Nguyen, L.M.; Horiguchi, S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections. In Proceedings of the 17th International Conference on World Wide Web (WWW’08), Beijing, China, 21–25 April 2008; pp. 91–100. [Google Scholar] [CrossRef]
Liu, P.; Qiu, X.; Huang, X. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16), New York, NY, USA, 9–15 July 2016; AAAI Press: Washington, DC, USA, 2016; pp. 2873–2879. [Google Scholar]
Wang, Y.; Wang, S.; Yao, Q.; Dou, D. Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 3091–3101. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Shallow to Deep Learning. arXiv 2020, arXiv:2008.00364. [Google Scholar]
Bicalho, P.V.; Pita, M.; Pedrosa, G.; Lacerda, A.M.; Pappa, G.L. A general framework to expand short text for topic modeling. Inf. Sci. 2017, 393, 66–81. [Google Scholar] [CrossRef]
Tang, Q.; Li, J.; Chen, J.; Lu, H.; Du, Y.; Yang, K. Full Attention-Based Bi-GRU Neural Network for News Text Classification. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; pp. 1970–1974. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1441–1451. [Google Scholar] [CrossRef] [Green Version]
Sitaula, C.; Shahi, T.B. Multi-channel CNN to classify nepali covid-19 related tweets using hybrid features. arXiv 2022, arXiv:2203.10286. [Google Scholar]
Sitaula, C.; Basnet, A.; Mainali, A.; Shahi, T.B. Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related Tweets. Comput. Intell. Neurosci. 2021, 2021, 2158184. [Google Scholar] [CrossRef] [PubMed]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Porteous, I.R.; Newman, D.; Ihler, A.T.; Asuncion, A.U.; Smyth, P.; Welling, M. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the KDD, Las Legas, NV, USA, 24–27 August 2008. [Google Scholar]
Kipf, T.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907. [Google Scholar]
Church, K.W.; Hanks, P. Word Association Norms, Mutual Information, and Lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 26–29 June 1989; pp. 76–83. [Google Scholar] [CrossRef]
Carlson, A.; Betteridge, J.; Kisiel, B.; Settles, B.; Hruschka, E.R.; Mitchell, T.M. Toward an Architecture for Never-Ending Language Learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI’10), Atlanta, Georgia, 11–15 July 2010; AAAI Press: Washington, DC, USA, 2010; pp. 1306–1313. [Google Scholar]
Bordes, A.; Usunier, N.; Garcia-Durán, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-Relational Data. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), Red Hook, NY, USA, 5–10 December 2013; Volume 2, pp. 2787–2795. [Google Scholar]
Aggarwal, C.C.; Zhai, C. A Survey of Text Classification Algorithms. In Mining Text Data; Springer: Boston, MA, USA, 2012. [Google Scholar]
Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the ACL, Ann Arbor, MI, USA, 25–30 June 2005. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’19/IAAI’19/EAAI’19), Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Washington, DC, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
Linmei, H.; Yang, T.; Shi, C.; Ji, H.; Li, X. Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4821–4830. [Google Scholar] [CrossRef] [Green Version]
Ikonomakis, E.; Kotsiantis, S.; Tampakas, V. Text classification: A recent overview. In Proceedings of the 9th WSEAS International Conference on Data Networks, Communications, Computers (DNCOCO’10), Faro, Portugal, 3–5 November 2005; p. 125. [Google Scholar]
Yang, Y.; Liu, X. A Re-Examination of Text Categorization Methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), Berkeley, CA, USA, 15–19 August 1999; pp. 42–49. [Google Scholar] [CrossRef]

Figure 1. The LDA model.

Figure 2. Word-level graph.

Figure 3. Graph construction.

Figure 4. Framework of SHINE + LDA.

Figure 5. Varying

δ_{S}

.

Figure 5. Varying

δ_{S}

.

Figure 6. Varying the embedding size of GCN.

Table 1. Characteristics of datasets.

Name	Quantity	Average Length in Words	Classes	Train (Ratio)
Twitter	10,000	3.5	2	40 (0.40%)
Snippets	12,340	14.5	8	160 (1.30%)
TagMyNews	32,549	5.1	7	140 (0.43%)
MR	10,662	7.6	2	40 (0.38%)
Headlines Today	15,425	19.2	15	300 (1.94%)

Table 2. Training results compared with some English short text classification methods.

Model	Twitter		Snippets		TagMyNews		MR
Model	ACC	F1	ACC	F1	ACC	F1	ACC	F1
SHINE	72.54%	72.19%	82.39%	81.62%	62.50%	56.21%	64.58%	63.89%
HGAT	63.21%	62.48%	82.36%	74.44%	61.72%	53.81%	62.75%	62.36%
LDA + SVM	54.34%	53.97%	62.54%	56.40%	40.40%	30.40%	54.40%	48.39%
SHINE + LDA	73.17%	73.17%	84.45%	84.45%	72.75%	72.71%	73.00%	69.24%

Table 3. Training results compared with some Chinese short text classification methods.

Model	ACC	F1
LDA + SVM	61.2%	59.4%
LDA + KNN	60.1%	57.7%
LDA + TF-IDF	86.8%	87.1%
SHINE + LDA	84.2%	83.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, X.; Li, B.; Shen, Y.; Luo, B.; Zhang, C.; Hao, F. Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion. Electronics 2023, 12, 2560. https://doi.org/10.3390/electronics12122560

AMA Style

Xu X, Li B, Shen Y, Luo B, Zhang C, Hao F. Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion. Electronics. 2023; 12(12):2560. https://doi.org/10.3390/electronics12122560

Chicago/Turabian Style

Xu, Xinlan, Bo Li, Yuhao Shen, Bing Luo, Chao Zhang, and Fei Hao. 2023. "Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion" Electronics 12, no. 12: 2560. https://doi.org/10.3390/electronics12122560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion

Abstract

1. Introduction

2. Related Work

2.1. Short Text Classification

2.2. SHINE Model

2.3. LDA Topic Model

3. Proposed Method

3.1. Word-Level Graph [13]

3.1.1. Node Embedding

3.1.2. Graph Construction

3.2. Short Text Graph

4. Experiments

4.1. Experimental Environment

4.2. Datasets

4.3. Hyperparameter Setting

4.4. Evaluation Metrics

4.5. Compared Methods

4.6. Model Sensitivity on Twitter

4.7. Application to Chinese

4.7.1. Jieba Segmentation

4.7.2. Experimental Result

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI