Next Article in Journal
A New Biorefinery Approach for the Full Valorisation of Anchovy Residues: Use of the Sludge Generated during the Extraction of Fish Oil as a Nitrogen Supplement in Anaerobic Digestion
Next Article in Special Issue
An Improved Image Filtering Algorithm for Mixed Noise
Previous Article in Journal
Application of Spinel and Hexagonal Ferrites in Heterogeneous Photocatalysis
Previous Article in Special Issue
Layout Design and Die Casting Using CAE Simulation for Household Appliances
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BiTTM: A Core Biterms-Based Topic Model for Targeted Analysis

1
School of Information and Engineering, Sichuan Tourism University, Chengdu 610100, China
2
Centre for Artificial Intelligence, University of Technology Sydney, P.O. Box 123, Broadway, NSW 2007, Australia
3
School of Computer Science and Information Engineering, Hefei University of Technology, Tunxi Road, Hefei 230009, China
4
Mininglamp Academy of Sciences, Mininglamp Technology, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(21), 10162; https://doi.org/10.3390/app112110162
Submission received: 9 October 2021 / Revised: 25 October 2021 / Accepted: 26 October 2021 / Published: 29 October 2021
(This article belongs to the Special Issue Soft Computing Application to Engineering Design)

Abstract

:
While most of the existing topic models perform a full analysis on a set of documents to discover all topics, it is noticed recently that in many situations users are interested in fine-grained topics related to some specific aspects only. As a result, targeted analysis (or focused analysis) has been proposed to address this problem. Given a corpus of documents from a broad area, targeted analysis discovers only topics related with user-interested aspects that are expressed by a set of user-provided query keywords. Existing approaches for targeted analysis suffer from problems such as topic loss and topic suppression because of their inherent assumptions and strategies. Moreover, existing approaches are not designed to address computation efficiency, while targeted analysis is supposed to provide responses to user queries as soon as possible. In this paper, we propose a core BiTerms-based Topic Model (BiTTM). By modelling topics from core biterms that are potentially relevant to the target query, on one hand, BiTTM captures the context information across documents to alleviate the problem of topic loss or suppression; on the other hand, our proposed model enables the efficient modelling of topics related to specific aspects. Our experiments on nine real-world datasets demonstrate BiTTM outperforms existing approaches in terms of both effectiveness and efficiency.

1. Introduction

Topic modelling as unsupervised learning has become a prevalent text mining tool for discovery of hidden semantic structures in a text body. Given a collection of documents, most of the existing topic models perform a full analysis to discover all topics occurring in the corpus. However, it was recently noticed [1] that in many situations users are interested in focused topics related to some specific aspects only. For example, given a set of Amazon product reviews, a user might be interested only in bedding products. A conventional topic model performing full analysis will identify all topics from the entire corpus such as “furniture”, “food” and “clothing”. Although the topic of “furniture” is related to the user interested aspect of “bedding products”, it is too coarse as the user might be more interested in fine-grained topics like “bed frames” and “mattress”. As a result, targeted (or focused) analysis is proposed by Wang et al. [1] to discover topics relevant to targeted aspects only. Particularly, given a corpus of documents from a broad domain and a set of user-provided keywords representing user-interested aspects, targeted analysis aims to discover topics related with the queried aspects only.
Methods for targeted analysis can be generally categorised into two groups: (1) conventional topic models incorporating filtering strategies and (2) specialised topic models. However, methods of both categories suffer from problems such as topic loss and topic suppression, because of the limitations of their respective assumptions and strategies.
For algorithms in the first group, both pre-filtering and post-filtering strategies can be adopted to empower full-analysis topic models to find topics related to queried aspects. Basically, the pre-filtering strategy retains only documents containing the query keywords and extracts topics from the retained “partial data”. The quality of the discovered topics thus heavily depends on user-supplied query keywords. If the keywords are not appropriate or comprehensive enough, many relevant documents will be filtered, which incurs a significant topic loss. For example, if a user provides “bath” as a query keyword, documents without the keyword but containing the synonyms like “shower” and similar words like “bathtub” will be filtered although such documents are actually relevant. Consequently, there is a great possibility to lose topics if modelling from the retained partial data. A post-filtering strategy applies conventional topic models to identify first all topics in the corpus and then filter the topics that do not contain the query keywords in the results. However, as analysed in [1], such a strategy may result in topic suppression when the query keywords are infrequent in the database. Topic suppression means that topics related to the user interested aspect are suppressed by general topics.
For algorithms in the second group, TTM [1] is the first and the state-of-the-art. TTM is a sparse topic model designed to directly mine focused topics based on user-provided query keywords. TTM simulates two topic-word distributions: ϕ r for relevant topics and ϕ i r for irrelevant topics. It considers documents at the sentence level and introduces a variable r to indicate the status of a sentence (e.g., relevant or irrelevant). Words are then sampled from ϕ r or ϕ i r according to the sentence status. Although TTM can accomplish the targeted analysis to a certain extent, the effectiveness of TTM is handicapped by its scheme of processing at the sentence level and its assumption that each sentence focuses on only one topic. By considering sentences individually and separately, topic information between consecutive sentences may be lost, which results in inferior topic qualities and possible topic loss. By assuming that each sentence is related with only one aspect, it is very likely for TTM to mistakenly assign relevance status for sentences related with multiple topics, which is often the case for long sentences. The wrong assignment of sentence statuses will in turn lead to possible missing of meaningful topics.
A common challenge faced by algorithms of both categories is the computation efficiency, while full analysis of topics is largely performed offline, targeted analysis is more likely an online module that is supposed to respond to user queries as soon as possible. However, existing algorithms for targeted analysis, especially the post-filtering strategy and the specialised topic models, are not devised to address this issue. The pre-filtering strategy may gain efficiency by modelling topics from a reduced set of “partial data”, but it achieves this at the cost of losing important topics.
To address the aforementioned issues, we propose a novel Core BiTerm-based Topic Model (BiTTM) for targeted analysis, which directly models fine-grained topics related to the queried aspect from a set of core biterms. Biterm, proposed in BTM [2], is a word-pair consisting of two different words that appear together in a fixed-size window and represent co-occurrence information. Improving biterms, we introduce core biterms as a set of selected biterms that have strong connections with query keywords. By modelling topics from the set of core biterms, BiTTM is expected to achieve better performance than existing specialised topic models in terms of the following aspects:
1. The existing specialised topic models for targeted analysis (i.e., TTM and APSUM [3]) process at either the sentence level or the word level so that the semantic information between consecutive sentences will be lost. In contrast, since a biterm may consist of two words coming from two successive sentences, information across the whole document can be captured by BiTTM to alleviate the issue of losing topics.
2. The TTM model samples relevance status at the sentence level which may be too coarse. When a sentence is related to multiple topics, it would be difficult to infer the relevance status of the sentence as a binary value. In contrast, the APSUM model [3] samples relevance status for individual words which may be too specific, because it cannot handle phrases that make sense when multiple words are considered together. Biterms, as a scheme in-between sentences and words, are expected to achieve more accurate inference of relevance status.
3. Existing specialised topic models do not have any finesse to accelerate the calculation without significant semantic information losing. Instead, BiTTM introduces a heuristic preprocessing based on core biterms for speeding topic modelling while alleviating information loss, which makes it a more pragmatic solution for targeted analysis according to user queries.
To comprehensively evaluate the performance of BiTTM, extensive experiments have been conducted on real-world datasets including short texts, medium texts and long texts. Moreover, we select a large number of targets with different word and document frequencies to explore the adaptability of BiTTM to various types of queries. The experimental results show that (1) BiTTM improves the quality of topics, alleviates topic losing, and outperforms baselines especially for query keywords of low frequencies; (2) the time cost of BiTTM is most outstanding and stable compared to those of the baselines, which demonstrates the high applicability of BiTTM on datasets with different characteristics.
The remainder of this paper is organised as follows. Prior research and related works are reviewed in Section 2. We provide technical details of BiTTM in Section 3, and discuss the experimental results in Section 4. Finally, Section 5 closes this paper with some conclusive remarks.

2. Related Work

In this section, we introduce works related to our research in three parts. Firstly, we review existing specialised topic models for targeted topic analysis. Secondly, we describe the model of BTM that introduces the concept of biterms for topic modelling. Thirdly, we discuss other topic models relevant to our proposed BiTTM.

2.1. Targeted Topic Models

Specialised topic models for targeted analysis are still rarely seen, which are mainly used for information retrieval [4], abstract extraction [3,5] and opinion mining [1,6]. TTM [1] and APSUM [3] are the two most representative models.
Wang et al. first study the problem to detect relevant and user-concerned topics from a given dataset [1] and propose the model TTM as illustrated in Figure 1. The main idea of TTM is to introduce a relevance variable r to indicate whether a sentence is related with a specified aspect. The variable r determines whether each word in a sentence is generated by a related topic or an irrelevant topic. Moreover, the relevant topic-word distribution φ r is sparse because the number of words related to the target is usually less than that of the irrelevant words.
The steps of generative process are illustrated as follows:
  • Draw φ i r D i r i c h l e t ( β i r )
  • For each relevant topic k { 1 , 2 , , T }
    (a)
    Draw ω k B e t a ( p , q ) .
    (b)
    For each word v { 1 , 2 , , V }
    i.
    Draw β t , v r B e r n o u l l i ( ω t ) .
    (c)
    Draw φ t r D i r i c h l e t ( β t r δ + ϵ ) .
  • For each document m { 1 , 2 , , M }
    (a)
    Draw π m B e t a ( γ ) .
    (b)
    raw relevance status r based on keyword indicator x and B e r n o u l l i ( π m ) .
    (c)
    If the document is relevant
    i.
    Draw z M u l t i n o m i a l ( θ r ) .
    ii.
    Draw w i M u l t i n o m i a l ( φ z r )
    (d)
    If the document is irrelevant
    i.
    Draw w i M u l t i n o m i a l ( φ i r ) .
Therefore, TTM considers the status r at the sentence-level. It is difficult to determine whether a sentence is related to the target when a sentence contains multiple topics. The wrong assignment of sentence status will negatively affect the quality of topics.
APSUM [3] is a generative aspect summarisation model designed for fine-grained summaries of online reviews. Compared with TTM, APSUM is different in terms of the following two aspects. Firstly, while TTM models the relevance at the sentence level, APSUM considers at the word level. As discussed in Section 1, the former might be too coarse to determine the relevance status for sentences appropriately; the latter is not able to handle phrases where it makes sense only when multiple words are considered together. Secondly, APSUM introduces an additional component called document aggregator to mitigate the issue of aspect sparsity, which refers to the circumstances where there are not enough text data related with specific aspects. The basic idea is to cluster similar documents through document aggregator and sample topics for documents at the document aggregator level.
Essentially, both TTM and APSUM try to identify potentially related words that can serve as bridges to link relevant documents, especially those without containing query keywords. TTM searches such related words from sentences containing query keywords; APSUM attempts from semantically similar documents through document aggregator. However, both models fail to capture the semantic information between neighbouring sentences, which does exist in natural language [7].

2.2. BTM

BTM [2] is a topic model for short texts. To alleviate the problem of insufficient information with short texts, this model extracts topics by modelling from biterms representing word co-occurrences. A biterm is an unordered word pair consisting of two different words in a fixed-length text window. BTM replaces documents with a biterm set that can reveal the correlation between words in depth.
The graphical model of BTM is depicted in Figure 2 with a generative process as follows:
  • Draw θ D i r i c h l e t ( α )
  • For each topic k { 1 , 2 , , K }
    (a)
    Draw ϕ k D i r i c h l e t ( β ) .
  • For each biterm b i B
    (a)
    Draw z i M u l t i n o m i a l ( θ )
    (b)
    Draw w i , 1 , w i , 2 M u l t i n o m i a l ( ϕ z i )
BTM is designed for full analysis on short texts, while the concept of biterms is used by BTM to seize word occurrences to alleviate data sparsity, we borrow the idea in our targeted analysis model to capture words closely related with the query keywords provided by users.

2.3. Other Topic Models

Topic models have been widely studied and used in different applications, such as analysing public opinions and trends [8,9,10,11,12], providing personalised user services [13,14,15] and news tracking [16,17,18]. Among a wide range of existing topic models, in this subsection, we discuss two types of topic models that are related with our BiTTM.
Firstly, since we consider user queried aspects, our model is linked with topic models considering user information (e.g., user profiles and behaviours). In the literature, there are many topic models that take into account user information to obtain in-depth analysis [19,20,21,22,23,24]. For example, Viet et al. exploit users’ browsing histories to propose a keyword-topic model [19] for contextual advertising. Kalyanam et al. [20] simultaneously consider textual data and user behaviours, such as forwarding and commenting, to explore the evolution of topics. Sordo et al. [21] consider the topological changes of users’ co-authorship network to identify groups of researchers. Although these models can incorporate user information into topic analysis, they cannot extract fine-grained topics related with user-interested specific aspects.
Secondly, BiTTM is also connected with sparse topic models [25,26,27,28,29,30,31,32]. The notable feature of these models is the consideration of distribution skewness that can be divided into two categories. Firstly, a document is related with only a few topics among all topics available in the data set. Secondly, a topic involves only a small part of the dictionary. A lot of sparse topic models have been devised based on the two types of distribution skewness. For example, Williamson et al. [30] and Chen et al. [28] address the document skewness, while Wang et al. [33] take into account the topic skewness. Moreover, the method called “dual-sparse topic model” [25] implements both types of skewness simultaneously. Generally speaking, the sparsity is addressed by incorporating the “Spike and Slab” priors: the “Spike” is used to control the selections of words; and the “Slab” is used to smooth distributions to avoid ill-defined distributions where some words never appear. Our BiTTM also considers both document skewness and topic skewness through the spike and slab. Differently, we address the sparsity by taking into account user-interested aspects at the same time.

3. BiTTM

In this section, we describe BiTTM for efficient topic analysis of targeted aspects. In Section 3.1, we introduce the concept of core biterms and the process to generate core biterms. Section 3.2 and Section 3.3 discuss the generative process and inference of BiTTM, respectively.

3.1. Core Biterms

Considering the user-specified aspect usually involves only part of the data, we believe data preprocessing is an indispensable step for efficient targeted analysis. However, existing specialised topic models perform directly on the entire dataset ignoring the efficiency issue. Existing methods incorporating pre-filtering strategies, as discussed before, achieve certain efficiency by modelling topics from a reduced data set; nevertheless, the reduced data set may lose relevant documents if the targets are not expressed appropriately or comprehensively. For example, Table 1 enumerates three situations where query keywords may easily be incomplete, resulting in possible loss of relevant documents and topics.
  • Synonyms. For example, if the supplied query keyword is “bath”, relevant documents containing words representing similar semantics, such as “shower”, may be missed.
  • Words referring to the same targeted aspect in a particular domain. For example, when the domain is confined to Amazon reviews of baby products, the keywords “crib” and “bed” represent the same aspect, although they are not exactly synonyms.
  • Words describing the same event. Users often use diverse words to refer to the same event, especially in social networks. For example, considering the Twitter dataset of Oscars, both “mistake” and “oscarsfail” are used to describe the event of a wrong envelope for the Best Picture Award.
To address the aforementioned issues, we propose an efficient data preprocessing method based on core biterms.
As introduced in BTM [2], a biterm consists of any two distinct words in a fixed-length window so that it captures the co-occurrence information in the document. As the window may span two or more sentences, the semantic information between consecutive sentences can be captured. Compared with TTM and APSUM, processing at the level of biterms addresses potential loss of information between successive sentences. Therefore, we consider biterms as the base unit of our preprocessing.
To handle the situations exemplified in Table 1, we consider to use “core words” to complement query keywords so that relevant documents that do not explicitly contain query keywords can be considered. Intuitively, if core words represent the same aspect indicated by query keywords, they should appear together with query keywords very often. Hence, we first extract “core words” that frequently co-occur with query keywords from biterms, and then extract frequent biterms containing core words as “core biterms”. The algorithm is illustrated in Algorithm 1, which can be summarised in three steps as follows:
Step 1: Calculate the desired size of the set of core words, s c w , and rank all biterms B a l l in descending order according of frequency (Lines 1–2).
Step 2: Acquire core words from top frequent biterms containing target, and then calculate the average frequency of biterms containing core words as t h r e s h o l d (Lines 3–15).
Step 3: Select core biterms according to two conditions. Firstly, the biterm has at least one core word. Secondly, the frequency of the biterm has to be greater than t h r e s h o l d (Lines 16–20).
We will then model targeted topics from the generated core biterms, which yields a threefold benefit as follows: (1) the context information between neighbouring sentences is preserved; (2) sampling relevance status based on biterms is more accurate; and (3) modelling topics from core biterms is more efficient.
Algorithm 1: Preprocessing based on biterms
Applsci 11 10162 i001

3.2. Model Description & Generative Process

In this subsection, we describe the model and the generative process of BiTTM. Table 2 lists the notations used in this paper.
The generative process is as follows:
  • Draw θ D i r i c h l e t ( α ) .
  • Draw ϕ i r D i r i c h l e t ( β i r ) .
  • For each target-relevant topic k { 1 , 2 , , K }
    (a)
    Draw ω k B e t a ( p , q ) .
    (b)
    For each word w { 1 , 2 , , W }
    • Draw β k , w r B e r n o u l l i ( ω k ) .
    (c)
    Draw ϕ k r D i r i c h l e t ( β k r δ + ϵ ) .
  • For each biterm b i ( w i , 1 , w i , 2 ) B
    (a)
    Draw π b B e t a ( γ ) .
    (b)
    Compute r based on x and B e r n o u l l i ( π b ) .
    (c)
    If b i is relevant to the target
    • Draw z 1 , z 2 M u l t i n o m i a l ( θ ) .
    • Draw w i , 1 M u l t i n o m i a l ( ϕ z 1 r ) , and
    • Draw w i , 2 M u l t i n o m i a l ( ϕ z 2 r ) .
    (d)
    If b i is irrelevant
    • Draw w i , 1 , w i , 2 M u l t i n o m i a l ( ϕ i r ) .
Graphical representation of BiTTM is shown in Figure 3. Following the above procedure, the generative process can be summarised into three parts. Firstly, we draw two global parameters θ and ϕ i r . The former is a topic distribution which models on the entire corpus instead of one document, and the latter is a topic-word distribution of irrelevant topic. In other words, two words in an irrelevant biterms are drawn from only one irrelevant topic. Secondly, ϕ k r is drawn for each target-relevant topic k { 1 , 2 , , K } . Please note that two smoothing parameters, smoothing prior δ and weak smoothing prior ϵ , are used for dual-sparsity [25]. Thirdly, status r of b i is determined by both target indicator x and B e r n o u l l i ( π b ) . According to two different types of status, relevant or irrelevant, we draw a word from ϕ r or ϕ i r .
Different from the generative process of BTM, BiTTM draws two topics for a relevant biterm and each word in the biterm may be assigned a different topic. The reason why we choose this strategy is that it is inappropriate to assume that the two words in a biterm share the same topic for targeted analysis, while for BTM, it is probably sufficient to draw one topic for a biterm as it is a full-analysis model which aims to mine coarse-grained topics.
Here is an example to elaborate the difference between full-analysis and targeted analysis. When dealing with the biterms b 1 ( b a t t e r y , l a r g e r ) and b 2 ( l e n s , l a r g e r ) , BTM is prone to assign the same topic to the two biterms because of the shared word “larger”. This allocation might be fine for full analysis since it does not pursue fine-grained topics so that it is not necessary to distinguish between “battery” and “lens”. However, for targeted analysis, “battery” and “lens” represent two different aspects and should be recognised distinctively. Note that, although the sampling process of the two words in a biterm is independent to each other, their combined effects determine the status (i.e., relevant or irrelevant) of the biterm.

3.3. Inference

Following BTM [2] and TTM [1], we choose Gibbs Sampling [34] to infer the model parameters. All notations used in this section are shown in Table 2.
We first sample the status of every biterm. Intuitively, if a biterm contains a query keyword, then it is relevant with the target aspect. Let d be a binary variable and d = 1 indicates a biterm contains the keyword provided by users. Then, we define the probability that a biterm is relevant as P ( r | x = d , β r , β i r , γ ) = 1 if d = 1 . Otherwise, we define the probability as shown below:
P ( r | x = d , β r , β i r , γ ) =
1 d = 1 n i r + γ | B | + 2 γ 1 w b i Γ ( β w r n i , w r + β w r δ + ϵ + 1 ) Γ ( w W ( β w r n i , w r + 1 ) + | β r | δ + | W | ϵ ) d = 0 , r = r e l e v a n t n i i r + γ | B | + 2 γ 1 w b i Γ ( n i , w i r + β w i r + 1 ) Γ ( w W n i , w i r + 1 + | β i r | ) d = 0 , r = i r r e l e v a n t
Next, we sample word selector β w r for all words w W . Applying Gibbs Sampling similar to TTM [1], we can obtain the equation P ( β w r | β w r , w , δ , ϵ , p , q | ) P ( β r , w | δ , ϵ , p , q ) . Then,
P ( β w r | β w r , δ , ϵ , p , q | ) P ( β r , w , ω , ϕ | δ , ϵ , p , q ) d w d ϕ
Γ ( n w r + δ + ϵ ) Γ ( | β w , r | δ + | W | ϵ + n w , r ) Γ ( | β w , r | δ + δ + | W | ϵ ) ( p + | β w , r | )                       β w r = 1 Γ ( δ + ϵ ) Γ ( | β w , r | δ + δ + | W | ϵ + n w , r ) Γ ( | β w , r | δ + | W | ϵ ) ( q + | W | | β w , r | 1 )     β w r = 0
For a biterm b i ( w i , 1 , w i , 2 ) , the probability of sampling k as the topic for w i , m can be computed as Equation (3).
P ( z i , m = k | α , β r , β i r , δ , ϵ )
P ( z i , m = k | α , β r , δ , ϵ ) r b i = relevant P ( z i , m = k | α , β i r ) r b i = irrelevant
As mentioned before, two words in a biterm may be assigned different topics if the biterm is relevant. Therefore, we can have
P ( z i , m = k | α , β r , δ , ϵ )
P ( z i , m = k | θ ) P ( θ | α ) d θ P ( w i , m | ϕ k ) P ( ϕ k | β k r δ + ϵ ) d ϕ
n | k r , i m + α K ( n | k r , i m + α ) β w i , m | k r n w i , m | k r , i m + β w i , m | k r δ + ϵ W ( β w i , m | k r n w i , m | k r , i m + β w i , m | k r δ + ϵ )
n | k r , i m + α K n | k r , i m + K α β w i , m | k r n w i , m | k r , i m + β w i , m | k r δ + ϵ n | k r , i m + β | k r δ + | W | ϵ
If a biterm b i is irrelevant (i.e., r b i = 0 ), then we directly sample a topic from ϕ i r for two words w i , m . Therefore, we can obtain the conditional probability:
P ( z i , m = k | α , β r , β i r , δ , ϵ )
n | k r , i m + α K n | k r , i m + K α β w i , m | k r n w i , m | k r , i m + β w i , m | k r δ + ϵ n | k r , i m + β | k r δ + | W | ϵ r b i = 1 n | k i r , i + α K n | k i r , i + K α ( n w i , 1 | k i r , i + β i r ) ( n w i , 2 | k i r , i + β i r ) ( n | k i r , i + | W | β i r ) ( n | k i r , i + | W | β i r + 1 ) r b i = 0
At last, we sample a word selector β w | k r for a topic k, where w W and k K .
P ( β k , w r = s | β k r , k , δ , ϵ , p , q | )
Γ ( | β w , | k r | δ + | W | ϵ + n w , | k r ) ( p + | β w , | k r | ) Γ ( | β w , | k r | δ + δ + | W | ϵ ) Γ ( n w | k r + δ + ϵ ) s = 1 Γ ( | β w , | k r | δ + | W | ϵ ) ( q + | W | | β w , | k r | 1 ) Γ ( | β w , | k r | δ + δ + | W | ϵ + n w , | k r ) Γ ( δ + ϵ ) s = 0

4. Experimental Results

4.1. Baselines and Metrics

Baselines. Three methods are chosen to be compared with BiTTM, including Targeted Topic Model (TTM), Biterm Topic Model-Partial Data (BTM-PD), and Biterm Topic Model with a post-filtering strategy (BTM ).
  • TTM. Targeted Topic Model is the first method for focused analysis that extracts related topics according to a target keyword provided by users. We select TTM rather than APSUM as the baseline of specialised topic models for targeted analysis because TTM outperforms APSUM in terms of topic coherence when the number of topics is less than 50 [3]. For targeted analysis of fine-grained topics, we believe the number of topics in a given corpus is usually less than 50. Moreover, TTM serves as the most valuable comparison because APSUM is not exactly designed for targeted analysis.
  • BTM . As our model is developed based on biterms, we also compare with two variations of BTM that are adapted for targeted analysis. BTM is a state-of-the-art topic model for short texts, which also applies to long texts [2]. As a typical full-analysis model, BTM aims to find all topics (or all aspects) from the entire corpus. We then use a filtering strategy to eliminate topics that do not contain the target keywords. This approach is named as BTM for simplicity.
  • BTM-PD. This is another variation of BTM which applies the pre-filtering strategy to perform focused analysis. We use only the subset of documents containing the target keywords to model topics. As discussed before, the pre-filtering strategy is handicapped by the variability of target keyword—relevant documents may be filtered so that topics may be missed out.
Metrics. We adopt two techniques to evaluate the quality of topics: topic coherence [35] and precision@n [1] ( P @ n for short). The former is a popular evaluation method to evaluate the quality of discovered topics [36,37,38,39,40]. As an automated evaluation metric, topic coherence mainly measures the interpretability of topics instead of target-relevance. More specifically, topic coherence measures document-level mutual information of keywords in topics, however, it does not reflect the relationship between topics and targets. In order to evaluate whether topics are target-relevant, we employ the metric P @ n , also used by TTM [1], which is an evaluation based on human judgment to assess the relevance between the target and topics.
Considering the M most probable words in topic k, the topic coherence of k is defined as Equation (7).
T C ( k ) = m = 2 M l = 1 m 1 l o g | { d o c ( w k , m , w k , l ) } | + 1 | { d o c ( w k , l ) } |
where | { d o c ( w k , m , w k , l ) } | is the number of documents containing both w k , m and w k , l ; | { d o c ( w k , l ) } | is the number of documents containing w k , l , and w k , l is the lth most probable word in topic k. Basically, for the mth probable word, the measure considers its co-occurrence with the m 1 more probable words. A smoothing count of 1 is added to avoid leading the logarithm to zero. Basically, the more the measure approximates to zero, the more coherent the discovered topics are.
Given the set of topics discovered by all models, suppose there are K u topics that have been verified by users to be related with the target aspect. Moreover, from all topics discovered by a particular model m, suppose there are K m topics related with the target. Then, the precision of model m at rank position n is defined as follows:
P m @ n = z = 1 K m | { C o r r e c t W o r d s ( z ) } | z = 1 K u | { C o r r e c t W o r d s ( z ) } |
where | { C o r r e c t W o r d s ( z ) } | is the number of words, among the top n words of topic z, which are relevant to the target (Note that, if a discovered topic is potentially related with multiple semantic topics, the best semantic topic based on the top 20 words will be adopted).
Therefore, the two evaluation methods have different merits and objectives. For example, topic coherence is an automated evaluation metric reflecting the interpretability of topics. P @ n demands human judgement and assesses the relevance between the discovered topics and the queried target. For the sake of fairness, we use P @ n to evaluate all comparing models (i.e., BiTTM, TTM, BTM-PD and BTM ) to find out the effectiveness of the models in performing focused analysis. However, we only compare BiTTM and TTM in terms of topic coherence to evaluate the topic quality since the other two models are variations of BTM, which is essentially designed for full-analysis of topics.

4.2. Data sets & Experimental Settings

Data sets. In order to comprehensively evaluate the performance of our proposed model, we conduct experiments on different types of text. In particular, three types of documents are considered, including short, medium and long texts. For each type of documents, we select three data sets. The description of the nine datasets used in our experiments is provided in Table 3. The datasets are all publicly available at the URLs listed in the bottom of Table 3.
Experimental Settings. In our experiments, we use various words as target queries to analyse the influence exerted by diverse targets on performance. For parameter settings, we follow the hyper-parameter setting in TTM: α = γ = 1 , β i r = 0.001 , p = q = 1 , and the two smoothing priors are set as δ = 0.001 , ϵ = 1 × 10 7 . Other baselines follow the parameter settings in their respective papers.

4.3. Quantitative Evaluation

In this subsection, we analyse the quality of discovered topics from two aspects: topic coherence (representing topic interpretability or semantic coherence) and P @ n (indicating topic relevance).
Analysing the results of topic coherence: The average topic coherence achieved by BiTTM and TTM is shown in Table 4, the more the score approximates to zero, the more coherent the discovered topics are. As we can see from the table, BiTTM is not comparable to TTM for analysing short texts in terms of topic coherence. However, with the increase of document length, BiTTM starts to outperform TTM. The reason why BiTTM generally works better than TTM on medium and long texts is because TTM is a sentence-based model for which the information between consecutive sentences will be lost. In contrast, by considering core biterms that may come from neighbouring sentences, our BiTTM model captures the semantics crossing sentences so that more interpretable topics can be generated. However, since it is quite often for a short text document to contain only one sentence, the limitation of sentence-based TTM cannot be reflected. Generally, by beating TTM on non-short text documents, BiTTM has a broader applications in text data analysis.
To evaluate the model performance with respect to different queries, we randomly sample query keywords from the documents according to word frequency distributions. We plot the comparative results of BiTTM and TTM in Figure 4, where the horizontal axis represents the word frequency of the target keyword, and the vertical axis indicates the percentage of documents containing the target. There are three types of symbol in the figure: red dots, green squares and blue triangles. Each symbol corresponds to a comparison between the topics discovered by BiTTM and TTM with respect to a query. In particular, a green square means BiTTM obtains a better topic coherence than TTM for this query, while a blue triangle implies the opposite. For a red dot, it indicates that TTM fails to discover the specified number of topics or words under some topics for this particular query. For example, we set the number of topics to 5 for the experiments in Figure 4 and consider the top 10 words for each topic. However, TTM discovers less than 5 topics or less than 10 words for a topic when handling queries corresponding to red dots. Note that, this situation does not happen for BiTTM.
The most obvious trend that can be observed from Figure 4 is that the red dots usually appear in the lower left corner, the blue triangles gather in the upper right corner, and the green squares fall in between. The red dots in the lower left corner imply that TTM is prone to miss out topics when dealing with infrequent targets. The blue triangles in the right corner suggest that TTM performs better when the targets appear very frequently in many documents. However, the number of such target keywords may be limited. On the contrary, BiTTM achieves satisfactory performance for a diverse range of targets even if they are infrequent in the corpus. This also verifies the effectiveness of using core words to enrich the semantic information in the context of the target keyword (i.e., BiTTM strategy) than taking words in same sentences as bridges to connect potentially target-related words (i.e., TTM strategy).
Analysing the results of P @ n : To calculate the measure of P @ n , similar to TTM, three human labelers familiar with the data sets are engaged to label the results. The P @ n values at the rank positions of 5, 10 and 20 are reported in Table 5, from which several interesting outcomes can be observed. Firstly, the performance of the two variations of BTM (i.e., BTM-PD and BTM ) is generally worse than that of the two specialised topic models (i.e., BiTTM and TTM), which demonstrates that full-analysis topic models with filtering strategies are not suitable for targeted analysis because they are prone to detect general topics instead of fine-grained target-related topics. In addition, comparing the two BTM variations, BTM-PD is better than BTM in most cases, which proves that the pre-filtering strategy is more effective in removing irrelevant words than the post-filtering strategy. Secondly, the average P @ n of BiTTM achieves a gain of more than 10% compared with TTM, and more than 26% compared with BTM-PD, over all queries in the table and the settings of n. Moreover, the performance difference among the three types of document is not significant, whereas the different target queries have influence on the P @ n results, which will be explained later using concrete examples. Thirdly, TTM is the second best model for P @ 5 . However, for P @ 10 , TTM achieves the best performance than all other models for some queries. It suggests the tendency of TTM to put target-related words in lower-ranked positions.
To explore the influence of different queries, let us take a closer look at two specific targets: “ashtray” (in the short-text data set “cigar”) and “rinses” (in the medium-text data set “baby”). As shown in Figure 4, both targets are infrequent words (appearing in the lower left corner) in respective datasets. However, the P @ n scores of BiTTM and TTM for the two queries, as shown in Table 5, are remarkably different. Basically, both models perform well with respect to “ashtray” but not with respect to “rinses”, especially for TTM. The P @ n score of TTM for handling “rinse” is unsatisfactory and several inexplicable words, such as “attention” and “entertain”, appear in the discovered topics, which makes it hard to interpret the topics. By examining the datasets, we find that documents containing “ashtray” consistently describe the appearance of ashtrays such as colours and materials. That is, the documents are pretty clean and relevant, which explains why both BiTTM and TTM process the query well. Nevertheless, the documents containing “rinses” are mostly composed of short sentences, such as “It rinses out well and dries quickly.” and “Rinses/Washes easy.”, where the meaningful descriptions are hidden in the context of sentences containing “rinses”. TTM cannot handle this situation since it is a sentence-based model. The two examples explain why the performance varies with respect to query keywords.
Comparing the performance in terms of topic coherence and P @ n , we notice that BiTTM is more capable to acquire topics related to the target (i.e., high P @ n scores) than to generate semantically coherent topics (i.e., better topic coherence values), especially for short text documents. This is because words related to the target do not necessarily have high co-occurrence, which is used to calculate topic coherence. For instance, “Oktoberfest” is an appropriate word related to the target “place” in the dataset cigar because a type of cigar named Quesada Oktoberfest is released in October for celebrating the famous Germany beer festival. However, “Oktoberfest” as a low-frequency word can not provide enough mutual information, which directly causes the poor performance in topic coherence. Conversely, a high-frequency word “rolled” contributes to high topic coherence score but it is not selected by BiTTM since it is too general to describe the target “place”.

4.4. Time Efficiency Analysis

As mentioned before, it is ideal for targeted analysis to provide responses to user queries as soon as possible. Therefore, in this experiment, we analyse the time efficiency of the comparative models.
The average time cost of the four methods on each dataset over 40 random queries is shown in Table 6. It can be observed that, generally, BiTTM has the best time efficiency, followed by BTM-PD. TTM is significantly slower without any preprocessing strategy, and BTM is the most inefficient model since BTM performs full analysis on the complete dataset.
To clearly demonstrate the impact of data size on the time efficiency, we plot the results in Figure 5 where the grey bars denote the size of datasets and the polylines in different colours indicate the time cost of different methods. Note that, since the time consumption of BTM is not comparable to the others, only three models (i.e., BiTTM, BTM-PD and TTM) are displayed in the figure. It can be observed that, generally, the time cost of all methods increases with respect to the increment of data size. However, the size of dataset has a greater impact on TTM than the other two methods, which shows that TTM is not suitable for processing large data sets. In contrast, BiTTM and BTM-PD have a better capability to adapt to large data sets. For these two methods, the difference of data size does not make dramatic changes to time consumption since they both have preprocessing strategies to focus on only the portion of data related to query targets. The difference between BiTTM and BTM-PD is that BiTTM is faster than BTM-PD especially when the length of documents increases. The reason is that the pre-filtering strategy adopted by BTM-PD is a simple and rough processing. It selects documents as long as they contain the query keywords. Consequently, irrelevant information contained by such documents will be included and processed as well, which negatively contributes to the time efficiency of BTM-PD.
To illustrate the impact of document length on the time efficiency, the percentage histogram of time cost of BiTTM, TTM and BTM-PD is plotted in Figure 6, where the average document length increases from left to right. It can be observed that the time efficiency of TTM is worst on short texts. Recall that the topic coherence of TTM on short texts is better than BiTTM. This experiment shows that TTM achieves this by significantly sacrificing time efficiency, while the topic quality in terms P @ n of TTM on short texts is also worse than that of BiTTM. Moreover, we can see that efficiency performance of BTM-PD is worse on long texts, compared to its performance on short texts. This is because BTM-PD is a biterm-based topic model and long texts generally have more biterms than short texts. Although BiTTM is also a biterm-based model, the strategy of selecting “core biterms” removes a lot of irrelevant biterms so that the performance of BiTTM on long texts is also promising.
Therefore, Figure 5 and Figure 6 demonstrate that BiTTM can be widely applied to various types of text data, because both data size and document length have no great impact on its time efficiency, thanks to the core biterm-based preprocessing strategy.

4.5. Qualitative Evaluation

We present qualitative analysis of the result topics generated by comparative models in this subsection. We focus on evaluating from two aspects: performance of discovering as many fine-grained relevant topics as possible and performance of dealing with semantically approximate targets. For exemplified queries discussed in the following, we have shown their word frequency and document frequency in Figure 4.

4.5.1. Discovering Relevant Topics

We take the query “disease” in the dataset food as an example. Table 7 shows the topics discovered by the four comparative models, together with the top 10 words of each topic. The third row of Table 7 are the labels we assign manually to summarise the semantics of each topic, where SFA is the abbreviation for Saturated Fatty Acid. Words that do not semantically align with the topics are displayed in red.
Compared with the topics discovered by BiTTM, all the other three methods fail to identify the topic prevention, which is clearly a relevant topic of “disease”. Moreover, the two BTM variation models (i.e., BTM-PD and BTM ) miss out the topic risk. By taking a closer look, we find that this is because the two BTM models cannot distinguish between the two topics risk and research that are different delicately. In other words, the two BTM models discover a topic combining research and risk. This is understandable because BTM as a full-analysis topic model discovers general topics. TTM succeeds in discovering both research and risk, but the topic quality is poorer than that of BiTTM (e.g., there are more bold words in the two topics discovered by TTM, which means more inconsistent words in results of TTM). Therefore, BiTTM discovers more relevant and fine-grained topics than other models for this example query.
Consider the topic SFA that is discovered by all of the four models. Results of BiTTM clearly indicate that saturated fatty acids affect blood sugar and carcinogenesis, but the results of other methods are not satisfactory. For example, TTM tends to find out which foods (e.g., tart, chip and sweetener) have unsaturated fatty acids. BTM-PD and BTM focus on food ingredients (e.g., palm oil and protein). These results are not related with the target “disease” queried by users. Hence, the topic quality of BiTTM is better than that of other models as well in this example.

4.5.2. Handling Semantically Approximate Targets

When the targets supplied by users are semantically approximate, a set of similar relevant topics are supposed to be discovered. We further examine the performance of the comparative models in handling semantically approximate targets. In particular, we analyse two types of semantically approximate queries mentioned in Section 3.1: synonyms and diverse descriptions of the same event.
An example of the first type is shown in Table 8. We query the dataset “baby” with two targets, “bath” and “shower”, which share similar semantics in the data set of Amazon reviews of baby products. A successful model should return similar topics. As shown in the table, BiTTM is the only model that can obtain the set of four meaningful topics for both queries, while other methods either miss topics or generate vague content for topics.
For instance, except BiTTM, the other three methods fail to identify the topic blanket with respect to the query “bath”, while TTM and BTM can retrieve the topic with respect to the target “shower”. According to the results of BiTTM, we find that blanket is an important aspect of bath/shower, because most people will cover their babies with a blanket after a bath/shower. Hence, the topic blanket is an aspect in which users are interested. Ignoring an important topic hinders downstream analysis and applications, such as high-quality personalised services and commodity recommended systems. As another example, there are two topics discovered by BiTTM only: sentiment and protection. Checking the content of topic sentiment, we learn that users tend to associate emotional expressions (e.g., “have a nice time with daughter/son”) when commenting on shower/bath products. This topic thus implies users’ emotional polarity of products, which is important for applications such as user profiling, recommendation and public opinion monitoring. The topic protection describes safety products that can be installed in tubs or on faucets. The safety issue of bath is an important concern especially for baby products, and it is non-ideal for the other three methods to ignore this topic.
Moreover, we find that BTM-PD extracts only two topics for both queries and the content of the topics are too vague to understand (e.g., we are not able to assign semantic labels to the topics). There are six identical words between the two sets of top 10 words, which makes it very hard to distinguish between the semantics of the topics. The same situation occurs to BTM —there are two similar topics about “spout”. For example, given the query “bath”, the two topics have eight identical words in the top 10 words. The content of these two topics may be correct, but the information expressed is redundant. It is not useful to generate identical topics but increasing the difficulty of further analysis.
Table 9 shows an example of the second type. Given the dataset Oscars, both “mistake” and “oscarsfail” refer to the same event that the Best Picture Award, which should belong to Moonlight, was wrongly presented to La La Land because of a wrong envelope. As we can see from the table, BiTTM can acquire three fine-grained relevant topics, which describes the process of the event development: At the beginning of the event, two guests present the Best Picture to La La Land, and no one was aware of the mistake. Many tweets emerge to talk about La La Land and express congratulations to the actors and the producer, which can be seen from the content of the topic beginning. Next, the error is corrected and the real winner is another movie Moonlight. Topic correction is a perfect interpretation of this stage. Note that, the top 10 words of this topic with respect to the query “mistake” contains the word “oscarsfail”, which demonstrates the usefulness of the core biterms strategy used by BiTTM. The third topic discussion covers the discussion of the actors’ reaction after this mistake has happened.
In contrast, TTM only retrieves the topic discussion and the quality is not satisfactory. Some irrelevant words like Moana, another movie, appear in the topic. BTM-PD and BTM also discover only the topic of discussion with respect to the target of “oscarsfail”, and the quality is low. For example, the word “documentary” which is not related with the two movies appears in results. Although the quality has improved with respect to the target “mistake”, the two topics discovered BTM are too similar with 6 identical words in top 10.

5. Conclusions

Targeted topic modelling is an increasingly vital task due to the prevalence of texts on the web and the limit of users’ interests. Compared with full-analysis topic models, such as LDA [41] and BTM [2], which are designed to discover all topics in a dataset, targeted analysis models aim to perform an in-depth semantic analysis to extract fine-grained topics about which users are concerned. In this paper, we propose a core biterm-based topic model for targeted analysis named BiTTM. Motivated by the fact that only part of the entire dataset is related with target aspects and the requirement to efficiently provide responses to user queries, a pre-processing mechanism is indispensable and core biterms related to target queries are proposed to be extracted (from neighbouring sentences) to preserve relevant information and to capture semantics across documents. Fine-grained topics are then modelled from core biterms where different topics are allowed to be sampled for each word in a biterm. Extensive experiments have been conducted to evaluate BiTTM, compared with the state-of-the-arts, in terms of topic coherence, topic relevance and time efficiency on nine real-world data sets including short texts, medium texts as well as long texts with respect to various query keywords randomly sampled from the corpus. The experimental results demonstrate that BiTTM outperforms existing models remarkably in terms of retrieving high quality topics relevant to targets and computation efficiency.
Future research should consider the potential effects of relevance in semantic space more carefully, for example, using multi-source semantic information to enhance the computational accuracy of relevance may significantly improve model performance. Recent studies [42,43,44,45,46] have shown that using word embeddings for topic modelling is potential for text analysis, and this may constitute the object of future studies.

Author Contributions

Conceptualization, J.W. and L.L.; methodology, J.W; validation, J.W. and L.L.; formal analysis, J.W. and L.C.; writing—original draft preparation, J.W.; writing—review and editing, L.C., L.L. and X.W; supervision, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly funded by the Sichuan Science and Technology Program, Grant/Award Numbers: 2019ZYZF0169; the A Ba Achievements Transformation Program, Grant/Award Number: 19CGZH0006, R21CGZH0001; the Chengdu Science and technology planning project, Grant/Award Number: 2021-YF05-00933-SN.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

This research was jointly funded by the Sichuan Science and Technology Program, Grant/Award Numbers: 2019ZYZF0169; the A Ba Achievements Transformation Program, Grant/Award Number: 19CGZH0006, R21CGZH0001; the Chengdu Science and technology planning project, Grant/Award Number: 2021-YF05-00933-SN. We would like to thank Wu Deng for his encouragement and guidance throughout this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wang, S.; Chen, Z.; Fei, G.; Liu, B.; Emery, S. Targeted Topic Modeling for Focused Analysis. In Proceedings of the ACM SIGKDD International Conference, San Francisco, CA, USA, 13–17 August 2016; pp. 1235–1244. [Google Scholar]
  2. Cheng, X.; Yan, X.; Lan, Y.; Guo, J. BTM: Topic Modeling over Short Texts. IEEE Trans. Knowl. Data Eng. 2014, 26, 2928–2941. [Google Scholar] [CrossRef]
  3. Rakesh, V.; Ding, W.; Ahuja, A.; Rao, N.; Sun, Y.; Reddy, C.K. A Sparse Topic Model for Extracting Aspect-Specific Summaries from Online Reviews. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, 23–27 April 2018; pp. 1573–1582. [Google Scholar] [CrossRef] [Green Version]
  4. Kim, H.; Choi, D.; Drake, B.L.; Endert, A.; Park, H. TopicSifter: Interactive Search Space Reduction through Targeted Topic Modeling. In Proceedings of the 14th IEEE Conference on Visual Analytics Science and Technology, IEEE VAST 2019, Vancouver, BC, Canada, 20–25 October 2019; pp. 35–45. [Google Scholar] [CrossRef] [Green Version]
  5. He, J.; Li, L.; Wang, Y.; Wu, X. Hierarchical features-based targeted aspect extraction from online reviews. Intell. Data Anal. 2021, 25, 205–223. [Google Scholar] [CrossRef]
  6. Nguyen, T.; Pham, T.; Le, H.; Nguyen, T.; Bui, H.; Ha, Q. A Targeted Topic Model based Multi-Label Deep Learning Classification Framework for Aspect-based Opinion Mining. In Proceedings of the 12th International Conference on Knowledge and Systems Engineering, KSE 2020, Can Tho City, Vietnam, 12–14 November 2020; pp. 165–170. [Google Scholar]
  7. Li, S.; Zhang, Y.; Pan, R.; Mao, M.; Yang, Y. Recurrent Attentional Topic Model. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3223–3229. [Google Scholar]
  8. Cai, G.; Peng, L.; Wang, Y. Topic Detection and Evolution Analysis on Microblog; Springer: Berlin/Heidelberg, Germany, 2014; pp. 67–77. [Google Scholar]
  9. Ye, C.; Liu, D.; Chen, N.; Lin, L. Mapping the topic evolution using citation-topic model and social network analysis. In Proceedings of the International Conference on Fuzzy Systems and Knowledge Discovery, Zhangjiajie, China, 15–17 August 2016; pp. 2648–2653. [Google Scholar]
  10. Xia, Y.; Tang, N.; Hussain, A.; Cambria, E. Discriminative Bi-Term Topic Model for Headline-Based Social News Clustering. In Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2015, Hollywood, FL, USA, 18–20 May 2015; pp. 311–316. [Google Scholar]
  11. Amara, A.; Taieb, M.A.H.; Aouicha, M.B. Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis. Appl. Intell. 2021, 51, 3052–3073. [Google Scholar] [CrossRef]
  12. Hu, Y.; Tai, C.; Liu, K.E.; Cai, C. Identification of highly-cited papers using topic-model-based and bibliometric features: The consideration of keyword popularity. J. Inf. 2020, 14, 101004. [Google Scholar] [CrossRef]
  13. Zhang, W.; Wang, J. Integrating Topic and Latent Factors for Scalable Personalized Review-based Rating Prediction. IEEE Trans. Knowl. Data Eng. 2016, 28, 3013–3027. [Google Scholar] [CrossRef]
  14. Wang, H.; Li, W. Relational Collaborative Topic Regression for Recommender Systems. IEEE Trans. Knowl. Data Eng. 2015, 27, 1343–1355. [Google Scholar] [CrossRef] [Green Version]
  15. Zhang, Y.; Chen, W.; Zha, H.; Gu, X. A Time-Topic Coupled LDA Model for IPTV User Behaviors. IEEE Trans. Broadcast. 2015, 61, 56–65. [Google Scholar] [CrossRef]
  16. Hu, C.; Hu, Y.; Xu, W.; Shi, P.; Fu, S. Understanding Popularity Evolution Patterns of Hot Topics Based on Time Series Features. In Proceedings of the Web Technologies and Applications—APWeb 2014 Workshops, SNA, NIS, and IoTS, Changsha, China, 5 September 2014; pp. 58–68. [Google Scholar]
  17. Feuerriegel, S.; Ratku, A.; Neumann, D. Analysis of How Underlying Topics in Financial News Affect Stock Prices Using Latent Dirichlet Allocation. In Proceedings of the Hawaii International Conference on System Sciences, HICSS 2016, Koloa, HI, USA, 5–8 January 2016; pp. 1072–1081. [Google Scholar]
  18. Viermetz, M.; Skubacz, M.; Ziegler, C.N.; Seipel, D. Tracking Topic Evolution in News Environments. In Proceedings of the IEEE Conference on E-Commerce Technology and the Fifth IEEE Conference on Enterprise Computing, E-Commerce and E-Services, Washington, DC, USA, 21–14 July 2008; pp. 215–220. [Google Scholar]
  19. Phuong, D.V.; Phuong, T.M. A keyword-topic model for contextual advertising. In Proceedings of the Symposium on Information and Communication Technology 2012, SoICT ’12, Halong City, Vietnam, 23–24 August 2012; pp. 63–70. [Google Scholar]
  20. Kalyanam, J.; Mantrach, A.; Saez-Trumper, D.; Vahabi, H.; Lanckriet, G. Leveraging Social Context for Modeling Topic Evolution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 517–526. [Google Scholar]
  21. Sordo, M.; Ogihara, M.; Wuchty, S. Analysis of the Evolution of Research Groups and Topics in the ISMIR Conference. In Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR 2015, Málaga, Spain, 26–30 October 2015; pp. 204–210. [Google Scholar]
  22. Zhao, B.; Xu, W.; Ji, G.; Tan, C. Discovering Topic Evolution Topology in a Microblog Corpus. In Proceedings of the Third International Conference on Advanced Cloud and Big Data, Yangzhou, Jiangsu, China, 30 October–1 November 2015; pp. 7–14. [Google Scholar]
  23. Gou, Z.; Li, Y. A method of query expansion based on topic models and user profile for search in folksonomy. J. Intell. Fuzzy Syst. 2021, 41, 1701–1711. [Google Scholar] [CrossRef]
  24. Sperrle, F.; Schäfer, H.; Keim, D.A.; El-Assady, M. Learning Contextualized User Preferences for Co-Adaptive Guidance in Mixed-Initiative Topic Model Refinement. Comput. Graph. Forum 2021, 40, 215–226. [Google Scholar] [CrossRef]
  25. Lin, T.; Tian, W.; Mei, Q.; Cheng, H. The dual-sparse topic model: Mining focused topics and focused terms in short text. In Proceedings of the 23rd International World Wide Web Conference, WWW ’14, Seoul, Korea, 7–11 April 2014; pp. 539–550. [Google Scholar] [CrossRef]
  26. Chien, J.T.; Chang, Y.L. Bayesian Sparse Topic Model. J. Signal Process. Syst. 2014, 74, 375–389. [Google Scholar] [CrossRef]
  27. Slutsky, A.; Hu, X.; An, Y. Learning Focused Hierarchical Topic Models with Semi-Supervision in Microblogs. In Proceedings of the Advances in Knowledge Discovery and Data Mining—19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, 19–22 May 2015; Part II. pp. 598–609. [Google Scholar]
  28. Chen, X.; Zhou, M.; Carin, L. The contextual focused topic model. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 96–104. [Google Scholar]
  29. Pu, X.; Jin, R.; Wu, G.; Han, D.; Xue, G.R. Topic Modeling in Semantic Space with Keywords. In Proceedings of the ACM International on Conference on Information and Knowledge Management, Melbourne, VIC, Australia, 19–23 October 2015; pp. 1141–1150. [Google Scholar]
  30. Williamson, S.; Wang, C.; Heller, K.A.; Blei, D.M. The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling. In Proceedings of the International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 1151–1158. [Google Scholar]
  31. Zhu, B.; Cai, Y.; Zhang, H. Sparse Biterm Topic Model for Short Texts. In Proceedings of the Web and Big Data—5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, 23–25 August 2021; Part I; Lecture Notes in Computer Science. Hou, U.L., Spaniol, M., Sakurai, Y., Chen, J., Eds.; Springer: Cham, Switzerland, 2021; Volume 12858, pp. 227–241. [Google Scholar]
  32. Shi, L.; Du, J.; Kou, F. A sparse topic model for bursty topic discovery in social networks. Int. Arab J. Inf. Technol. 2020, 17, 816–824. [Google Scholar] [CrossRef]
  33. Wang, C.; Blei, D.M. Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 1982–1989. [Google Scholar]
  34. Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101 (Suppl. S1), 5228. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  35. Mimno, D.; Wallach, H.M.; Talley, E.; Leenders, M.; Mccallum, A. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, UK, 27–31 July 2011; pp. 262–272. [Google Scholar]
  36. Yao, L.; Zhang, Y.; Wei, B.; Qian, H.; Wang, Y. Incorporating Probabilistic Knowledge into Topic Models. In Proceedings of the Advances in Knowledge Discovery and Data Mining—19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, 19–22 May 2015; Part II. pp. 586–597. [Google Scholar] [CrossRef]
  37. Arora, S.; Ge, R.; Halpern, Y.; Mimno, D.; Moitra, A.; Sontag, D.; Wu, Y.; Zhu, M. A Practical Algorithm for Topic Modeling with Provable Guarantees. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 280–288. [Google Scholar]
  38. Li, C.; Wang, H.; Zhang, Z.; Sun, A.; Ma, Z. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Proceedings of the International Acm Sigir Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 17–21 July 2016; pp. 165–174. [Google Scholar]
  39. Allahyari, M.; Kochut, K. Automatic Topic Labeling Using Ontology-Based Topic Models. In Proceedings of the IEEE International Conference on Machine Learning and Applications, ICMLA 2015, Miami, FL, USA, 9–11 December 2015; pp. 259–264. [Google Scholar]
  40. Huang, J.; Peng, M.; Wang, H.; Cao, J.; Gao, W.; Zhang, X. A probabilistic method for emerging topic tracking in Microblog stream. World Wide-Web-Internet Web Inf. Syst. 2017, 20, 325–350. [Google Scholar] [CrossRef]
  41. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  42. Bollegala, D.; Hayashi, K.; Kawarabayashi, K. Think Globally, Embed Locally—Locally Linear Meta-embedding of Words. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 3970–3976. [Google Scholar] [CrossRef] [Green Version]
  43. Zhang, P.; Wang, S.; Li, D.; Li, X.; Xu, Z. Combine Topic Modeling with Semantic Embedding: Embedding Enhanced Topic Model. IEEE Trans. Knowl. Data Eng. 2020, 32, 2322–2335. [Google Scholar] [CrossRef]
  44. Li, S.; Pan, R.; Luo, H.; Liu, X.; Zhao, G. Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling. Knowl. Based Syst. 2021, 218, 106827. [Google Scholar] [CrossRef]
  45. Inoue, S.; Aida, T.; Komachi, M.; Asai, M. Modeling Text using the Continuous Space Topic Model with Pre-Trained Word Embeddings. In Proceedings of the ACL-IJCNLP 2021 Student Research Workshop, ACL 2021, Online, 5–10 July 2021; Kabbara, J., Lin, H., Paullada, A., Vamvas, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 138–147. [Google Scholar]
  46. Gupta, P.; Chaudhary, Y.; Schütze, H. Multi-source Neural Topic Modeling in Multi-view Embedding Spaces. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tür, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4205–4217. [Google Scholar]
Figure 1. The graphical model of TTM.
Figure 1. The graphical model of TTM.
Applsci 11 10162 g001
Figure 2. The graphical model of BTM.
Figure 2. The graphical model of BTM.
Applsci 11 10162 g002
Figure 3. The graphical model of BiTTM.
Figure 3. The graphical model of BiTTM.
Applsci 11 10162 g003
Figure 4. Effect of various targets on topic coherence. The x-axis shows the word frequency of targets, and the y-axis represents the percentage of documents containing a target.
Figure 4. Effect of various targets on topic coherence. The x-axis shows the word frequency of targets, and the y-axis represents the percentage of documents containing a target.
Applsci 11 10162 g004
Figure 5. Time cost in datasets of different sizes.
Figure 5. Time cost in datasets of different sizes.
Applsci 11 10162 g005
Figure 6. Time cost in datasets with different document lengths.
Figure 6. Time cost in datasets with different document lengths.
Applsci 11 10162 g006
Table 1. Three examples where targeted aspect can be expressed by multiple different keywords.
Table 1. Three examples where targeted aspect can be expressed by multiple different keywords.
Example 1SynonymsTwo candidate query keywords: “bath” and “shower”.
Example 2Domain restrictionTwo candidate query keywords, “crib” and “bed”, in the Amazon review dataset Baby.
Example 3Event DescriptionTwo candidate query keywords, “mistake” and “osarsfail”, in the Twitter dataset Oscars.
Table 2. Symbol description.
Table 2. Symbol description.
Notation      Meaning
Bthe set of (core) biterms
Wthe set of words
Dthe set of documents
π b the bernoulli distribution over biterm b
γ , β i r , α beta prior of π b , Dirichlet prior of ϕ i r , θ
ϕ i r topic-word distribution over the irrelevant topic
ϕ k r topic-word distribution over the kth relevant topic
p , q beta prior of ω
δ , ϵ word smoothing prior, weak word smoothing prior
ω k bernoulli distribution of word selector β k r
x , w , z , r target indicator, word, topic, status
β w | k r , β | k r word selector of word w under topic k, the sum of word selector β w | k r .
β w r , β r word selector of word w, the sum of word selector β w r
β w , r the sum of word selector except the word w
β w , | k r the sum of word selector under topic k except the word w
n i , w r , n i , w i r the number of times word w is relevant (irrelevant) excluding biterm b i
n i r , n i i r the number of relevant (irrelevant) biterms excluding biterm b i .
w i , m the mth word in biterm b i , where m = 1 , 2
n w i , m | k r , i m the number of times that word w i , m assigned to topic k under relevant status excluding w i , m
n | k r , i m the total number of words assigned to topic k under relevant status excluding w i , m
n w i , m | k i r , i the number of times that word w i , m assigned to topic k under irrelevant status excluding biterm b i
n | k i r , i the total number of words assigned to topic k under irrelevant status excluding biterm b i
n w r the number of times that word w is relevant
n w , r the total number of words that are relevant excluding word w
Table 3. Datasets.
Table 3. Datasets.
Datasets
TypeSourceDomainLengthSize (KB)
shortcigar a Twitter2.947836641
ecig a Twitter3.499578708
Oscars b Twitter3.906165565
mediumbaby c Amazon28.07813141
camera a Amazon79.083071285
computer a Amazon80.90011295
longhome c Amazon179.4867619
food c Amazon258.4938195
care c Amazon675.14931523
ahttp://jmcauley.ucsd.edu/data/amazon/, accessed on 28 October, 2021; bhttps://github.com/shuaiwanghk/TTM, accessed on 28 October, 2021; chttps://www.kaggle.com/madhurinani/oscars-2017-tweets, accessed on 28 October, 2021.
Table 4. Topic coherence.
Table 4. Topic coherence.
DatasetsBiTTM
TopM = 5TopM = 10TopM = 15TopM = 20TopM = 25TopM = 30
cigar−43.85160224−213.938738−515.6799558−942.5318026−1493.639111−2170.963155
ecig−47.18005764−225.7818028−537.0456906−975.5867746−1540.959263−2233.885475
Oscars−41.56838081−203.696938−495.7966598−913.9376539−1451.063314−2102.622643
baby−18.77161638−97.92745779−258.704099−512.2443085−862.3028153−1314.514524
camera−11.60928852−57.83209167−151.6040027−295.3511581−493.2409214−748.8210303
computer−14.82764795−66.09833632−163.5166346−310.8723517−517.6532656−789.6438756
home−9.146932399−45.84159746−117.7837843−239.7455224−418.9237371−656.914665
food−7.47342957−42.52194503−109.7998671−220.2148208−370.0466113−563.1454128
care−11.52868834−49.17233659−116.4456707−224.1156048−375.6415389−585.8790988
DatasetsTTM
TopM = 5TopM = 10TopM = 15TopM = 20TopM = 25TopM = 30
cigar−34.68247162−197.0478114−493.1432304−921.1591797−1481.235217−2165.174583
ecig−34.67967533−200.4202621−502.3021289−935.8301943−1495.777983−2182.933002
Oscars−31.17239749−178.8829525−456.1151923−858.0192159−1383.122495−2032.967068
baby−16.2223621−101.2752879−281.5019366−562.4756204−937.8116571−1440.487996
camera−11.28057698−62.01834755−164.0824617−318.5467601−529.0200582−801.4709905
computer−13.16215736−68.04826101−174.2534566−332.9993329−549.2408003−824.3810116
home−9.934943893−51.89310174−131.4846734−254.4960003−428.6019462−643.0140473
food−11.33222981−54.36687511−128.2609949−242.0789656−399.9506445−600.8773534
care−9.739145131−53.9780039−133.7955495−254.3544155−432.013784−666.1259446
Table 5. P @ n scores of all models over a set of 18 targets on 9 data sets. n is set to 5, 10 and 20.
Table 5. P @ n scores of all models over a set of 18 targets on 9 data sets. n is set to 5, 10 and 20.
TypeDatasetsTargetsBiTTMTTMBTM-PDBTM
P@5P@10P@20P@5P@10P@20P@5P@10P@20P@5P@10P@20
shortcigarashtray0.920.840.60.920.660.460.60.60.390.20.20.135
place0.640.560.430.520.440.290.480.40.260.220.170.135
ecigsmokeless0.720.640.450.560.50.450.440.40.310.30.240.19
warning0.60.540.540.480.460.370.360.30.250.40.240.215
Oscarsmistake0.60.560.490.40.480.450.20.320.280.280.180.14
oscarsfail0.640.520.440.360.40.310.240.180.150.260.180.125
mediumbabyrinses0.520.380.30.20.260.240.080.10.110.120.120.085
shower0.520.480.380.440.40.320.20.360.310.240.190.145
cameraportable0.520.480.40.40.440.290.320.280.230.10.090.08
price0.520.480.510.480.440.420.440.340.260.080.110.09
computerdisplay0.480.740.720.40.620.620.280.320.320.10.150.165
keyboard0.520.660.590.440.60.520.320.340.330.180.150.14
longhomeclean0.760.80.690.680.720.640.320.340.320.250.190.18
kitchen0.720.720.660.680.80.660.640.520.440.220.180.16
fooddisease0.60.60.60.480.460.420.440.40.380.220.230.19
microwave0.840.580.470.680.640.440.320.320.260.180.170.125
carediabetic0.560.620.530.480.440.250.280.280.230.120.130.12
infant0.480.540.510.360.30.290.20.140.110.180.150.105
average score0.620.600.520.500.500.410.340.330.270.200.170.14
improvement by BiTTM +0.12+0.09+0.10+0.28+0.27+0.24+0.42+0.43+0.38
Table 6. The time cost (seconds) of all datasets. The best performance on each dataset is in bold.
Table 6. The time cost (seconds) of all datasets. The best performance on each dataset is in bold.
DomainSize (KB)BiTTMBTM-PDTTMBTM
cigar6416.0120.37860.994314.940
ecig7082.9602.900107.008489.180
Oscars5653.6203.93458.042418.560
baby14116.0119.75324.2461153.740
camera128581.086164.834811.68415,997.380
computer129596.978147.023713.06817,559.660
home61937.08390.283129.4817116.300
food1959.16631.83318.5342496.600
care1523135.040218.0651516.92718,767.580
Table 7. The contents of topics on target “disease” in long texts food.
Table 7. The contents of topics on target “disease” in long texts food.
Datasets: Food. Target: Disease
BiTTMTTMBTM-PDBTM
TeaSFAResearchRiskPreventionTeaSFAResearchRiskTeaSFAResearchTeaSFAResearch
teafatstudydiseasecherrydiseasefatheartriskteaproteincherryteafatstudy
effectcancerincreaseriskreduceteasaturatedstudydiseasegreenoilchocolateantioxidantsaturateddisease
benefitpeopleresearchhearttartflavourtartcancercherryweightfattartdiseasediseaserisk
workhealthsmallhighpreventbadchipreducestarfatpalmstudyheartriskheart
lowerfruitantioxidantgreenincludepricesweetenerorderbaglosssaturateddiseasefruitoilreduce
tastebloodgoodcellbodyhigherfollowbrandmeasurestudyqualitydarkrichcoconutshow
longsugareatlevelfindcompanythingvegetablecalorieincreaseeatcocoaprovidefindcherry
kindanimalamountshowproductbackproductioncookstatebodydietsweethealthstudyhealth
puresaturateddietdayresultnutrientaddlargepricedrinkdiseasecancersubstituteincreasefood
richdrinkwatersweetaddsimplyexpensivedrinkshippingcaloriegramproductvegetablehealthcancer
Table 8. The contents of topics on two approximate semantic targets in dataset “baby”.
Table 8. The contents of topics on two approximate semantic targets in dataset “baby”.
BiTTMTTMBTM-PDBTM
Target: Bath
BlanketSpoutProtectionSentimentBlanketSpoutSentiment--BlanketSpout-1Spout-2
coverspoutcutefitspoutfitspoutcoverspoutfaucet
pullheadplaytimecovercutebathbathcuteeasy
productstaytubdaughtersnapsonshowerspoutheadspout
showerbathputfaucetbrightwhaletubfaucetfitcute
timewhaleeasygivehatequicklycoverheadcovertub
childmonthbuynicereadshowerwhaleprotectfaucetcover
recommendcribthingproblemrealizebendfittubtubproduct
blanketeasilykidbumpmobilebearfaucetwhalesonbath
lotsideprotectdiaperbreakfrontknobfitwhalehead
startyearbumperfindparenttouchperfecttimebathwhale
Target: shower
blanketspoutcutesonshowerspoutcoverspoutspoutblanketspoutspout
buyshowerthingpullgifttubfitshowershowerbuycoverfit
coverheadeasybathbathkidcutecovercovergiftfitshower
putstayfaucettimebuytopwhalecutewhaleswaddlefaucetcute
bigkidproductdaughterholeeasyfacefaucetbathreceiveheadpull
prettygiftsideplayblanketheadeasilyheadtubfriendshowerstay
safeperfectproblemnicetimeremovecouplepullpullchildproductcover
strollersoftinstallprotectthingworrysnaptubmolddaytubson
changeworrychildfindtrippicturehighwhalekidholdtimetub
qualitybagcarfalltotallyfaceexpectbathfaucetshowerpullwhale
Table 9. The contents of topics on two approximate semantic targets in dataset “Oscars”.
Table 9. The contents of topics on two approximate semantic targets in dataset “Oscars”.
BiTTMTTMBTM-PDBTM
Target: Oscarsfail
BeginningCorrectionDiscussionDiscussionDiscussionDiscussionDiscussion
lalalandmoonlightoscaroscarsfailwinwinmoonlight
hollywoodoscarsfailenvelopegateoscaroscarsfailvotebestpicture
winnerawardpictureenvelopegatelalalandlalalandlalaland
castmistakeactormajorgroupvotemoonlightrussians
actressvotereactionpickshortpopularhack
emmavarietyelectoralscenewhiteelectoralelection
peoplemomenttimehollywoodratingoscarsfailenvelope
whitemahershalamoviehackhelmetslrihendryvotetrumppics
barryjenkinsvioladaviswordmoanamoonlightwordneontaster
bestpictureblackaffleckauliicravalhodocumentaryoscaroscarsfail
Target: mistake
lalalandmoonlightwinwinlalalandlalalandmoonlight
realizewinnermistakemoonlightmistakemoonlightwinner
announcerealacademypicturemoonlightpicturepicture
awardoscarblackmomentcrewwinannounce
producernightfilmrealizerealizerealizelalaland
momentpeoplereactionlalalandcastmistakereal
congratoscarsfailgivewatchmomentmomentabc
abcmoviecongratulationmistakewatchcrewwatch
thrtheellenshowtimecastwinwatchmistake
happenrussianssupportcrewpicturecastrealize
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, J.; Chen, L.; Li, L.; Wu, X. BiTTM: A Core Biterms-Based Topic Model for Targeted Analysis. Appl. Sci. 2021, 11, 10162. https://doi.org/10.3390/app112110162

AMA Style

Wang J, Chen L, Li L, Wu X. BiTTM: A Core Biterms-Based Topic Model for Targeted Analysis. Applied Sciences. 2021; 11(21):10162. https://doi.org/10.3390/app112110162

Chicago/Turabian Style

Wang, Jiamiao, Ling Chen, Lei Li, and Xindong Wu. 2021. "BiTTM: A Core Biterms-Based Topic Model for Targeted Analysis" Applied Sciences 11, no. 21: 10162. https://doi.org/10.3390/app112110162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop