A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies

Miao, Runsheng; Huang, Yuchen; Zhang, Zhenyu

doi:10.3390/math11143154

Open AccessArticle

A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies

by

Runsheng Miao

¹,

Yuchen Huang

^1,* and

Zhenyu Zhang

²

¹

College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai 200093, China

²

School of Automation, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(14), 3154; https://doi.org/10.3390/math11143154

Submission received: 30 May 2023 / Revised: 10 July 2023 / Accepted: 12 July 2023 / Published: 18 July 2023

(This article belongs to the Special Issue Mathematical Computation in Knowledge Graph: Theories, Techniques, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In large social media knowledge retrieval systems, employing a keyword-based fuzzy matching method to obtain knowledge presents several challenges, such as irrelevant, inaccurate, disorganized, or non-systematic knowledge results. Therefore, this paper proposes a knowledge retrieval method capable of returning hierarchical, systematized knowledge results. The method can match the knowledge demands according to the keyword input by users and then present the knowledge supplies corresponding to the knowledge demands as results to the users. Firstly, a knowledge structure named Knowledge Demand is designed to represent the genuine needs of social media users. This knowledge structure measures the popularity of topic combinations in the Topic Map, so the topic combinations with high popularity are regarded as the main content of the Knowledge Demands. Secondly, the proposed method designs a hierarchical and systematic knowledge structure, named Knowledge Supply, which provides Knowledge Solutions matched with the Knowledge Demands. The Knowledge Supply is generated based on the Knowledge Element Repository, using the BLEU similarity matrix to retrieve Knowledge Elements with high similarity, and then clustering these Knowledge Elements into several knowledge schemes to extract the Knowledge Solutions. The organized Knowledge Elements and Knowledge Solutions are the presentation of each Knowledge Supply. Finally, this research crawls posts in the “Autohome Forum” and conducts an experiment by simulating the user’s actual knowledge search process. The experiment shows that the proposed method is an effective knowledge retrieval method, which can provide users with hierarchical and systematized knowledge.

Keywords:

knowledge retrieval; Topic Map; Knowledge Elements; social media

MSC:

68T50

1. Introduction

As social media booms, more and more users are expressing their opinions and thoughts on various social media platforms. In this paper, social media specifically refers to professional social media, which is defined as an online content production platform used by internet users to share and exchange opinions, insights, experiences, and creative ideas related to a specific professional subject, containing a large text corpus with valuable information in the form of comments and articles published by users [1]. Users and enterprises also realized that the information needed to support their decision making is not isolated or fragmented information but rather is specified and interrelated knowledge [2,3]. However, professional social media text is characterized by a large volume, varying lengths, strong professional relevance, a high level of creativity, and the use of colloquial language. Users searching on social media typically focus on a specific topic [4]. Using full-text searching methods to acquire specific knowledge on a particular topic, users often need to engage in extensive search and reading activities [5,6]. Concurrently, knowledge is abundant and complex, making it challenging to organize within a unified structure and difficult to comprehend [7,8]. Users must manually perform classification, deduplication, evaluation, and summarization of the acquired information, processes that are both time-consuming and labor-intensive. Consequently, the cost of directly obtaining specialized knowledge from social media platforms is considerably significant for users and enterprise [9]. Therefore, this paper proposes a new knowledge retrieval method for users and enterprises to reduce search time and improve the accuracy and efficiency of knowledge acquisition from social media platforms.

Knowledge retrieval refers to the process of identifying, extracting, and organizing relevant information or knowledge from various sources based on specific queries or requirements [10,11]. The main objective of knowledge retrieval is to improve the efficiency and accuracy of obtaining desired knowledge, ultimately facilitating decision-making or problem-solving processes [12]. Knowledge retrieval is often seen as a subfield of knowledge discovery [13]. Several scholars proposed classic methods in this field. Mavrogiorgou et al. (2021) [14] proposed a KDD approach that concentrates on the selection, preprocessing, and transformation of healthcare data and verified its applicability in various healthcare scenarios with a prototype. Belcastro et al. (2022) [15] presented a knowledge discovery method comprising three modules: discovering user mobility patterns, estimating public opinion, and discovering social media discussion topics. With the development of big data, knowledge discovery methods have begun to transition from traditional data mining approaches to incorporating machine learning methods [16]. Xu et al. (2019) [17] proposed a novel mechanism that enables KV-MemNN models to perform interpretable reasoning for complex questions on a variety of knowledge graphs and question-answering retrieval datasets. Manias et al. (2023) [18] improved sentiment knowledge acquisition by introducing multilingual BERT-based classifiers, which enhances the accuracy of sentiment analysis. Tan et al. (2023) [19] introduced an FGL framework to strong GNNs that extracts the common underlying structure knowledge, demonstrating its superiority over existing methods in cross-dataset and cross-domain non-IID settings. However, the aforementioned classical methods do not consider the creation of a hierarchical and systematic knowledge organization, which can enhance the efficiency of knowledge retrieval and improve the user’s search experience.

The proposed knowledge retrieval method is built on the newly designed knowledge structures of Knowledge Demand and Knowledge Supply. In this paper, Knowledge Demand is defined as a knowledge structure extracted from social media to represent the relevant topics of users’ real needs. The construction of Knowledge Demand relies on topics extracted from social media. Since users’ real knowledge needs usually revolve around specific topics in social media, the representation of Knowledge Demand is designed as a knowledge main topic and its several combinations of subtopic words [20]. Knowledge Supply denotes the related knowledge clusters corresponding to a particular Knowledge Demand, which is a manifestation of Knowledge Sharing [21]. Its content encompasses multiple knowledge clusters that can satisfy the specific Knowledge Demand. The knowledge clusters in Knowledge Supply are called Knowledge Solutions in this paper. Moreover, Knowledge Supply has a well-organized, easy-to-read structure for users. The detailed content structure and construction method of Knowledge Demand and Knowledge Supply are presented later.

Previous studies were conducted based on different knowledge characteristics. Knowledge can be classified according to different characteristics, such as ambiguity, explicitness, immediacy, and heterogeneity [22]. For ambiguity knowledge, scholars proposed retrieving methods based on the theory of fuzzy mathematics (Zadeh, L.A. 1975) [23] from distance, proximity, and Zadeh’s truth value of two fuzzy sets (Zhu, L. 2008) [24]. From the perspective of explicit and implicit knowledge, Zhao et al. (2006) [25] studied how to use the latent semantic indexing model to achieve mutual retrieving between user profiles and enterprise knowledge sources or user profiles. Yu et al. (2011) [26] applied Herbart’s formal stage theory to investigate the dynamic matching process of implicit knowledge. Yang et al. (2017) [27] established a multi-dimensional matching model for information retrieval based on implicit knowledge to achieve the reasonable matching of implicit knowledge in information retrieval. In public emergencies, knowledge from social media needs to be timely. Liu et al. (2011) [28] proposed a high-precision public information retrieval method based on a cloud model and cloud computing. Yue (2022) [29] proposed a category-theoretic framework for emergency knowledge retrieval, which provides a reliable basis for decision making. As for heterogeneity, knowledge takes various forms in different organizations. Rubiolo et al. (2012) [30] proposed an ontology-matching model based on an Artificial Neural Network for different knowledge sources’ discovery on the Semantic Web. Guo et al. (2017) [31] combined evidence reasoning theory and knowledge fusion methods to effectively match network public opinion knowledge.

The expression of knowledge needs to rely on specific knowledge organization tools, such as information systems and structured knowledge databases, which differ from general tangible products. (Liu, Y. and Li, K., 2017) [32]. Outdated knowledge organization tools can result in a high false positive rate in knowledge retrieval and reduce the service satisfaction of the knowledge demand end. Therefore, many scholars have optimized knowledge retrieval methods with the help of the latest information technology. Mohamed et al. (2019) [33] proposed an automatic knowledge retrieval algorithm on the internet based on semantic role labeling and lexical-syntactic matching techniques, which extended the architecture of semantic networks. Sun et al. (2019) [34] established a technology term information matching model based on a domain knowledge base to improve the translation quality of technology terms. Ma et al. (2019) [35] proposed a method of building a specialized knowledge base that relies on knowledge from the literature to meet the personalized needs of researchers in traditional Chinese medicine, specifically in promoting blood circulation and removing blood stasis.

There are several issues in the existing social media knowledge retrieval methods:

(1) Existing methods are limited to understanding the topic of Knowledge Demands, which makes it difficult to identify users’ actual knowledge needs. Consequently, the knowledge retrieved is inappropriate for the user’s needs.

(2) The efficiency of existing knowledge retrieval research is low, and few studies focus on trading space for time. There needs to be more research on indexing methods based on different matching criteria.

(3) The knowledge results returned to users in existing research could be more organized, lacking a systematic classification or clustering of knowledge, resulting in repetitive knowledge content scattered among many knowledge results, which affects users’ knowledge acquisition efficiency.

Based on the gaps identified in existing research, the main objective of this article is to enhance the efficiency and effectiveness of knowledge acquisition from social media by designing novel knowledge organizations and retrieval methods.

The originality of this paper lies in three key contributions: Firstly, it defines a new structure for knowledge requirements considering the semantic meaning of the knowledge demands, rather than just the surface features such as keywords or phrases. Secondly, it defines a new structure for knowledge supply, reducing the frequency of repetitive results and extracting valuable knowledge content in a structured and hierarchical manner. Thirdly, it improves the retrieval efficiency through the redesign of the retrieval process and the utilization of algorithms such as FP-Tree and Affinity Propagation, which allows for a faster and more accurate matching of user needs with relevant knowledge.

2. Materials and Methods

2.1. Modeling Framework

Figure 1 both introduces the overall modeling framework of this paper and annotates the structure of this method.

The modeling steps for knowledge retrieval in this paper include

(1) Knowledge Retrieval Process Design: designing the application process of knowledge retrieval based on the user’s perspective. First, the user needs to enter keywords. Second, related entities are retrieved based on the keywords, and the user’s genuine Knowledge Demand is recognized by combining the Topic Map. Then, the Knowledge Element Repository is used to retrieve Knowledge Elements that match the Knowledge Demand. These Knowledge Elements are clustered, aggregated, and refined to obtain a set of Knowledge Elements that serve as Knowledge Supply. Finally, a structured, comprehensive, and explicit Knowledge Supply is provided to the user.

(2) Modeling of Knowledge Demand: building a knowledge structure for Knowledge Demand and extracting Knowledge Demands. The Topic Map is used to create candidate combinations of topic words. Then, the raw corpus from social media is then screened to identify questions that are relevant to the candidate combinations. The screened question corpus is used to construct an FP-Tree, which is a data structure that can be used to measure the Support Degree of each candidate combination. The candidate combinations with the highest Support Degrees are then used to represent the user’s Knowledge Demands.

(3) Modeling of Knowledge Supply: building a knowledge structure for Knowledge Supply and generating Knowledge Supplies: Firstly, Word2vec technology is used to obtain a similarity matrix between words. This matrix is then used to build an inverted index of words in the knowledge repository. Relevant Knowledge Elements are searched and retrieved through the inverted index. Secondly, these Knowledge Elements are clustered based on similarity, generating an aggregated Knowledge Solution. Finally, the comprehensive sentiment index, keyword phrases, and Knowledge Conclusions in the Knowledge Solution are extracted as concise information, forming a complete Knowledge Supply.

The three modeling processes mentioned above are elaborated on in Section 2.2, Section 2.3 and Section 2.4, respectively.

In this article, the proposed method is grounded in two fundamental knowledge structures: Topic Maps and Knowledge Elements. Topic Maps are an emerging intelligent knowledge organization, developed in response to the interconnectedness of information resources. Topics are the core ideas that users want to express when posting comments on social media platforms. The role of Topic Maps is to present an overview of topics within a specific domain and to reveal the relationships among topics. Topic Maps can serve as a fundamental basis for Knowledge Element modeling, as well as for knowledge retrieval services. The concepts mentioned in Figure 1 are explained in Table 1, and the modeling process of these concepts is described in detail in Section 2.2, Section 2.3 and Section 2.4.

2.2. Application Process for Knowledge Retrieval

Users play a key role in the social media knowledge usage and the goal of knowledge retrieval should also be to serve users. The knowledge retrieval process is a significant, high-frequency user application scenario. Reasonable and efficient processes and well-organized search results are critical for users. Figure 2 shows the steps of the knowledge retrieval method with the involvement of users’ genuine knowledge demands.

The specific steps of the method are as follows:

(1): User inputs search content. The user enters keywords or key sentences related to the desired knowledge in the search box.
(2): Identify entity vocabulary in the search content. Identify whether the user wants to learn about a specific entity such as a specific category, model, or product function.
(3): Identify Knowledge Demands. Identify the user’s genuine knowledge demands based on the user’s input keywords.
(4): Match Knowledge Supplies. Using the similarity calculation methods to obtain the corresponding Knowledge Supplies for the Knowledge Demands.
(5): User obtains specific knowledge. The Knowledge Demand and Knowledge Supply obtained in steps (3) and (4), respectively, are presented as search results according to an integrated knowledge structure, making it easy for users to read and analyze the search results.

2.3. Modeling of Knowledge Demand

Knowledge Demand refers to the need for obtaining relevant knowledge on a specific topic by enterprises or users. In this study, the Topic Map is constructed from the corpus of all users in social media, encompassing subjective comments, usage experience, user interaction, and various text contents of user questions. The construction of Knowledge Demand aims to be realistic and accurate. Mining Knowledge Demand from the user corpus can ensure that the topic content in the Topic Map covers all aspects of user discussion and includes Knowledge Demand. Therefore, Knowledge Demand can be regarded as a subset of the Topic Map, which includes the combination of topics that meet the Knowledge Demand of users or enterprises. In this study, machine learning and text clustering methods will be used to obtain Knowledge Demand based on the construction of the Topic Map.

In this paper, the knowledge organization of Knowledge Demands is defined as

< K n o w l e d g e T o p i c t, S u b t o p i c C o m b i n a t i o n t_{s e t}, S u p p o r t D e g r e e s, D e m a n d Q u e s t i o n S e t q_{s e t} >

In this context, a Knowledge Topic

t

is derived from the Topic Map, while a Subtopic Combination

t_{s e t}

is a combination of subtopics in the Topic Map that can reflect Knowledge Demand. Support Degree

s

is a measure of the importance of a particular Knowledge Demand, and a Demand Question Set

q_{s e t}

is a set of user questions that correspond to the Subtopic Combination.

The process of extracting Knowledge Demand can be summarized as follows: using the Knowledge Demand modeling method, and extracting Knowledge Demand from the user-questioning corpus of social media and the Topic Map. The above process is represented as

f_{d} : (D_{q}, T) \to (t, t_{s e t}, s, q_{s e t})

(1)

The steps for mining Knowledge Demand include

(1) On the basis of building the Topic Map Repository of the social media, combine the main topic words in the Topic Map to form a candidate topic combination, which is the potential Knowledge Demand (Knowledge Topic, Subtopic Combination).

(2) Train the LSTM model using the social media user corpus, and use the model to screen user questions from the original database.

(3) Construct a Knowledge Demand Support Degree measurement method based on the FP-Tree model using the screened user question text to calculate the Support Degree of each potential Knowledge Demand. Finally, the user question posts used in this process form the question set of the Knowledge Demand, which can provide support for extracting the Knowledge Demand.

2.3.1. Generating Topic Map and Topic Combinations

The main topic of social media includes the topic contents discussed by users and the hierarchical relationship between various topics. Knowledge Demand is often based on multiple similar or adjacent topics, and the keywords of Knowledge Demand are primarily in a topic cluster. Since the Topic Map is extracted from the global corpus, and topic clusters are closely related by subject terms, the topic set corresponding to the Knowledge Demand often contains the topic set

t_{s e t}

in the Topic Map.

To select the topic set corresponding to the Knowledge Demand from the Topic Map, it is necessary to first generate candidate topic combinations from the Topic Map, which can be represented as

t_{t o t a l_s e t} = {{C_{1}}_{N}^{M}, {C_{2}}_{N}^{M}, \dots, {C_{K}}_{N}^{M}}

(2)

where

K

represents the number of topic clusters in the Topic Map, each of which contains

N

topic words. From the

N

topic words, any

M

topics are selected as candidate topic combinations. The total number of topic combinations in the candidate topic set

t_{t o t a l_s e t}

is represented by T.

T = \frac{N (N - 1) \dots (N - M + 1)}{M!} \cdot K

(3)

The above formula illustrates that the number of topic combinations in the candidate topic set increases exponentially with the number of topics and the number of topic words. Therefore, an efficient and fast method is needed to determine whether a topic combination can support Knowledge Demand.

2.3.2. Screening of User Question Posts Based on LSTM Model

In social media, the user corpus contains potential Knowledge Demand. However, directly mining from these corpora will lead to irrelevant information and pseudo-demands, such as advertising and second-hand trading information, which decrease the value and relevance of Knowledge Demand.

To obtain high-quality Knowledge Demand, it is necessary to obtain “raw materials” suitable for mining Knowledge Demand from social media. This ensures the generation of a high-quality user corpus that aligns with the characteristics of Knowledge Demand. The high-quality corpus should contain Knowledge Demand and Knowledge Demand Questions raised by users. Q&A (question and answer) posts are essential in social media posts, and most of them meet the above characteristics. Therefore, whether a post is a Q&A post is used as a feature for screening potential Knowledge Demand corpora. Q&A posts are identified, those that do not contain domain knowledge are removed, and these Q&A posts are used as the text “raw materials” for mining Knowledge Demand.

In this study, we train an LSTM (Long Short-Term Memory) model to determine whether a text post is a Q&A post and whether it contains specific knowledge in the domain.

(1) Obtaining annotations. The training of the LSTM model relies on high-quality training data, making the acquisition of a reasonable annotated dataset crucial. However, manual annotation is time-consuming and labor-intensive. Thus, this study uses existing tags in social media for annotation. Most social media platforms have Q&A sections containing a certain number of Q&A posts that can be regarded as user-generated annotations as positive samples in this study. Moreover, negative samples are selected from the remaining posts through manual annotation.

(2) Training the LSTM model. The advantage of the LSTM model is its use of the recurrently connected memory block structure to replace the hidden units in standard RNNs. This architecture allows for the preservation of the relationships between words that are far apart in a sentence and is suitable for identifying Q&A posts. The LSTM model is trained using the annotated training set.

(3) Using LSTM to identify Q&A posts. Input the unannotated posts from social media into the trained LSTM model one by one. Store the identified Q&A posts in a database to form a text corpus for mining Knowledge Demand.

2.3.3. A Knowledge Demand Support Degree Measurement Model Based on the FP-Tree Model

The large number of topic combinations in the candidate topic set

t_{t o t a l_s e t}

requires a method for measuring the degree of user demand for a specific topic combination. This method should be able to quickly calculate the user’s need for a specific topic combination based on the raw corpus, and thus use this method to select suitable topic combinations as Knowledge Demand. The idea for this study comes from a commonly used method in e-commerce mining called association analysis (Apriori). The basic idea is that if the topic words in a candidate topic combination

t^{'}

often appear in a Q&A post simultaneously, then the probability of this topic combination being a Knowledge Demand is relatively high. According to this idea, it is only necessary to calculate the Support Degree

s

of each candidate topic combination in the Knowledge-Demand-mining raw material library, and then the threshold

T

of the Support Degree should be set to filter out high-support topic combinations as Knowledge Demand. Here, the Support Degree of Knowledge Demand is defined as

s = \frac{σ (t_{1}^{'} ⋃ t_{2}^{'} ⋃ \dots ⋃ t_{n}^{'})}{N}

(4)

where

σ (t_{1}^{'} ⋃ t_{2}^{'} ⋃ \dots ⋃ t_{n}^{'})

represents the number of times a combination of topic keywords in the topic knowledge

t^{'}

appears in the Q&A posts of the Knowledge Demand mining material

D_{q}

.

N

represents the total number of problems in

D_{q}

. The meaning of Support Degree

s

is to indicate the frequency of the appearance of the topic knowledge combination

t^{'}

in the Knowledge Demand mining material.

However, according to Formula (3), the number of topic combinations in the candidate topic set increases exponentially with the increase in the number of topic and topic words. Moreover, when a new topic combination appears and its support needs to be calculated, it is necessary to re-traverse the Knowledge Demand mining material. Therefore, directly calculating Knowledge Demand Support Degree

s

by traversal is very inefficient. This study uses the FP-Tree (Frequent Pattern Tree) data structure to improve computing efficiency. The FP-Tree uses the divide-and-conquer approach to compress the original transaction data into a frequent itemset tree, reducing the amount of original data but retaining the association information among items [36].

The process of constructing the FP-Tree in this study is as follows:

(1): Firstly, the title of each Q&A post in the Knowledge Demand mining material is segmented, and the topic words in the Topic Map are used to filter out miscellaneous words that are not topic words, which are replaced by the filtered topic words.
(2): Then, the titles in the Knowledge Demand mining material are traversed, and the word sets with a frequency of 1 are obtained. The minimum support of each word (the minimum number of times a word appears) is defined, and words that appear less than the minimum support are deleted. Then, the titles in the original database are sorted in descending order according to the concentration of words.
(3): The titles of each Q&A post in the original Knowledge Demand mining material are traversed again, and an item header table is created (the topic words are sorted in descending order according to word frequency). Then, for each topic word in the item header table, its conditional pattern base (CPB) is found, and low-support word combinations are deleted by recursively calling the tree structure.

After the establishment of the FP-Tree, the Support Degree of each combination of topic words in the candidate Knowledge Demand set will be measured according to Formula (4).

2.4. Modeling of Knowledge Supply

Knowledge Supply refers to the relevant knowledge points corresponding to a specific Knowledge Demand, including multiple solutions that can meet the Knowledge Demand. These Knowledge Solutions are sourced from the Knowledge Element Repository. The main service target of Knowledge Supply is the Knowledge Demand, which dynamically changes with the Knowledge Demand. Therefore, this study defines the organizational form of Knowledge Supply as

< K n o w l e d g e D e m a n d s, S e t < K n o w l e d g e S o l u t i o n s (S e t < K n o w l e d g e E l e m e n t >), K n o w l e d g e C o n c l u s i o n > >

Each Knowledge Demand corresponds to multiple Knowledge Solutions. A Knowledge Solution is a collection of multiple Knowledge Elements with similar content or meaning, and each Knowledge Element supports the corresponding Knowledge Solution. In addition, each Knowledge Solution corresponds to a Knowledge Conclusion, which summarizes and describes the Knowledge Solution and consists of several concise statements. The structure of the Knowledge Supply is shown in Figure 3.

The Knowledge Element in Figure 3, also known as a Knowledge Unit or Knowledge Tuple, is a basic unit of knowledge used for operation and management [37]. It is an independent Knowledge Element that can be freely divided, expressed, stored, organized, retrieved, and utilized. In this paper, the structure of Knowledge Elements is defined as <Topic, Sentiment Orientation, Keywords, Key Sentences>. The Knowledge Element Repository should be pre-established with LDA (Latent Dirichlet Allocation), LSTM sentiment analysis, and TextRank.

According to the structure, the construction process of Knowledge Supply can be divided into three steps:

(1): Match relevant Knowledge Elements based on Knowledge Demands: in the Knowledge Element Repository, select the Knowledge Elements that contain the topic words or their synonyms in the Knowledge Demand, and ensure the comprehensiveness of the screening results while maintaining screening efficiency.
(2): Cluster Knowledge Elements to generate Knowledge Solutions: Define the similarity between Knowledge Elements based on dimensions such as topic, keywords, key sentences, and knowledge meanings. Then, use this similarity to calculate the similarity matrix of Knowledge Elements. Finally, use the Knowledge Element similarity as edge weights to construct a graph model and apply spectral clustering to obtain a topic Knowledge Solution for each cluster.
(3): Generate Knowledge Conclusions: merge the contents of the Knowledge Elements, calculate the importance of each keyword and key sentences in the Knowledge Elements, and generate Knowledge Conclusions.

2.4.1. Matching Relevant Knowledge Elements According to Knowledge Demand

Matching Knowledge Elements in the Knowledge Element Repository that contains vocabulary from the Knowledge Demand or that are similar to the vocabulary is the foundation for refining Knowledge Supply. As Knowledge Demand changes with the corpus, efficient and comprehensive retrieval from the Knowledge Element Repository is essential for satisfying the matching requirements. Therefore, to improve the matching speed while ensuring the completeness of the matching results, this study introduces semantic similarity calculation technology and full-text retrieval technology to match Knowledge Elements. The specific steps include

(1): Construction of Similarity Matrix between Words

Based on a similarity matrix, this study filters synonyms and near-synonyms for the topic keywords in the Knowledge Demand. The diverse synonyms and near-synonyms found in social media corpora are often spoken informally. The Word2vec model is used to obtain the similarity between words, and a similarity matrix based on ontological vocabulary is constructed to speed up the search process. Based on the similarity matrix, a similarity threshold

t h r e s h o l d \in [0, 1]

is set to quickly identify vocabulary with a semantic similarity above the threshold and to establish mapping relationships between these words. When searching in the subsequent Knowledge Element Repository, these words and their mapping relationships can be used to replace synonyms or near-synonyms.

(2): Construction of the Index for Expanded Ontological Vocabulary in the Knowledge Element Repository

The Knowledge Element Repository contains thousands of Knowledge Element data samples. Traversing the entire database for each Knowledge Element according to the search word will inevitably slow the retrieval speed. In addition, the searched words obtained in step (1) are the topic words and their synonyms from the expanded ontological vocabulary in the Knowledge Demand. Therefore, this study uses the Lucene full-text search technology [38] to establish an index in advance for the data in the Knowledge Element Repository and then uses this index for high-speed matching.

(3): Knowledge Element Matching based on Inverted Index

This step involves using the inverted index for matching topic words, obtaining a set of Knowledge Elements that include any topic words in the Knowledge Demand, and filtering using the Matching Degree between the Knowledge Demands and Knowledge Elements. When matching each topic word, the corresponding Knowledge Element frequency and frequency of each Knowledge Element containing the topic word can be obtained through inverted indexing. Then, Knowledge Elements with a higher Matching Degree to the Knowledge Demands should be filtered and matched by calculating the Matching Degree between the Knowledge Demands and Knowledge Elements. The process can be described by the following formula:

T = \{T_{1}, T_{2}, \dots, T_{n}\}

(5)

where set

T

represents all the topic words in the Knowledge Demand, and

T_{n}

represents the

n

th topic word in the Knowledge Demand;

S = \{U_{1}, U_{2}, \dots, U_{m}\}, T ⋂ U_{m} \neq Ø

(6)

where

S

represents the set of all Knowledge Elements

U_{m}

that contain at least one topic word

T_{i}

;

{M D}_{m} = \sum_{T_{i} \in U_{i}} T_{i, F}, U_{i} \in S

(7)

where

{M D}_{m}

represents the Matching Degree between the Knowledge Demand and the

m

th Knowledge Element,

T_{i, F}

represents the occurrence frequency of the topic word

T_{i}

in the corresponding Knowledge Element

U_{i}

, and

U_{i}

is a unit in the set

S

. The calculation formula for this Matching Degree can be understood as the sum of the frequency of each topic word appearing in the Knowledge Element.

Once the Matching Degrees between each Knowledge Element in set

S

and the Knowledge Demand are obtained, the overall Matching Degree can be calculated based on the number of Knowledge Elements in set

S

, and the Knowledge Elements with high Matching Degrees of

T o p_{N}

can be selected as the matching results, denoted as

U_{S}

.

Introducing an inverted index in matching Knowledge Elements can significantly reduce the algorithm’s time complexity. While ensuring the matching efficiency, selecting Knowledge Elements by calculating the Matching Degree guarantees the accuracy and comprehensiveness of the matching results

U_{S}

.

2.4.2. Generating Knowledge Solutions Based on the Knowledge Elements Clustering

After obtaining a set of matching Knowledge Elements U for a given Knowledge Demand, generating multiple topic Knowledge Solutions from this set is necessary. A topic Knowledge Solution comprises multiple Knowledge Elements with similar content or topics, describing similar objective things or knowledge. It can support the Knowledge Demand from similar perspectives. Therefore, this research clusters the Knowledge Elements in the set from the computing Knowledge Element similarity perspective and takes each cluster of the clustering result as a topic Knowledge Solution.

However, the Knowledge Elements in this study are semi-structured data with a structure of <text topic, topic sentiment orientation, keywords, key sentences>, where both the keywords and key sentences are unstructured data. In addition, the Knowledge Elements have multi-dimensional properties, and existing Knowledge Element similarity models are mainly based on the concept of sets to calculate the similarity of corresponding attribute sets, and then to obtain the similarity through a weighted sum [39]. However, the set-based similarity models have difficulty in considering the information contained in the word order of key sentences. Therefore, this study proposes a Knowledge Element similarity model based on the BLEU (Bilingual Evaluation Understudy) model for Knowledge Element clustering.

(1): Similarity Model for Knowledge Elements

The concept behind the Knowledge Element similarity model in this research is to employ different similarity calculation methods for each sub-element of the Knowledge Element based on its characteristics. Therefore, the corresponding similarity of each element is calculated, and then they are weighted and summed to obtain a comprehensive similarity measure that minimizes information loss. The similarity between Knowledge Elements is defined as

S i m (u_{i}, u_{j}) = w_{h} S i m_{B} (h_{u_{i}}, h_{u_{j}}) + w_{t} S i m_{J} (t_{u_{i}}, t_{u_{j}}) + w_{p} S i m_{P} (p_{u_{i}}, p_{u_{j}}) + w_{k_{w}} S i m_{J} ({k_{w}}_{u_{i}}, {k_{w}}_{u_{j}}) + {w_{k}}_{s} S i m_{B} ({k_{s}}_{u_{i}}, {k_{s}}_{u_{j}})

(8)

where

w_{h}

,

w_{t}, w_{p}, w_{k_{w}}, w_{k_{s}}

, respectively, represent the similarity weight value of the title, topic, sentiment orientation, keywords, and key sentences in the Knowledge Element, and

w_{h} + w_{t} + w_{p} + w_{k_{w}} + w_{k_{s}} = 1

, that is, the sum of the weights is 1;

h_{I}, h_{u_{j}}

represent the name of the Knowledge Element;

t_{u_{i}}, t_{u_{j}}

represent the set of topic keywords in the Knowledge Element;

p_{u_{i}}, p_{u_{j}}

represent the sentiment orientation in the Knowledge Element;

{k_{w}}_{u_{i}}, {k_{w}}_{u_{j}}

represent the set of keywords in the Knowledge Element;

{k_{s}}_{u_{i}}, {k_{s}}_{u_{j}}

represent the key sentences in the Knowledge Element;

S i m_{J}

represents the Jaccard similarity based on set concepts;

S i m_{B}

represents the sentence similarity based on the BLEU metric; and

S i m_{P}

represents the similarity of Sentiment Orientation.

① Jaccard similarity based on set concepts

S i m_{J}

S i m_{J}

represents the Jaccard similarity based on set concepts, which uses the total number of elements in the intersection of two sets divided by the total number of elements in the union of two sets to obtain the similarity between two sets. The formula for Jaccard similarity is

S i m_{J} (A, B) = \frac{|A ⋂ B|}{|A ⋃ B|} = \frac{|A ⋂ B|}{|A| + |B| - |A ⋂ B|}

(9)

where

A

and

B

represent two sets. According to the formula, it can be known that when sets

A

and

B

are the same, the value of

S i m_{J} (A, B)

is 1, and when they are completely different, the value of

S i m_{J} (A, B)

is 0.

② Sentence similarity based on an improved BLEU model

The calculation method of

S i m_{B}

similarity is based on the BLEU model. BLEU was originally used as a metric to measure the similarity between machine-translated text and reference text in machine translation. It is a directional similarity metric with a value range between 0 and 1, with a value closer to 1 indicating a better machine translation result. With the continuous development of machine learning and natural language processing technology, the scope and calculation method of the BLEU model have also changed, and it can now be used for calculating sentence similarity. The original BLEU model only considered the word frequency in the text, constructed a dictionary based on the word frequency, and calculated the similarity using the Jaccard similarity method based on sets. However, when two texts contain many common words, or some words appear frequently, the calculated similarity may be high because the frequently occurring words account for a high proportion. Nevertheless, the actual semantics of the two texts may be different. The sentence similarity calculation using the set-based similarity method is difficult to calculate accurately because it needs to consider the relationship between words. To address the drawbacks mentioned above of the BLEU model, this model introduces the concept of the n-gram. The original BLEU model calculates based on individual words as the smallest unit, while the improved BLEU model combines multiple words as the smallest unit. Since the number of combined words can be an integer between 0 and

n

, selecting different combination lengths will produce different Jaccard similarities. Therefore, the similarity corresponding to the various lengths of combined words needs to be summarized and combined.

This study further improves the BLEU model. The original n-gram-based model calculates a directed similarity score, denoted as

S i m (s_{1}, s_{2}) \neq S i m (s_{2}, s_{1})

, between the reference and the candidate sentences. However, to measure sentence similarity using BLEU, the model is modified to a bidirectional sentence similarity metric, denoted as

S i m_{B}

or

S i m_{B} (s_{1}, s_{2}) = S i m_{B} (s_{2}, s_{1})

.

First, compute the sentence similarity

{S i m}_{n}

for combinations of words with a length of

n

:

{S i m}_{n} = \frac{\sum_{k = 0}^{K - 1} M i n (C o u n t_{w_{n_{k}}}^{s_{1}}, C o u n t_{w_{n_{k}}}^{s_{2}})}{\sum_{k = 0}^{K - 1} M a x (C o u n t_{w_{n_{k}}}^{s_{1}}, C o u n t_{w_{n_{k}}}^{s_{2}})}

(10)

where

K

represents the total number of word combinations of length

n

in sentences

s_{1}

and

s_{2}

;

k

represents the

k

th word combination;

w_{n}

represents a combination of words of length

n

; and

C o u n t_{w_{n_{k}}}^{s_{1}}

represents the number of times the

k

th word combination

w_{n_{k}}

appears in the sentence

s_{1}

.

Then, the similarity of word combinations with a length from 0 to

n

is summarized. This study uses the weighted geometric mean to calculate the average similarity

{S i m}_{a v g}

for the similarity of

n

sentences:

{S i m}_{a v g} = \sqrt[\sum_{n = 0}^{N - 1} W_{n}]{\prod {S i m}_{n}^{W_{n}}} = e^{I n \sqrt[\sum_{n = 0}^{N - 1} W_{n}]{\prod {S i m}_{n}^{W_{n}}}} = e^{\frac{\sum_{n = 0}^{N - 1} W_{n} \ln S i m_{n}}{\sum_{n = 0}^{N - 1} W_{n}}}

(11)

The formula is simplified by using the characteristics of the natural logarithm base, where

W_{n}

represents the weight of the similarity of n-length combination words when calculating the geometric mean. If the value of

W_{n}

is set to 1, the above formula can be simplified to

{S i m}_{a v g} = e^{\frac{\sum_{n = 0}^{N - 1} \ln {S i m}_{n}}{N}}

(12)

Finally, when the length of the above two sentences is one long and one short, the long sentence may contain all the combinations of words in the short sentence. According to the above formula, there will be a high degree of similarity, which is different from the actual semantics. Therefore, it is necessary to add a sentence length penalty factor to the above formula:

ϕ = \{\begin{matrix} e^{1 - \frac{l e n (s_{1})}{l e n (s_{2})}}, & {l e n (s}_{1}) > l e n (s_{2}) \\ 1, & {l e n (s}_{1}) = {l e n (s}_{2}) \\ e^{1 - \frac{l e n (s_{2})}{l e n (s_{1})}}, & {l e n (s}_{1}) < l e n (s_{2}) \end{matrix}

(13)

where

{l e n (s}_{1})

represents the length of the sentence

s_{1}

. The effect of this factor is that the more significant the difference in length between two sentences, the lower their similarity. Therefore, the calculation formula of

S i m_{B}

similarity is

{S i m}_{B} = ϕ e^{\frac{\sum_{n = 0}^{N - 1} \ln {S i m}_{n}}{N}}

(14)

Using the above formula to calculate the similarity between names and key sentences in the Knowledge Element can reduce the loss of order information between words. The simplified formula can improve the calculation speed.

③ Similarity Based on Transformation of Ordinal Variables

Ordinal variables refer to categorical variables with ordinal meaning, which can usually be sorted according to a specific order meaning. In the Knowledge Element, the sentiment orientation of the topic can be classified as an ordinal variable. However, ordinal variables need to be sorted in order of meaning beforehand. This study stipulates that the more positive the sentiment orientation of the Knowledge Element, the greater the variable value. Therefore, it is necessary to transform the original sentiment orientation variable value. A value of 0 expresses negative emotion, 1 expresses neutral emotion, and 2 expresses positive emotion. After the transformation, it will be calculated according to the following formula:

{S i m}_{P} = 1 - \frac{{| p}_{i} - p_{j} |}{n - 1}

(15)

where

p_{i}

and

p_{j}

represent the sentiment orientation variables of Knowledge Elements

i

and

j

, respectively; and

n

represents the category number of the sentiment orientation variables.

(2): Knowledge-Solution-Generating Model based on Knowledge Element Similarity Clustering

According to Formula (8), the similarity between any two Knowledge Elements can be calculated. Therefore, each Knowledge Element

u_{i}

in the set

U_{S}

matched by the Knowledge Demand can be treated as a node. The similarity

S i m (u_{i}, u_{j})

between Knowledge Elements can be treated as edges to construct an undirected graph model

G_{S}

. The purpose of constructing this graph model is to use graph clustering algorithms to cluster Knowledge Elements and to use each cluster in the clustering result as a Knowledge Solution.

Due to the large number of Knowledge Demands and the varying number and similarity of Knowledge Elements in different Knowledge Element sets

U_{S}

, it is difficult to ensure the accuracy and efficiency of clustering by manually setting the clustering target number for each Knowledge Element set

U_{S}

based on empirical experience. In addition, the method of generating Knowledge Solutions through clustering of Knowledge Elements requires high validity and accuracy of the clustering results, emphasizing the clustering algorithm of the clustering results.

This study uses the Affinity Propagation (AP) algorithm for clustering [40]. Compared to other commonly used clustering algorithms, such as K-means, the AP algorithm has the characteristics of high robustness and accuracy. In addition, the AP algorithm treats each sample as a potential cluster center. Therefore, the AP algorithm is a suitable clustering method for generating solutions based on the clustering of Knowledge Elements’ similarity.

The AP algorithm is used to cluster the Knowledge Elements in the Knowledge Element set

U_{S}

, and the clustering result contains several clusters, each of which corresponds to a Knowledge Solution

S

for a specific Knowledge Demand.

2.4.3. Generating Knowledge Conclusions

Knowledge Conclusions refer to the summary information and statements extracted from the Knowledge Solution

S

, which can clearly express the main idea of the entire Knowledge Solution. It summarizes all the knowledge contained in the Knowledge Elements of the solution. By reading the Knowledge Conclusion, users can understand the essence of the Knowledge Solution. Therefore, based on the structure of Knowledge Elements, this study defines the organization structure of Knowledge Conclusions

C

as

< Sentiment Index p, Keyword k_{w}, Conclusion sentence s > .

(1): Method for Calculating Sentiment Index

The sentiment orientation of each Knowledge Element in the Knowledge Element set is transformed into an ordinal variable during the similarity calculation, where 0 represents negative sentiment, 1 represents neutral sentiment, and 2 represents positive sentiment. The larger the variable value, the higher the positive sentiment. Therefore, in this scenario, the sentiment score

p

of the Knowledge Conclusion can be obtained by simply calculating the arithmetic mean of the sentiment orientation variables of all Knowledge Elements in the Knowledge Solution using the following formula:

p = \frac{1}{n} \sum_{i = 0}^{n} p_{i}

(16)

where

n

represents the number of Knowledge Elements in the Knowledge Conclusion.

(2): Keyword Extraction Method based on Weighted Term Frequency Importance

The keywords

k_{w}

in the Knowledge Conclusion should reflect the accuracy and comprehensiveness of the Knowledge Solution, and the source of the keywords is not necessarily limited to the keywords in the original Knowledge Elements. Any word in any Knowledge Element representing the meaning of the Knowledge Solution can become a keyword

k_{w}

in the Knowledge Conclusion. Therefore, it is necessary to merge and deduplicate all the vocabulary in the Knowledge Solution and remove stop words and meaningless words to form a candidate keyword dictionary

D

. In addition, it is necessary to consider the frequency of occurrence of words in each element of the Knowledge Element. The word frequency weight in different elements is also different. For example, words appearing in the topic and keywords already represent the meaning of the post to a certain extent, so their importance is higher than when the same word appears in other elements. Finally, the breadth of the occurrence of words in the Knowledge Element should also be considered. For example, when a word appears simultaneously in the title, keywords, and key sentence, its importance is significantly increased. Therefore, this study defines a universal formula for the weighted word frequency importance of individual words.

i m p o r t a n c e_{m} = \sum_{i = 0}^{n} {(w_{h} f}_{h} + {w_{t} f}_{t} + {w_{w} f}_{w} + {w_{s} f}_{s}) \log_{2} (K + 1)

(17)

In this formula, there are a total of

n

Knowledge Elements in the Knowledge Solution, where

i

represents the

i

th Knowledge Element;

f_{h}, f_{t}, f_{w}, f_{s}

represent the frequency of the word in the title

h

, topic word

t

, keyword

w

, and key sentence

s

of the Knowledge Element, respectively.

w_{h}, w_{t}, w_{w}, w_{s}

represent the weight of the word frequency in each element.

K

represents the number of elements in which the word appears in the Knowledge Element.

The main topic words in the Knowledge Elements of this study are obtained through the LDA algorithm, and the keywords are calculated through the improved TextRank algorithm, so the word frequency in these two elements is both 1. In addition, the weight of the word frequency in the title

w_{h}

and the weight of

w_{s}

are both set to 1, so Formula (17) can be simplified as

i m p o r t a n c e_{m} = \sum_{i = 0}^{n} {(f}_{h} + w_{t} + w_{w} + f_{s}) \log_{2} (K + 1)

(18)

The values of

w_{t}

and

w_{w}

should be greater than 1.

According to Formula (18), calculate the importance of each word in the candidate vocabulary

D

, and select the top

T o p N

words with the highest importance as the keywords

k_{w}

in the Knowledge Conclusion.

(3): Conclusion-Generating Method based on Key Sentence Extraction Method

The primary method used to generate the conclusion

s

from the Knowledge Solution

S

can be classified as “key sentence extraction technology” or “automatic summarization technology”.

TextRank is a typical automatic summarization algorithm that uses a voting mechanism to rank text units. Therefore, this study still adopts the idea of voting, combined with the keywords

k_{w}

obtained from the Knowledge Conclusions, to calculate the importance

i m p o r t a n c e_{s t}

of each sentence in each Knowledge Element in the Knowledge Solution (the titles and key sentences of all Knowledge Elements included in the importance), and then selects high-importance sentences to form conclusion sentences. The importance of each sentence is composed of two parts. The first part is the average similarity between the sentence and all other sentences in the Knowledge Solution. This study calculates the similarity

{S i m}_{B}

between each pair of sentences using Formula (14). The second part is the proportion of all vocabulary in the Knowledge Conclusions included in the keywords

k_{w}

. The formula for this is

i m p o r t a n c e_{s t} = \frac{1}{c - 1} \sum_{j, j^{'} \in S, j \neq j^{'}} {S i m}_{B} (j, j^{'}) + \frac{1}{w} \sum_{k_{i} \in k_{w}} f_{k_{i}}

(19)

where

S

represents the set of sentences from all Knowledge Elements in the Knowledge Solution,

c

represents the number of sentences in

S

,

j^{'}

represents any sentence in

S

that is not sentence

j

,

k_{i}

represents a keyword in the key sentences

k_{w}

of the Knowledge Conclusion,

f_{k_{i}}

represents the frequency of the keyword

k_{i}

appearing in sentence, and

w

represents the length of sentence

j

(the number of words it contains). It should be noted that since sentences in social media are often short and punctuation usage is often non-standard, commas, semicolons, consecutive spaces, and line breaks in both Chinese and English are all used as sentence delimiters when segmenting sentences.

3. Results

The data for this study were from the “Autohome Forum”, https://club.autohome.com.cn (accessed on 1 September 2021). More than 20 popular car model forums were selected for crawling, including “Passat Forum”, “Accord Forum”, and “Camry Forum”, from September 2019 to September 2021. Approximately 200,000 post contents were crawled in total. Figure 4 shows a snapshot of the post contents.

In this section, we conduct knowledge retrieval experiments in the Social Media User Knowledge System using the methods described in Section 2, taking the automobile forum as an example. The experiment simulates user searches, follows the steps of the topic knowledge matching flow in Section 2.2, and displays and explains all intermediate results in the experiment to validate the topic knowledge organization and matching methods. At the same time, we provide result examples to demonstrate the effectiveness of the methods.

In the experiment, we utilized the following experimental platform and development environment:

(1)

Experimental Platform:

❿: CPU: Intel Core i7-7700 K, 4.0 GHz
❿: Memory: DDR3, 8 GB * 4, totaling 32 GB
❿: GPU: Nvidia GTX 1080 Ti (11 GB VRAM)
❿: Operating System: Ubuntu 16.04.3

(2)

Development Environment:

❿: Python 3.6.8
❿: TensorFlow 1.0.0
❿: MySQL 5.8.1

3.1. User Inputs Search Content

The keyword input for this experiment simulating user search was “Magotan” and “abnormal noise”. The user demands to obtain comprehensive, clear, and structured knowledge around a specific entity and topic by entering keywords of the required expertise. Therefore, through topic knowledge matching, the user should obtain topic knowledge related to the “Magotan” car model and the “abnormal noise” topic. Additionally, the topic knowledge should be displayed layer by layer according to the knowledge organization structures of both Knowledge Demand and Knowledge Supply.

3.2. Identifying Automobile Entities in Search Content

To identify the entity terms discussed by users in Autohome Forums, this study first established an entity dictionary that contains the mainstream car brands, manufacturers, and models in the forum. The dictionary included 150 car brands, 331 car manufacturers corresponding to each brand, and 1620 car models corresponding to each manufacturer.

Using the constructed automobile entity dictionary, user input keywords can be processed to identify the specific automobile entity that the user is concerned with. This experiment simulated the keywords “MAGOTAN” and “abnormal sound” input by the user. With the help of the car entity dictionary, the “Magotan” model was identified as the specific entity. Therefore, the “Magotan Forum” content was selected for analysis. The selected corpus contained 28,738 topic posts and 293,633 replies or comments, up to 322,371 pieces.

3.3. Identifying Knowledge Demands

3.3.1. Generate Topic Combinations

In this experiment, the simulation of user input keywords “Magotan” and “abnormal noise” was conducted. “Magotan” was identified as a car entity, and the remaining keyword, “abnormal noise”, was used to match the Knowledge Demand. First, the primary topics in the Topic Map generated were traversed to obtain the highest-similarity primary topic of “abnormal noise, brake”, which included 30 topic words. According to the method in Section 2.3.1, the number of topic words in each topic demand was set to

M =

2, and candidate topic combinations were generated. There were a total of

C_{30}^{2}

= 435 candidate topic combinations, and the top 10 generated candidate topic combinations are shown in Table 2.

3.3.2. Filtering User Problem Posts

To screen user problem posts, following the method in Section 2.3.2, it is necessary to select Q&A posts from the global corpus of the automobile forum for LSTM model training. This experiment selected 5000 Q&A posts and 5000 non-Q&A post titles from the “Autohome” Q&A section as the training dataset. One thousand Q&A posts and one thousand non-Q&A post titles were used as the validation dataset. Finally, 28,738 posts from the “Magotan” forum were used as the test dataset for testing. Table 3 shows sample examples from the training dataset, where the post title is the input content of the model and whether it is a Q&A post is the output result of the model.

The above model achieved an accuracy of 96.1% on the validation dataset after training, indicating good classification performance. The titles of 28,738 posts from the “Magotan” forum were input into the model, and eventually, 8165 Q&A posts were identified. These Q&A posts were the “raw materials” for measuring the Support Degree of Knowledge Demand in subsequent steps.

3.3.3. Knowledge Demand Support Degree Measurement

In the 8165 identified Q&A posts through the LSTM model, 702 posts containing the keyword “abnormal noise” were selected by filtering. Then, following the method in Section 2.3.2, the 702 Q&A posts were used to construct an FP-Tree. To simplify and clarify the process of Knowledge Demand generation, the total number of posts was reduced by 100 times, and the Knowledge Demand Support Degree measurement process was explained using seven posts. Then, the titles of these Q&A posts were tokenized, and the specialized vocabulary was filtered using the Topic Map of car. Meaningless and unrelated words were removed, and the frequency of all remaining specialized words in the Q&A posts was calculated and sorted in reverse order by frequency. The seven selected posts, along with the tokenization, filtering, and sorting results, are shown in Table 4.

Table 5 shows the frequency statistics and sorting of domain vocabulary in the Q&A posts.

Using the contents in Table 4 and Table 5, the FP-tree was constructed for the selected vocabulary in the Q&A post according to the steps in Section 2.3.3.

Figure 5 shows the list of words sorted by frequency in Table 5 on the left. The connecting lines from top to bottom represent the connections between different words in the topic phrase that appear in the same post. In contrast, the connecting lines from left to right represent the connections between the same topic phrase in different posts.

For each candidate topic combination in Table 2, each word in the topic combination is first searched from top to bottom, and then from left to right, iteratively, to quickly obtain the Support Degree s of the candidate topic combination in all 702 titles of the Q&A posts. The candidate topic combinations with high Support Degree were then selected as sub-topic combinations in the Knowledge Demand, and the Knowledge Demand was organized according to the structure of the Knowledge Demand.

The Knowledge Demand obtained through the keywords “Magotan” and “abnormal noise” is shown in Table 6.

3.4. Matching Knowledge Supplies

3.4.1. Matching Knowledge Elements

After obtaining the Knowledge Demands that match the keywords “Magotan” and “abnormal noise”, the corresponding Knowledge Supply needed to be matched for each Knowledge Demand. Firstly, related Knowledge Elements were matched for each Knowledge Demand using the method described in Section 2.4.1.

This experiment used pre-trained Word2vec model results. A similarity threshold

t h r e s h o l d = 0.6

was set to include words with similarity scores above T in the table of synonym mapping relationships. A partial result of the mapping table is shown in Table 7.

Using the seed ontology vocabulary, we built an inverted index of the vocabulary in the Knowledge Element Repository. Then, based on the generated inverted index, we filtered out the Knowledge Elements with high matching degrees according to the method of matching Knowledge Elements for Knowledge Demands. The Knowledge Element Repository, which contains the collection of generated Knowledge Elements, is the database from the previous study [4]. The matching degree between Knowledge Demands and Knowledge Elements is shown in Figure 6, which was calculated by Formula (7). Figure 6 shows Knowledge Elements with a Support Degree higher than 2, and these Knowledge Elements were used as examples for subsequent explanations of Knowledge Solutions and Conclusions. Each subfigure in Figure 6 has a serial number in the caption, and each serial number represents the corresponding Knowledge Element in the following paragraphs.

3.4.2. Display of Knowledge Solutions

After matching the Knowledge Demand “noise, front wheel” to the Knowledge Elements in Table 8, the Knowledge Solutions were generated using these Knowledge Elements according to the method described in Section 2.4.2.

Firstly, similarities among these Knowledge Elements were calculated. The similarity weight values of Knowledge Element title, topic, sentiment orientation, keywords, and key sentences in the Knowledge Element similarity model were set to

w_{h} = 0.3

,

w_{t} = 0.1

,

w_{p} = 0.1

,

w_{k_{w}} = 0.2

, and

w_{k_{s}} = 0.3

. Since the similarity values of topic similarity and sentiment similarity were relatively high due to their short content length, they had a relatively large impact on the overall similarity. Therefore, lower similarity weights were set for these two items. Table 8 shows the similarity calculation results between Knowledge Element 1 and each matched Knowledge Element, including the similarity of each sub-item between Knowledge Elements and the total similarity.

According to Table 8, Knowledge Element 1 was highly similar to Knowledge Element 5 and Knowledge Element 9. By clustering similar Knowledge Elements together, a Knowledge Solution can be formed. Table 9 shows the similarity matrix between each Knowledge Element. The matrix provides the primary data for constructing the graph model in the AP clustering algorithm by looking up the similarity between Knowledge Elements.

According to the Knowledge Solution generation model based on the clustering of Knowledge Elements similarity in Section 2.4.2 (1), the AP cluster method according to Section 2.4.2 (2) was performed using the topic similarity matrix in Table 9. The clustering result is shown in Figure 7.

Each red dot in Figure 7 represents a Knowledge Element, and the number around each red dot is the Knowledge Element ID. The size of the red dot indicates the centrality of the Knowledge Element in the clustering result. The blue lines show the similarity between each Knowledge Element, and the thicker the blue line, the higher the similarity. In Figure 7, the 10 Knowledge Elements can be divided into four clusters, as shown by the dashed circles in Figure 7: 1, 5, and 9 form the first cluster; 2, 3, and 6 form the second cluster; 4, 7, and 8 form the third cluster; and 10 is a separate fourth cluster. The set of Knowledge Elements in each cluster is a Knowledge Solution.

3.4.3. Display of Knowledge Conclusions

According to the Knowledge Conclusion generation method in Section 2.4.3, Knowledge Conclusions were extracted from the obtained Knowledge Solutions. The calculation result of the sentiment index p may be a decimal, where 0 represents negative sentiment, 1 represents neutral sentiment, and 2 represents positive sentiment. The variable value increases as the positive sentiment orientation increases. The top 2 (

T o p N = 2

) keywords and the top 3 conclusion sentences (

T o p N = 3

) of each Knowledge Conclusion were selected. Figure 8 shows the Knowledge Conclusion extraction results for each Knowledge Solution, including the sentiment index

p

, keyword

k_{w}

, and conclusion sentence

s

.

3.5. Display of Knowledge Retrieval Application Process

From the user’s perspective, all the results generated from the above processes were linked together. Figure 9 shows the output results of the entity recognition, knowledge matching, and acquisition processes in response to the user input “MAGOTAN” and “abnormal noise” search keywords. The top half of Figure 9 displays the name of each process and its corresponding knowledge organization structure, and the bottom half displays the intermediate results that the user can obtain. Figure 9 shows the Knowledge Supply corresponding to the Knowledge Demand of “abnormal noise, front wheel”, which displays three sets of Knowledge Solutions and their corresponding Knowledge Conclusions.

The traditional search function in social media can solely retrieve the original posts that include the keywords in their titles, without any refinement, categorization, or summarization of the search results. Users can only obtain the desired knowledge or information by reading the specific content of each post one by one. In contrast, the Knowledge Solutions and Knowledge Conclusions obtained through Knowledge Demands and Knowledge Supplies build a centralized and hierarchical knowledge content in the corresponding search results in Figure 9. Users can have an overview of the returned knowledge content. They can also explore specific details of the content they are interested in by following the hierarchical chain of Knowledge Demands, Knowledge Supplies, Knowledge Solutions, and Knowledge Elements. For example, in Figure 9, the user can obtain Knowledge Demands such as “vibration”, “bumps”, “brakes”, and “steering”, and then view the specific content in Knowledge Element 3 of Knowledge Solution 2 corresponding to “bumps”, obtaining information such as “abnormal noise”, “cold weather”, “starting”, and “front wheels” keywords in the Knowledge Element. Therefore, the knowledge retrieval method of this paper is more efficient in obtaining knowledge than the search function in the Autohome forum.

The experiment in this study involved randomly simulating user input keywords for searching. The experiment was repeated 1000 times, and each returned result was manually evaluated. Among the results, 819 had clear knowledge demand themes, matched knowledge supply content, and provided clear and meaningful knowledge conclusions. This resulted in a retrieval accuracy of 81.9%, demonstrating the accuracy of the knowledge retrieval modeling. In addition, a statistical analysis was conducted on the 1000 retrieval results. The average number of knowledge supply items returned per search was 11.3, and the average number of Knowledge Solutions was 43.2. In comparison, conducting the same 1000 searches on the Autohome Forum yielded an average of 9325 results per search. Through manual evaluation, the proposed knowledge retrieval method in this paper was found to save 90% of the time spent on result reading. Therefore, these experiments provide evidence for the accuracy and efficiency of the knowledge retrieval proposed in this paper.

4. Discussion

In order to overcome the challenges of existing knowledge retrieval methods, including mismatched retrieval results with real knowledge needs, low retrieval efficiency, and a lack of structured and systematic outputs, this paper proposes a new knowledge retrieval approach. Firstly, a knowledge retrieval process is developed to meet the users’ requirements for accuracy and efficiency. Secondly, knowledge structures for Knowledge Demand and Knowledge Supply are designed to reflect authenticity, accuracy, and diversity. Thirdly, leveraging the structural characteristics of Knowledge Demands and Knowledge Supply, various methods such as FP-Tree based on the inverted index, the similarity matrix based on the BLEU model, AP clustering, and key information extraction are employed to identify and extract knowledge demands and knowledge supply. Through conducted experiments, the experimental results demonstrate that the method achieves an accuracy rate of 81.9% in knowledge retrieval and reduces the time required to obtain professional knowledge from social media by 90% by providing users with hierarchical and logically organized topic knowledge. Therefore, the accuracy and efficiency of this method have been verified.

The proposed knowledge retrieval method has both theoretical and practical implications. The method overcomes the challenges of keyword-based fuzzy matching methods. Moreover, by flexibly designing knowledge structures and incorporating cutting-edge machine learning methods, this study provides new perspectives for future knowledge retrieval. Since the proposed method can help to organize knowledge in a more structured and systematic way, which can make it easier for users to find and reuse knowledge from social media. This can improve the overall efficiency of knowledge management processes in organizations and businesses from various sectors and industries. Since this method can provide users with access to a wide range of accurate and relevant information, it can be used to improve the accuracy of recommendation engines on social media platforms and can be used to create knowledge hubs on social media platforms.

In the future, we would like to work with social media platforms to implement the proposed method. This would be a great way to see how the method works in practice and to receive feedback from real social media users.

The key problem of knowledge retrieval methods lies in the understanding of content semantics in user Knowledge Demand and Knowledge Supply. Therefore, future research directions include the following two aspects. The first aspect is to design more effective dimensions to extract knowledge demand and knowledge supply and integrating deep learning and natural language processing technology to further understand the semantics in knowledge. The second aspect is based on the understanding of knowledge semantics, to design a more accurate similarity calculation method, and to use the generative text summary method to obtain knowledge conclusions and finally improve the use value of knowledge provided by the knowledge system of social media users.

Author Contributions

Conceptualization, R.M. and Y.H.; methodology, R.M.; software, R.M.; validation, R.M., Y.H. and Z.Z.; formal analysis, Z.Z.; investigation, Y.H.; resources, R.M.; data curation, Y.H.; writing—original draft preparation, R.M. and Y.H.; writing—review and editing, R.M., Y.H. and Z.Z.; visualization, R.M.; supervision, R.M.; project administration, R.M.; funding acquisition, R.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Municipal Foundation for Philosophy and Social Science (grant number 2022ETQ004).

Data Availability Statement

The data presented in this paper are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mameli, M.; Paolanti, M.; Pietrini, R.; Pazzaglia, G.; Frontoni, E.; Zingaretti, P. Deep Learning Approaches for Fashion Knowledge Extraction From Social Media: A Review. IEEE Access 2022, 10, 1545–1576. [Google Scholar] [CrossRef]
Barrera-Diaz, C.A.; Nourmohammadi, A.; Smedberg, H.; Aslam, T.; Ng, A.H.C. An Enhanced Simulation-Based Multi-Objective Optimization Approach with Knowledge Discovery for Reconfigurable Manufacturing Systems. Mathematics 2023, 11, 1527. [Google Scholar] [CrossRef]
Iriondo Pascual, A.; Smedberg, H.; Högberg, D.; Syberfeldt, A.; Lämkull, D. Enabling Knowledge Discovery in Multi-Objective Optimizations of Worker Well-Being and Productivity. Sustainability 2022, 14, 4894. [Google Scholar] [CrossRef]
Lin, J.; Miao, R.; Zhang, Z. Research on Extraction Methods of Topic Knowledge Tuples in Professional Social Media. Libr. Inf. Serv. 2019, 63, 101–110. [Google Scholar] [CrossRef]
Kauffmann, E.; Peral, J.; Gil, D.; Ferrández, A.; Sellers, R.; Mora, H. A Framework for Big Data Analytics in Commercial Social Networks: A Case Study on Sentiment Analysis and Fake Review Detection for Marketing Decision-Making. Ind. Mark. Manag. 2020, 90, 523–537. [Google Scholar] [CrossRef]
Ibtihel, B.L.; Lobna, H.; Lotfi, B.R. A Deep Learning-Based Ranking Approach for Microblog Retrieval. Procedia Comput. Sci. 2019, 159, 352–362. [Google Scholar] [CrossRef]
Jia, J.; Ma, G.; Wu, Z.; Wu, M.; Jiang, S. Unveiling the Impact of Task Conflict on Construction Project Performance: Mediating Role of Knowledge Integration. J. Manag. Eng. 2021, 37, 04021060. [Google Scholar] [CrossRef]
Hartono, B.; Sulistyo, S.R.; Chai, K.H. Knowledge Management Maturity and Performance in a Project Environment: Moderating Roles of Firm Size and Project Complexity. J. Manag. Eng. 2019, 35, 04019023. [Google Scholar] [CrossRef]
Zhang, H.; Zang, Z.; Zhu, H.; Uddin, M.I.; Amin, M.A. Big Data-Assisted Social Media Analytics for Business Model for Business Decision Making System Competitive Analysis. Inf. Process. Manag. 2022, 59, 102762. [Google Scholar] [CrossRef]
Kiryakov, A.; Popov, B.; Terziev, I.; Manov, D.; Ognyanoff, D. Semantic Annotation, Indexing, and Retrieval. J. Web Semant. 2004, 2, 49–79. [Google Scholar] [CrossRef]
Guo, J.; Fan, Y.; Ai, Q.; Croft, W.B. A Deep Relevance Matching Model for Ad-Hoc Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Birmingham, UK, 24 October 2016; ACM: Indianapolis, IN, USA, 2016; pp. 55–64. [Google Scholar]
Amendola, M.; Passarella, A.; Perego, R. Social Search: Retrieving Information in Online Social Platforms–A Survey. Online Soc. Netw. Media 2023, 36, 100254. [Google Scholar] [CrossRef]
Ristoski, P.; Paulheim, H. Semantic Web in Data Mining and Knowledge Discovery: A Comprehensive Survey. J. Web Semant. 2016, 36, 1–22. [Google Scholar] [CrossRef] [Green Version]
Mavrogiorgou, A.; Kiourtis, A.; Manias, G.; Kyriazis, D. An Optimized KDD Process for Collecting and Processing Ingested and Streaming Healthcare Data. In Proceedings of the 2021 12th International Conference on Information and Communication Systems (ICICS), Valencia, Spain, 24–26 May 2021; pp. 49–56. [Google Scholar]
Belcastro, L.; Cantini, R.; Marozzo, F. Knowledge Discovery from Large Amounts of Social Media Data. Appl. Sci. 2022, 12, 1209. [Google Scholar] [CrossRef]
Shu, X.; Ye, Y. Knowledge Discovery: Methods from Data Mining and Machine Learning. Soc. Sci. Res. 2023, 110, 102817. [Google Scholar] [CrossRef]
Xu, K.; Lai, Y.; Feng, Y.; Wang, Z. Enhancing Key-Value Memory Neural Networks for Knowledge Based Question Answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 2937–2947. [Google Scholar]
Manias, G.; Mavrogiorgou, A.; Kiourtis, A.; Symvoulidis, C.; Kyriazis, D. Multilingual Text Categorization and Sentiment Analysis: A Comparative Analysis of the Utilization of Multilingual Approaches for Classifying Twitter Data. Neural Comput. Appl. 2023. [Google Scholar] [CrossRef]
Tan, Y.; Liu, Y.; Long, G.; Jiang, J.; Lu, Q.; Zhang, C. Federated Learning on Non-IID Graphs via Structural Knowledge Sharing. Proc. AAAI Conf. Artif. Intell. 2023, 37, 9953–9961. [Google Scholar] [CrossRef]
Jiang, Y.; Liang, R.; Zhang, J.; Sun, J.; Liu, Y.; Qian, Y. Network Public Opinion Detection During the Coronavirus Pandemic: A Short-Text Relational Topic Model. ACM Trans. Knowl. Discov. Data 2021, 16, 52:1–52:27. [Google Scholar] [CrossRef]
Castaneda, D.I.; Cuellar, S. Knowledge Sharing and Innovation: A Systematic Review. Knowl. Process Manag. 2020, 27, 159–173. [Google Scholar] [CrossRef]
Dienes, Z.; Perner, J. A Theory of Implicit and Explicit Knowledge. Behav. Brain Sci. 1999, 22, 735–808. [Google Scholar] [CrossRef]
Zadeh, L.A. The Concept of a Linguistic Variable and Its Application to Approximate Reasoning—I. Inf. Sci. 1975, 8, 199–249. [Google Scholar] [CrossRef]
Zhu, L.; Xia, Y.; Li, J.; Zhou, G. An Improved Method of Fuzzy Knowledge Matching—IDM Method. Comput. Technol. Dev. 2008, 18, 140–143+253. [Google Scholar]
Zhao, T.; Yuan, L.; Zeng, J. Knowledge Matching Model Based on Latent Semantic Indexing and Arithmetic Analysis. J. China Univ. Geosci. (Soc. Sci. Ed.) 2006, 6, 54–56. [Google Scholar] [CrossRef]
Yu, C.; Chen, H.; Guo, D. The Analysis and Countermeasures Research of Virtual Enterprise Asymmetric Knowledge Sharing. In Proceedings of the 2011 International Conference on Network Computing and Information Security, Guilin, China, 14–15 May 2011; pp. 254–259. [Google Scholar]
Yang, H.; Li, J. Application of Multivariate Statistics and 3D Visualization Analysis in Tacit Knowledge Diffusion Map. Displays 2021, 69, 102062. [Google Scholar] [CrossRef]
Liu, L.; Yao, X.; Qin, L.; Zhang, M. Ontology-Based Service Matching in Cloud Computing. In Proceedings of the 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Beijing, China, 6–11 July 2014; pp. 2544–2550. [Google Scholar]
Yue, Q. Bilateral Matching Decision-Making for Knowledge Innovation Management Considering Matching Willingness in an Interval Intuitionistic Fuzzy Set Environment. J. Innov. Knowl. 2022, 7, 100209. [Google Scholar] [CrossRef]
Rubiolo, M.; Caliusco, M.L.; Stegmayer, G.; Coronel, M.; Gareli Fabrizi, M. Knowledge Discovery through Ontology Matching: An Approach Based on an Artificial Neural Network Model. Inf. Sci. 2012, 194, 107–119. [Google Scholar] [CrossRef]
Guo, R.; Chen, F.; Cheng, X. Research on Dynamic IPO Knowledge Matching Based on Evidence Reasoning. J. China Soc. Sci. Tech. Inf. 2017, 36, 1290–1301. [Google Scholar]
Liu, Y.; Li, K.W. A Two-Sided Matching Decision Method for Supply and Demand of Technological Knowledge. J. Knowl. Manag. 2017, 21, 592–606. [Google Scholar] [CrossRef]
Mohamed, M.; Oussalah, M. SRL-ESA-TextSum: A Text Summarization Approach Based on Semantic Role Labeling and Explicit Semantic Analysis. Inf. Process. Manag. 2019, 56, 1356–1372. [Google Scholar] [CrossRef]
Sun, Y.; He, Y.; Wu, G. Information Matching Model of Terms in Scientific and Technological Literature Based on Domain Knowledge Base. Inf. Sci. 2019, 37, 16–21. [Google Scholar] [CrossRef]
Ma, Y.; Wang, F.; Huang, J.; Jiang, E.; Zhang, X. Research on Construction of a Subject Knowledge Base based on Literature Knowledge Extraction: Using the Knowledge Base of Activating Blood Circulation and Removing Stasis as the Object. J. China Soc. Sci. Tech. Inf. 2019, 38, 482–491. [Google Scholar]
Deng, L.; Lou, Y. Improvement and Research of FP-Growth Algorithm Based on Distributed Spark. In Proceedings of the 2015 International Conference on Cloud Computing and Big Data (CCBD), Shanghai, China, 4–6 November 2015; pp. 105–108. [Google Scholar]
De Smedt, K.; Koureas, D.; Wittenburg, P. FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units. Publications 2020, 8, 21. [Google Scholar] [CrossRef] [Green Version]
Wang, D. Digital Archive Management Based on Lucene Full-Text Search Engine. J. Phys. Conf. Ser. 2021, 2074, 012001. [Google Scholar] [CrossRef]
Peták, M.; Görner, T.; Brožová, H.; Houška, M. Compensating for the Loss of Future Tree Values in the Model of Fuzzy Knowledge Units. Urban For. Urban Green. 2022, 74, 127627. [Google Scholar] [CrossRef]
Bandi, A.; Joshi, K.; Mulwad, V. Affinity Propagation Initialisation Based Proximity Clustering For Labeling in Natural Language Based Big Data Systems. In Proceedings of the 2020 IEEE 6th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing, (HPSC) and IEEE International Conference on Intelligent Data and Security (IDS), Baltimore, MD, USA, 25–27 May 2020; pp. 1–7. [Google Scholar]

Figure 1. Framework of the social media knowledge retrieval method.

Figure 2. Knowledge retrieval process with user’s involvement.

Figure 3. Knowledge structure of Knowledge Supply.

Figure 4. Snapshot of post contents.

Figure 5. FP-tree of the vocabulary in the selected answer post.

Figure 6. Matching results of Knowledge Elements for Knowledge Demands: (a) the result of post 1–5; (b) the result of post 6–10.

Figure 7. Clustering result of Knowledge Elements based on AP clustering algorithm.

Figure 8. Knowledge Conclusion display of each Knowledge Solution: (a) the Knowledge Conclusion of Knowledge Solution 1; (b) the Knowledge Conclusion of Knowledge Solution 2; (c) the Knowledge Conclusion of Knowledge Solution 3; (d) the Knowledge Conclusion of Knowledge Solution 4.

Figure 9. The process of Knowledge Retrieval with user’s topics.

Table 1. Concept explanations.

Concept	Explanation	Contents
Social Media	In this context, social media specifically refers to professional social media, which is defined as an online content production platform used by internet users to share and exchange opinions, insights, experiences, and creative ideas related to a specific professional subject.	It encompasses a vast amount of user-generated content, within which lies valuable knowledge that can be utilized.
Topic Map	To categorize the topics and establish a graph based on their relationships.	Incorporating the associations between knowledge topics.
Knowledge Element	Knowledge primitives, used for the operation and management of knowledge, are independent units that can be freely segmented, expressed, accessed, organized, retrieved, and utilized.	Extracted from social media corpora, each Knowledge Element consists of corresponding knowledge topics, sentiment tendencies, keywords, and key sentences. <Topic, Sentiment Orientation, Keywords, Key Sentences>
Knowledge Element Repository	In knowledge retrieval and matching, it typically refers to a repository that contains a wealth of Knowledge Elements, which are usually obtained through knowledge extraction.	A substantial number of Knowledge Elements extracted from social media corpora.
Knowledge Demand	Knowledge acquisition needs arising to address the encountered issues.	$< K n o w l e d g e T o p i c t, S u b t o p i c C o m b i n a t i o n t_{s e t}, S u p p o r t D e g r e e s, D e m a n d Q u e s t i o n S e t q_{s e t} >$ Please refer to Section 2.3 for more details.
Knowledge Supply	The relevant knowledge points corresponding to a specific Knowledge Demand.	Each Knowledge Supply consists of multiple Knowledge Solutions and Conclusions that can fulfill Knowledge Demands. These Knowledge Solutions and Conclusions are sourced from the Knowledge Element Repository. Please refer to Section 2.4 for more details.
Knowledge Solution	It is a collection of multiple Knowledge Elements that have similar content or meaning, where each Knowledge Element serves as a supporting point for the respective Knowledge Solution.	Each Knowledge Demand corresponds to multiple Knowledge Solutions and several clusters of Knowledge Elements. Please refer to Section 2.4.2 for more details.
Knowledge Conclusion	It is a summary and overview of the Knowledge Solution, which is composed of several brief statements.	<Sentiment Index $p$ , Keyword $k_{w}$ , Conclusion sentence $s$ > Each Knowledge Solution corresponds to a Knowledge Conclusion. Please refer to Section 2.4.3 for more details.
Knowledge Retrieval Method	The process of matching “Knowledge Demand” with “Knowledge Supply” through knowledge search and knowledge computation.

Table 2. Candidate topic combinations corresponding to the topic “abnormal noise” and “brakes”.

Serial Number	Candidate Topic Combinations	Serial Number	Candidate Topic Combinations
1	Abnormal Noise, Sound	2	Abnormal Noise, Traffic Light
3	Abnormal Noise, Vibration	4	Abnormal Noise, Neutral Gear
5	Abnormal Noise, Noise	6	Abnormal Noise, Accelerator
7	Abnormal Noise, Tremble	8	Abnormal Noise, Startup
9	Abnormal Noise, Brake	10	Abnormal Noise, Problem
……	……	A total of 435	……

Table 3. Samples in the training dataset.

Post Title	Is Q&A Post
What’s the reason for the car trembling when starting up	1
Is it normal for the brake discs of a new car to rust? The car was bought in November	1
May I ask if there will be any impact on the car when it runs 1500 km on the highway shortly after it is put into service?	1
Finally picked up my car today, the 330 Leading Edition. Let me share my thoughts	0
The car was bought for my wife. Just for the sake of credibility, I need to authenticate myself by posting a reply!	0
Just for fun, added some atmosphere lights and 17-inch wheels to the car during my free time	0
……	……

Table 4. Selected Q&A posts and results of segmentation processing.

ID	Q&A Post Title	After Tokenization
1	Abnormal noise from the front wheels of my Magotan… urgent!!!	Abnormal Noise, Front Wheels
2	Abnormal noise from the chassis when driving on bumpy roads??	Abnormal Noise, Bump, Chassis
3	Why is there often a noise when releasing the brake during starting?	Abnormal Noise, Brake, Startup, Releasing
4	Help! Help! My new car just picked up, but there is a front wheel noise when driving on bumpy roads.	Abnormal Noise, Front Wheels, Bump, Help, Roads, New Car
5	There is a problem with the abnormal sound of the front wheel when steering	Abnormal Noise, Front Wheels, Steering, Problem
6	Help, when the car steers at low speed or on the spot, there is a “boom boom boom” abnormal sound I can hear in the driver’s compartment.	Abnormal Noise, Help, Steer, Low Speed, Driver’s Compartment, On the Spot
7	Regarding the abnormal noise issue, there is always a knocking sound during the process of braking until the car comes to a near stop.	Abnormal Noise, Brake, Stop, Process, Sound
……	……	……

Table 5. Frequency and sorting of domain terms in Q&A posts.

Topic Word	The Number of Occurrences	Topic Word	The Number of Occurrences
Abnormal Noise	7	Driver’s Compartment	1
Front Wheels	3	Roads	1
Brake	2	Startup	1
Steer	2	Problem	1
Bump	2	New Car	1
Help	2	On the Spot	1
Chassis	1	Stop	1
Low Speed	1	Process	1
Driving	1	Sound	1

Table 6. Knowledge demand obtained through the keywords “Magotan” and “abnormal noise”.

ID	Knowledge Topic $t$	Subtopic Combination $t_{s e t}$	Support Degree $s$	Set of Demand Issues $q_{s e t}$ (Post ID)
1	Abnormal Noise, Brake	Abnormal Noise, Front Wheels	0.429	1, 4, 5
2	Abnormal Noise, Brake	Abnormal Noise, Bump	0.286	2, 4
3	Abnormal Noise, Brake	Abnormal Noise, Brake	0.286	3, 7
4	Abnormal Noise, Brake	Abnormal Noise, Steer	0.286	5, 6
……	……	……	……	……

Table 7. Examples of mapping table for synonyms.

No.	Synonyms	Similarity	No.	Synonyms	Similarity
1	Engine, Motor	0.724	2	Engine,	0.630
3	Headlights, Portable Lighter	0.812	4	Headlights, Headlamp	0.779
5	Warm Air, Heating	0.877	6	Warm Air, Hot Air	0.852
7	Gearbox, Gear Case	0.873	8	Gearbox, Transmission	0.821
9	Car Body, Bodywork	0.698	10	Mirror, Reflector	0.935
……	……			……

Table 8. Similarity calculation results between Knowledge Element 1 and each matched Knowledge Element, including the similarity values of each sub-item and the overall similarity.

Post ID	$S i m_{B} (h_{u_{i}}, h_{u_{j}})$ Title Similarity	$S i m_{J} (t_{u_{i}}, t_{u_{j}})$ Topic Similarity	$S i m_{P} (p_{u_{i}}, p_{u_{j}})$ Sentiment Similarity	$S i m_{J} ({k_{w}}_{u_{i}}, {k_{w}}_{u_{j}})$ Keyword Similarity	$S i m_{B} ({k_{s}}_{u_{i}}, {k_{s}}_{u_{j}})$ Key Sentence Similarity	$S i m (u_{i}, u_{j})$ Similarity
Post ID	$w_{h} = 0.3$	$w_{t} = 0.1$	$w_{p} = 0.1$	$w_{k_{w}} = 0.2$	$w_{k_{s}} = 0.3$	1
1	1	1	1	1	1	1
2	0.181	1	1	0.142	0.012	0.286
3	0.143	1	1	0.333	0.121	0.346
4	0.111	1	1	0.333	0.036	0.311
5	0.083	1	1	0.6	0.221	0.411
6	0.125	1	1	0.333	0.084	0.329
7	0.071	1	1	0.333	0.074	0.31
8	0	1	1	0.333	0.076	0.289
9	0.083	1	0.5	0.6	0.189	0.352
10	0.2	1	0.5	0	0	0.21
……	……	……	……	……	……	……

Table 9. Similarity matrix among Knowledge Elements.

	1	2	3	4	5	6	7	8	9	10	……
1	1
2	0.286	1
3	0.346	0.398	1
4	0.311	0.272	0.267	1
5	0.411	0.234	0.210	0.302	1
6	0.329	0.407	0.431	0.268	0.293	1
7	0.31	0.304	0.328	0.389	0.264	0.322	1
8	0.289	0.291	0.265	0.437	0.312	0.284	0.358	1
9	0.352	0.331	0.225	0.241	0.437	0.241	0.271	0.255	1
10	0.21	0.181	0.207	0.231	0.214	0.196	0.165	0.203	0.173	1
……											……

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miao, R.; Huang, Y.; Zhang, Z. A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies. Mathematics 2023, 11, 3154. https://doi.org/10.3390/math11143154

AMA Style

Miao R, Huang Y, Zhang Z. A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies. Mathematics. 2023; 11(14):3154. https://doi.org/10.3390/math11143154

Chicago/Turabian Style

Miao, Runsheng, Yuchen Huang, and Zhenyu Zhang. 2023. "A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies" Mathematics 11, no. 14: 3154. https://doi.org/10.3390/math11143154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Social Media Knowledge Retrieval Method Based on Knowledge Demands and Knowledge Supplies

Abstract

1. Introduction

2. Materials and Methods

2.1. Modeling Framework

2.2. Application Process for Knowledge Retrieval

2.3. Modeling of Knowledge Demand

2.3.1. Generating Topic Map and Topic Combinations

2.3.2. Screening of User Question Posts Based on LSTM Model

2.3.3. A Knowledge Demand Support Degree Measurement Model Based on the FP-Tree Model

2.4. Modeling of Knowledge Supply

2.4.1. Matching Relevant Knowledge Elements According to Knowledge Demand

2.4.2. Generating Knowledge Solutions Based on the Knowledge Elements Clustering

2.4.3. Generating Knowledge Conclusions

3. Results

3.1. User Inputs Search Content

3.2. Identifying Automobile Entities in Search Content

3.3. Identifying Knowledge Demands

3.3.1. Generate Topic Combinations

3.3.2. Filtering User Problem Posts

3.3.3. Knowledge Demand Support Degree Measurement

3.4. Matching Knowledge Supplies

3.4.1. Matching Knowledge Elements

3.4.2. Display of Knowledge Solutions

3.4.3. Display of Knowledge Conclusions

3.5. Display of Knowledge Retrieval Application Process

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI