Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis

Artene, Codruț-Georgian; Oprișa, Ciprian; Buțincu, Cristian Nicolae; Leon, Florin

doi:10.3390/math11092053

Open AccessArticle

Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis

¹

Department of Computer Science and Engineering, “Gheorghe Asachi” Technical University of Iasi, 700050 Iasi, Romania

²

Computer Science Department, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(9), 2053; https://doi.org/10.3390/math11092053

Submission received: 3 April 2023 / Revised: 20 April 2023 / Accepted: 23 April 2023 / Published: 26 April 2023

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Disinformation in the form of news articles, also called fake news, is used by multiple actors for nefarious purposes, such as gaining political advantages. A key component for fake news detection is the ability to find similar articles in a large documents corpus, for tracking narrative changes and identifying the root source (patient zero) of a particular piece of information. This paper presents new techniques based on textual and semantic similarity that were adapted for achieving this goal on large datasets of news articles. The aim is to determine which of the implemented text similarity techniques is more suitable for this task. For text similarity, a Locality-Sensitive Hashing is applied on n-grams extracted from text to produce representations that are further indexed to facilitate the quick discovery of similar articles. The semantic textual similarity technique is based on sentence embeddings from pre-trained language models, such as BERT, and Named Entity Recognition. The proposed techniques are evaluated on a collection of Romanian articles to determine their performance in terms of quality of results and scalability. The presented techniques produce competitive results. The experimental results show that the proposed semantic textual similarity technique is better at identifying similar text documents, while the Locality-Sensitive Hashing text similarity technique outperforms it in terms of execution time and scalability. Even if they were evaluated only on Romanian texts and some of them are based on pre-trained models for the Romanian language, the methods that are the basis of these techniques allow their extension to other languages, with few to no changes, provided that there are pre-trained models for other languages as well. As for a cross-lingual setup, more changes are needed along with tests to demonstrate this capability. Based on the obtained results, one may conclude that the presented techniques are suitable to be integrated into a decentralized anti-disinformation platform for fact-checking and trust assessment.

Keywords:

MSC:

68T30; 68T50

1. Introduction

The complex information environment that is aggressively developing online is threatening the way information is perceived. With high amounts of disinformation being actively spread on the internet [1], AI-generated content [2] that has the capacity to sky-rocket disinformation, and a diverse range of threat actors [3], humanity’s access to trusted information is threatened as never before. Not long from now, every piece of information will be questioned because people will soon realize it is no longer possible to separate truth from fiction. In this seemingly dystopian near future, an information crisis will unfold unless new tools and techniques are developed to counter this phenomenon.

In this paper, we propose new techniques based on textual similarity and semantic textual similarity that will enable the analysis of information propagation, narrative changes and identification of patient zero, that is, the root source of a particular piece of information. Our goal is to implement and evaluate different text similarity and semantic textual similarity techniques in order to determine which one is more suitable for the task at hand.

For text similarity, we apply Locality-Sensitive Hashing on the set of n-grams representing each news article. The resulting representation can be indexed and allows for the quick discovery of similar articles. For semantic textual similarity, we use Named Entity Recognition and sentence embeddings from pre-trained language models based on Transformers.

The techniques explored in this paper take as input a collection of documents. In our case, these documents are represented by online news articles that are assembled into a large corpus database by a series of web crawlers. However, these techniques can be applied to a wide range of scenarios and document types where semantic similarity analysis is needed (e.g., plagiarism detection).

Although the proposed methods were evaluated only on texts extracted from news articles in the Romanian language, and part of the pre-trained models we use are specially designed for the Romanian language, these methods can be extended to other languages with few or no changes. The text similarity method that we use can be applied to other languages as it is. The semantic textual similarity method uses pre-trained language models and it can be applied to other languages as long as the model behind it has been pre-trained on those languages. The same applies to the Named Entity Recognition technique.

Regarding the cross-lingual approach, we cannot directly apply the proposed text similarity technique, as it is based on n-grams and their translations differ from one language to another, resulting in different representations. The same problem is valid in the case of the Named Entity Recognition technique that we use. An additional module for translating texts into different languages could solve this problem. The semantic textual similarity module could be used in a cross-lingual environment as long as the underlying pre-trained model is a multi-lingual one, but new tests are needed to demonstrate this capacity.

Our main contribution are the following:

We designed a solution to analyze information propagation and narrative changes and to identify the patient zero in the context of online disinformation, based on text similarity techniques.
We collected and annotated a set of news from the websites of some Romanian publications to be used for the evaluation of the proposed solution.
We implemented and evaluated some text similarity and semantic textual similarity techniques in order to assess their performance in terms of the quality of the results, execution times and scalability.

The next section of the paper describes the main concepts and state of the art for fake news detection and news article similarity, along with methods used for semantic textual similarity. A detailed description of the proposed solution is given in Section 3, while Section 4 presents the experimental results. The results section is divided into three subsections: the quality evaluation, where the proposed solutions are evaluated using metrics such as Precision, Recall and

F_{1}

score, the performance evaluation and the solution integration for finding patient zero and tracking the narrative changes. The paper ends with the conclusions section, where the experimental results are discussed along with future research ideas.

2. Related Work

The problem of identifying similar news articles as a part of the fake news detection process has been approached in other publications. This section compiles a list of academic papers that describe solutions for this problem and for variations of this problem that involve similarity computations. It is difficult to perform a quantitative comparison between our approach and the approach in the presented papers, as our work focuses on Romanian language articles. However, for each article we highlight the common techniques used along with the main differences. This section also lists the literature describing the semantic textual similarity techniques that were used in this paper.

The authors of [4] acknowledge the importance of similarity for fake news detection. The article proposes a multi-modal approach, where relevant features are extracted from both text and images, that sometimes can be used to misguide the users. The mismatch between the textual and visual information in a news article can be a strong indicator of fake news.

The approach in [5] focused on English and Hindi articles and employed three methodologies: cosine similarity using TF-IDF vectors, Jaccard similarity with TF-IDF vectors and Euclidean distance on bags of words. The cosine similarity obtained the best results. Here, TF-IDF is the abbreviation for the “Term Frequency–Inverse Document Frequency” method. The experimental results were obtained on 1000 news stories extracted from Google News.

The cosine distance, along with n-gram extraction was also used in [6], where the authors proposed a clustering algorithm based on the n-grams extracted from the text of the articles. The authors also conducted experiments to determine the best n-gram size on two public datasets.

The approach in [7] is based on keyphrase extraction algorithms, which is somewhat similar to our approach of using Named Entity Recognition for filtering potentially similar pairs. The best approach was found to be the combination of the word2vec representation [8] with the cosine similarity.

Similar content can be presented in different languages. The authors of [9] explored such news articles and proposed a cross-lingual system for linking different language events. The best results were obtained by Cross-Lingual Latent Semantic Indexing [10] and Canonical Correlation Analysis [11], which performed better on smaller clusters.

An application of news article similarities in the form of detecting media bias was described in [12]. The authors leveraged machine learning techniques such as Support Vector Machines, Logistic Regression and Siamese networks [13] to compute a similarity score for pairs of articles and also approached the problem of finding all similar items with a give article for a given documents corpus.

Not only is the news articles’ text relevant for articles published on the Internet, but also their user comments. The research conducted in [14] used deep learning and similarity comparison to evaluate the political stance of the top news commenters and obtained an impressive accuracy.

All these methods are based on measuring the similarity between the text of the articles. Text similarity is the task of measuring how similar two pieces of text are. It is used in various Natural Language Processing (NLP) tasks and researchers have proposed multiple methods for text representation and text distance measurement [15]. While the earliest methods for text similarity are based on the identification of common terms, such as words or groups of words, capturing only the lexical and syntactic similarities, newer methods, based on deep learning techniques, are able to capture and represent the semantic similarities between texts [16].

Semantic textual similarity is the task of measuring how similar two pieces of text are from the point of view of semantic content. State-of-the-art language models are able to learn representations of words and phrases, capturing their meaning and context. Large pre-trained language models, such as BERT (Bidirectional Encoder Representations from Transformers) and its variants, are already the basis of some techniques for semantic textual similarity, including both supervised and unsupervised methods [17,18,19,20,21,22,23].

First introduced in [24], BERT is a deep neural network architecture that uses a multi-layer bidirectional transformer encoder to learn contextual representations of words and phrases. Pre-trained BERT models can be fine-tuned for semantic similarity tasks by adding a task-specific output layer [17,18]. Another approach is to use BERT-based Siamese networks for semantic textual similarity [19,20]. However, the main drawback of these methods is that they require annotated data for training.

In unsupervised methods, pre-trained BERT is used to generate fixed-length vector representations of input sentences that capture their semantic meanings [21,22,23]. These representations are called sentence embeddings and can be used to compute the semantic similarity between pairs of sentences using different distance metrics, such as cosine similarity. For large texts, composed of multiple sentences, the main challenge remains to establish the matching pairs of sentences or find a method to aggregate the similarities between pairs of sentences.

To the best of our knowledge, we are the first to apply a semantic textual similarity technique based on pre-trained Transformers Language Models on relatively large texts extracted from news articles in order to determine how these articles relate to each other in the context of online disinformation.

3. Solution Description

We propose a solution to analyze information propagation, narrative changes and the identification of patient zero, based on text similarity. Figure 1 illustrates the proposed solution, which is composed of the following three main steps:

(A): First, a collection of text documents is processed by a Similarity Analysis module. We apply Named Entity Recognition and text similarity techniques to compute similarity scores between pairs of text documents. Based on the computed similarity scores, we form clusters of documents.
(B): Each cluster of text documents is then further processed and a similarity graph is constructed based on the similarity scores.
(C): The similarity graph is further processed including additional information, such as the date and time when the text documents were fetched from the Internet, allowing us to find the patient zero and to track narrative changes.

Figure 1. Overview of the proposed solution.

The detailed description of the solution is presented in the following, including implementation details. The key component of our solution is represented by the Text Similarity module. We experimented with multiple techniques for text similarity and semantic textual similarity, to determine which of these is more suitable for the presented solution. Considering that the proposed solution must provide qualitative results in the shortest possible time, we evaluated these techniques both from the point of view of the quality of the results and from the point of view of execution times, assessing the scalability at the same time. We also describe how the similarity scores between the documents in one cluster can be used to find the patient zero and to track narrative changes by building and analyzing a similarity graph.

3.1. Textual Similarity

The approach presented in this subsection describes a method to find similar articles from a large document corpus. The current assumption is that an article was reproduced without significant changes. The proposed technique is robust to some changes, such as removing or adding some phrases or paragraphs, reordering the phrases or even small word-level modifications inside the phrases. We will show how this method can be scaled to deal with massive collections of articles. The next section (Section 3.2) will address this problem from a semantic point of view, showing robustness to even stronger modifications of the original article, as long as the text semantics are preserved.

3.1.1. Article Representation and Similarity Score

For computing the similarity score in a scalable way, we have chosen to represent each article as a set of word n-grams [25]. Figure 2 presents the pipeline to process an article text in order to retrieve the set of n-grams.

The first of the three steps, denoted by split takes the unstructured text and splits it into phrases, by identifying the punctuation marks. Each phrase is then represented as a list of words, from which we eliminated very short words (one or two characters) and numbers. We do not consider further any phrase that is shorter than the n-gram size, since no n-grams can be extracted from it.

The normalize step reduces every word to lowercase and will convert all Unicode characters to their ASCII equivalent. This is an important step for the Romanian language, as it contains a number of diacritics (ă, â, î, ș and ț) on top of the English alphabet. While the standard use of the Romanian language involves the use of these diacritics, most informal texts such as instant messages or social media posts replace them with the English equivalent (e.g., ă is replaced by a) [26]. Such texts are still readable by native Romanians; sometimes the missing diacritics even go unnoticed. Yet, for a computer algorithm, paturi and pături are two distinct words, so an article reproduction that replaces the diacritics would avoid basic similarity matches.

The extract step iterates over the words of each phrase using a sliding window of size n and computes a CRC32 hash (Cyclic Redundancy Check [27]) on each sequence of n consecutive words. For instance, if we have the phrase A B C D E F G and

n = 5

, the extracted n-grams are <A B C D E>, <B C D E F> and <C D E F G>. Although CRC32 is not a cryptographically secure hashing algorithm [28], it would be very difficult for an attacker to modify a text such as to compute false matches on multiple n-grams. We have chosen this algorithm for the advantages of being fast to compute and having a compact representation (only 4 bytes). All the n-grams extracted from all the phrases are combined into a single set that represents the entire article. This means that duplicate phrases or repeating word sequences will only appear once in the set representation.

Having two articles represented as sets X and Y, we can define their similarity score using the Jaccard formula [29] as in Equation (1). This formula is commonly used for set similarity and computes the ratio between the size of the intersection of the sets and the size of the set union. The similarity score will be 1 (or 100%) only if the sets are identical, while two sets with no common n-grams will have a similarity score of 0.

s i m (X, Y) = \frac{| X \cap Y |}{| X \cup Y |}

(1)

3.1.2. Enabling Scalability with Locality-Sensitive Hashing

The Jaccard similarity is a powerful tool for computing article similarity and it can be computed in linear time, assuming we store the articles’ representations as sorted arrays of n-grams hashes. However, this representation grows in size with the article size, becoming quite large for large articles. A naive approach to find all pairs of similar articles would be to compute the similarity score for each pair of articles, an effort that is quadratic in the collection size. This section describes how we can drastically reduce the computation time by approximating the similarity score using the Locality-Sensitive Hashing (LSH) technique.

According to [30], an LSH function is a hash function, for which the collision probability for two items increases with their similarity score. Each similarity score has its own family of LSH functions. Since we focus on the Jaccard similarity, we will briefly describe the MinHash functions family [31] as an LSH function.

The MinHash of a set X given a permutation

σ

is defined as the minimum value of

σ (x), \forall x \in X

. It is proven in [30] that the probability that two sets X and Y have the same MinHash value is equal to their Jaccard similarity

s i m (X, Y)

. Sticking to a single permutation is not very useful, since the probability of matching an article that is 70% similar to a given one is only 70%. However, if we define a larger number of permutations, for instance,

N = 150

, and compute MinHashes on these permutations, statistically, about 70% of the MinHashes will match.

Previous work [31,32] proved that it is easy to generate a family of MinHash functions by selecting a prime number m and generating two series of random integers

a_{i}, b_{i} \in Z_{m}, \forall 1 \leq i \leq N

and

a_{i} \neq 0

. The permutation

σ_{i} : Z_{m} \to Z_{m}

will be defined as

σ_{i} (X) = (a_{i} \cdot x + b_{i}) mod m

. Thus, the i-th MinHash of the set X will have the value

min_{x \in X} (a_{i} \cdot x + b_{i}) mod m

.

The N MinHashes defined above can be further split into b bands, with r rows on each band (

N = b \times r

). For each band, j,

1 \leq j \leq b

, a regular hash (such as CRC32) can be computed on the r MinHashes situated on that band. If two articles represented by the sets X and Y have the Jaccard similarity

s \in [0, 1]

, the probability that at least one of the b Band Hashes will match is given by Equation (2).

P (s) = 1 - {(1 - s^{r})}^{b}

(2)

To prove this claim, we will assume that the hash collision on each band is negligible, so the probability that X and Y have the same Band Hash is the same as the probability of having the same MinHash on each of the r rows. Since the probability of having the same MinHash on a specific row is s and the rows are independent, the probability of having the same MinHash on the r rows of a given band is

s^{r}

. The probability of having different Band Hashes on all b bands is

{(1 - s^{r})}^{b}

, which means that the probability for at least a matching Band Hash is its opposite, as in Equation (2).

Figure 3 plots the probability of having at least one common Band Hash, for

r = 5

and

b = 30

against the Jaccard similarity

s \in [0, 1]

of two articles X and Y. We can notice that the plot has a sigmoid shape and that the probability is very close to 1 for similarities

s \geq 70 %

and very close to 0 for similarities

s \leq 20 %

. By tuning the parameters b and r, we can decrease or increase the similarity threshold for which we obtain match probabilities very close to 1.

The MinHashes and the Band Hashes can be computed individually for each article, enabling parallelization. By maintaining database indices on each of the b bands, we can quickly select a list of candidate articles that have at least one common Band Hash with a searched one. Instead of computing the Jaccard similarity with every other article in the database, we can compute the similarity only with the list of candidates and the probability of missing a similar article is very low (although not 0).

Figure 4 presents the second pipeline for the textual similarity approach, which starts from the previous article representation, as a set of word n-grams and computes the LSH representation, consisting of MinHashes and Bands.

Having the articles represented as arrays of MinHashes enables a faster similarity computation, since the number of MinHashes N is constant, regardless of the article size. The similarity approximation is given in Equation (3), which can be computed in

O (N)

.

s i m_{m h} (X, Y) = \frac{|{j ∣ 1 \leq j \leq N \land X . M H [j] = Y . M H [j]}|}{N}

(3)

Having computed the Band Hashes representation of an article as an array B with b values, the list of candidates from a document corpus

D

can be computed as in Equation (4).

c a n d i d a t e s (X) = ⋃_{j = 1}^{b} {Y \in D ∣ X . B [j] = Y . B [j]}

(4)

Storing the article Band Hashes as indexed columns in a database table enables the fast retrieval of the candidates set, as long as we do not have too many matches on any of the bands. For practical considerations, Band Hashes that appear in a significant number of documents can be ignored, as they usually correspond to unuseful frequent phrases and significantly slow down the similarity computation.

The similarity metric from Equation (3) is reflexive and symmetric. Reflexivity states that a similarity metric should be equal to the maximum similarity value when comparing an object to itself. In the case of MinHash similarity, the formula is

s i m_{m h} (X, X) = \frac{|{j ∣ 1 \leq j \leq N \land X . M H [j] = X . M H [j]}|}{N}

(5)

Since we are comparing an object to itself, the number of matching MinHash values is equal to the number of MinHash values, resulting in

s i m_{m h} (X, X) = \frac{N}{N} = 1,

(6)

which is the maximum similarity value. Therefore, the MinHash similarity metric from Equation (3) satisfies reflexivity.

Symmetry states that the similarity between two objects should be the same regardless of the order in which they are compared. In the case of MinHash similarity from Equation (3), by swapping the sets X and Y, we obtain

s i m_{m h} (Y, X) = \frac{|{j ∣ 1 \leq j \leq N \land Y . M H [j] = X . M H [j]}|}{N}

(7)

Since the number of matching MinHash values between X and Y is the same as the number of matching MinHash values between Y and X (i.e., the same hash values are being compared)

|{j ∣ 1 \leq j \leq N \land X . M H [j] = Y . M H [j]}| = |{j ∣ 1 \leq j \leq N \land Y . M H [j] = X . M H [j]}|,

(8)

the MinHash similarity metric from Equation (3) satisfies symmetry:

s i m_{m h} (X, Y) = s i m_{m h} (Y, X)

(9)

3.2. Semantic Similarity

To compute the semantic similarity between two text sequences, we propose a semantic similarity pipeline based on sentence embeddings and Named Entity Recognition (NER) [33]. The pipeline is illustrated in Figure 5 and, in this section, we describe its components.

3.2.1. Sentence Tokenizer

First, each text is divided into sentences, resulting in sequences of variable length:

X = {x_{1}, . . ., x_{n}}, n \geq 1 .

(10)

For this, we apply the sentence tokenizer available in the NLTK (Natural Language Toolkit) Python tokenizer package (The NLTK tokenizer package is available at https://www.nltk.org/api/nltk.tokenize.html, accessed on 7 March 2023).

3.2.2. Named Entity Recognition

Next, we extract named entities from the text, applying the available Romanian NLP pipeline from spaCy (ro_core_news_lg) (The SpaCy trained pipelines for Romanian are available at https://spacy.io/models/ro, accessed on 8 March 2023). We use the entity recognizer to extract the entities and their corresponding labels from each sentence, without filtering them based on the labels. An example of the extracted named entities together with their corresponding labels is presented in Figure 6.

To build the identifier of an entity, we apply the following steps:

We use the lemmatizer from the same Romanian NLP pipeline to extract the base form of each entity.
Then, a Romanian language stemmer is used to extract the stem of the tokens that make up each entity. (The NLTK Romanian Snowball stemmer is available at https://www.nltk.org/api/nltk.stem.snowball.html#nltk.stem.snowball.RomanianStemmer, accessed on 8 March 2023).

“UE” is short for “Uniunea Europeană”, representing the same entity. However, the lemmatizer that we use produces different outputs for the two forms and this is why we apply the stemmer. To illustrate this process, the identifier for the “UE” and “Uniunea Europeană” forms is constructed as follows:

UE \overset{(1)}{\to} Uniunea_Europeană \overset{(2)}{\to} uniun european,

Uniunea Europeană \overset{(1)}{\to} uniune european \overset{(2)}{\to} uniun european .

3.2.3. Entities Match

In practice, for a large number of text documents, the number of combinations between them for which we have to calculate the similarity score is too high. To reduce computational costs, we set a threshold regarding the minimum number of common entities between two text documents, starting from the assumption that two texts that do not have common entities are not similar Thus, the other stages of the pipeline continue only if this condition is met.

For two given text documents,

X

and

X^{'}

, and their corresponding sets of extracted entities identifiers,

ϵ

and

ϵ^{'}

, we define the entities match score as the ratio between the number of common entities and the total entities from the two texts:

σ (X, X^{'}) = \frac{| (ϵ \cap ϵ^{'}) |}{| (ϵ \cup ϵ^{'}) |} .

(11)

In our experiments, we use a threshold of

10 %

for the entities match score between two input texts. For lower values, we consider that the two input documents are not similar, assigning them a similarity score of 0.

3.2.4. Sentence Embeddings

The method we propose to extract sentence embeddings is based on the pre-trained BERT [24] for the Romanian language (The BERT base cased model for Romanian that we used is available at https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1, accessed on 7 March 2023) [34]. BERT uses a technique called MLM (Masked Language Modelling) to pre-train the model on a massive dataset of text. During pre-training, BERT randomly masks some of the words in the input text, and the model must then predict the original word based on the context provided by the non-masked words. This allows the model to learn the relationships between words in a sentence and their context.

The BERT model also includes a special token called the [CLS] token, which is added at the beginning of the input sentence and is used for classification tasks such as text classification or Named Entity Recognition. The final hidden state corresponding to this token is used as the aggregate representation of the input sentence for the classification task. However, researchers proved that better representations can be obtained by applying different pooling strategies to the sequence of hidden states of the internal layers of BERT [21,22,35].

Each sentence in a given input sequence

X

is tokenized using the specific BERT tokenizer corresponding to the pre-trained model that we use and the resulting output is passed to BERT to produce sentence embeddings. After this step, the input sequence of sentences is encoded as a sequence of dense vectors with a length of 768, as follows:

E = {e_{1}, \dots, e_{n}}, n \geq 1 .

(12)

To extract sentence embeddings from BERT, we experiment with two methods:

We take the dense vector representation of the [CLS] token as the semantic representation of the input sentence.
We take the average of the hidden state of the last layer on the time axis.

We also include in our experiments sentence embeddings from the Sentence Transformers Library (The Sentence Transformers library is available at https://www.sbert.net/index.html, accessed on 18 March 2023). First introduced in [21], Sentence Transformers can be used to compute sentence embeddings that can be compared using cosine similarity. The pre-trained Sentence Transformer that we use is distiluse-base-multilingual-cased-v2 (The pre-trained multilingual Sentence Transformer that we used is available at https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2, accessed on 18 March 2023) [36], an extension of the Multilingual Universal Sentence Encoder from [37]. Different from the BERT model we use, the selected Sentence Transformer maps sentences to dense vectors with a length of 512.

3.2.5. Sentence Similarity

For two given text sequences,

X

and

X^{'}

, with an arbitrary number of sentences,

n_{1}

and

n_{2}

, respectively, and their corresponding sentence embeddings sequences,

E

and

E^{'}

, respectively, we determine the values of the similarity matrix

M : = (m_{i j}), 1 \leq i \leq n_{1}, 1 \leq j \leq n_{2},

as

m_{i j} = s i m (e_{i}, e_{j}^{'}),

(13)

where

s i m (e_{i}, e_{j}^{'})

represents the cosine similarity, defined as

s i m (e_{i}, e_{j}^{'}) = cos (θ) = \frac{e_{i} \cdot e_{j}^{'}}{| | e_{i} | | | | e_{j}^{'} | |} .

(14)

During our preliminary experiments, we observed that the values of the similarity matrix

M

are greater than zero even in cases where the two sentences underlying such a value had nothing in common from a semantic point of view. As researchers have shown, this may be due to the anisotropic word embeddings space produced by BERT, in which word embeddings occupy a narrow cone in the vector space [23,38].

To solve this problem, we propose a method to modify the values of the matrix

M

according to the entities in each sentence. For two given sentences,

x

and

x^{'}

, and their corresponding sets of extracted entities identifiers,

ϵ

and

ϵ^{'}

, we apply Equation (11) from Section 3.2.3 to compute the entities match score between the two sentences,

σ (x, x^{'})

. Then, the final similarity matrix

M^{'} : = (m_{i j}^{'}), 1 \leq i \leq n_{1}, 1 \leq j \leq n_{2}

, is defined as

m_{i j}^{'} = s i m (e_{i}, e_{j}^{'}) σ (x, x^{'}) .

(15)

3.2.6. Text Similarity

We define the final similarity between the input text sequences,

s i m i l a r i t y (X, X^{'})

, as the average value of the similarity of

X

to

X^{'}

, denoted by

s i m i l a r i t y (X | X^{'})

, and the similarity of

X^{'}

to

X

, denoted by

s i m i l a r i t y (X^{'} | X)

:

s i m i l a r i t y (X, X^{'}) = \frac{s i m i l a r i t y (X | X^{'}) + s i m i l a r i t y (X^{'} | X)}{2} .

(16)

To compute the similarity of

X

to

X^{'}

, for each sentence

x_{i}

in

X

we could take the corresponding sentence

x_{j}^{'}

in

X^{'}

for which the sentence’s similarity value,

m_{i j}^{'}

, is maximum. However, different sentences from

X

may have the maximum similarity with the same sentence from

X^{'}

. To cope with this problem, we select the pairs of corresponding sentences,

(x_{i}, x_{j}^{'})

, that solve the linear sum assignment problem:

m a x \sum_{i}^{} \sum_{j}^{} m_{i j}^{'} c_{i j},

(17)

where

c_{i j} = 1

if

x_{i}

is assigned to

x_{j}

, otherwise

c_{i j} = 0

. In this way, each sentence

x_{i}

in

X

is assigned to at most one sentence

x_{j}^{'}

in

X^{'}

and vice versa.

If

X

has more sentences than

X^{'}

, not every sentence in

X

will have a correspondent in

X^{'}

and vice versa. This means that the similarity of

X

to

X^{'}

must also take into account the sentences without a correspondent. Thus, we define the similarity of

X

to

X^{'}

and the similarity of

X^{'}

to

X

as

s i m i l a r i t y (X | X^{'}) = \frac{1}{n_{1}} m a x \sum_{i}^{} \sum_{j}^{} m_{i j}^{'} c_{i j},

(18)

s i m i l a r i t y (X^{'} | X) = \frac{1}{n_{2}} m a x \sum_{i}^{} \sum_{j}^{} m_{i j}^{'} c_{i j},

(19)

and the final similarity between the input text sequences becomes

s i m i l a r i t y (X, X^{'}) = \frac{n_{1} + n_{2}}{2 n_{1} n_{2}} m a x \sum_{i}^{} \sum_{j}^{} m_{i j}^{'} c_{i j} .

(20)

The similarity metric from Equation (20) is reflexive and symmetric. For the given equation, when

X = X^{'}

,

n_{1} = n_{2}

and X and

X^{'}

are the same text sequences, the similarity equation becomes

s i m i l a r i t y (X, X) = \frac{n_{1} + n_{1}}{2 n_{1} n_{1}} \sum_{i}^{} 1 = \frac{2 n_{1}}{2 n_{1} n_{1}} n_{1} = 1 .

(21)

Hence, the similarity metric from Equation (20) is reflexive.

Since Equation (20) is derived from Equation (16), by swapping the sequences X and

X^{'}

, we obtain

s i m i l a r i t y (X^{'}, X) = \frac{s i m i l a r i t y (X^{'} | X) + s i m i l a r i t y (X | X^{'})}{2} .

(22)

Summation is commutative, meaning that

s i m i l a r i t y (X | X^{'}) + s i m i l a r i t y (X^{'} | X) = s i m i l a r i t y (X^{'} | X) + s i m i l a r i t y (X | X^{'}) .

(23)

Therefore, the similarity metric from Equation (16) and, by extension, the similarity metric from Equation (20) are symmetric:

s i m i l a r i t y (X^{'}, X) = s i m i l a r i t y (X, X^{'}) .

(24)

3.3. Algorithm Parallelization

The design of the algorithm interface makes it easy for its parallelization. The core algorithm is exposed through a public interface:

computeSemanticSimilarity(document x1, document x2),

where

x 1

and

x 2

are the inputs representing the documents for which the semantic similarity is to be calculated, and the output is a floating point number between 0 and 1 representing the computed semantic similarity.

Separating the actual algorithm implementation from its interface allows us to implement different techniques and variations of the semantic similarity algorithm while using the same code base for parallelization.

The pseudocode for the semantic similarity algorithm is as following Listing 1:

Listing 1. Semantic Similarity Algorithm.

This algorithm will be applied to the document database which is continuously populated by web crawlers. Considering the database expressed as set of documents,

D = {d_{1}, \dots, d_{n}}, n \geq 1,

(25)

the final result will be expressed as a semantic similarity square matrix

SM

of size n, where

S M_{i j}

represents the semantic similarity between documents

d_{i}

and

d_{j}

. The

SM

matrix is symmetrical, thus only the elements above its main diagonal need to be computed. The elements on the main diagonal have the value 1, since each document is a perfect match to itself.

The pseudocode for the semantic similarity algorithm applied to the document database D is as following Listing 2:

Listing 2. Semantic Similarity Algorithm on document database.

The algorithm takes as input the set of documents D and outputs the

SM

matrix.

In terms of complexity, this algorithm belongs to the time complexity class

O (n^{2})

, where n is the number of documents in the document database D.

It is obvious that on a single computer, applying this algorithm on a continuously increasing document database does not scale, as the time complexity exhibits quadratic growth relative to the number of documents. Therefore, we parallelized this algorithm so it can run on multiple computing nodes and multiple threads.

The pseudocode for the multi-threaded parallelized semantic similarity algorithm applied to the document database D is the same as in Listing 2, with the instructions replaced with their parallel versions (Listing 3):

Listing 3. Semantic Similarity Algorithm on document database—multi-threaded.

While this takes care of the algorithm parallelization on one node, a separate distributed approach enables the algorithm to run in parallel on multiple computing nodes. To achieve this behavior, the distributed algorithm runs in three separate phases as follows:

The sentence tokenization and NER extraction phase—this phase takes all the documents from the document database D and extracts the sentences, sentence embeddings and NER entities. Each node in the cluster analyzes a portion of the document database according to a hash function. A document $d \in D$ is processed by a node k if

$h a s h (d) mod n = k,$

(26)

where n is the number of computing nodes.
The document d is processed by the function extractNER (document d) from Listing 1.
The processing results of document d are then stored in a distributed cache.
The pseudocode for this phase is as following Listing 4:
Listing 4. Distributed Semantic Similarity Algorithm—document processing.
The NER-based clustering phase. This phase distributes the documents from the document database D in clusters according to their NER similarity. The reasoning for clusterization is based on the fact that semantically similar documents also exhibit NER similarity. Clustering the documents by NER saves large amounts of computations since only the documents belonging to the same cluster have to be analyzed for semantic similarity. Moreover, this phase is not computing-intensive, since it only analyzes the NER representation of the documents. At the end of this phase, the document database D is split into different document clusters.

$D = ⋃_{i = 1}^{n c} {dc}_{i}, n c \geq 1,$

(27)

where $d c_{i}, i = 1 \dots n c$ are the clusters and $n c$ is the total number of clusters.
The semantic similarity analysis phase. Each computing node is assigned a document cluster from the previous phase, according to a hash function. A document cluster $d c$ is processed by a node k if

$h a s h (d c) mod n = k,$

(28)

where n is the number of computing nodes.
We define hash(dc) as follows:

$h a s h (d c) = ⨁_{i = 1}^{n_{d c}} h a s h (d_{i}), d_{i} \in d c,$

(29)

where $n_{d c}$ is the number of documents in the cluster $d c$ .
The document cluster $d c$ is processed by the function computeSemanticSimilarityMatrix_multiThreaded (document set D) from Listing 3.
The processing results of document cluster $d c$ are then stored in a distributed cache for further analysis.
The pseudocode for this phase is as following Listing 5:
Listing 5. Distributed Semantic Similarity Algorithm—document cluster processing.
The result of processing a document cluster is a semantic similarity matrix. In the context of online disinformation, this matrix can be used to find the patient zero news article and to track narrative changes, as described in Section 3.4.

3.4. Finding Patient Zero and Tracking Narrative Changes

Semantic similarity analysis can be applied in a wide variety of scenarios. One such scenario is related to fighting online disinformation. In this context, finding patient zero and tracking narrative changes is paramount in exposing fake news and its propagation on the Internet. Patient zero is considered to be the root source from where the disinformation is propagated online. In fact, the techniques discussed in this paper were integrated into a specialized AI module of the FiDisD project that aims to fight online disinformation (The FiDisD project: https://www.trublo.eu/fidisd/, accessed on 15 March 2023).

In the previous sections, we discussed how we can compute semantic similarity among documents in a document database. In the context of the FiDisD project, these documents represent online news articles that are fetched from the internet by a series of specialized web crawlers.

In Section 3.3 we showed how the document database is organized in disjointed clusters based on semantic similarity. After this step, each cluster contains similar documents and can be further processed in parallel and independently of the other clusters.

Figure 7 illustrates the main steps of processing a document cluster containing 10 documents (news articles). The documents also contain timestamp metadata that represents the date and time when these documents were fetched from the Internet. For the sake of simplicity, we labeled these documents from 1 to 10, with lower label values corresponding to lower timestamps.

Figure 7a illustrates the semantic similarity graph of the document cluster. Each node in the graph represents a document. This graph is based on the semantic similarity adjacency matrix computed by the algorithms presented in the previous sections. The minimum document similarity threshold used for this cluster is 35%. This graph is an undirected graph.
Figure 7b further filters this undirected graph by applying a stronger document similarity threshold of 75%. At the same time, we remove all the edges between the nodes whose timestamps do not follow a timeline. This time relation transforms the initial undirected graph into a directed graph, where a node points to another node only if it meets the similarity threshold and its timestamp is higher.
Figure 7c illustrates the similarity paths, marked with red, of each document in the graph. These paths are obtained for each node, by selecting the parent with the highest similarity.
Figure 7d illustrates the results of removing all edges from the directed graph, except the ones marked with red. The result is a forest of semantic similarity trees.
Figure 7e illustrates how we can track the common source of propagation given a series of documents. Here, we wanted to find the common source for documents 8 and 10, which is 4. The problem of finding the common sources is known as the Lowest Common Ancestor (LCA) and there are algorithms that solve it in linear time [39].

In the context of online disinformation, the same or similar information is being presented in different news articles on different publisher sites. As the information spreads, some narrative changes might also occur. It is possible to start from a piece of true information just to gradually inject disinformation as the news propagates online on other sites. Once we mark a series of similar news items as fake news, by using the techniques described in this paper, we can track and identify the common source of disinformation and analyze how it is propagated on the Internet. Therefore, by providing tools that can show exactly where a particular piece of news started and how it changed its narrative as it propagated online, we can bring valuable insights and help in fighting online disinformation.

Figure 7. Document cluster for 10 documents. (a) Similarity graph (threshold 35%). (b) Similarity directed graph (threshold 75%). (c) Parent node selection. (d) Disjoint trees (forest) generation. (e) Finding common document source.

4. Experimental Results

The solution and algorithms proposed in Section 3 were implemented using the Python programming language. Some Python libraries such as PyTorch, NLTK or SpaCy were used, as mentioned in the solution description. All these libraries are free to use, so the solution can be reproduced based on our proposed algorithms and equations. Other programming languages can be used as well, especially for fine-tuning the algorithms’ performance. Considering that the proposed solution must provide qualitative results in the shortest possible time, we evaluated these techniques both from the point of view of the quality of the results and from the point of view of execution times, assessing the scalability at the same time.

4.1. Quality Analysis

The algorithms were tested on a collection of 7845 unique news articles crawled from seven Romanian news sites. We are interested in evaluating the proposed system’s ability to identify similar articles, that cover the same story and present the same news.

The first step of the evaluation was to manually label the pairs of similar articles. Since for N articles we have

\frac{N (N - 1)}{2}

pairs, the number of pairs is more than 30 million, which is too much for human labeling. Instead, we ran all the proposed metrics with reduced thresholds and considered all the similar pairs discovered by at least one method for manual labeling. To speed things up, we ran a single linkage clustering algorithm [40,41] on them. Any two clusters

C_{1}

and

C_{2}

were joined if any similarity metric found a similar pair of articles

(x, y)

with

x \in C_{1}

and

y \in C_{2}

. The result was 242 clusters, comprising 1300 unique articles, ranging in size from 2 to 58. Most of the pairs from the large clusters were dissimilar due to the chaining effect specific to the single linkage clustering approach (if A is similar to B and B is similar to C, then A, B and C will end up in the same cluster, regardless of the fact that A and C are dissimilar).

The manual labeling produced 822 pairs of articles that were considered similar. Since the total number of pairs is 30,768,090, we have 822 positive examples and 30,767,268 negative ones. Due to the imbalanced nature of the data (explainable, since most pairs of news articles are not similar), we cannot use the accuracy metric [42] to evaluate the similarity systems, since a trivial system that labels any pair as dissimilar would obtain 99.997% accuracy. We will use the following metrics instead:

P = \frac{T P}{T P + F P}

(30)

R = \frac{T P}{T P + F N}

(31)

F_{1} = \frac{2 \cdot P \cdot R}{P + R}

(32)

The Precision (Equation (30)) is the ratio between True Positives and the number of pairs labeled by the system as similar (both True Positives and False Positives). The Recall (Equation (31)) is the ratio between True Positives and the number of similar pairs resulted from the manual labeling (True Positives and False Negatives). The

F_{1}

score [43] is the harmonic mean between the Precision and the Recall and gives both scores equal weight.

To capture the influence of the similarity thresholds on the results, we also performed a Precision–Recall (PR) analysis [44] on the data. This type of analysis is an alternative to the Receiver Operator Characteristic (ROC) analysis [45]. In the classical definition of the ROC analysis, the True Positive Rate (defined as

T P R = \frac{T P}{T P + F N}

) is plotted against the False Positive Rate (defined as

F P R = \frac{F P}{F P + T N}

). Since the True Negatives from the

F P R

are a large imbalance factor, we found the PR analysis more appropriate here.

The PR analysis is a plot where the threshold is varied in the interval

[0, 1]

and for each threshold, we compute the Precision and Recall. Each threshold is represented by a point in the PR plot, with the Recall being represented on the x axis and the Precision on the y axis. An ideal point would be

(1, 1)

, representing 100% Recall with 100% Precision. The Area Under the Curve (AUC) for the PR plot can be used as a quality metric, besides Precision and Recall, to asses the similarity systems.

Table 1 summarizes the results obtained for each of the similarity systems presented in this paper. For each method, we have the Precision, the Recall, the

F_{1}

score and the AUC score. We also computed the threshold that maximizes the

F_{1}

score, since this balances both the Precision and the Recall. The best value obtained for each score is bolded in the table.

As stated in Section 3.2.4, to extract sentence embeddings we use the dense vector representation of the [CLS] token from BERT (bert_cls), we average the hidden state of the last layer of BERT (bert_lhs) or we use the Sentence Transformer library (st). The text similarity score for this method is computed based on the sentence similarity matrix from Equation (13). We also include the corresponding variants, namely bert_cls_ner, bert_lhs_ner and st_ner, that use the adjusted sentence similarity matrix based on entities match scores from Equation (15).

The detailed Precision–Recall plots are presented in Figure 8 for the Locality-Sensitive Hashing technique, and in Figure 9 for the BERT and Sentence Transformers techniques.

4.2. Time Efficiency Analysis

Experimental setup: To test the efficiency of the proposed semantic textual similarity algorithm we used a system with

4 v i r t u a l C P U s I n t e l (R) X e o n (R) E 5 - 2640 v 4 @ 2.40 G H z

,

48 G B R A M

and

4 G P U s N V I D I A G e F o r c e G T X T I T A N X

. We implemented the pseudocode from Section 3.3 as a Python script using Pytorch for BERT and Sentence Transformer inference on the GPU with batches of at most 128 sentences. The cosine similarity was computed on the GPU, using the Sentence Transformers utility library. To test the parallelization capability of the algorithm, we used the Pytorch multiprocessing library to spawn 4 processes, each process using a separate GPU. For the single-process variant, only one GPU was used.

We measured the time required to run the algorithm on batches of 500, 1000, 2000 and 4000 text documents. The comparison of the times needed to run the variants of the algorithm is illustrated in Figure 10. For the multi-process variant, the times are presented in Figure 11.

The performance of the LSH approach is plotted in Figure 12, using the same batch sizes that were used for the semantic similarity experiments. For each batch size, all the news articles were processed using the flows in Figure 2 and Figure 4. This “preprocessing time” includes the time required to index all the documents on the Band Hashes. Next, each document was searched in the indexed collection for similar items, using the LSH approach that extracted a list of candidates that have at least one common Band Hash with the searched item. We decided to separate the preprocessing time and the search time, as the articles are usually preprocessed only once, while the search for similar documents can be performed multiple times, when searching for the source of fake news.

Figure 13 shows that the LSH approach is linear in the number of documents, obtaining an average preprocessing time of 54 ms and an average search time of 41.5 ms. These results were confirmed by generating 100,000 synthetic articles having the same characteristics as the initial document corpus. Since the implementation was performed in pure Python, we can expect a significant speedup by re-implementing the system in C++.

5. Conclusions

This paper presents multiple approaches for assessing the similarity between news articles. As discussed in Section 3.4, identifying similar articles is a key component in fake news detection and tracking, as it enables the determination of the original source of disinformation.

The first presented approach is a textual approach that represents each news article as a set of n-grams, then uses the Locality-Sensitive Hashing technique to compute a further representation that can be indexed in order to quickly find similar articles.

The second approach, which takes into account the semantics of the news articles, is based on sentence embeddings and Named Entity Recognition. We experimented with pre-trained BERT for the Romanian language and a multi-lingual Sentence Transformer to extract sentence embeddings. Named Entity Recognition is used for two separate tasks: filtering pairs of articles that have a low probability of being similar since they have less than

10 %

entities in common, and adapting the similarity matrix to adjust the cosine similarities calculated based on sentence embeddings.

The experimental results show that the semantic approach is better at identifying similar documents, the bert_lhs_ner method obtaining an

F_{1}

score of over 70%, being almost 10% better than the Locality-Sensitive Hashing textual approach. This result was expected, as the textual approach is only able to identify similar articles where the copy still contains phrases and expressions from the original document. This is a more conservative approach, obtaining the highest precision (about 80%), but the recall was just below 50%.

The best results obtained by the semantic approach also use the Named Entity Recognition module, filtering for similarity computation only the documents with at least 10% common entities and taking into account the entities match for the overall similarity. This solution provided better results and also helped to improve the performance.

The scalability issues were also taken into account. While on small batches of samples (500 documents) the processing times are similar, on larger batches (4000 documents), the textual approach can be even 100 times faster than the serial version of the semantic approach. Since the execution time for the textual approach is linear in the number of documents, we consider it more suitable for large collections, even if the quality of the results is below the semantic approach.

Both approaches have their qualities and flaws. The textual Locality-Sensitive Hashing approach offers better performance, being suitable for large collections, where poorer results are better than late results, while the semantic approach obtains better quality results when the running time is not critical. In future work, we will try to use the Locality-Sensitive Hashing technique for the semantic representation of the article, achieving the best of both worlds: quality results and scalability.

The techniques presented in this paper were integrated into a specialized AI module of the FiDisD platform, a decentralized anti-disinformation platform for fact-checking and trust assessment built upon blockchain technology, crowd wisdom, and federated artificial intelligence modules. This platform provides the end user with a simple and objective way to assess which media institutions are most reliable and to check the trust level of online news articles.

Author Contributions

Conceptualization, C.N.B., C.O. and C.-G.A.; methodology, C.-G.A., C.O. and C.N.B.; software, C.-G.A., C.O. and C.N.B.; validation, C.O. and C.-G.A.; formal analysis, C.O. and C.-G.A.; investigation, C.-G.A. and C.O.; data curation, C.O. and C.-G.A.; writing—original draft preparation, C.-G.A., C.N.B., C.O. and F.L.; writing—review and editing, C.O., C.N.B., C.-G.A. and F.L.; visualization, C.O., C.-G.A. and C.N.B.; supervision, C.N.B. All authors have read and agreed to the published version of the manuscript.

Funding

The research presented in this paper is part of the FiDisD project. FiDisD is the acronym for “Fighting disinformation using decentralized actors featuring AI and blockchain technologies”. The FiDisD project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 957228. FiDisD is developed in the context of TruBlo (“Trusted and reliable content on future blockchains”) which is part of the European Commission’s Next Generation Internet (NGI) initiative.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ASCII	American Standard Code for Information Interchange
AUC	Area Under the Curve
BERT	Bidirectional Encoder Representations from Transformers
CRC32	32-bit Cyclic Redundancy Checksum
FiDisD	“Fighting Disinformation using Decentralized Actors” project
LCA	Lowest Common Ancestor
LSH	Locality-Sensitive Hashing
MLM	Masked Language Modelling
NER	Named Entity Recognition
NLP	Natural Language Processing
NLTK	Natural Language Toolkit
PR	Precision Recall
ROC	Receiver Operator Characteristic
TF-IDF	Term Frequency-Inverse Document Frequency

References

Kanoh, H. Why do people believe in fake news over the Internet? An understanding from the perspective of existence of the habit of eating and drinking. Procedia Comput. Sci. 2018, 126, 1704–1709. [Google Scholar] [CrossRef]
Kreps, S.; McCain, R.M.; Brundage, M. All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. J. Exp. Political Sci. 2022, 9, 104–117. [Google Scholar] [CrossRef]
Susukailo, V.; Opirskyy, I.; Vasylyshyn, S. Analysis of the attack vectors used by threat actors during the pandemic. In Proceedings of the 2020 IEEE 15th International Conference on Computer Sciences and Information Technologies (CSIT), Zbarazh, Ukraine, 23–26 September 2020; Volume 2, pp. 261–264. [Google Scholar]
Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-Aware Multi-modal Fake News Detection. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, 11–14 May 2020; pp. 354–367. [Google Scholar]
Singh, R.; Singh, S. Text similarity measures in news articles by vector space model using NLP. J. Inst. Eng. Ser. 2021, 102, 329–338. [Google Scholar] [CrossRef]
Bisandu, D.B.; Prasad, R.; Liman, M.M. Clustering news articles using efficient similarity measure and N-grams. Int. J. Knowl. Eng. Data Min. 2018, 5, 333–348. [Google Scholar] [CrossRef]
Sarwar, T.B.; Noor, N.M.; Miah, M.S.U. Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding. PeerJ Comput. Sci. 2022, 8, e1024. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the NeurIPS (NIPS) 2013 Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [Google Scholar]
Rupnik, J.; Muhic, A.; Leban, G.; Skraba, P.; Fortuna, B.; Grobelnik, M. News across languages-cross-lingual document similarity and event tracking. J. Artif. Intell. Res. 2016, 55, 283–316. [Google Scholar] [CrossRef]
Dumais, S.T.; Letsche, T.A.; Littman, M.L.; Landauer, T.K. Automatic cross-language retrieval using latent semantic indexing. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval; Stanford University: Stanford, CA, USA, 1997; Volume 15, p. 21. [Google Scholar]
Hotelling, H. The most predictable criterion. J. Educ. Psychol. 1935, 26, 139. [Google Scholar] [CrossRef]
Baraniak, K.; Sydow, M. News articles similarity for automatic media bias detection in Polish news portals. In Proceedings of the 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), Poznan, Poland, 9–12 September 2018; pp. 21–24. [Google Scholar]
Neculoiu, P.; Versteegh, M.; Rotaru, M. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, 11 August 2016; pp. 148–157. [Google Scholar]
Choi, S. Internet News User Analysis Using Deep Learning and Similarity Comparison. Electronics 2022, 11, 569. [Google Scholar] [CrossRef]
Wang, J.; Dong, Y. Measurement of text similarity: A survey. Information 2020, 11, 421. [Google Scholar] [CrossRef]
Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—A survey. Acm Comput. Surv. (Csur) 2021, 54, 1–37. [Google Scholar] [CrossRef]
Peinelt, N.; Nguyen, D.; Liakata, M. tBERT: Topic models and BERT joining forces for semantic similarity detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7047–7055. [Google Scholar]
Li, Z.; Lin, H.; Shen, C.; Zheng, W.; Yang, Z.; Wang, J. Cross2Self-attentive bidirectional recurrent neural network with BERT for biomedical semantic text similarity. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 1051–1054. [Google Scholar]
Feifei, X.; Shuting, Z.; Yu, T. Bert-based Siamese Network for Semantic Similarity. Proc. J. Phys. Conf. Ser. 2020, 1684, 012074. [Google Scholar] [CrossRef]
Viji, D.; Revathy, S. A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification. Multimed. Tools Appl. 2022, 81, 6131–6157. [Google Scholar] [CrossRef] [PubMed]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Ma, X.; Wang, Z.; Ng, P.; Nallapati, R.; Xiang, B. Universal text representation from bert: An empirical study. arXiv 2019, arXiv:1910.07973. [Google Scholar]
Li, B.; Zhou, H.; He, J.; Wang, M.; Yang, Y.; Li, L. On the sentence embeddings from pre-trained language models. arXiv 2020, arXiv:2011.05864. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Manning, C.; Schutze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Tufiş, D.; Chiţu, A. Automatic diacritics insertion in Romanian texts. In Proceedings of the International Conference on Computational Lexicography COMPLEX, Pecs, Hungary, 16–19 June 1999; Volume 99, pp. 185–194. [Google Scholar]
Peterson, W.W.; Brown, D.T. Cyclic codes for error detection. Proc. Ire 1961, 49, 228–235. [Google Scholar] [CrossRef]
Sobti, R.; Geetha, G. Cryptographic hash functions: A review. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 461. [Google Scholar]
Jaccard, P. The distribution of the flora in the alpine zone. 1. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
Leskovec, J.; Rajaraman, A.; Ullman, J.D. Mining of Massive Data Sets; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Oprisa, C. A MinHash approach for clustering large collections of binary programs. In Proceedings of the 2015 20th International Conference on Control Systems and Computer Science, Bucharest, Romania, 27–29 May 2015; pp. 157–163. [Google Scholar]
Oprişa, C.; Checicheş, M.; Năndrean, A. Locality-sensitive hashing optimizations for fast malware clustering. In Proceedings of the 2014 IEEE 10th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj, Romania, 4–6 September 2014; pp. 97–104. [Google Scholar]
Marrero, M.; Urbano, J.; Sánchez-Cuadrado, S.; Morato, J.; Gómez-Berbís, J.M. Named entity recognition: Fallacies, challenges and opportunities. Comput. Stand. Interfaces 2013, 35, 482–489. [Google Scholar] [CrossRef]
Dumitrescu, S.; Avram, A.M.; Pyysalo, S. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Online, 2020; pp. 4324–4328. [Google Scholar] [CrossRef]
Artene, C.G.; Tibeică, M.N.; Leon, F. Using BERT for Multi-Label Multi-Language Web Page Classification. In Proceedings of the 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca， Romania, 28–30 October 2021; pp. 307–312. [Google Scholar]
Reimers, N.; Gurevych, I. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv 2020, arXiv:2004.09813. [Google Scholar]
Yang, Y.; Cer, D.; Ahmad, A.; Guo, M.; Law, J.; Constant, N.; Abrego, G.H.; Yuan, S.; Tar, C.; Sung, Y.H.; et al. Multilingual universal sentence encoder for semantic retrieval. arXiv 2019, arXiv:1907.04307. [Google Scholar]
Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv 2019, arXiv:1909.00512. [Google Scholar]
Bender, M.A.; Farach-Colton, M.; Pemmasani, G.; Skiena, S.; Sumazin, P. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 2005, 57, 75–94. [Google Scholar] [CrossRef]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
Sibson, R. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput. J. 1973, 16, 30–34. [Google Scholar] [CrossRef]
Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2021, 17, 168–192. [Google Scholar] [CrossRef]
Van Rijsbergen, C. Information retrieval: Theory and practice. In Proceedings of the Joint IBM/University of Newcastle upon Tyne Seminar on Data Base Systems, Newcastle upon Tyne, UK, 4–7 September 1979; Volume 79. [Google Scholar]
Cook, J.; Ramadas, V. When to consult precision-recall curves. Stata J. 2020, 20, 131–148. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]

Figure 2. Pipeline for extracting word n-grams from unstructured text.

Figure 3. The probability for two articles with Jaccard similarity

s \in [0, 1]

to have at least one common Band Hash, for

r = 5

and

b = 30

.

Figure 3. The probability for two articles with Jaccard similarity

s \in [0, 1]

to have at least one common Band Hash, for

r = 5

and

b = 30

.

Figure 4. Pipeline for computing the Locality-Sensitive Hashing representation from word n-grams.

Figure 5. Semantic similarity pipeline: we split the input texts into sentences, extract the named entities and the embeddings for each sentence, compute the similarities between the sentences in the two texts and aggregate them into the final text similarity.

Figure 6. Example of extracted named entities and their corresponding labels.

Figure 8. Precision–Recall analysis for Locality-Sensitive Hashing.

Figure 9. Precision–Recall analysis for Semantic Similarity. (a) bert_cls. (b) bert_cls_ner. (c) bert_lhs. (d) bert_lhs_ner. (e) st. (f) st_ner.

Figure 10. Comparison of the times needed to run the variants of the semantic similarity algorithm on different numbers of text documents.

Figure 11. Comparison of the times needed to run the variants of the semantic similarity algorithm on different numbers of text documents using 4 processes.

Figure 12. Locality-Sensitive Hashing performance.

Figure 13. Average Locality-Sensitive Hashing preprocessing and search time.

Table 1. Evaluation metrics.

Method	Best Threshold	Precision	Recall	$F_{1}$ Score	AUC Score
lsh	40%	0.806	0.490	0.609	0.675
bert_cls	90%	0.574	0.317	0.409	0.305
bert_cls_ner	38%	0.702	0.699	0.700	0.725
bert_lhs	90%	0.688	0.401	0.507	0.386
bert_lhs_ner	40%	0.721	0.684	0.702	0.729
st	90%	0.656	0.701	0.678	0.716
st_ner	49%	0.590	0.585	0.588	0.631

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Artene, C.-G.; Oprișa, C.; Buțincu, C.N.; Leon, F. Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis. Mathematics 2023, 11, 2053. https://doi.org/10.3390/math11092053

AMA Style

Artene C-G, Oprișa C, Buțincu CN, Leon F. Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis. Mathematics. 2023; 11(9):2053. https://doi.org/10.3390/math11092053

Chicago/Turabian Style

Artene, Codruț-Georgian, Ciprian Oprișa, Cristian Nicolae Buțincu, and Florin Leon. 2023. "Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis" Mathematics 11, no. 9: 2053. https://doi.org/10.3390/math11092053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis

Abstract

1. Introduction

2. Related Work

3. Solution Description

3.1. Textual Similarity

3.1.1. Article Representation and Similarity Score

3.1.2. Enabling Scalability with Locality-Sensitive Hashing

3.2. Semantic Similarity

3.2.1. Sentence Tokenizer

3.2.2. Named Entity Recognition

3.2.3. Entities Match

3.2.4. Sentence Embeddings

3.2.5. Sentence Similarity

3.2.6. Text Similarity

3.3. Algorithm Parallelization

3.4. Finding Patient Zero and Tracking Narrative Changes

4. Experimental Results

4.1. Quality Analysis

4.2. Time Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI