Next Article in Journal
Deep Reinforcement Learning for Autonomous Driving in Amazon Web Services DeepRacer
Previous Article in Journal
Exploring the Impact of Body Position on Attentional Orienting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Algorithm Framework for the Influence Maximization Problem Using Graph Clustering

by
Agostinho Agra
1,*,† and
Jose Maria Samuco
2,†
1
Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal
2
Center for Research & Development in Mathematics and Applications, University of Aveiro, 3810-193 Aveiro, Portugal
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2024, 15(2), 112; https://doi.org/10.3390/info15020112
Submission received: 5 January 2024 / Revised: 6 February 2024 / Accepted: 7 February 2024 / Published: 14 February 2024
(This article belongs to the Special Issue Optimization Algorithms and Their Applications)

Abstract

:
Given a social network modelled by a graph, the goal of the influence maximization problem is to find k vertices that maximize the number of active vertices through a process of diffusion. For this diffusion, the linear threshold model is considered. A new algorithm, called ClusterGreedy, is proposed to solve the influence maximization problem. The ClusterGreedy algorithm creates a partition of the original set of nodes into small subsets (the clusters), applies the SimpleGreedy algorithm to the subgraphs induced by each subset of nodes, and obtains the seed set from a combination of the seed set of each cluster by solving an integer linear program. This algorithm is further improved by exploring the submodularity property of the diffusion function. Experimental results show that the ClusterGreedy algorithm provides, on average, higher influence spread and lower running times than the SimpleGreedy algorithm on Watts–Strogatz random graphs.

1. Introduction

The influence maximization (IM) problem has attracted many researchers in the last few decades, especially due to its applications in various situations, such as viral marketing [1,2,3], terrorist attack prevention [4], and network control [5]. Viral marketing has proved to be more interesting for researchers. Chen et al. [6] justify the strong motivation for studying information and influence diffusion models with viral marketing. That is, if a company needs to introduce a new product in the market through word-of-mouth effects in a social network, it must select influencers in the network and convince them to adopt the new product. The goal is that the selected influencers will generate a very big cascade effect in the network, driving other people to use the new product. Thus, diffusion is the transmission of information among persons, such as the transmission of viruses.
The literature on IM is very extensive. Recent studies range from using reinforcement learning to solve IM problems [7] to investigating the equilibrium between information influence and the cocoon effect [8]. Several surveys on IM have been published; see [9,10,11,12,13,14].
The IM problem is a discrete optimization problem that was enunciated by Kempe et al. [15]. Given a social network modelled by a directed graph  G = ( V , E )  with influence weights on edges, in the influence maximization problem, we are interested in finding the k (pre-defined parameter) nodes of the network which maximize the spread of the influence once activated under a propagation model. The influence diffusion process occurs in discrete time steps. Here, for the propagation, we consider the linear threshold model, which is classified as a progressive diffusion model, since activated nodes cannot be deactivated [11].
The IM problem is NP-hard under the linear threshold model (and under other diffusion models such as the Independent Cascade model); see [15] for details. Therefore, heuristics play an important role in the solution techniques. As stated in [11], “existing works focus on approximate solutions, and a keystone of these algorithmic IM studies is the greedy framework”. Among these heuristics, probably the most well-known is the SimpleGreedy algorithm proposed by Kempe et al. [15]. This algorithm employs a greedy approach by selecting the node with the highest marginal gain over the network and including it in the seed set at each iteration.
In this paper, we propose a new algorithm framework based on greedy approaches for the IM problem. This framework consists of creating a partition of the graph nodes into clusters, solving the IM problem within each cluster to obtain a set of candidate nodes from each cluster, and choosing the seed set from those candidate nodes. To solve the IM problem in each cluster, we use a greedy approach. Following the taxonomy proposed in [11], the proposed algorithms are classified into simulation-based methods, since Monte-Carlo simulations are employed for evaluating the influence spread of the selected seed set.
First, we introduce the ClusterGreedy algorithm for the IM problem under the LT diffusion model. The ClusterGreedy creates a partition of the given network and implements the SimpleGreedy in the smaller networks resulting from that partition. Then, it solves a subproblem to determine the seed set of the original network from the seed set obtained for each of the smaller networks. With this approach, we solve, using a greedy approach, more problems and, consequently, more simulations (one for each iteration of the SimpleGreedy). However, each problem is solved in a smaller network, which is important in order to reduce the running times of the simulations, since this step is computationally very expensive and depends on the size of the network. By exploring the submodularity property of the diffusion function, we improve the ClusterGreedy by reducing the number of iterations that need to be performed. The resulting algorithm, called Improved ClusterGreedy, solves a few more iterations than the SimpleGreedy; however, all simulations run in smaller networks resulting from the clustering operation. The computational experiments show that this improved algorithm allows for a reduction in the runtime to 25% when we compare it with the algorithm proposed by Kempe et al. [15], under Watts–Strogatz random graphs and 54% in the three datasets selected. One important aspect in our approach is that by splitting the network into smaller networks and analysing each network separately, we lose the interactions between nodes in different clusters. This may be observed as a disadvantage. However, as reported in the computational section, we observe that there may be an advantage in that operation, since analysing each network separately to find the candidate solutions for the seed set in each cluster and choosing the seed set later by combining the different sets of candidates helps in avoiding the myopic behaviour of the greedy approach.
This paper is structured as follows: In Section 2, we review the literature most related to our work. The IM problem, the diffusion model, and the algorithm proposed by Kempe et al. [15] are described in Section 3. In Section 4, the ClusterGreedy and the Improved ClusterGreedy algorithms are introduced, and the main components, namely the Markov cluster algorithm and the subproblem to determine the final seed set from the seed of each smaller network, are explained. Experimental results are presented in Section 5. Finally, Section 6 summarizes the main conclusions of this paper.

2. Related Literature

The literature for IM is quite vast. In this section, we review the literature that is most related to our work. In particular, we focus on works related to greedy approaches.
Kempe et al. [15] introduced the SimpleGreedy algorithm to solve the problem of maximizing the influence. This algorithm uses the concept of marginal gain, which is the influence spread increase obtained by adding a node to the seed set. The SimpleGreedy algorithm employs a greedy approach by selecting the node with the highest marginal gain over the network and including it in the seed set at each iteration. This process is repeated k times [16]. Given the spread function  σ , the marginal gain of a given node u can be computed by  σ ( S { u } ) σ ( S ) , which Kempe et al. [15] estimated with Monte Carlo simulations. This algorithm is known to have a poor running time performance. That inefficiency has motivated several alternative approaches and motivated our study to reduce the computation time.
The CELF (Cost-Effective Lazy Forward) algorithm was introduced by Leskovec et al. [17]. In the first round, the algorithm calculates an estimate of the influence spread of each node using Monte Carlo simulations. However, since the marginal gain in each iteration is not superior to its marginal gain of previous iterations, they reduced the number of Monte Carlo simulations [18]. CELF was presented as being 700 times faster than the SimpleGreedy. Goyal et al. [19] introduced CELF++ and described it as being 35–55% faster than CELF.
The NewGreedy algorithm was introduced by Chen et al. [20]. This algorithm improved the one proposed by Kempe et al. [15]. NewGreedy reuses the results of the Monte Carlo simulations from previous iterations to calculate the influence spread for each node in the same iteration [18].
Cheng et al. [21] introduced the StaticGreedy algorithm. They proposed a solution to the warranty submodularity and monotonicity in an effective way. Instead of generating a significant quantity of Monte Carlo simulations in each iteration, a not-particularly large number of Monte Carlo snapshots are generated at the start and then reused in all subsequent iterations. This approach guarantees that the estimated influence spreads of any seed set remain consistent over successive iterations. It also ensures that the features of submodularity and monotonicity are upheld.
The SIMPATH algorithm was introduced by Goyal et al. [22]. The algorithm uses simpath-spread and is based on the assumption that the influence propagated from a node can be calculated as the sum of the weights of all arcs from that node. When the precedent idea is combined with the CELF algorithm, the resulting algorithm can be applied to solve the IM problem.
Cluster algorithms for graphs are also known as community detection algorithms for networks. Community-based algorithms detect communities on social networks and solve the IM problem through the structure of the given network data; see [23,24,25,26]. Rahimkani et al. [23] consider an approach that first identifies the communities on social networks and then creates a new network based on those communities. In the new network, each node represents a community. Then, the most central nodes of the new network are selected based on the measures of betweenness and centrality. Each node in the new network includes nodes in the corresponding community. Therefore, a number of nodes, proportional to the size of the community, was selected based on the degree centrality to form the candidate set. The k most influential nodes are then selected from the set based on the betweenness centrality measure. Community-based algorithms should not be confused with our clustering approach. Although such algorithms use clusters (corresponding to communities), their approach is quite different from ours. We split the graph into clusters and consider them separately. In the Community-based algorithms, the clusters are simplified by choosing a subset of nodes that represents the cluster.
Ghayour-Baghbani et al. [27] propose an efficient method using linear programming (LP) for the IM problem in the linear threshold propagation model. The method works in three steps: first, the sparse matrix multiplication is used to estimate the influence graph from the social network graph; second, the influence graph is used to model the IM problem in the LT model as a linear program. Finally, the candidate is created. The linear program’s output is seeded using a randomized rounding approach.
Some methods solve the IM problem by LP in the LT model, but either their runtime is higher than the greedy algorithm or they tend to produce worse solutions (the spread of the produced seed set is less than the greedy algorithm [28,29,30,31]).

3. Preliminary

In this section, we formulate the IM problem and provide an overview of the linear threshold model and the SimpleGreedy algorithm. Table 1 provides the notation used in this paper.
Given a graph  G = ( V , E ) , a stochastic diffusion model on G, and a budget k, the IM problem is the stochastic optimization problem of finding a set  S V  with  | S | k , such that the influence spread of  S , σ S , under the given stochastic diffusion model is maximized. That is, determine  S * V , such that
S * = argmax S V , S = k σ S .
Regarding the problem formulation above, it should be noted that finding a group of k nodes with the greatest combined influence is not the same as finding the top k nodes with the greatest individual influences. It makes sense to identify two top influencers who both have a significant impact on the same group of individuals as top influencers, but only one of them should be chosen as a seed to maximize the influence.
One of the two primary diffusion models, the independent cascade (IC) model, can be used to explain simple contagions where activations might result from a single source, like the spread of viruses or information. However, in several circumstances, people need to be exposed to various independent sources for their behavior to change. The linear threshold (LT) model was introduced by Kemp et al. [15] as a stochastic diffusion model to represent this kind of behavior.
In this work, the LT diffusion model is considered. Each arc  ( u , v ) E  in the LT model has an effect weight  w u v [ 0 , 1 ] , which indicates how likely u influences v. The weights are normalized such that  u N i n ( v ) w u v 1 , for all  v V  [6]. Formally, we have the following definition [6]:
Definition 1 
(Linear Threshold Model). Given a graph  G = ( V , E ) , with influence weights  w a , a E  assigned to the arcs, and a seed set  S 0 , the LT model creates the active sets  S t , for all  t 1  repeatedly, using the following randomized operation procedure: first, a threshold value  θ v  is created for each  v V  using a uniform distribution ranging from 0 to 1. For each time step  t 1 , the set  S t  is initialized to the same value as  S t 1 . For each node v in the set  V S t 1  that is not active, if the combined weight of the arcs from its active neighbors is equal to or greater than  θ v , denoted as
u S t 1 N i n ( v ) w ( u , v ) θ v ,
then node v is included in  S t  (i.e., node v becomes active at time t).
An important property in IM is submodularity,
Definition 2 
(Submodularity [6]). Let V be a finite set, and denote by  2 V  the power set of V. A function  f : 2 V R  is called submodular if, for each  S T V  and any  v V T , we have
f S { v } f ( S ) f T { v } f ( T )
In the context of IM, this property says that the marginal gain obtained by adding a node to the seed set decreases with the increase in the seed set.
Kempe et al. [15] proved the following theorem:
Theorem 1. 
The influence spread function  σ ( · )  in the linear threshold model is monotone and submodular.
Based on the monotony, non-negativity and submodularity properties of the spread function  σ , Kempe et al. [15] devised an algorithm to address the influence maximization problem—the SimpleGreedy algorithm (Algorithm 1). The algorithm iteratively selects the node with the highest marginal gain and includes it in the seed set. The algorithm stops when the seed set has k elements. Under the LT model (and IC model), computing exact marginal gains, i.e., the exact expected spread, is #P−hard [6].
Since the SimpleGreedy algorithm needs to compute the exact marginal gains, or the exact expected spread (Line 3 of Algorithm 1), Kempe et al. proposed to estimate them by Monte Carlo simulations, with r runs. With this change, we obtain Algorithm 2.
Algorithm 1 Simple Greedy  ( k , σ )
Input:  k : cardinality of the returning set;  σ : monotone and submodular set function
Output: seed set S
   1: initialize  S
   2: for  i = 1  to k do
   3:     u argmax w V S ( σ ( S { w } ) σ ( S ) )
   4:     S S { u }
   5: end for
   6: return S
Algorithm 2 SimpleGreedy  ( G , k )
Input:   G = ( V , E ) : social graph; arc weights  w e , e E ; k: cardinality of the returning set
Output: seed set S
   1: function mc-estimate( S , G )
   2:      counter  0
   3:   for  j = 1  to r do
   4:      Perform a simulation of the diffusion process on the graph G using the seed set S.
   5:       n a  The number of active nodes once the diffusion concludes.
   6:      counter counter +   n a
   7:   end for
   8: return  counter/r
   9: end function
 10: initialize  S
 11: for  i = 1  to k do
 12:     u argmax w V S  mc-estimate   ( S { w } , G )
 13:     S S { u }
 14: end for
 15: return S
Notice that both Algorithms 1 and 2 generate a sequence of sets  S 0 , S 1 , , S k  and the corresponding estimative of spread. This observation will be used later. Algorithm 2 has time complexity  O ( k r n e ) , where n is the number of vertices and e is the number of edges.

4. Cluster Greedy Algorithm

This paper presents a novel algorithm named ClusterGreedy (Algorithm 3) that combines the SimpleGreedy (Algorithm 2) with graph clustering. The main idea is to create a partition of the set of nodes using clustering techniques. Each subset of nodes obtained from the partition induces a smaller subgraph. Then, we apply the SimpleGreedy to each subgraph to generate a set of candidate nodes for the seed set. Then, a subproblem called the linking set problem is solved to obtain the best seed set from the combination of the sets of candidate nodes generated for each subgraph. We present an overview of the ClusterGreedy algorithm in Figure 1. The purpose of this approach is twofold. On one hand, we aim to reduce the running times since the evaluation of the marginal gains (which is the most expensive task from the computational point of view) is conducted on smaller subgraphs. On the other hand, as the SimpleGreedy approach is heuristic and known to be myopic, by leaving to a second step, the choice of the final seed set may help to obtain better solutions.
Next, the three main components of the proposed algorithm are presented. First, the ClusterGreedy algorithm is described, then the subproblem to combine the candidate sets of nodes is modelled, and, finally, the full algorithm is detailed.
Algorithm 3 ClusterGreedy  ( G , k ) .
Input:   G = ( V , E ) : social graph; arc weights  w e , e E ; k: cardinality of the returning set
Output: seed set S
   1: generate the clusters of graph G using the MCL algorithm
   2: set m to the number of clusters of G
   3: for  j = 1  to m do
   4:     G j  the subgraph induced by the set of nodes in cluster j
   5:     S j =  SimpleGreedy ( G j , k )  (Algorithm 2)
   6: end for
   7: compute  c i j 0 , the value of spread within cluster  j ,  with seed set  S j i
   8: solve the linear integer program defined by Equations (3)–(6)
   9: Let  x ¯  be the optimal solution obtained
 10: for  i = 1  to k do
 11:    for  j = 1  to m do
 12:     if  x ¯ i j = 1  then
 13:        S S j i
 14:     end if
 15:    end for
 16: end for
 17: return S

4.1. Markov Cluster Algorithm

Clustering aims to partition a given data set into clusters where the elements inside each cluster exhibit a similarity or connectivity according to a preset criterion [32]. However, there exist graphs that lack a naturally clustered structure. Nevertheless, a clustering algorithm produces a clustering for every input graph. If the graph’s structure is completely uniform, with edges evenly distributed across the nodes, the clustering produced by any technique will be somewhat arbitrary. This constitutes no major issue for our approach since the main goal is to create several smaller graphs.
In our case, the clustering is performed using the Markov clustering (MCL) algorithm [33]. The MCL algorithm emulates stochastic walks on the underlying interaction network, employing two alternating operations: expansion and inflation [33]. Firstly, loops are incorporated into the provided graph. The highest weight among the weights of the edges connecting to the vertex determines the weight of each loop. Next, the graph is converted into a stochastic “Markov” matrix, denoted as A, which captures the probabilities of transitioning between each pair of vertices. The probability of a random walk of length  λ  between any two vertices is then determined using a method called expansion, which involves calculating  A λ . The resulting values are used for forming the clusters in such a way that frequent random walks on the graph occur between nodes belonging to the same cluster.
In contrast to other clustering algorithms, such as K-means clustering, the MCL algorithm does not require the number of groups to be predetermined. Nevertheless, the value of this number can be indirectly regulated by manipulating the inflation parameter. The level of inflation directly impacts the granularity or resolution of the clustering result. Lower inflation values ( 1.3 1.4 ) result in fewer and larger clusters, whereas higher inflation values (5, 6) lead to more and smaller clusters [34,35].

4.2. Linking Set

In order to obtain the seed set for the original network from the candidate seed sets obtained for each cluster, it is necessary to solve a subproblem. Given m clusters, the aim is to find the best seed set of k nodes selected among the m candidate seed sets obtained for each cluster.
To model this subproblem as an integer linear model, let us consider non-negative weights  c i j , for all  i K = { 1 , , k }  and all  j J = { 1 , , m } , which represent the expected spread in cluster j, considering the seed set of cardinality  i ,  denoted by  S j i . Also, consider the binary variables  x i j  for all  i K  and  j J  indicating whether we select i nodes from the cluster  j .  The subproblem can be formulated as follows:
max j = 1 m i = 1 k c i j x i j ,
j = 1 m i = 1 k i x i j = k ,
i = 1 k x i j 1 , j J ,
x i j { 0 , 1 } , i K , j J
The objective function aims to maximize the estimated spread. Equation (4) means that k nodes are chosen. On the left-hand side, the number of nodes selected from each cluster is added. The total sum must be k. Constraints (5) impose that at most one variable is positive. Notice that if all of the variables are null for a given j, then no node is selected from cluster  j .  Constraints (6) are the integrality conditions of the variables.
This problem is a particular case of the multiple-choice knapsack problem, and it is known as the linking set problem (LSP) [36], which can be solved by the following dynamic program (which is an adaptation of the one given in [36] to our case).
Let  α ( j , i )  denote the optimal value of the subproblem when considering the first j clusters and i seed nodes. For  j = 1 ,  we obtain:
α ( 1 , i ) = c i 1 , i K , + , otherwise
For  j = 2 , , m  the recursive relations are given by
α ( j , i ) = max w { 0 , , i 1 } { α ( j 1 , i w ) + c i j } , i K , + , otherwise
where  c i 0 = 0 .
The optimal value of the LSP is given by  α ( m , k ) .  This dynamic program can be solved in  O ( m k 2 )  operations.
It should be noticed that the optimal value of the LSP should be an under-estimative of the value of  σ ( S ) , since the spread of each set  S j i  is estimated within each subgraph, not in the entire graph G, and the interactions between activated nodes in different subgraphs are not taken into account.
An important property of the IM models is submodularity; see Section 3. This property also holds when  σ ( · )  is applied to each cluster. Consequently, that property implies that for each cluster j the gains for including a new node in the seed set are decreasing, that is,
c i , j c i 1 , j c i + 1 , j c i , j
where we assume again  c 0 , j = 0  for all  j J .
When this property holds, the linking set problem can be solved more efficiently using the following greedy algorithm (Algorithm 4) where, in each iteration, the marginal gain is obtained from selecting a node from each cluster, and selecting the node with the largest gain. This algorithm has time complexity  O ( m k ) .
The proof of this theorem is left in Appendix B. If the values  c i j  are obtained from the true function  σ ( · )  or from a greedy algorithm, then Property (7) holds and Theorem 2 can be applied.
Theorem 2. 
If condition (7) holds then Algorithm 4 solves the LSP.
Algorithm 4 Greedy algorithm for the linking set problem.
   1: Set  v a l u e : = 0 ;   t : = 0 ;  and, for  j J ,   i j : = 0 ,
   2: while  t k  do
   3:   compute  j * = a r g m a x j J c i j + 1 , j c i j , j ;
   4:    v a l u e = v a l u e + c i j * + 1 , j * c i j * , j * ;
   5:    i j * : = i j * + 1 ;
   6:    t = t + 1 .
   7: end while
   8: for  j J  do
   9:    x i j = 1 ,  if  i = i j  and  x i j = 0 ,  otherwise.
 10: end for

4.3. The ClusterGreedy Algorithm

The full ClusterGreedy is described in Algorithm 3.
First, the algorithm applies the MCL algorithm to define the clusters of graph  G .  Then, it applies the SimpleGreedy to each subgraph of G induced by the clusters obtained by the graph clustering of the main network. SimpleGreedy obtains a candidate seed set  S j  for each cluster j with cardinality  k .  Remember that SimpleGreedy (see Algorithm 2) forms set  S j  in a constructive manner, including a node at each iteration. Hence, it generates a sequence of sets  S j i ,  for  i = 1 , , k  containing the seed set with i nodes. In addition, it estimates the spread of each set, which is given by  c i j .  These values are the input parameters to the LSP presented in Section 4.2. Then, the seed set S is obtained from the optimal solution to the LSP, from the combination of all sets generated for each cluster. This set is the union of sets  S j i ,  such that  x i j = 1 .
The time complexity of the ClusterGreedy is the maximum of the time complexity of the dynamic program for the linking set ( O ( k m ) ) and the time complexity of the SimpleGreedy ( O ( k r j = 1 m n j e j ) , where  n j  and  e j  are the numbers of nodes and edges, respectively, of the subgraph induced by cluster  j .  From this complexity, we observe that the size of the largest cluster will be an important factor in determining the running time of the algorithm.
By observing that when the values of  c i j  are obtained from the greedy algorithm and Property (7) holds, we can improve the ClusterGreedy. Instead of solving, for each cluster, the SimpleGreedy for a seed set of k nodes, we merge the SimpleGreedy with the Greedy algorithm for the LSP. The full algorithm, denoted by Improved ClusterGreedy, is described in Algorithm 5.
Algorithm 5 starts by computing the gain from selecting a node from each cluster. Then, in each iteration, we choose the node from the cluster that provides a larger marginal gain and apply a new iteration of the greedy algorithm to that cluster.
This algorithm allows us to reduce the number of iterations performed in the calls to the SimpleGreedy. In particular, instead of performing  m k  calls to the function MC-ESTIMATE(S, G) it performs only  k + m  calls. Since the function MC-ESTIMATE (S, G) is the most time-consuming operation with  O ( r e )  operations (where e is the number of edges of graph G), this reduction allows for substantial time savings.
Algorithm 5 Improved ClusterGreedy  ( G , k ) .
Input:  G = ( V , E ) : social graph; arc weights  w e , e E ; k: cardinality of the returning set
Output: seed set S
   1: generate the clusters of graph G using the MCL algorithm
   2: set m to the number of clusters of G
   3: for  j = 1  to m do
   4:     G j  the subgraph induced by the set of nodes in cluster j
   5:     S j =  SimpleGreedy ( G j , 1 )  (Algorithm 2)
   6:    compute  c i j , the value of spread within cluster  j ,  with seed set  S j i ,  for  i = 1
   7: end for
   8: Set  v a l u e : = 0 ;   t : = 0 ;  and, for  j J ,   i j : = 0 ,
   9: while  t k  do
 10:    compute  j * = a r g m a x j J c i j + 1 , j c i j , j ;
 11:     v a l u e = v a l u e + c i j * + 1 , j * c i j * , j * ;
 12:     i j * : = i j * + 1 ;
 13:     t = t + 1 .
 14:     S j =  SimpleGreedy ( G j , i j * )  (Algorithm 2)
 15:    Set  C i j * , j  to the optimal value of SimpleGreedy ( G j , i j * )
 16: end while
 17:  S
 18: for  j J  do
 19:     S S j i
 20: end for
 21: return S

5. Computational Experiments

In this section, the computational experiments conducted to evaluate the proposed approach are reported. These experiments compare the performance of the ClusterGreedy algorithm (Algorithm 3) with the SimpleGreedy (Algorithm 2) and to evaluate the gains of using the Improved ClusterGreedy algorithm when compared to the ClusterGreedy.
These experiments were conducted in a Desktop DELL Intel Core i7-11700 2.50GHz 16 GB 500 GB Win11Pro. All the implementations were conducted in the Julia language.
We consider two sets of instances. A set of instances based on Watts–Strogatz random graphs was generated using the package Graphs.jl (version v1.8.0) [37] and a set of real datasets.
The LSP was solved using the packages JuMP.jl (version v1.11.1) and HiGHS.jl (version v1.5.2). The clusters were generated by MCL [34] (available in the Clustering Julia package/version v0.15.2). Based on some preliminary tests to control the number of clusters, we set the inflation parameter to  I = 5.5 .
We report the results with Watts–Strogatz random graphs on Section 5.1 and the results on a real data set on Section 5.2.

5.1. Computational Results with Watts–Strogatz Random Graphs

For n equal to 3000, 4000, 5000, and 6000, we obtained, on average, 51, 62, 58, and 60 clusters (Figure 2), respectively. The runtime of MCL was 28.65 s, on average, which is very small when compared to the total running time of the algorithms. Figure 3 shows the average runtime of cluster generation for n equal to 3000, 4000, 5000, and 6000.
The experimental results show that the ClusterGreedy is faster than SimpleGreedy. For  r = 50  (10 result) and  r = 100  (10 result) simulations, both algorithms were tested for n equal to  3000 ,   4000 ,  5000, and 6000. ClusterGreedy used, on average, 35%, 36%, 27%, and 22% of the SimpleGreedy runtime, respectively, to obtain the seed set S (Figure 4 and Figure 5). It corresponds, globally, to 25%. That is, on average, ClusterGreedy required only 25% of the runtime of SimpleGreedy to achieve the solution to the problem. More importantly, there is a clear trend showing that the gains in the running time increase as n increases. The Wilcoxon signed-rank test used to compare the running times of both algorithms provides a p value close to zero for all sets of instances, showing that the gains are statistically meaningful.
To evaluate the spread of the seed set S in the network, we simulate the propagation of the seed set S in the network with 1000 runs and calculate the average value. The results show that the seed sets obtained with ClusterGreedy provide a gain of about 3%, on average, for the estimated spread when compared with the seed sets obtained with SimpleGreedy (Figure 6). The Wilcoxon signed-rank test provides a p value lower than 5% for all sets of instances except for  n = 3000 ,  which is  0.09 ,  meaning that for these larger instances, the improvement in the objective function value is statistically meaningful.
In the experimental results (in bold), reported in Appendix A, numbers 4, 11, 13 (Table A1), 4, 15, 16 (Table A2), 1, 6, 15 (Table A3), 2, 4, 11, 13, 16, and 17 (Table A4), the runtime of the ClusterGreedy algorithm is not more than 15% of the runtime of the SimpleGreedy algorithm. These were the best results obtained. In opposition to these, results with numbers 3, 18 (Table A1), 20 (Table A2), and 5 (Table A4) were recorded, in which the execution time of the ClusterGreedy algorithm was longer than SimpleGreedy’s.
In order to compare the Improved ClusterGreedy with the ClusterGreedy, we test the running times for  n = 3000 , 4000 , 5000 , 6000  and  r = 50 . The running times of the Improved ClusterGreedy were 4.1%, 2.8%, 2.5%, 1.6%, repectively, and, on average,  2.8 % . In relation to the SimpleGreedy, the running times of the Improved ClusterGreedy were, on average,  0.5 % .

5.2. Computational Results with Real Datasets

  • Email-eu-core [38]: Email records from a major European research institution were utilized to create the network. If individual u has transmitted at least one email to individual v, then there is an edge  ( u , v )  in the network. The dataset excludes any inbound or outgoing messages from or to the rest of the world; the emails exclusively document communication among members of the institution (the core). “Ground-truth” community memberships of the nodes are also included in the dataset. At the research institute, each individual is affiliated with precisely one of the 42 departments.
  • Adolescent health [39]: The construction of this network was informed by a survey done between 1994 and 1995. Each student was directed to create a list of their top five female acquaintances and their top five male acquaintances. A node represents a student, and an edge  ( u , v )  linking two students signifies that student u has chosen student v as a friend.
  • Gnutella peer-to-peer (8 August 2002) [38]: An assemblage of the Gnutella peer-to-peer file-sharing network captures the month span of August 2002. The Gnutella network topology is composed of nodes, which symbolize hosts and edges, which symbolize connections among the hosts.
The statistics of the real–world graph datasets that we employ are outlined in Table 2.
The experimental results demonstrate that the ClusterGreedy algorithm exhibits a higher speed compared to the SimpleGreedy algorithm. Both methods underwent testing using the three chosen real data sets. The ClusterGreedy algorithm utilized 4% of the SimpleGreedy runtime in the Email-eu-core network, 70% in Adolescent health, and 60% in Gnutella peer-to-peer to obtain the seed set S (see to Table 3). Globally, it represents 54%. On average, ClusterGreedy achieved the solution to the problem in just 54% of the duration required by SimpleGreedy, based on the selected real data sets. The highest performance was attained in the dataset with the most significant average clustering coefficient (Email-eu-core network).

6. Conclusions

We present a novel algorithmic strategy to solve the problem of maximizing the influence that is based on the idea of creating a partition of the set of nodes of the original network into smaller subsets, solving the IM problem in the subgraph induced for each subset of nodes, and, finally, combining the solution obtained for the several subgraphs in order to generate a final solution (seed set). This general idea was applied using greedy algorithms, and the results showed that:
  • The proposed algorithm (called ClusterGreedy) needed only, on average, 25% of the classical greedy (SimpleGreedy) runtime to achieve the solution of the influence maximization problem;
  • under the general condition that the marginal gains decrease with the increase in the size of the seed set (which is satisfied by the greedy algorithm and by the diffusion function under the linear threshold model), then combining the solutions from the different clusters can be conducted in a much more efficient way. Moreover, the two-phase approach can be improved and a new algorithm that searches for the highest marginal gain between the different clusters can be used. This improved algorithm runs, on average, in  4 %  of the running time of the ClusterGreedy algorithm.
Overall, the proposed approach is more efficient than the classical one of solving the IM problem. It should be important to notice that this approach can be applied using other algorithms for the IM problem, not necessarily the greedy one. This future research direction includes the use of both proxy-based and sketch-based approaches, see [11]. However, for those cases, the necessary condition for the improved algorithm may not hold. Another direction of research is to explore the relation of the IM problem with the p-median problem, where a given set of p nodes must also be selected. In Ref. [40], a decomposition approach for the p-median problem on disconnected graphs is introduced. This decomposition allows for solving large-scale problems. In our approach, we obtain disconnected graphs after performing the clustering. Hence, a similar decomposition approach can be employed for the IM problem.

Author Contributions

Conceptualization, J.M.S. and A.A.; methodology, J.M.S. and A.A.; software, J.M.S.; validation, J.M.S. and A.A.; formal analysis, J.M.S. and A.A.; investigation, J.M.S. and A.A.; data curation, J.M.S.; writing—original draft preparation, J.M.S.; writing—review and editing, J.M.S. and A.A.; visualization, J.M.S.; supervision, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by The Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT—Fundação para a Ciência e a Tecnologia), references https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020 (accessed on 6 February 2024).

Data Availability Statement

The Julia code used to obtain reported results can be shared by sending an email to jsamuco@gmail.com.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CELFCost-effective lazy forward
IMInfluence maximization
LTLinear threshold
LSPLinking set problem
LPLinear programming
MCLMarkov clustering

Appendix A

Here, we present the tables with the full detailed results.
Table A1, Table A2, Table A3 and Table A4 provide detailed information regarding the running times and objective function values for  n = 3000 n = 4000 n = 5000 n = 6000 ,  respectively, for the instances based on Watts–Strogatz random graphs. Column m provides the number of clusters, column k provides the number of elements in the seed set, and column r provides the number of Monte Carlo simulations. We can observe that the objective function value obtained from the LSP is an underestimation of the estimated spread, as expected.
Table A1. Computational results for  n = 3000 .
Table A1. Computational results for  n = 3000 .
mkrRuntime (s)Objective Function Value
MCLSimpleGreedyClusterGreedyLinking SetSimpleGreedyClusterGreedy
4630509.635686.061272.13526366
7430508.496102.561205.10526364
35305010.471842.132991.36556366
5430507.186448.28646.86546455
3630509.542507.601075.74576367
36305019.016646.612276.28556466
57305011.233806.671530.79516255
35305018.505989.162078.36546467
45305011.304338.802415.55516365
35305011.014388.511671.92536353
723010010.7417,638.402549,70546463
363010010.3213,198.847041.78526466
70301009.6623,175.312879.32566468
353010042.8018,236.075763.19566467
353010010.6022,917.419680.69556564
713010017.546037.133449.89556465
723010011.305725.892720.86546366
353010010.685859.436868.13556466
633010012.226199.051556.98566468
713010011.395965.021091.29546468
Average1386353038546464
Table A2. Computational results for  n = 4000 .
Table A2. Computational results for  n = 4000 .
mkrRuntime (s)Objective Function Value
MCLSimpleGreedyClusterGreedyLinking setSimpleGreedyClusterGreedy
37405020.6119,183.016084.58708484
94405017.0515,541.193535.49718388
47405024.5817,288.803378.13728589
78405025.5719,902.082953.08708487
70405020.2819,153.864529.98718387
48405024.6015,556.002729.04738488
95405020.669470.891879.85728489
96405023.2810,521.981758.60738488
37405024.2514,953.144982.72728488
83405026.168261.442475.15738488
474010015.7337,350.5816,860.93758589
644010017.5113,187.336510.77718488
484010018.7113,766.842511.51728588
494010019.3712,624.122842.77718488
984010018.2929,727.572001.41698589
954010016.8031,476.403398.91728585
384010015.4930,891.7210,749.67708589
424010016.9531,523.7418,893.21708688
484010021.7556,815.6018,476.38728688
184010019.6437,483.6144,788.42648668
Average2022,2348067718587
Table A3. Computational results for  n = 5000 .
Table A3. Computational results for  n = 5000 .
mkrRuntime (s)Objective Function Value
MCLSimpleGreedyClusterGreedyLinking SetSimpleGreedyClusterGreedy
72505035.5039,198.994090.1187104109
27505033.6637,746.8312,551.0583104110
35505038.1343,241.7121,132.678810397
45505030.1141,008.6115,922.9190105112
59505028.3423,410.068635.8894103107
88505026.3218,710.152134.4992104106
76505028.2817,171.252656.9087105109
50505039.5424,120.5115,279.6086103106
60505020.8512,712.062265.8791103111
60505023.8810,538.982520.8888104110
615010027.4548,493.6714,665.6593107113
615010027.1537,923.9416,476.3892106104
465010029.0549,783.7315,885.7684107109
595010028.3178,956.3719,807.0289105110
765010027.38159,351.6816,017.6191106109
585010029.3979,943.6322,085.4990108110
605010029.2437,985.1914,601.9492106112
595010031.6282,649.4825,357.6491105112
515010033.7289,812.2616,172.1786107111
595010035.0578,412.9528,552.0892105112
Average3050,55913,84189105109
Table A4. Computational results for  n = 6000 .
Table A4. Computational results for  n = 6000 .
mkrRuntime (s)Objective Function Value
MCLSimpleGreedyClusterGreedyLinking SetSimpleGreedyClusterGreedy
71605064.6347,839.6313,682.93107124132
142605034.9547,010.982842.19106124128
47605036.2483,701.9733,636.33101126133
72605066.51172,009.1622,980.20108126129
17605066.5274,947.1188,144.339812692
20605034.8429,437.3416,364.70101125129
71605034.8357,536.9810,625.88109124132
39605076.5748,285.8533,088.12103123130
63605077.4848,952.4019,807.15105125134
61605035.4052,906.3815,269.56102125132
716010044.85267,189.2714,896.03108129134
586010037.08243,731.7750,383.76110126135
726010038.53159,052.7312,433.80108127135
426010038.51180,923.7682,078.7599126131
96010054.29296,129.5659,096.51102125132
706010034.30262,348.8712,476.31112128136
546010059.83313,372.3536,160.77105126133
726010072.20101,866.3537,024.97111126134
736010072.34106,270.4025,122.75112126128
716010038.18289,850.3451,670.68109126134
Average51144,16831,889106126130

Appendix B

Here, we prove Theorem 2.
Proof. 
First, consider the following reformulation of the LSP using variables  y i j ,   j J , i K , which indicate whether at least i nodes from cluster j are selected. The model is written using the marginal gains, as follows:
m a x j = 1 m i = 1 k ( c i j c i 1 , j ) y i j
j = 1 m i = 1 k y i j = k ,
y i j y i 1 , j , j J , i K ,
y i j { 0 , 1 } , i K , j J
Constraints (A1) ensure that k nodes are selected, and constraints (A2) ensure that if at least i nodes are selected from cluster j, then at least  i 1  nodes are selected from cluster  j .
Notice that  y i j  variables can be related to variables  x i j , as follows:
x i j = 1   iff ( y t j = 1 , t i , and y t j = 0 , t > i ) , j J , i K
Now, we can observe that the relaxation obtained from removing constraints (A2) is a uniform matroid which can be solved by the greedy algorithm. As the coefficients are decreasing for each j (since they satisfy Equation (7)), then the greedy algorithm follows exactly the steps of Algorithm 4. Now, observing that the solution obtained by this algorithm satisfies constraints (A2), we conclude that it is optimal for the LSP, which concludes the proof. □

References

  1. Domingos, P.; Richardson, M. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 57–66. [Google Scholar]
  2. Richardson, M.; Domingos, P. Mining knowledge-sharing sites for viral marketing. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 23–26 July 2002; pp. 61–70. [Google Scholar]
  3. Rui, X.; Meng, F.; Wang, Z.; Yuan, G. A reversed node ranking approach for influence maximization in social networks. Appl. Intell. 2019, 49, 2684–2698. [Google Scholar] [CrossRef]
  4. Huang, H.; Shen, H.; Meng, Z. Community-based influence maximization in attributed networks. Appl. Intell. 2020, 50, 354–364. [Google Scholar] [CrossRef]
  5. Leskovec, J.; Adamic, L.A.; Huberman, B.A. The dynamics of viral marketing. ACM Trans. Web TWEB 2007, 1, 5. [Google Scholar] [CrossRef]
  6. Chen, W.; Castillo, C.; Lakshmanan, L.V. Information and Influence Propagation in Social Networks; Morgan & Claypool Publishers: San Rafael, CA, USA, 2013. [Google Scholar]
  7. Haleh, S.; Dizaji, S.; Patil, K.; Avrachenkov, K. Influence Maximization in Dynamic Networks Using Reinforcement Learning. SN Comput. Sci. 2024, 5, 169. [Google Scholar]
  8. Ni, P.; Guidi, B.; Michienzi, A.; Zhu, J. Equilibrium of individual concern-critical influence maximization in virtual and real blending network. Inf. Sci. 2023, 648, 119646. [Google Scholar] [CrossRef]
  9. Arora, A.; Galhotra, S.; Ranu, S. Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM International Conference on Management of Data, New York, NY, USA, 14–19 May 2002; pp. 651–666. [Google Scholar]
  10. Guille, A.; Hacid, H.; Favre, C.; Zighed, D.A. Information diffusion in online social networks: A survey. In SIGMOD Rec. 2013, 42, 17–28. [Google Scholar] [CrossRef]
  11. Li, Y.; Fan, J.; Wang, Y.; Tan, K.-L. Influence Maximization on Social Graphs: A Survey. IEEE Trans. Knowl. Data Eng. 2018, 30, 1852–1872. [Google Scholar] [CrossRef]
  12. Sun, J.; Tang, J. A survey of models and algorithms for social influence analysis. In Social Network Data Analytics; Springer: New York, NY, USA, 2011; pp. 177–214. [Google Scholar]
  13. Tejaswi, V.; Bindu, P.V.; Thilagam, P.S. Diffusion models and approaches for influence maximization in social networks. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 1345–1351. [Google Scholar]
  14. Zheng, Y. A Survey: Models, Techniques, and Applications of Influence Maximization Problem; Southern University of Science and Technology: Shenzhen, China, 2018. [Google Scholar]
  15. Kempe, D.; Kleinberg, J.; Tardos, E. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 137–146. [Google Scholar]
  16. Ko, Y.-Y.; Cho, K.-J.; Kim, S.-W. Efficient and effective influence maximization in social networks: A hybrid-approach. Inf. Sci. 2018, 465, 144–161. [Google Scholar] [CrossRef]
  17. Leskovec, J.; Krause, A.; Guestrin, C.; Faloutsos, C.; VanBriesen, J.; Glance, N. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA, 12–15 August 2007; pp. 420–429. [Google Scholar]
  18. Taherinia, M.; Esmaeili, M.; Minaei Bidgoli, B. Optimizing CELF Agorithm for Influence Maximization Problem in Social Networks. J. Data Min. 2022, 10, 25–41. [Google Scholar]
  19. Goyal, A.; Lu, W.; Lakshmanan, L.V.S. Celf++ optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 47–48. [Google Scholar]
  20. Chen, W.; Wang, Y.; Yang, S. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 199–208. [Google Scholar]
  21. Cheng, S.; Shen, H.; Huang, J.; Zhang, G.; Cheng, X. Staticgreedy: Solving the scalability-accuracy dilemma in influence maximization. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
  22. Goyal, A.; Lu, W.; Lakshmanan, L.V.S. Simpath: An efficient algorithm for influence maximization under the linear threshold model. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, Vancouver, BC, Canada, 11–14 December 2011; IEEE: New York, NY, USA, 2011; pp. 211–220. [Google Scholar]
  23. Rahimkhani, K.; Aleahmad, A.; Rahgozar, M.; Moeini, A. A fast algorithm for finding most influential people based on the linear threshold model. Expert Syst. Appl. 2015, 42, 1353–1361. [Google Scholar] [CrossRef]
  24. Shang, J.; Zhou, S.; Li, X.; Liu, L.; Wu, H. COFIM: A community-based framework for influence maximization on large-scale networks. Knowl. Based Syst. 2017, 117, 88–100. [Google Scholar] [CrossRef]
  25. Bozorgi, A.; Haghighi, H.; Zahedi, M.S.; Rezvani, M. Incim: A community-based algorithm for influence maximization problem under the linear threshold model. Inf. Process. Manag. 2016, 52, 1188–1199. [Google Scholar] [CrossRef]
  26. Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef]
  27. Ghayour-Baghbani, F.; Asadpour, M.; Faili, H. Mlpr: Efficient influence maximization in linear threshold propagation model using linear programming. Soc. Netw. Anal. Min. 2021, 11, 3. [Google Scholar] [CrossRef]
  28. Baghbani, F.G.; Asadpour, M.; Faili, H. Integer linear programming for influence maximization. Iran. J. Sci. Technol. Trans. Electr. Eng. 2019, 43, 627–634. [Google Scholar] [CrossRef]
  29. Keskin, M.E.; Güler, M.G. Influence maximization in social networks: An integer programming approach. Turk. J. Electr. Eng. Comput. 2018, 26, 3383–3396. [Google Scholar] [CrossRef]
  30. Wu, H.-H.; Küçükyavuz, S. A Two-stage Stochastic Programming Approach for Influence Maximization in Social Networks. Comput. Optim. Appl. 2018, 69, 563–595. [Google Scholar] [CrossRef]
  31. Ackerman, E.; Ben-Zwi, O.; Wolfovitz, G. Combinatorial model and bounds for target set selection. Theor. Comput. Sci. 2010, 411, 4017–4022. [Google Scholar] [CrossRef]
  32. Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
  33. Vlasblom, J.; Wodak, S.J. Markov Clustering versus Affinity Propagation for the Partitioning of Protein Interaction Graphs. BMC Bioinform. 2009, 10, 99. [Google Scholar] [CrossRef]
  34. Van Dongen, S. Graph Clustering Via a Discrete Uncoupling Process. SIAM J. Matrix Anal. Appl. 2008, 30, 121–141. [Google Scholar] [CrossRef]
  35. Van Dongen, S. Graph Clustering by Flow Simulation. Ph.D. Thesis, Utrecht University, Utrecht, The Netherlands, 2000. [Google Scholar]
  36. Agra, A.; Requejo, C. The linking set problem: A polynomial special case of the multiple-choice knapsack problem. J. Math. Sci. 2009, 161, 919–929. [Google Scholar] [CrossRef]
  37. Fairbanks, J.; Besançon, M.; Simon, S.; Hoffiman, J.; Eubank, N.; Karpinski, S. Juliagraphs/Graphs.jl: An Optimized Graph Package for the Julia Programming Language. 2021. Available online: https://github.com/JuliaGraphs/Graphs.Jl (accessed on 6 February 2024).
  38. Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 6 February 2024).
  39. Moody, J. Peer Influence Groups: Identifying Dense Clusters in Large Networks. Soc. Netw. 2001, 23, 261–283. [Google Scholar] [CrossRef]
  40. Agra, A.; Cerdeira, J.O.; Requejo, C. A decomposition approach for the p-median problem on disconnected graphs. Comput. Oper. Res. 2017, 86, 79–85. [Google Scholar] [CrossRef]
Figure 1. Scheme of Algorithm 3.
Figure 1. Scheme of Algorithm 3.
Information 15 00112 g001
Figure 2. Average number of clusters generated by the MCL algorithm.
Figure 2. Average number of clusters generated by the MCL algorithm.
Information 15 00112 g002
Figure 3. Average runtime of clusters’ generation.
Figure 3. Average runtime of clusters’ generation.
Information 15 00112 g003
Figure 4. Comparison of the average execution time of SimpleGreedy and ClusterGreedy algorithms.
Figure 4. Comparison of the average execution time of SimpleGreedy and ClusterGreedy algorithms.
Information 15 00112 g004
Figure 5. Percentage of average runtime of ClusterGreedy compared with SimpleGreedy.
Figure 5. Percentage of average runtime of ClusterGreedy compared with SimpleGreedy.
Information 15 00112 g005
Figure 6. Comparison of the average objective function value given by SimpleGreedy and ClusterGreedy algorithms.
Figure 6. Comparison of the average objective function value given by SimpleGreedy and ClusterGreedy algorithms.
Information 15 00112 g006
Table 1. Notation used.
Table 1. Notation used.
NotationDescription
  G = ( V , E ) Directed graph G with vertex set V and edge set E
nNumber of vertices in the graph G
mNumber of clusters
SSeed set
  S t Set of activated nodes at step t of SimpleGeedy algorithm
  S j Set of activated nodes in cluster j
  S j t Set of activated nodes in cluster j at step t of SimpleGeedy algorithm
  | A | Cardinality of set A
kNumber of seeds to be selected  ( k =   S )
rNumber of Monte Carlo simulations
  σ ( S ) The total number of nodes activated by the set of nodes S
  N in ( v ) Set of vertices u of directed graph  G = ( V , E )  such that the arc  ( u , v ) E
Table 2. Dataset statistics.
Table 2. Dataset statistics.
Network
Statistics
Email-Eu-Core
Network
Adolescent Health
Network
Gnutella Peer-to-Peer
Network
Nodes100525398114
Edges25,57112,96926,013
Average clustering coefficient0.39940.14190.0095
TypeDirectedDirectedDirected
Table 3. Computational results of evaluation of ClusterGreedy in real data sets.
Table 3. Computational results of evaluation of ClusterGreedy in real data sets.
NetworknmkrRuntime (s)Objective Function Value
MCLSimpleGreedyClusterGreedyLinking SetSimpleGreedyClusterGreedy
Email-eu-core1005181101000.3060,630.302327.60131.268530.814411.78
Adolescent health2539406255025.872511.181768.0482.181176.257157.273
Gnutella peer-to-peer811413815059.87482,067.82288,061.54244.29763.558771.724
Average 29181,73697,386153490447
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Agra, A.; Samuco, J.M. A New Algorithm Framework for the Influence Maximization Problem Using Graph Clustering. Information 2024, 15, 112. https://doi.org/10.3390/info15020112

AMA Style

Agra A, Samuco JM. A New Algorithm Framework for the Influence Maximization Problem Using Graph Clustering. Information. 2024; 15(2):112. https://doi.org/10.3390/info15020112

Chicago/Turabian Style

Agra, Agostinho, and Jose Maria Samuco. 2024. "A New Algorithm Framework for the Influence Maximization Problem Using Graph Clustering" Information 15, no. 2: 112. https://doi.org/10.3390/info15020112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop