Abstract

Given a target area and a location-aware social network, the location-aware influence maximization problem aims to find a set of seed users such that the information spread from these users will reach the most users within the target area. We show that the problem is NP-hard and present an approximate algorithm framework, namely, TarIM-SF, which leverages on a popular sampling method as well as spatial filtering model working on arbitrary polygons. Besides, for the large-scale network we also present a coarsening strategy to further improve the efficiency. We theoretically show that our approximate algorithm can provide a guarantee on the seed quality. Experimental study over three real-world social networks verified the seed quality of our framework, and the coarsening-based algorithm can provide superior efficiency.

1. Introduction

In recent years, social networks have become prevalent platforms for the spread of product adoption, ideas, and news. Under this trend, influence maximization (IM) problem is becoming popular, which aims to seek users (referred to as seeds) to maximize the number of influenced users (referred to as influence spread) in the network. Kempe et al. [1] proved that this problem is NP-hard and presented an -approximate algorithm by greedily selecting seed users which has the maximum marginal gain of influence spread. Motivated by their work, a vast amount of studies then focused on improving the effect of influence spread and the efficiency, such as the heuristic-based algorithm PMIA [2] and the sketch-based algorithm IMM [3].

However, many real-world applications such as location-based word-of-mouth marketing recently have location-aware requirements in IM. In [4], Li et al. focused on a location-aware IM (LAIM) query, which aims to seek users to maximize the expected influence in a query region. They assumed that the network and the location of users are given beforehand, where an index can be constructed offline, and the target region is submitted online as a query. However, such assumption may not be always satisfied as the network and IM request are given at the same time, which is exactly the scenario discussed in traditional IM works [2, 5, 6]. Besides, existing works in LAIM can only answer the problem towards simple regions such as a rectangle or a circle. However, locations of users are always complicatedly visualized and managed via maps, the atomic regions of which are not necessarily rectangles or circles. Instead, they always show up as various and complex polygons. Therefore, it is meaningful to find an efficient method to address the LAIM problem, by targeting at an arbitrary polygon region, from scratch. Below we provide a running example to elaborate this point.

Example. A company wants to sell a new product in a city. It is obvious that people in this city are potential buyers. In order to propagate this product to the public, we need to find several individuals who are the most influential in the network, and hope that, through their propagation, as many people as possible could know this product and then further purchase it.

In our paper, we present a novel algorithm framework, namely, TarIM-SF (Targeted Influence Maximization with Spatial Filtering), to deal with this problem. Given a location-aware social network and a target region, we firstly adopt a spatial filtering model (SFM) to identify the targeted users. Then we will utilize the elegant sampling approach in latest IM solutions [3] to find seed nodes. In order to further improve the efficiency, we have coarsened the social network. In all, our contributions in this work are as follows:(i)We relax the target region in LAIM to arbitrary polygons, which is more practical in real applications.(ii)Secondly, our model can address LAIM from scratch without any assumption for offline processing.(iii)To the best of our knowledge, we are the first to prove the hardness of LAIM theoretically in detail, and for the large-scale and complex networks, we propose a coarsening-based model that can further improve the efficiency with guaranteed seed quality.

Experiments on real-world datasets Gowalla, Tweets, and Weibo demonstrate that our framework could generate a seed set with theoretically guaranteed quality, which outperforms a series of baseline methods in terms of influence spread quality. Besides, the coarsening-based algorithm can provide superior efficiency.

The rest of paper is organized as follows. Section 2 lists the related studies. Section 3 gives the definition of LAIM with proving its hardness and presents some fundamental knowledge. Afterwards, we discuss the proposed algorithm framework in Section 4. Section 5 shows the theoretical guarantee of seed quality. Section 6 reports the experimental results and some discussion. In Section 7, we conclude the paper.

Kempe et al. [1] first formulated the influence maximization problem and proved that it is NP-hard in general, but can be approximated with factor. They presented a greedy algorithm with a provable approximation guarantee to solve this problem. However, the greedy algorithm needs to perform the Monte Carlo simulation [7] to obtain the approximate ratio, which has a large time overhead. Furthermore, in order to improve the efficiency and the effect, there has been a large body of research works that can be divided into three types. Simulation-based methods accurately estimate influence by simulating the diffusion process repeatedly with a theoretical guarantee. Leskovec et al. [8] proposed a CELF method with the lazy-forward heuristic, which is originally designed to optimize submodular functions in [9], as well as [1014]. Heuristic-based methods are developed to avoid using Monte Carlo simulation at the expense of solution quality. For example, Chen et al. [2] proposed to use local directed acyclic graphs to approximate the influence regions of nodes, while [15] restricting the spread of influence into communities and [6] approximating the influence spread using linear systems. Sketch-based methods resolved the inefficiency of Monte Carlo simulations without loss of accurate guarantees. Borges et al. [16] presented a nearly optimal time algorithm for IM under IC model. This method relies on reverse simulations of the diffusion process and builds sketches to estimate the influence function efficiently. In subsequential works, techniques for bounding the sketches’ size are developed [3, 1721] and [3, 18, 19] are the representative ones that exhibit higher efficiency in all sketch-based methods. Moreover, Liu et al. [22] construct a community-level influence analysis model, instead of focusing on individual-influence, while [23] defining the outer influence of a community and aiming to find the most influential communities, as well as [24] constructing an influential propagation model considering the temporal-interaction between users in the social network.

Recently, more additional demands for IM problem emerged, such as considering the interests of users [25, 26], geographical factor, or some factors of time. Especially in order to meet the location-aware requirements in IM, Li et al. [4] proposed a method to solve location-aware IM, which works by seeking users to maximize the expected influence spread of the query region. Wang et al. [27] considered the distance between two users and defined the distance-aware IM problem. They proposed a priority based algorithm with -approximation ratio. The authors in [28] also studied the DAIM problem, considering the distance between the locations and the users. Zhu et al. [29] proposed Gaussian based and distance based mobility models, to derive the location-aware propagation probability in LBSN. Zhou et al. [30] take users’ historical mobility behaviour into account and study the IM problem under O2O model. Li et al. [31] aim to find several seed users to maximize geographic spanning regions (MGSR) in the query region, while Li et al. [32] assume that users have their location preference and solve the IM problem for the targeted users. Furthermore, some works focus on spatial-temporal IM problem [33, 34], which aims to find best trajectories to be attached with an advertisement and maximizes the number of influenced users. Besides the location, the interests/topics of users are also taken into consideration in IM, and [35] proposed an algorithm that returns top- topics related to the query of a user. Su et al. [36] take not only users’ interests but also their preference for locations into account, to find the targeted users, and then seek seeds to maximize the influence for targeted users.

3. Problem Statements and Preliminaries

3.1. Problem Definition

Definition 1 (LAIM). Given a location-aware social network where each node is associated with a location (denoted as ), a budget , and a target region , the location-aware influence maximization (LAIM) aims to find seed nodes (denoted as ) from , such that the influence spread from can reach the most number of nodes in .

We show the hardness of LAIM problem under Independence Cascade (IC) model, which is one of the most popular diffusion models [1]. Before that, we first define a problem called Subset Cover, which will be utilized in the following content.

Definition 2 (subset cover). There are an element set , a subset of , and a collection of subsets of , and we wish to know whether there exist subsets in , whose union is equal to .

Theorem 3. The location-aware influence maximization problem is NP-hard for IC model.

Proof. From [1], we know that the influence maximization is NP-hard by reduction from Set Cover. Here we can prove that Subset Cover problem above is also NP-hard by reduction from Set Cover problem; the process is as follows.
For S and U in Set Cover problem, we get a subset from U, and we get a new set by adding t different elements to U, while getting a new collection of V corresponding S, where . This process can be completed in polynomial time. When we find a solution A for Set Cover problem, the corresponding subsets in can cover all nodes in ; and if we find the solution of Subset Cover problem, Set Cover problem can also be solved. Based on this, we are able to construct a corresponding directed graph with nodes like the proof in [1]: there are node i corresponding to each in S, node j corresponding to each element in , and a directed edge with activity , where if ; otherwise it is equal to 0. The Subset Cover problem is equivalent to deciding if there is a set A in this graph with , where denotes the influence spread of node set A. Initially, if we find a set A which makes , the Subset Cover problem will be solved, and if all nodes corresponding to sets in solution of Subset Cover are activated, all t nodes corresponding to set will be activated.

3.2. Sampling Technique

Borgs et al. [16] introduced a sampling method called RIS, which first constructs a suitable number of sketches from different target nodes reversing DFS (referred to as RR sets ) and then finds out users as the seed nodes with the maximum coverage of . The process of constructing RR sets is as follows.

Firstly, given an edge-weighted graph , we denote as the influence graph of G, where p is the propagation probability for edges between two user nodes. We delete every edge e in G with probability . After that, we need to randomly choose one node in V and then construct a hypergraph and get the RR set for it.

Definition 4 (RR set). Let be a node in V. A RR set for is generated by firstly sampling a graph from and then taking the set of nodes in that can reach .

For instance, from hypergraphs in Figure 1, we can get the RR sets as follows: , , , , ⋯. Through the construction of , we can seek nodes as seeds which have the maximum coverage of .

3.3. Coarsening Method

Definition 5 ([37] coarsened influence graph). Given a social network , let be an influence graph and be a partition of  , where each is strongly connected (SC). Then, a coarsened influence graph obtained from is defined as a vertex-weighted influence graph , where

The mapping is defined as such that . For a vertex set , we let and . We follow the same coarsening process as [37]. Given an influence graph , whose distribution is , we first construct subgraphs , which are random graphs sampled from . In these subgraphs, we identify the vertex sets connected to each other in all subgraphs, so that we get a partition of and corresponding (see Figure 2).

3.4. Spatial Filtering Model

In order to figure out which nodes fall into region , the most intuitive way is to compare the location of each node with the boundary of , which is costly when is complex or is very large. Herein, we will adopt an efficient method for this task. Our method works by comparing the convex hull of nodes with and iteratively removing those nodes falling out of . Finally, we can end with a group of nodes whose convex hull is inside . In this manner, we avoid enumerating all nodes in V. Notably, for a point set T, we could use Graham Scan method [38] to seek its convex hull. Let be the nodes’ locations and be the boundary of as a point sequence. Then the process of finding target nodes in is shown in Algorithm 1.

Input:
A point set , a polygon ;
Output:
The points of within .
1: Initialize , , , ;
2: repeat
3: Compute the convex hull of and compare with ;
4: if ( are all outside )( are in ) then
5: return  .
6: else if ( are all outside )( are outside )() then
7: return  .
8: else then
9: continue;
10: end if
11: end if
12: is in , ;
13: ;
14: until  .
15: return  .

4. TarIM-SF Framework

Here we describe our TarIM-SF framework in detail. In our LAIM problem, target users change from the whole network in classic IM problem into users in a target region. As the construction of RR sets starts from the target users, it is reasonable for us to construct enough RR sets over the whole network for users in the target region within our problem and then choose the node set which maximizes the coverage of (referred to as ). Based on this idea, we first need to identify the users in the target region before constructing RR sets, which is addressed using the method proposed in Section 3.4. Moreover, in order to improve the efficiency of constructing RR sets, we have coarsened the network using the method proposed in Section 3.3. More details are given in Algorithm 2.

Input:
A location-aware social network , a query polygon , and a budget ;
Output:
A seed set S.
1: Initialize , a set of goal users  ;
2: location points of ;
3: SFM ;
4: Coarsening , according to Algorithm 1 in [37];
5: Sampling , according to Algorithm 2 in [3];
6: for   to   do
7: seek a node in with the maximum marginal coverage for ;
8: choose a vertex in randomly;
9: ;
10: end for
11: return  S.

For instance, in Figure 3, given a location-aware social network G, , and a query region (such as a triangle), we first use SFM (line 3) to identify the goal users , then we coarsen the whole network, and we get the partition , where (line 4). Next we will sample the coarsened influence graph and construct enough RR sets for the goal users (line 5). In the coarsened influence graph, the probability of partition node being chosen to construct RR set is , and there is -probability for node . As a result, we get RR sets: , , . We can see that node has the maximum coverage of , including , , . In turn for the original network graph, we choose one user node randomly in as the seed, such as node 2 (lines 6-10).

In our framework, the first step is to identify the users in the target region , whose complexity depends on the location distribution of all nodes. In case that all users’ locations are uniformly randomly distributed, the time complexity for identifying users within the target region is under our spatial filtering model, where and is the number of points for . The second step is to coarsen the network, which requires time, where denotes the number of random subgraphs sampled from . Afterwards, we use algorithm in [3] to seek solution in coarsened influence network, and the time of this step is spent on constructing RR sets for target region. The complexity of this process is , denoting the edges number of the i-th RR set, where is decided by the parameters and in [3], as well as the number of nodes and edges in the coarsened influence graph.

5. Effectiveness Study

We will subsequently conduct a theoretical study over the seed quality of our algorithm framework for both cases when coarsening is present or not.

5.1. Seed Quality without Coarsening

In [1], it has been proved that, under Independent Cascade model, the result influence function is sunmodular. Here we define as the target influence spread, which is the number of influenced nodes in target region. It is easy to see that, for any sets S and T, , and any elements , also holds. Hence, the function is submodular.

Theorem 6 (see [1]). For a nonnegative, monotone submodular function , let be a size- set by selecting one element at a time, each time choosing the element which has the maximum marginal function value. Assuming is the set that maximizes the value of over all -element sets, then ; in other words, guarantees a -approximation.

So if we get a seed set by adopting a greedy algorithm, is a -approximate solution for the location-aware influence maximization problem. Then we will show the performance guarantee when we adopt IMM method [3] as the greedy approach.

Lemma 7 (see [16]). For any seed set S and any vertex , the probability that a diffusion process from S can activate equals the probability that S overlaps an RR set for .

We generate a sizeable set of random RR sets for the nodes in target region , and for any seed set S, the fraction of RR sets in covered by S is the unbiased estimator of , where is the number of vertices in target region . In TIM+ [17], it has been proved that the solution which covers the maximum number of RR sets provides a -approximation with at least probability, but the number of RR sets is at least , where OPT is the maximum expected influence of any size- nodes set in G and is a function of , and . In IMM, it seeks a tighter lower bound LB of OPT than TIM+. Next, based on the analysis of the performance guarantee in the IMM, we will describe the parameter settings and performance guarantee in our framework.

Let be the sequence of generated RR sets for nodes in the target region . Let be any size- seed set in G and be random variable that equals 0 if and 1 otherwise; then based on Lemma 7, we haveConsider is the size- node set with the maximum expected influence; let . From (2), we can get that is an unbiased estimator of OPT. By Corollary 2 in which we set and Lemma 3 in [3], we have the following.

Lemma 8. Let , , andif , then holds with at least probability.

Assuming that our framework without coarsening returns a solution , and if holds, according to the properties of the greedy approach, then

Lemma 9. Let , , andif (4) holds and , then holds with at least probability.

Based on Lemmas 8 and 9, we have the following.

Theorem 10. Given any and any with , if , the result which IMM returns is a -approximate solution with probability.

For the parameters in Theorem 10, we set , and under this setting, is minimized when , and , whereIn this case, , and conversely if we set , wherethe seed set that covers the maximum number of RR sets is a -approximation. However, as OPT is unknown in advance, we will find a tight lower bound LB of OPT as IMM method. In the sampling phase in IMM, Lemma 6, Lemma 7, and Lemma 8 in [3] proved that and LB is close to OPT. And based on that, we can also prove that and LB is a tight lower bound of OPT with a high probability by changing to .

Theorem 11. With at least probability, sampling algorithm in our framework returns a set of RR sets with , where is as defined in (7).

Combining Theorem 10 and Theorem 11, our algorithm framework without coarsening can get a solution which is a -approximation with a high probability.

5.2. Seed Quality for Coarsening Method

In this part, we will show the result seed set can achieve -approximation, when coarsening technique is adopted to improve the efficiency. Based on the study in [37], for the target region , we will get the following equation:where , is the number (sum of weights) of vertices in that are reachable from S in , and is the number of vertices in that S can activate in , which equals in Section 5.1 and is submodular.

In the coarsened influence graph , we also haveIt denotes the sum of weights for nodes in that can activate in H. After coarsening the network, we define , where if for all ; otherwise , and is the number of users in target region that S can activate in I.

Here, we show the relationship between H and in terms of influence function through I, which has the same structure as .

Lemma 12. For any and any , .

Proof. For any u and v in V, “u can reach v through the edges in E with p” if and only if “ can reach through the edges in F with q”, since every subgraph , , is strongly connected. Therefore, it holds that . Thus,

For and I, we also can find the relationship between them as follows.

Lemma 13. For any and any , .

Proof. and I are the influence graph with the same structure, but for every edge e. So for any .
For any subgraph of , , its strongly connected reliability, denoted as , is defined the same as in Equation 14 in [37] and indicates the probability that is strongly connected.

Lemma 14. , for any and .

Proof. For each , we define and . For and , holds if every subgraph for is strongly connected. Thus

Theorem 15. For any and any , .

Let and be the optimal solutions of size for and , respectively. Based on Lemma 12 and Lemma 13, we can sure that , and have

Then applying Lemma 14, we have

Therefore, our coarsening-based algorithm achieves a -approximate solution for , where refers to .

6. Results and Discussion

In this section, we conduct experiments on several real-world datasets to test the performance of the proposed algorithm framework. All algorithms are implemented in C++ and run on Ubuntu 16.10 machine with Intel Core i5-6500 quad-core, 3.20GHz, 16GB RAM.

In the following experiments, we use three location-aware social networks, namely, Gowalla, Tweets, and Weibo. The statistics for the datasets are listed in Table 1 (n represents the number of vertices and m represents the number of edges). By default, we use a randomly selected for all datasets, and the number of user nodes falling in is denoted as . We conduct our experiments on WC model, which is widely used for information diffusion. For the weight of every edge, we set the probability of an edge as , where denotes the in-degree of user node . In TarIM-SF, we set .

6.1. Comparison with Baseline

We evaluate the performance of our algorithm framework TarIM-SF compared with method Assembly in [4] under WC model. In order to estimate the performance in general, we selected three regions with fixed for each dataset ( for Gowalla, for Tweets, and for Weibo) and reported the average performance, while varying for Gowalla from 10 to 50 and for Tweets and Weibo from 100 to 500. In Figure 4, we can see that the target influence spread of seeds in our framework is obviously superior to that of Assembly, especially on Tweets and Gowalla.

6.2. Effect of Coarsening

As mentioned above, we adopt coarsening technique so as to improve the efficiency for large-scale social networks. In this part, we did a series of experiments on Weibo, a large-scale network with about a million users. In order to justify how the parameter r in coarsening method will affect the target influence spread of seeds, we report the Relative Error, which measures the gap between the real influence spread of seed set we get and its estimated influence spread. We set and the target region as the whole network. Figure 5(a) shows the relative error is decreasing when becomes bigger. When , the estimated influence spread of seeds using our algorithm with coarsening is nearly equal to the eventual influence spread of seeds without coarsening. Figures 5(b) and 5(c) indicate that the running time for coarsening approach is significantly less than that of noncoarsening one, without loss of seed quality.

6.3. Varying the Size and Shape of

We also conducted another group of experiments by varying in terms of both size and shape. Specifically, we vary as triangle, tetragon, and pentagon, respectively. For each shape, we vary the size at several different levels and report the performance of our algorithm (shown in Figure 6, ). It justifies that our algorithm can work on target region with arbitrary polygon shapes. Besides, the time spent on the first phase in our algorithm framework increases slightly as the shape of varied from pentagon to triangle at the same scale, because nodes in convex hull need to be compared with the polygon queried and the fewer the number of polygon edges, the fewer the comparison times and the less time.

7. Conclusions

In this study, we present a novel model that can address LAIM with a target region that can be an arbitrary polygon. Our framework uses a spatial filtering model to initially figure out the nodes falling into the target polygon. Afterwards, a coarsening process is conducted over the network. Then, the state-of-the-art sampling algorithm adopted in traditional IM is used to find the solution. We theoretically prove the influence spread guarantee in both noncoarsened and coarsened cases. Empirical study over three real-world datasets demonstrates that our framework outperforms the baseline algorithm in terms of influence spread and is efficient in large-scale networks.

Data Availability

This study is based on the datasets Gowalla, Tweets, and Weibo provided in [4], and they are available at http://dbgroup.cs.tsinghua.edu.cn/ligl/laim/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant No. 61672408), Fundamental Research Funds for the Central Universities (No. JB181505), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2018JM6073), and China 111 Project (No. B16037).