Abstract

The Internet has become an important carrier of information. Its data contain abundant information about hot events, user relations and attitudes, and so on. Many enterprises use high-impact Internet users to promote products, so it is very important to understand the mechanism of information transmission. Mining social network data can help people analyze the complex and changing relationships between users. The traditional method for doing this is to analyze information such as common interests and common friends, but this data cannot truly describe the degree of intimacy between users. What really connects different users on the Internet is the delivery of information. The algorithm proposed in this paper considers the dynamic characteristics of information transmission, finds maximum transmission paths from information transmission results, and finally calculates the intimacy degrees between users according to all the maximum information transmission paths within a certain period.

1. Introduction

Social network data contains a wealth of information about events, relationships, and attitudes. On the basis of fully understanding and analyzing the data, a series of technologies, such as text mining, statistical theory, association analysis, and visualization technologies, are adopted to realize emotional orientation analysis, information extraction, user influence analysis, and so on. Many current methods of computing user intimacy can be applied to static networks. However, users might unfollow certain friends and their interests might shift to new and different topics. In other words, the tie strengths between different users change over time. The algorithm proposed in this paper takes the dynamic nature of data into account to improve information transmission analysis in social networks. After that maximum transmission paths are identified in the information transmission results, and then the intimacy degrees between nodes can be computed according to multiple groups of maximum transmission paths.

The remainder of this paper is organized as follows: Section 2 introduces the related work of this paper. Section 3 proposes the concept of the information transmission matrix. Section 4 introduces the computational process of tie strength. Section 5 states the experimental results. Section 6 introduces the conclusions.

Many enterprises use influential users to promote new products, but the mechanism of how information spreads through the network still needs to be further studied. It is very important to understand the communication mechanism of information, which can be applied in many fields, such as viral marketing, social behavior prediction, social recommendation, and community detection. These problems attract the attention of researchers from different fields, such as epidemiology, computer science, and sociology, who propose different information diffusion models to describe and simulate the process of information transmission, such as the independent cascade model, linear threshold model, and epidemic model. These models are mainly applied to influence evaluation, influence maximization, and information source detection. Most models recognize that information is transmitted from a source node set and other nodes can only obtain information from the nodes that neighbor the source node set.

Social networking service providers, such as Twitter and Facebook, have grown rapidly in recent years, with increasing number of users sharing information with their friends. There are more than 2 million active users on Facebook every month from all over the world and about 5 billion new tweets on Twitter every day. Social network analysis can be divided into the following aspects [1, 2]: (a) studying the network structure and trends [3], (b) online learning of complex networks [4], (c) comparing different models, and (d) predicting node status [1, 5]. The focus of social influence study is to investigate neighbors and associations to predict the impact and influence of the occurrence of an action [2, 6].

Researchers have examined information transfers, including the analysis of relationships [7], social action tracking [1], and other types of relationship transfer [8]. The algorithm proposed in this paper constructs a matrix based on the information transmission between users to describe the complex correlation relations. By making certain changes to the matrix, information transmission paths can be identified and the tie strengths between nodes can be calculated. Due to the small computational difficulty involved in constructing a matrix, the algorithm proposed in this paper performs more efficiently than other algorithms.

3. Information Transmission Matrix

A piece of information is very valuable at one time, but after that it may be worthless. From the perspective of information transmission, the degree of interaction between users can be calculated. By analyzing information transmission paths, an information transmission tree can be generated to describe the information transmission rules and be used to analyze the dynamic changes of the correlations between users.

Definition 1. Let G be a graph with n nodes and e edges. Ifthen the n × e matrix composed of element aij(1 ≤ i ≤ n, 1 ≤ j ≤ e) is constructed. M = (aij)n×e is the complete incidence matrix of graph G, namely, the information transmission matrix.
In this paper, information transmission data is used to construct and update M. The construction process of M is given below.
Figure 1 depicts the information transmission relationships between nodes. If there is an edge between nodes, then it means that information has been successfully passed between them. Otherwise, no information has been passed. We construct matrix M according to Figure 1, which describes the mapping relationship between nodes and edges. If there is an association between Nx and ey, then axy = 1. Otherwise, axy = 0.
Because there is a large number of inactive nodes, most of the actions of the nodes on the Internet are from browsing information while actions such as commenting and forwarding are rare. Therefore, matrix M is a sparse matrix. To reduce the negative impact of a large number of meaningless zeros in the matrix on subsequent calculations, further analysis of M is required to delete redundant nodes. In Section 3.1, we describe a quick and effective way to remove redundant nodes.

3.1. Isolated Nodes

Definition 2. If the determinant of nth order matrix M is not zero, that is, |M| ≠ 0, then M is called a nonsingular matrix or full rank matrix. Otherwise, M is called a singular matrix or reduced-rank matrix.

Definition 3. Nodes in graph G are connected if and only if the rank of the complete incidence matrix is n − 1. The matrix whose order is min {p, q} is called a large submatrix of the p × q matrix.
By calculating whether |M| is 0, we can judge whether the nodes in G are connected or not. A reduced matrix D can be achieved by deleting redundant nodes in M. D is a full rank matrix, that is, |D| ≠ 0. At this time, D is the maximum complete incidence matrix. That is, all the nodes in the new graph G that are formed by D are reachable, and there are no isolated nodes for information transmission.
Take matrix M in Figure 2 as an example to illustrate the process of removing isolated nodes. The rank of M is obtained by calculating the maximum number of linearly-independent crossings (that is, the maximum order of the nonzero submatrix):According to the abovementioned calculation results, R(M) = 6. This indicates the existence of isolated nodes in M. It can be seen that rows N7 and N8 are , so N7 and N8 are redundant, isolated nodes. Because the original data in line N6 and N7 are same and N7 was determined to be an isolated node to be deleted, N6 is also an isolated node. In conclusion, N6, N7, and N8 are isolated nodes. After removing redundant nodes, it is necessary to determine whether there are redundant edges in the matrix. Because column e6 is after the redundant nodes are deleted, e6 is a redundant edge that needs to be deleted.
Matrix D is obtained after deleting the redundant nodes in M. Next, whether the nodes in D are connected must be calculated as follows:The result is R(D) = 5. That is, |D| ≠ 0, so D is a full rank matrix. The conclusion is that all nodes in D are connected. In other words, there are no isolated nodes of information transmission.
To discover all information transmission paths in M, it is necessary to further determine which nodes can be tentatively considered to be redundant. The deleted redundant nodes are reconstituted into a new matrix M and the abovementioned operations are repeated to obtain a matrix D. Finally, multiple matrix Ds are obtained.

3.2. Information Transmission Path

To study the information transmission mechanism, it is necessary to identify all the information transmission paths from the information matrix. Therefore, further processing of the set of Ds is required.

Definition 4. Submatrix A is obtained by removing one row from the complete incidence matrix D. For A to be nonsingular, the edges that correspond to the columns of A must form a spanning tree of G.
Definition 4 provides a method for calculating all spanning trees in the connected graph G. By removing one row from matrix D and then calculating all the maximized nonsingular submatrices of the newly-generated matrix D, the edges that correspond to the columns of each nonsingular submatrix form a spanning tree of G.
The matrix D obtained in the previous section is taken as an example to illustrate the process of identifying information transmission paths according to Definition 4. Remove one row from D (delete row 5 here) to get a matrix A:By calculating the rank of A, we can get R(A) = 4. This value indicates that the nodes in A are connected. Although all nodes are connected to other nodes, there may be redundant edges. For example, the nodes N1, N2, and N3 in Figure 1 have three edges, and these three nodes can be completely connected to each other by two of the edges. To remove redundant edges, we apply the following rules to the matrix: Rule 1: the ith row of the matrix can be added and subtracted to the jth rowRule 2: repeat the operation of Rule 1 until there is no operable itemRule 3: the row vectors of the matrix are not interchangeableRule 4: the column vectors of the matrix are not interchangeableThrough the transformation of the matrix, the number of 1 in the matrix is reduced and the most concise matrix is finally obtained. At this time, the nodes in the matrix are connected by the minimum number of edges. Take matrix M in Figure 2 as an example to illustrate the process of removing redundant edges. Matrix D is obtained by deleting isolated nodes in M, and matrix A with 4 rows and 5 columns is obtained after deleting one row from D. Matrix A is a full rank matrix. To delete the redundant information transmission path, it is necessary to delete one column from A to form multiple sets of different column combinations. The different combinations of columns are {(e1, e2, e3, e4); (e1, e2, e3, e5); (e1, e2, e4, e5); (e1, e3, e4, e5); and (e2, e3, e4, e5)}. Then, perform row operations on each of the abovementioned matrices according to the rules.
The first two rows in Table 1 describe the cases in which the constructed matrix does not meet the judgment condition for generating a maximum information transmission path. In the first combination, edges (e1, e2, e3, and e4) are selected. It is found that in the matrix are all 0, so this path does not contain N5. That is, it is not a maximum information transmission path, so the combination of (e1, e2, e3, and e4) is deleted and the calculation is stopped. Similarly, in the second combination, , is 0, so the calculation result obtained by this structure does not include N4, that is, it is not a maximum information transmission path. The ranks of the third, fourth, and fifth matrixes are all 4, so they are full rank matrices that satisfy the condition of generating maximum information transmission paths. The fourth column in rows 3, 4, and 5 in Table 1 show the row transformation process. Number 1 is the lowest in the transformed matrix, so the matrix does not have redundant edges. Column 5 shows the graph structure of the matrix obtained after eliminating the redundant edges. It can be seen from the graphs that the method proposed in this paper can be used to identify all maximum information transmission paths.

4. Tie Strength between Nodes

According to the characteristics of information transmission, it is reasonable to assume that there must be some association between the nodes in the same transmission path. Here, it is assumed that if information is transmitted frequently between two nodes, then the degree of intimacy between these two nodes is high. After a period of data accumulation, data about maximum information transmission paths is added to the correlation strength matrix (denoted as T). Because the construction of T is executed according to information transmission flows, matrix T also keeps changing with the change of information transmission state. In matrix T, Ti, I represents the occurrence number of node i in the process of information transmission and Ti, j represents the information transfer times between nodes i and j.

The following is the formula for calculating the weight of node i:

The following is the formula for calculating the ties between node a and b:

According to formulas (5) and (6), the degree of intimacy between different users is calculated. The specific algorithm is shown in Algorithm 1.

INPUT: matrix M
OUTPUT: intimacy correlation graph
(1)generate matrix D according to matrix M
(2)construct matrix A from D
(3)FOREACH full rank matrix X IN matrix A
(4) update matrix T according to X
(5)ENDFOREACH
(6)calculate weights and ties in T
(7)construct a graph describing the degree of intimacy between nodes

5. Experiments

Five datasets are used in this paper. For detailed information about the datasets, please refer to our paper [2] published earlier.(1)Coauthor (https://www.aminer.cn/data): a dynamic coauthor network from ArnetMiner (http://www.aminer.cn/). We collected publications published from 2010 to 2016 by 100,000 authors.(2)DBLP (http://www.vldb.org/dblp/): the dataset is derived from a snapshot of the bibliography for 10 years, where each vertex represents a scientist and two vertices are connected if they work together on an article.(3)Twitter (https://twitter.com): we crawled the following links between 19,000,00 users from Twitter at 10 different time stamps from October to December 2017.(4)Weibo (http://code.google.com/p/weibo4j/): the most popular Chinese microblogging site. The data are crawled from March 8, 2014 when the crash of MH370 happened to April 8, 2014.(5)Dolphin’s Associations (http://www-personal.umich.edu/∼mejn/netdata/): this dataset is an undirected social network of frequent associations between 62 dolphins, which has 62 nodes and 159 edges.

Three sets of baseline approaches are chosen for the experiments:(1)PTPMF [9]: this method uses neighborhood overlap to approximate tie strength and extend the popular Bayesian Personalized Ranking (BPR) model to incorporate the distinction of strong and weak ties(2)TrustMF [10]: this is a model-based method that adopts matrix factorization technique that maps users into low-dimensional latent feature spaces in terms of their trust relationship and aims to more accurately reflect the users’ reciprocal influence on the formation of their own opinions and to learn better preferential patterns of users for high-quality recommendations.(3)SBPR: this method presents a generic optimization criterion BPR-Opt for personalized ranking, that is, the maximum posterior estimator derived from a Bayesian analysis of the problem

Figure 3 shows the information transmission graph without data processing. It contains 38,501 nodes and 20,354 edges. If all nodes in the Coauthor dataset were displayed in Figure 3, then the picture would be black and the structure would not be visible. Therefore, only some of the nodes in the Coauthor dataset are shown in this figure. As can be seen in Figure 3, it is very difficult to process network data.

In the Coauthor dataset, the lengths of most information transmission paths are 2 or 3. Figure 4 shows the path with the maximum length in the Coauthor dataset.

By constructing a matrix according to the structure in Figure 4 and executing the algorithm proposed in this paper on this matrix, it can be found that several groups of the largest and nonsegmented information transmission paths can be found, as shown in Figure 5. As can be seen from Figure 5, all the paths are loop free and achieve the maximum coverage of all nodes. Therefore, Figure 5 verifies the accuracy of the algorithm from the perspective of visualization.

Figure 6 depicts the degree of all nodes in the maximum information propagation path. It is found that the degree of most nodes is 1, the degree of a few nodes is greater than or equal to 2, and the highest degree value is 13. Figure 6 illustrates that the algorithm achieves the maximum removal of redundant edges.

The tie coefficients between different nodes are calculated according to information transmission paths. Figure 7 shows the tie coefficient of nodes. In it, the darkness of the edges represents the correlation strengths between the node and the ego node. The darker the color is, the stronger the correlation is and vice versa. The number in the edge represents the tie strength between two connected nodes, which is the final result obtained by fusing multiple sets of maximum information transmission paths.

In order to analyze the experimental results, we use the following measurement parameters [10]: Precision calculated by P = tp/(tp + fp), Recall by R = tp/(tp + fn), and F1-score by F = P × R × 2/(P + R). tp is the number of correctly identified examples, tn is the number of correctly identified nonrelated examples, fn is the number of not correctly identified the related examples, and fp is the number of not correctly identified nonrelated examples. Table 2 shows a comparison of the performances of different clustering algorithms on different datasets. It displays performance comparisons of SBPR, TrustMF, PTPMF, and TieCP using different datasets. According to Table 2, we can conclude that TieCP has the most stable execution effect and the best result regarding F-Score.

6. Conclusion

The algorithm proposed in this paper calculates the intimacy degrees between users according to the information transmission matrix. Compared to some mainstream methods, our method is simple and able to identify all the maximum information transmission paths. Beyond that our algorithm is relatively more stable when dealing with different kinds of data. Due to the small computational difficulty of constructing a matrix, the algorithm proposed in this paper performs more efficiently than other algorithms.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Youth Program of the National Social Science Fund of China (Project name: Research on Online Behavior Pattern of Customers and Multidimensional Customer Insight Method under Big Data; Grant no. 19CGL024).