Abstract

In the big data environment, the visualization technique has been increasingly adopted to mine the data on library and information (L&I), with the diversification of data sources and the growth of data volume. The previous research into the information association of L&I visualization network rarely tries to construct such a network or explore the information association of the network. To overcome these defects, this paper explores the visualization of L&I from the perspective of big data analysis and fusion. Firstly, the authors analyzed the topology of the L&I visualization network and calculated the metrics for the construction of L&I visualization topology map. Next, the importance of meta-paths of the L&I visualization network was calculated. Finally, a complex big data L&I visualization network was established, and the associations between information nodes were analyzed in detail. Experimental results verify the effectiveness of the proposed algorithm.

1. Introduction

In the big data environment, digital L&I information resources increase and age at a growing pace, with the diversification of data sources and the growth of data volume. Meanwhile, the user demand for discovery services is changing rapidly [16]. The visualization technique has been increasingly adopted to mine the data on L&I [711]. Therefore, it is of certain practical significance to study the L&I visualization from the perspective of big data analysis and fusion.

Currently, the research on L&I visualization mainly focuses on the visualization methods and visualization analysis tools of information resources and information retrieval [1115]. Mo et al. [16] analyzed the status quo of L&I visualization in terms of annual number of published papers, authors, journals, and keywords and explained the utilization of visualization software CiteSpace with an actual case. To deeply mine potential knowledge and disclose the deep correlations between L&I, Rowley et al. [17] introduced cooccurrence analysis to visualize L&I and quantified the cooccurrence information in L&I information carriers. Borrego [18] sorted out and summarized the current research of L&I data-driven knowledge discovery, clarified the core ideas of library knowledge discovery and L&I visualization service, and detailed the innovative data environment, driving mechanism, and pattern application, which are necessary for digital library knowledge discovery and L&I visualization service. Johnson [19] reviewed the results of cross disciplinary research on L&I in China, visualized the relevant research results with automated valuation models (AVMs), and discussed the differences between various methods, namely, cluster analysis, survey method, and strength, weakness, opportunity, and threat (SWOT) analysis.

The information isolated island and overloading of book data hinder users from acquiring and sharing of L&I [20, 21]. L&I information is commonly organized in two modes: information fusion and information aggregation [22, 23]. Paramonova [24] enhanced the semantics of L&I information through information classification, ontology analysis, and data association method and relied on the semantics for knowledge organization and visual metering. In this way, knowledge organization was integrated with econometric analysis to mine the resource associations, through the analysis of cooccurrence relationships, coupling relationships, and social network analysis.

Through the above analysis on domestic and foreign research, there are several defects with the research on information association of L&I visualization network: the lack of optimization of network layout algorithms, and the absence of L&I information fusion and comparison in multiple disciplines, in the big data environment [2527]. To overcome these defects, this paper explores the visualization of L&I from the perspective of big data analysis and fusion. Section 2 analyzes the topology of the L&I visualization network and calculates the metrics for the construction of L&I visualization topology map, node degree distribution, clustering coefficient, and mean path length. Section 3 calculates the importance of meta-paths of the L&I visualization network. Section 4 establishes a complex big data L&I visualization network and analyzes the associations between information nodes. The visualization results of the proposed algorithm were obtained through experiments, which confirm the effectiveness of our algorithm.

2. Topology Analysis

L&I information is a dataset composed of books, literature information, and archival information on the library management platform. Let be the information of a book or document in the L&I group information and let ASm be the association between multiple books or documents. Then, the set QX of L&I group information can be expressed as

Figure 1 visualizes the L&I network. On the left side is the visualized information of different disciplines and books, as well as L&I nodes. On the right side are the visualized codes of the visualized information on the left. In Figure 1, the L&I nodes are coded in multiple dimensions by their respective size and color, according to the type of book resources. From the structural features of the visualized network, the following will calculate the metrics of the topology of L&I visualization network, including node degree distribution, clustering coefficient, and mean path length.

Let l be a node with the degree of l in the L&I visualization network; let M be the total number of nodes. Then, the node degree distribution ND (l) can be described by

If the constructed network is a scale-free network, the degree distribution of nodes can be obtained by the power law distribution features:

The mean clustering coefficient that measures the closeness between multiple book or document nodes can be expressed as

Let Nmn (l) be the mean number of connections between a node with the degree of l and its adjacent nodes. Then, the local clustering coefficient in formula (4) can be calculated by

Network closeness NC, which represents the associations between book or document nodes in the L&I visualization network, can be calculated by

Let be the number of the shortest paths between nodes r and o; let be the number of the shortest paths between nodes r and o, which pass through node q. Then, the betweenness centrality BE, which reflects the influence of a node in the L&I visualization network, can be calculated by

Formula (7) shows that the greater the BE is, the more likely the corresponding node will be visualized in the information network.

In essence, the information query in the L&I visualization network is to find the L&I nodes with similar semantic relationships as the input node in the query. This paper abstracts the multiple paths between two nodes into a meta-path and describes the semantic association between nodes in an advanced level. Figure 2 provides the examples of the paths and meta-path between nodes.

For the L&I visualization network, the minimum mean path length can be denoted as KAV, and the maximum clustering coefficient can be denoted as COmax. Suppose all the nodes in the network are connected by meta-paths; then the network is fully coupled; i.e., the network state satisfies KAV = 1 and COmax = 1. The length of meta-paths changes regularly with the distances between nodes. In the fully coupled state, the L&I visualization network has a fixed number of nodes and a fixed number of meta-paths. M nodes need to be connected by meta-paths.

In the network, a node r is connected to its adjacent nodes and the nodes within the distance of l/2 on the left and right. Let l be the number of nearest neighbors of the node. Then, the clustering coefficient of node r can be calculated by

When the number of nodes is infinitely large, the mean path length of the L&I visualization network can be calculated by

Suppose the L&I visualization network is a star network, in which any node is only connected to one node. Then, the clustering coefficient of the M-node network can be calculated by

The mean path length of the network can be calculated by

If the number and distribution meta-paths are random, then the L&I visualization network is a stochastic network. As long as the mean node degree l remains unchanged, the presence/absence of each meta-path in the random network is uncertain. Then, the node degree distribution of the meta-paths in the network can be described by Poisson’s distribution as follows:

3. Calculation of Meta-Path Importance

Without considering the limitations on L&I contents, the first step to find the optimal meta-path that depicts node relationships is to determine the meta-path length, path number, and path rarity. The optimal meta-path refers to the meta-path with the greatest importance. Let be the set of meta-paths between nodes r and o, the importance support function of meta-path LU, the rarity of LU, and the length attenuation function of LU. Then, the importance of meta-path LU connecting the given node pair <r, o> can be calculated by

Traditionally, the word frequency of L&I contents is calculated by counting the term frequency-inverse document frequency (TF-IDF) value. However, the TF-IDF-based method performs poorly on relatively short L&I texts, which has a few words and lacks context information after word segmentation. In this paper, short L&T texts are firstly segmented into words, and high-quality Chinese word vectors are generated by the directional skip-gram (DSG) model. Let ST be the input short L&I text, the word sequence corresponding to the short text, L the length of the word sequence, the i-th word, and Uwi the corresponding word vector. From the word vector, the sentence vector UST can be derived as

Compared with short L&I texts, the abstracts of books and documents offer clear contextual relationships and many word vectors. To acquire the weighted mean sentence vector, the weight coefficients of words can be determined through statistical method. Let Φ be the abstract of a book or a document, the word sequence segmented from the abstract, and be the weight sequence of the segmented words. Then, a word vector can be constructed for each word segmented from the abstract. Then, the weighted mean sentence vector can be calculated by

The similarity between the sentence vectors of two texts, namely, the input short text and the abstract of a book or document, can be measured by cosine similarity:

Based on the similarity of the short text of abstract, the importance of meta-paths , , and can be measured. Let be the attenuation coefficient. Then, a length penalty function can be configured to effectively lower the importance of meta-path LU, when it is relatively long. The length attenuation of the length penalty function could be adjusted by changing the value of α.

The rarity function aims to evaluate how rare is the meta-path LU in the set of node pairs similar to node pair <r, o>. can be defined aswhere set can be expressed as

Set can be described as

Since o does not belong to set , then r does not belong to set . Drawing the idea of IDF calculation, the rarity function can be described as

To rank the meta-paths by importance, the meta-path importance function must be monotonically decreasing and have a maximum. Since the length penalty function is a strict monotonically decreasing function, formula (20) must satisfy

Formula (21) shows that is monotonically increasing, with a minimum of 0. According to formula (13), for to be monotonically decreasing and have a maximum, must have the same properties. To ensure that have these properties, this paper adopts the minsize of path that can represent the minsize of each type of nodes in LU. The meta-path of a given node pair <r, o> can be described by

Let ICi be the number of instances for the i-th node on the meta-path. Then, the minsize of LU can be calculated by

This is due to the existence of the following inequality:

The minsize MU of LU is monotonically decreasing. In the L&I visualization network, the relationships do not take the same form. Instead, imbalanced relationships might exist, such as one-to-many or one-to-one relationships. To solve the problem, this paper resorts to the calculation of enhanced minsize, which is obtained by applying an intensity factor to the minsize. The intensity factor can be calculated by

Let D be the nodes with minsize obtained by formula (23); let UZ (D) and RZ (D) be the out-degree and in-degree of D, respectively. If node D is a book or document, UZ (D) can be calculated by

Formula (26) shows that UZ (D) is the sum of similarities between the short L&I text and each node vector in instance set D. Through the above analysis, and STR (LU) can be combined to get the support function of meta-path importance:

Meta-paths are extensible. The importance of an extended meta-path equals the product between the current and the intensity coefficient STR (Ei) of the extended edge Ei.

In most of the existing studies, it is assumed that the meta-paths are provided by experts in the relevant fields. If the network is sufficiently large, it is impossible to know the type of all nodes or edges. The only solution to meta-path generation is to fix the length of the meta-path for each dataset.

According to the example node pair provided by the users, this paper automatically generates the meta-path that best explains the node pair. That is, all the possible meta-paths and the possible subsets of meta-paths were enumerated for the given node pair, and the meta-path that gives the example node pair the highest similarity was selected. Then, the forward hierarchical path generation algorithm was improved to adapt to the query task of this paper, thereby generating reliable meta-paths. The meta-paths were then sorted by importance.

4. Network Construction and Association Analysis

The existing studies have raised different opinions and drawn various conclusions about whether the visualization network nodes are associated with each other, but they did not measure the intensity of the associations. To fill up the gap, this paper introduced edge weight into the visualization network model and extracted the intensity of the interaction between associated nodes in the network.

4.1. Network Construction

Let be the probability for an edge to link up two nodes at the same time. The probability indicates how many of the edges to node o also link up node r. The conditional probability for the connection from node r to node o can be defined aswhere is the number of other nodes connected to nodes r and o; LIo is the number of other nodes connected to node o. Then, the following inequality holds:

Some edges could link up more than 2 nodes. Hence, the correlations vary from node to node:

Let be the number of edges from node r to node o; let LIo be the change in the number of edges to node o. If many values with the association of 1 are obtained, and if the sample size is small, it is improper to judge that the two nodes have a close linear relationship, solely based on the correlation coefficient. Thus, it is necessary to remove the repetitive values during network construction and parameter screening. The similarity between two nodes in the L&I visualization network can be calculated by

The calculation of network instances shows that the correlation between most information nodes equals 1. When there are only a few nodes, the linear relationship between variables LIr and LIo cannot be judged by correlation coefficient alone. Table 1 shows the calculation results of scale-free network.

Figure 3 presents the calculated node degree distribution of the established scale-free L&I visualization network. Only 3.6% of all network nodes had a correlation greater than 0.8, which satisfies the construction principle of complex network and the theoretical power law distribution of node degree.

4.2. Association Analysis

To further measure the association between L&I nodes, this paper analyzes the similarity of individual books and documents and that of a group of books and documents in L&I visualization network. The most important aspect of the analysis is the modeling of network nodes. The L&I nodes were analyzed and modeled from three aspects: the basic information BA, the text information TE, and the relationship information RE.

Let IM (r, o) be the similarity between two nodes r and o. For node r, a model can be established as . For node o, another model can be established as . Based on the data of BA, TE, and RE, the similarity between nodes could be calculated. Firstly, the similarity of nodes was calculated in terms of the three attributes. The similarity between the basic information BA (r) and BA (o) of nodes r and o can be described by

The similarity between the text information TE (r) and TE (o) of nodes r and o can be described by

The similarity between the relationship information RE (r) and RE (o) of nodes r and o can be described by

The basic information of network nodes mainly includes the publisher, publication date, and number of pages. The basic information of nodes r and o can be expressed as , and , respectively. Let be the weight of the similarity for each attribute of the basic information. Then, the similarity between the basic information can be calculated bywhere . This paper uses the text eigenvector to characterize the text information of books and documents. The similarity between the text information can be calculated by

The relationship information between nodes can be divided into primary and secondary levels. The relationship information of nodes r and o can be expressed as and , respectively. The similarity analysis of relationship information is equivalent to the similarity analysis of the primary and secondary relationship information. Let and be the weights of the similarity between primary relationship information and secondary relationship information, respectively. The similarity between the relationship information can be calculated bywhere . The similarity between the primary relationship information can be calculated by cosine similarity:

The similarity between the secondary relationship information can be calculated by cosine similarity:

Then, the weights were assigned to the three kinds of information BA, TE, and RE. The relationship between the three weights can be expressed as . The similarity between two nodes r and o can be calculated by

Based on the topology of the L&I visualization network, the similarity between network nodes was analyzed according to the primary and secondary relationships between groups of books and documents. The L&I visualization network can be described as a directed graph QX (H, K), where H and K are the set of nodes and set of edges in the network, respectively. Let B be the path connection matrix of the directed graph; and let be the similarity matrix of nodes r and o, with SS (r, o) being the structural similarity between r and o:where and are the weights of primary relationship node l and secondary relationship node l, respectively; RZl and UZl are the in-degree and out-degree of node l, respectively.

5. Experiments and Results Analysis

The experiments were carried out in the following steps: (1) capture all the associations between the L&I nodes of each discipline; (2) statistically describe different disciplines at different node degrees; (3) compare the structural evolution of the L&I visualization network through discipline crossover; (4) compare the query accuracies of L&I with different encoding schemes.

All the associations between L&I nodes in each type of disciplines were captured under the JavaScript programming environment. However, it is impossible to visualize so many types of disciplines one by one. Figures 4 and 5 present the L&I visualization networks for science and engineering disciplines, respectively. The light blue and light pink clusters are located in the center of the networks, respectively.

Table 2 presents the descriptive statistics (number of nodes, number of edges, node degree, and mean clustering coefficient) on the primary disciplines of the top 8 disciplines, whose L&I visualization network density is greater than 0.5. It can be seen that the L&I networks of disciplines with node degree greater than 0.5 have similar features: the networks of science, engineering, medicine, agronomy, and management have a few nodes, but the nodes boast a high node degree, possess a high mean clustering coefficient, and belong to the fully connected state. To better describe the distance between nodes from the planar perspective, Figure 6 visualizes the L&I networks of different disciplines with node degree smaller than 1. The figure provides a visual display of the cross influence of node distribution density and mean clustering coefficients on the L&I of these disciplines, as well as the status of every secondary discipline network under each primary discipline L&I network. The information in the figure fully demonstrates the superiority of the visualization technology that cannot be reflected by columns of numbers.

Table 3 presents the descriptive statistics on 24 disciplines with network node degree smaller than 0.5. From the statistics on the number of nodes, it can be inferred that this number is negatively correlated with network node degree. There is also a negative correlation between the number of nodes and the sum of the initial cross influence between disciplines. This law can be verified by the number of edges. If a few nodes are connected by many edges, the clustering coefficient and node degree of the visualization network will increase. If many nodes are connected by a few edges, the clustering coefficient and node degree of the visualization network will decrease. This further confirms the cross influence of L&I in terms of discipline.

To measure the influence of discipline integration on the structure of the L&I visualization network, this paper compares the structural evolution of the network before and after discipline integration and analyzes the dynamic evolution law of the interactive operation on the form of L&I visualization. Figure 7 presents the structure of the visualization network for each primary discipline of liberal arts.

Table 4 displays the nodes of visualized paths on different levels of importance, aiming to verify the effectiveness of path importance on L&I visualization tasks. Considering the calculation results on importance levels and the meta-paths for the typical paths in the table, it is possible to generate the node pairs corresponding to the paths and then measure the similarity between nodes by the methods specified in Section 4.2.

Further, our algorithm is expected to correctly guide the similarity measurement of nodes in information flow query tasks. To verify if our algorithm meets the expectation, this paper compares the experimental data under different weight distributions, similarity metrics, and edge properties. Figure 8 compares the L&I query accuracies under different visualization encoding schemes. There is a total of eight curves in the figure, including the curved edges with no weight and no similarity metric CUR1, the curved edges with no weight yet similarity metric CUR2, the curved edges with weight and no similarity metric CUR3, the curved edges with weight and similarity metric CUR4, the straight edges with no weight and no similarity metric SL1, the straight edges with no weight yet similarity metric SL2, the straight edges with weight and no similarity metric SL3, and the straight edges with weight and similarity metric SL4. It can be observed that, under any encoding scheme, the nodes with similarity metric had higher accuracies than those without similarity metric. Regardless of the presence or absence of weight, the similarity metric and modeling difficulty both greatly affected the L&I query accuracy.

6. Conclusions

This paper probes into the visualization of L&I from the perspective of big data analysis and fusion. The first step is to analyze the topology of the L&I visualization network and calculate the metrics for the construction of L&I visualization topology map, as well as the importance of meta-paths of the L&I visualization network. After that, a complex big data L&I visualization network was established to analyze the associations between information nodes. Through experiments, the disciplines with network node degree greater or smaller than 0.5 were described statistically, and the structural evolution of the L&I visualization network before and after discipline integration was demonstrated to confirm the effectiveness of our visualization strategy. Further, the nodes of the visualized paths were obtained under different levels of difficulty, which verify that the path importance is effective in L&I visualization tasks. It is also proved that, during the information flow query, the node similarity metric derived by our algorithm can correctly guide the query process.

This paper needs to be further improved in many aspects. For one thing, the meta-path features can be learned by attention-based neural network models, in order to generate more explanatory meta-path importance. For another, big data can be visualized in various forms. It is meaningful to explore the data visualization of multiple variables. The future work will analogize and analyze big data visualization more systematically.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.