Abstract

With the rapid development of the Internet, the information on the Internet presents an explosive growth. Cloud computing and big data analysis technology based on Internet information rise accordingly. However, all web pages contain not only important information but also the noise information irrelevant to the subject information. They seriously affect the accuracy of information extraction, so the research of web page information extraction technology arises at the historic moment and becomes the research hotspot. The quality of web page text information will directly affect the accuracy of later information processing and decision-making. If we can accurately evaluate the information of the web pages captured from the Internet and classify the extracted web pages according to the corresponding characteristics, we can not only improve the efficiency of information processing, but also improve the practical value of the information decision-making system. From the practical application requirements and user-friendly operation point of view, the information visualization of web design based on big data is studied in this paper. Specifically, the system designed in this paper improves the traditional template-based web information extraction method, establishes a web information extraction rule scheme combined with templates, and achieves the goal of web information extraction rule selection and template generation in the visual environment. Finally, the visualization algorithm based on T-SNE verifies the effectiveness of the web page information visualization algorithm designed in this paper.

1. Introduction

With the rapid development of Internet, the Internet has become the largest source of all parties to obtain the required data information. It is becoming increasingly common to search and find information or data using the Internet, in the meantime, to meet user needs. The Internet is also exploding in size and data. However, there is a lack of effective data organization form for the result data extracted from the existing web page information, which leads to the disordered data, no standard data format or organizational norms, and poor usability in the secondary utilization or analysis of data extracted from the result [1].

On the other hand, the main data information of web pages in the Internet is usually accompanied by more noise or interference information, such as advertising information, navigation information, and copyright information. Although the noise information plays a certain role in web pages, it is useless for important information extraction, which seriously interferes with and affects the accuracy of web page replaying information extraction. How to distinguish useful information from irrelevant information in web pages more accurately is very important and has become an important guarantee for the accuracy of data information extraction. In today’s world, in the era of big data, the effective target information for economic development and social progress is playing a more and more important role; natural language as human communication is the most direct, the most commonly used, and the most convenient tool of information, applied to various life scenes, anytime and anywhere in the Internet technology. Application of mature and deep, increasingly huge amounts of information in the form of electronic documents appeared in front of people; excessive information growth also brought many negative effects.

Information retrieval technology is closely related to information extraction technology. Information retrieval technology only retrieves pointers of relevant documents through statistics and keyword matching technology, without really understanding the content of the documents, and the documents retrieved are just a pile of related words. Information extraction technology makes use of massive data and supplements and improves information retrieval and other information acquisition means from the perspective of satisfying users’ information needs. From the engineering point of view, information extraction technology realizes the automatic search, understanding, and extraction of massive information. This technology will play an important role in promoting information collection, scientific and technological literature monitoring, medical care services, business information acquisition, and other fields [3].

Visualization was first created in 1989 by Stuart Cadjok McKinley and George Robertson. Initially, information visualization is to help the audience understand the content based on text and data and apply visual elements with beautification effect. In this process, its charm has become not only a means of recording information, but also an effective tool to enhance the depth and interest of knowledge and information. Although the research on information visualization has been more than ten years, as an interdisciplinary field with a long history, information visualization has led a new research boom in modern user interface design in recent years and has been widely used in mathematics, statistics, computer science, and other related fields. With the popularization of computers and mobile devices, the network, as a fast and extensive platform, enables the dissemination of information to break through the limitation of time and space and has more forms of expression. A lot of information data at an astonishing amount and speed is in front of people, a flat surface is different from the past traditional static information visualization, web page appears more intuitive interactive information visualization through clear visual interface, and the interaction between people and information way has become a new trend of development of conveying abstract information. The collection of dynamic visualizations, graphics, and interactive technologies that make complex information data less mysterious and accessible to most people continues to be explored, according to a 1999 report by Stuart Card, information visualization, which started in the 1990s. It actually comes from several other intersecting fields: information graphics statistics computer science user interaction, etc. [4].

At the same time, it is necessary to timely understand the market situation, monitor competitors, and forecast the development trend of the industry, so as to make correct analysis and decisions and improve the core competitiveness. One of the effective means to achieve this goal is to establish perfect, accurate, and efficient information visual system. Both mechanism modeling and data modeling have advantages and disadvantages. In the context of big data era, sufficient prior knowledge of the target system can be obtained through data modeling, and then the mechanism modeling of the target system can avoid the discrepancy between the mechanism model and the actual situation to a certain extent. Data modeling and mechanism modeling are compared in Figure 1 from which we know that the two modeling methods can construct the information system well; each module in Figure 1 is correlated and provides a feedback of making the whole system work and run automatically.

Therefore, if the web pages captured from the Internet cannot be timely and effectively extracted and classified as target information, this will mainly bring three disadvantages: a massive amount of web information will greatly increase the storage burden of the system to capture back intelligence information. The massive amount of web information will bring great inconvenience to the follow-up and related information processing and other related work, even impossible to complete. Three massive web information will continue to make enterprise users still lost in the vast ocean of information; it is difficult to find information related to their own, even if they were lucky to get it [810]. The research of this paper can explore the characteristics and design principles of interactive information visualization under the background of the information age, try to find more accurate and vivid visual language and communication methods, and add some content to the existing research theories. It provides some reference for information visualization designers to study and develop interactive information visual interface design from a more comprehensive perspective. Although some well-known websites and social media have begun to apply information visualization to redesign information data, most of them just stay at a simple and boring level, which also causes the lack of information visualization talents. This paper summarizes the principles and methods of information visualization design as well as the development trend in the future and plays a certain guiding role in the practical application of interactive information visualization in web pages and the expansion of diversified information transmission ways [11].

Due to the diversity of information type W and data structure, how to generalize data information in an all-round way is a difficult problem. Therefore, a data organization concept, called metadata, has gradually come into being [12, 13]. A popular definition of metadata is structured data that describes the characteristics and attributes of a resource such as data or information. However, the application of metadata is mainly in the field of library science, especially in domestic research. As an important way of information organization, metadata began to play a huge role in various fields after the emergence of various standards. In the educational resource organization, a metadata about learning objects was designed to encapsulate the correlation between educational resources in this learning object. The framework of educational resources is established based on metadata so that all educational resources can be organized and reused. In the field of e-government, researchers integrate the information and services of government with the classification structure named faceted metadata in order to help government agencies to realize the interconnection between different departments and unified server architecture management [14].

Information extraction originates from text understanding. The earliest research on obtaining structured information from unstructured natural texts can be traced back to the mid-1960s, W’s Linguistic String project of New York University and Ye Represented by the University of Roux’s FRUMP project. The research focused on extracting formatting information from medical reports and extracting information from news reports and continued into the 1980s. The research on information extraction technology based on Chinese started relatively late [15, 16]. Due to the huge difference between Chinese and western alphabetic characters, the research on Chinese information extraction makes slow progress. In the early stage, the work mainly focuses on Chinese named entity recognition, and, on this basis, it has moved to a higher stage, such as common reference digestion extraction event extraction. It can be seen that although a nearly blank information extraction system has not yet appeared, in recent years the field of information extraction has shown a more active trend, and some new developments have been made in both theory and application [17].

In the 21st century, with the rapid development of the Internet, the field of information extraction begins to develop to the field of web information extraction. HTML tags have no real semantic meaning [18]. The method based on D0M tree structure makes common similarity calculation and other operations on the label D0M tree of a large number of sample web pages, so as to summarize the structural features of the parts to be extracted, form rules, and carry out extraction. The template-based extraction method, through the comparison and analysis of the web pages generated from the same template, always produces a set of general extraction templates, which can directly extract information in practice [1922].

Information visualization is an interdisciplinary design category, since the early 1990s into the field of vision, through the in-depth exploration and display of information data, so as to draw effective scientific conclusions. It can be applied to news dissemination knowledge visualization product design, environmental protection, political education, business exchange, entertainment gossip, and other fields. In recent years, with the deepening of information visualization research, the research of map information visualization is also rising. With the rapid development of computer technology, major map making software has emerged, such as foreign MapInfo ArcInfo Areview, domestic SuperMap, and GeoStar, all providing mapping templates for thematic maps. These modules are very powerful and distinctive [23]. Even if the technology is relatively mature DataV Ali cloud platform, data generated graphics can achieve brilliant visual effect, but visual forms are still too single, achieving the goal of the graphical data expression, but difficult to move users who have deeper experience, due to the lack of communication and interaction between the audience. Therefore, it cannot meet the current trend of humanized and emotional design [24].

With the development of globalization, the estrangement between geography, culture, and language is getting smaller and smaller. People are flooded with a large amount of information and various digital media, which reduces the audience’s ability of receiving, selecting, and processing information [25]. At the same time, when the audience expects to share concepts and information, a clear way of visual communication becomes very important. How to make the information easily conveyed and understood by users has become the focus of information design. Therefore, information visualization, a multidisciplinary design field, arises at the right moment. On a visual level, visualization is to establish a mental model or mental image of something [26]. Therefore, visualization is based on people’s cognitive activities and reunderstanding of information, and it emphasizes the transformation of complex and invisible things into visual ones. Visualization in a broad sense can be divided into scientific computing visualization and information visualization according to different application fields and functions. Scientific computing visualization is widely used in geography, physics, and medicine, focusing on the analysis of computational research results [27]. Information visualization is mainly applied to Internet business, finance, political relations, entertainment information, and other fields. With the increase of Internet technology and usage, people can share and exchange more information. At the same time, people are not only the recipients of information, but also the creators of information. Therefore, there must be a way of information transmission that can cross the boundaries of language and culture. The emergence of information visualization provides people with a more interesting way to obtain information. When people use it as a means of communication, it not only changes the meaning of information dissemination and design language, but also broadens the possibilities of traditional charts and annotations such as pie charts, line charts, and bar charts [28].

The generation of information visualization is closely related to the increasingly serious phenomenon of information fatigue. Information fatigue refers to the fact that, in the process of information transmission, the audience’s attention is distracted and decreased under the stimulation of a large amount of information, and the selective reception mechanism cannot play a normal role [29, 30]. As a result, the affinity between the audience and information is gradually reduced until rejection occurs, which eventually leads to the reduction of the audience’s ability of receiving information and their ability of processing information and their motivation to process information, resulting in psychological and physical fatigue. The emergence and development of the Internet, the spread of information, have broken the traditional media in time and space limitations; information capacity is unprecedented expansion compared to text; information visualization is a form of comprehensive information. A simple list is different from the language; it is through the various forms of information integration and the presence of the audience’s mind set-up information image. Due to the large amount of information, the length of the interface is also large. At the same time, limited by the size of the web interface, the form of information transmission is single and fixed, which will cause visual fatigue and psychological disgust of the audience [31]. To summary table of vision graphics to show more abundant information, under the combination of fun and vitality, help the audience quickly the hair now and exploration information of data implies a substantive issues and internal connection and real-time updating based on dynamic changes in network interface of interactive information visualization is more strengthened the users for information content selection ability and initiative [32]. Therefore, the appearance of graphic color image as the main form of information visualization has become a necessity of modern information transmission.

The contributions of the paper are given as follows:(1)The system designed in this paper improves the traditional template-based web information extraction method and establishes a web information extraction rule.(2)achieves the goal of web information extraction rule selection and template generation in the visual environment.(3)The visualization algorithm based on T-SNE verifies the effectiveness of the web page information visualization algorithm designed in this paper.

3. Information Visualization Design for Web

3.1. The Flow Chart of the Design Method

Information visualization is an interdisciplinary design field, which mainly uses graphical image technology and methods to make graphical summary and visual presentation of large-scale complex information data more intuitively, helping people understand and analyze data. Information visualization technology is more in line with the development trend of the information age and greatly improves the readability and appreciation of information data.

The system structure block diagram of the proposed method is shown in Figure 2. As can be seen from the figure, position, hue, transparency, and shape view data are used to encode category data. The visual channels of brightness, saturation, and size are used to encode ordered data.

3.2. t-SNE Visualization Algorithm

Generally speaking, t-SNE algorithm is improved from SNE. So, it is really good for visual processing of data because it is embedded in the framework that we have proposed. SNE uses conditional probability to describe the similarity between two data amounts, as follows:

We only care about the similarity between different pairs; we can also use this conditional probability to define distance in lower dimensional space:

The variance in the lower dimensional space is directly set to

Then, in a higher dimensional space, if you take the conditional probability, you will have a conditional probability distribution, and you will also have a conditional probability distribution in the position space. If the data distribution after dimensionality reduction is the same as the data distribution in the original high-dimensional space, then theoretically these two conditional probability distributions are the same; then, how do you measure the difference between these two conditional probability distributions? The answer is to use k-L divergence (also known as relative entropy), so the objective function is zero:

Through the above introduction, summarize the shortcomings of SNE, that is, the complexity of gradient calculation. The gradient calculation for the objective function is as follows. Since the conditional probability is not equal to 0, a large amount of calculation is required in gradient calculation:

In the original SNE, the conditional probability in the high-dimensional space is not equal to 0 in the low-dimensional space. Therefore, symmetric SNE is proposed and a more general joint probability distribution is adopted to replace the original conditional probability, so that

In short, it is defined in lower dimensions:

Of course, we can define it in higher dimensions:

However, for outliers, we obviously need to obtain a greater penalty, so the joint probability in higher dimensional space is modified as

This avoids the outlier problem, where the gradient becomes

Then, we have

Accordingly, there are

4. Experimental Results and Analysis

4.1. Experimental Results Analysis

Take Figure 3 as an example; this is a thematic map of finding the quietest places in New York City made by The New York Times. These places are drawn based on suggestions provided by readers. The places with dense blue dots are the quietest corners of the city. The purpose is to provide users with a more intuitive experience of quietness. When the mouse slides to the corresponding location, there will be the relevant text description and introduction of the location, as well as the video shot. In addition to understanding the relevant information, users can also mark the quiet corner of New York on the map. The space is presented in such an interactive way, which makes it more intuitive and easy to read and plays a role of low-cost expansion of relevant data. At the same time, under the operation with a sense of participation, users can get more fun and improve the sense of substitution.

Unlike probability, reaction time and decision time are a distribution rather than a value when grouped by geography and are closely related to people's daily habits. It can be seen from Figures 4(a) and 4(b) that the peak of watching behavior and forwarding behavior is around 8: 00 in the morning and 9: 00 in the evening, which indicates that it is the time to engage in these activities and match with human behavior patterns.

For enterprise, therefore, advertising should be concentrated in the center of the network and has a huge amount of connected nodes.

The daily viewing frequency distribution of four different types of web pages is shown in Figure 5. From the frequency of the four kinds of information in Figure 4, it can be seen that the activities in the daytime are more frequent than the advertisements at night and the contents watched in the daytime are roughly the same. For emotional articles, text checking was more likely to take place between 5 pm and 10 pm, suggesting that people prefer to express their emotions at night. Holiday greetings are usually given between 8 am and 2 pm, which means that people do not like to give their blessings between 2 pm and 4 pm. As can be seen from the above description, human behavior shows obvious suddenness and periodicity, resulting in complex time characteristics of information transmission.

As can be seen from Figure 6(a), there is no significant difference in topological popularity among the four categories. Advertising (AD) is more likely to have large cascades (involving more than 10,000 users) compared with other categories. As can be seen from Figure 6(b), holiday greetings (HG) have the smallest life cycle value, followed by advertising emotional prose (EE) and news bulletin (NB), suggesting that holiday greetings usually last for a short period of time, because they are usually spread foronly a few days before and after the holiday. Figure 6(c) shows the largest median spread area of emotional prose, which indicates that emotional prose is more easily spread to more provinces. The reason for this is that emotional themes are more likely to break geographical boundaries and resonate with users in different regions.

Figure 7 shows the maximum cascade scale of nodes participating in information forwarding in the underlying social network, which further supports the above explanation that when α = 0, that is, when nodes at a higher height participate in information forwarding, the cascade scale will suddenly jump. Therefore, the tail peak in the cascade size distribution when α = 0 corresponds to the large cascade caused by the forwarding information of the central node. On the contrary, when α = 0.8, the increase of cascade size is relatively stable and continuous.

Curve type refers to the segmentation of pictures and words on the page as a curve arrangement, or the overall interface to produce a curve flow of sight guidance; this layout is full of rhythm of beauty. The map show in Figure 8 shows not only each family’s castle, but also the locations of some important events in the first quarter, with numbers set to indicate the locations of the events. As the background map and route information are all slightly curved, curved visual segmentation is formed, which makes the whole work more dynamic and accurate.

5. Conclusions

Due to the rapid development of the Internet and the global economy brought about by the intelligence, let us with the accelerating pace of life access information under the condition of ceaseless overlay, ushered in the map reading age, because the graphical information performance can make people more quickly and clearly understand the map data information as one of the earliest information of images. In the field of information visualization which has a wide range of applications, more and more designers tend to choose maps with geographical characteristics of information expression.

Based on the research of information visualization design in web pages, this paper takes visual presentation as the basis and interactive experience in web design as the innovation point, comprehensively analyzes many practical cases at home and abroad, deeply excavates the differences between information visualization based on web design and traditional communication media, summarizes the design methods, and puts them into practice. It is hoped that the combination of big data and interactive experience can improve the efficiency and quality of information visualization and provide certain design guidance value for future information visualization based on web design.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.