Abstract
The traditional filling method for network big data matrix has poor filling effect and suffers from noise. Therefore, a filling algorithm for network big data high rank matrix based on density peak clustering is proposed. The missing data are replaced by small-interval data, the information entropy of the high rank matrix of network big data is calculated, the density peak clustering algorithm is optimized through the cluster center selection strategy, the block data set is obtained through the unknown block method, and the block filling is realized by the host filling algorithm. Experimental results show that the filling accuracy of the proposed algorithm is as high as 0.895, and the loss rate is between 2% and 12%.
1. Introduction
Matrix filling is one of research hotspots in the fields of matrix analysis, optimization, image processing, means filling the missing elements accurately through known elements in the case of missing elements in the sampling matrix, and finally completing the sampling matrix [1]. In practice, the sampling matrix sometimes has special structures, such as symmetric matrix and Toeplitz matrix, which play an important role in communication engineering and power system, especially in the field of signal and image processing [2–4].
There are two kinds of methods to deal with the problem of matrix rank minimization: one is to relax the rank function convex into the matrix kernel norm and establish the optimization model of the kernel norm; the second is to give the rank of the matrix in advance and establish a low-rank decomposition model [5]. Many domestic experts and scholars have also carried out a lot of researches on matrix filling and applied the research results to the fields of image processing, text analysis, and recommendation system [6].
In the aspect of image restoration, the large singular values in the data matrix retain more characteristics of the original data, while the small singular values contain more noise. Matrix filling technology has been widely used in data analysis, recommendation system, image filling, video denoising, and machine learning [7, 8].
The researches on big data require the involvement of a huge amount of information, which is usually collected and stored in daily life, but the process is carried out without supervision. Once external interference occurs, it will inevitably produce some missing data [9]. The collected data usually contain important information, and if it cannot be processed in time or improperly, there will be a serious impact on the real-time and effectiveness of the data, and even some wrong data information may appear, leading to users’ wrong decisions. In view of missing data, we need to fill in the data in time [10–15].
Relevant scholars have studied this problem and achieved some results. Sun et al. [16] proposed the missing data filling algorithm based on the improved neural process [16], expressed the observed time series in a single way, obtained their respective characterization vectors by the neural network, obtained the distribution function of the data through the neural process model, introduced the correction coefficient in the training stage, determined the sampling rate of the training data more accurately according to the data missing rate, and estimated the missing value of the data through the trained model. The results show that this algorithm has good filling effect in the context of small data sets, but the filling accuracy of missing data is poor. Lin et al. [17] optimized K based on cuckoo algorithm—the missing data filling algorithm of means clustering [17]. By taking the training error of neural network as the fitness function, the weight and threshold of neural network are optimized, the weight of neural network through cuckoo search algorithm is calculated, and K-means clustering algorithm is adopted to realize the optimization of missing data threshold and the filling of missing data. This method has high missing data filling efficiency, but the data loss rate is unacceptable.
This paper presents a high rank matrix filling algorithm for network big data based on density peak clustering. The specific steps are as follows.
In the first step, density peak clustering is introduced, the local density is calculated based on cutoff kernel function, and the clustering center is obtained by comprehensively considering the local density value and minimum distance value of data points [18–22]; the second step is to classify the nonclustering center points, identify the abnormal points, replace the lost data with small-interval data, and calculate the information entropy of the filling data; in the third step, the distance measurement is obtained by the density peak clustering algorithm to realize data segmentation, and the complete sub-data set is obtained by the host filling method to complete the high rank matrix filling algorithm of network big data [23–28].
The fourth step is to verify the effectiveness of the proposed method and draw a conclusion.
2. High Rank Matrix Filling Algorithm for Network Big Data
2.1. Big Data Information Entropy Calculation
Set the data with missing correlation to , represents the object set, and represents the attribute set. If , , attribute , that is, the interval data will be missing data , which can be represented by “.”
Compared with a missing data set, if the missing data are replaced by the inter-cell, the information entropy of the large data set will increase significantly [29–36].
On the big data set , the attribute condition is met, and , . Assuming that the data is missing data, the data set that has not been filled is called the original data, and the information entropy is calculated according to the formula as follows:where represents the minimum multi-interval similarity between and , and represents the correlation coefficient between the index and the index, both of which can be expressed as follows:where represents the interval of missing data.
Let be a small-interval data, replace the lost data in the big data set with , and at the same time, fill the data set in the original big data.where represents the lower limit of the interval and represents the upper limit of the interval.
The newly obtained large data set can be expressed according to
The information entropy of the newly obtained data set can be calculated by indicates that the missing data in the large data set are small-interval data, so is almost equal to zero, and . That is, is
The data in the original data is missing data, and the length can be regarded as zero, while the original big data are . It can be confirmed that the big data information entropy and the original data information entropy after filling meet the following relationship:
By using relatively small-interval data to replace missing data and continuously expanding the interval range, the information entropy in certain ranges will continue to decrease [37–40]. The similarity relationship of the above-mentioned use interval expands the range of the interval. The greater the position relative to the newly filled data interval, the greater the possibility that the object will be classified into other object data sets [41, 42]. When the filling range of the data interval is too large, the data classification is abnormal, resulting in data confusion. When the newly filled interval range increases from the minimum to the maximum, there will be at least one entropy representing the minimum [43–47].
2.2. Determination of Density Peak Clustering Center
According to the calculated information entropy of filling big data and , the relationship between them is analyzed and the basis is put forward. The density peak clustering algorithm is optimized by using the clustering center selection strategy [48, 49]. According to the principle of cluster center selection, the difference degree of cluster points is measured by using normalized product of adjacent distance and density . According to the statistical characteristics and change trend of the difference degree, the largest group of points is selected as the cluster center. After obtaining the cluster center, the network big data are divided into different clusters according to the adjacent distance label so as to realize clustering [50–52].
In order to quantify the degree to which a data point of network big data is offset from the origin and after normalization, the cluster center weight is introduced according to the positive proportional relationship:
In order to obtain the data point set with the largest deviation, the cluster center weights are arranged in the order from large to small. The first points are taken, and is usually set to 30. Take the point with the greatest deviation from the origin as the inflection point of the overall downward trend of the cluster center weight from acute to slow.
The downward trend of the weight of the survey center is described by the slope of the two-point line segment:
In equation (9), denotes the average change rate of cluster center weight within range, which reflects overall change trend of a certain range . Then, the inflection point can be described as
In equation (10), represents the slope from the first point to the th point, which is the average change rate of point set ; is used to describe the slope from the th point to the st point.
Based on the above analysis, the cluster center selection process is given:(1)Calculate the difference degree of each network data point .(2)The cluster center weights are arranged in order from large to small.(3)Calculate and as well as the maximum value of , and determine the inflection point .(4)Take the network data point before the inflection point as the cluster center point.
2.3. Unknown Block Calculation Method
Based on the understanding of the data set, the correlation missing data can be divided into two types: block known and block unknown. For known missing data, it can be directly divided into blocks by known information. In this case, there will be fewer variables, and the meaning of the variables is clear. In this paper, several data sets are used for experimental verification.
In actual operation, most of the missing data sets are unknown blocks, especially in cloud computing with missing big data, which makes it difficult to distinguish block information. In the case that the block is unknown, this paper adopts the density peak clustering algorithm and uses the method of improving the distance measurement to achieve the block of missing data.
The density peak clustering algorithm does not need to specify the number of categories in advance. The calculation method in the data set is as follows: Input: the number of clusters is , and the data set is . Output: clusters of all objects. Step 1: randomly select cluster centers, denoted by . Step 2: iterate until convergence.
Calculate the cluster belonging to all objects in :
Update the center of the class based on all the classes :
Before describing the block calculation, the following problems should be solved:(1)The clustering method can be divided into Q-type and R-type clustering. The K-means calculation method is actually the Q-type clustering method. R-type refers to relative variable clustering. To cluster the variables using the density peak clustering algorithm, the data set should be transferred to first and then to cluster .(2)Since the block processing effect is poor, when the sample size of is relatively small, after transposition will be an uncertain data set and density peak clustering algorithm has no limitations in clustering. However, when is relatively large, is still a missing data after replacement, and the density peak clustering algorithm has certain limitations in clustering. In order to solve this problem, the sparse expression is used to select variables. The central idea is to constrain the weights of variables in the objective function, forcing variables with relatively small weights to not cluster, thereby retaining variables with large weights. In this way, the selection of variables is realized, and the objective function of the result is defined as follows:where represents the variable weight vector, represents the coefficient of the variable weight, and represents the adjusted parameter. For the method mentioned above, selection of variables is adopted to solve the limitation of the density peak clustering algorithm when the number is large.(3)Compared with the classical clustering algorithm, the Euclidean distance is usually used to calculate the distance. When a large amount of data is missing, it is difficult to calculate the Euclidean distance. Therefore, it is necessary to define the distance between the missing objects and :were represents the value of the object on the variable.(4)For classical clustering methods, arithmetic averaging is usually used to update the cluster centers, but this method is not applicable when the data are missing. Therefore, the paper proposes the clustering centers in the case of missing data. The object in the cluster includes the missing part; that is, the cluster center is taken in the variable. The formula iswhere , which is .
After solving the above problems, the calculation method of KMB is given below: Input: data set , number of clusters . Output: block data set. Step 1: transpose to obtain . Step 2: arbitrarily select cluster center points . Step 3: iterate until convergence. Calculate the cluster belonging to all objects in : Relative to all classes , update the center of the cluster according to the definition in the text: Step 4: convert the cluster to complete the data set and obtain . Step 5: segment the data set according to the clustering result to obtain a block data set.
Block filling is a missing data filling method obtained by dividing the data set into blocks according to the characteristics of the data set. It is suitable for a wide range of data filling and relying on other variables, which is a big advantage. It is also called the host method that is widely used. Most of the traditional filling can be used in the host algorithm proposed in the article, except for mode filling and mean filling. However, some filling algorithms do not rely on variables, which meet the conditions as follows: Input: Missing data set . Output: Complete data set . Step 1: Determine the data block. If the block is known, then it can be directly divided; if it is unknown, it needs to be divided by the KMB algorithm. Step 2: Through the block information. Segmentation of the missing data set yields sub-data sets and . Step 3: Compared with all missing data sets, the complete sub-data set , can be obtained by using host filling. Step 4: Combine all the complete sub-data sets to obtain a complete data set .
To block the initial data set and fill all blocks in parallel, the calculation time for filling should be reduced to , where represents the number of blocks and represents the filling time of blocks. When the data set dimensions and data volume are large, the block filling effect is obvious.
3. Experimental Study
This experiment uses the movie rating data information obtained from the Movie Lens data set, randomly selects the rating data of 1500 users for 3000 movies, and converts it into a 3000 × 1500 partial observable user-movie rating data matrix . Given three cases of rank , the proportion of observable items is set to in the three cases. Comparative analysis is conducted in terms of the iteration convergence time , the number of iterations n, and the relative error between the repair matrix and the original matrix. The results are shown in Figure 1.

(a)

(b)

(c)
It can be seen from Figure 1 that when the rank of the matrix to be processed is 10, with the increase of observable items, the relative convergence errors of the four filling algorithms all have a downward trend in a certain period of time. That is to say, when there are more observable items, the accuracy of the algorithm convergence is higher, and the singular value threshold truncation algorithm and the algorithm proposed in this paper have higher accuracy of convergence than the other two algorithms. It can be seen from Figure 1(b) that when the rank is 500, with the increase of observable items, the relative error accuracy of the proposed matrix filling algorithm is better than other algorithms in a certain period of time. As an improved method of accelerating the nearest neighbor gradient algorithm, the augmented Lagrangian algorithm has better convergence rate and accuracy. It can be seen from Figure 1(c) that when the rank is 1000, under the condition of less observable items, the accuracy of augmented Lagrangian algorithm is better than other algorithms. In the case of many observable items, the error convergence accuracy of the proposed matrix filling algorithm is significantly better than other algorithms.
3.1. Comparison of Filling Accuracy
The filling accuracies of different methods are measured by taking two standards as indicators. One standard is , which is used to measure the degree of matching between the real value and the filling value.
According to Table 1, it can be seen that for any combination of missing, the proposed algorithm is obviously higher than the other two algorithms. In addition, the more the missing data of correlation, the lower the obtained by the other two methods. The filling accuracy will decrease with the missing rate of the data. However, the filling accuracy of the proposed algorithm has always been maintained at a very high level. In terms of , the proposed filling method is obviously higher than the other two algorithms as well.
3.2. RMSE Mean Analysis
The average error between the filled value and the true value is measured according towhere is the number of missing values, is the true value of the missing value, is the filled value of the missing value, is the average value of , and is the average value of , , the meaning of the two standards. The larger the value of , the higher the filling accuracy. Conversely, the smaller the RMSE value, the higher the filling accuracy.
As can be seen from Figure 2, the filling accuracy of the proposed method is relatively stable, the data missing will be between 2% and 12%, the value will be above 0.8, and the RMSE value will be between 0.15 and 0.2. Compared with the single correlation missing, the filling accuracy of single missing mode will be significantly higher than that of multi-filling mode. Since the missing data of multi-fill pattern are relatively large, the interference of feature extraction and restoration is higher than that of single fill pattern. This proves that the proposed method has stronger stability and higher filling accuracy than the other two methods.

3.3. Analysis of Filling Effect of High Rank Matrix under Multiple Data Sets
In order to further verify the filling effect of high rank matrix of network big data, filling accuracy of density peak clustering method, means clustering method and improved neural process method are comparatively analyzed based on Google dataset search dataset (https://toolbox.google.com/datasetsearch), Google Trends dataset (https://trends.google.com/trends/explore), and EU open data portal dataset (https://data.europa.eu/euodp/en/data/), as shown in Tables 2–4.
According to Tables 2–4, the highest filling accuracy of means clustering method under Google dataset search, Google Trends, and EU open data portal datasets is 60.02%, 60.12%, and 62.10%, respectively. In contrast, the highest filling accuracy of the improved neural process method under Google dataset search, Google Trends, and EU open data portal datasets is 60.98%, 56.29%, and 66.10%, respectively. The highest filling accuracy of the improved neural process method under Google Dataset Search, Google Trends, and EU Open Data Portal data sets is 60.98%, 56.29%, and 66.10%, respectively. The highest filling accuracy of density peak clustering method under Google Dataset Search, Google Trends, and EU Open Data Portal data sets is 98.93%, 99.63, and 99.85%, respectively. The above data show that the density peak clustering method has higher filling accuracy as it uses small-interval data to replace the lost data. By optimizing the density peak clustering algorithm through the cluster center selection strategy, obtaining the block data set through the unknown block method, and realizing the block filling by using the host filling algorithm, the filling noise of the network big data matrix can be effectively avoided and the filling effect can be improved.
4. Conclusion
This paper presents a high rank matrix filling algorithm for network big data based on density peak clustering. The density peak clustering is introduced, the missing data are replaced by small-interval data, and the information entropy of filled data is calculated. Combined with the density peak clustering algorithm and the improved distance measurement in the case of missing data, the missing data are partitioned, and the host filling can be used to obtain a complete sub-data set. The following conclusions are drawn through experiments:(1)When the rank is 500, with the increase of observable terms, the relative error accuracy of the matrix filling algorithm proposed in this paper is better than that of other algorithms. When there are many observable items, the error convergence accuracy of the proposed matrix filling algorithm is obviously better than that of other algorithms as well.(2)For any missing combination, the proposed algorithm is obviously higher than the other two algorithms. In addition, in the case of more correlation missing data, the filling accuracy of the proposed method is always stable.(3)The filling accuracy of the proposed method is relatively stable, with missing data ranging between 2% and 12%, D2 value more than 0.8, and the RMSE value between 0.15 and 0.2. The filling accuracy of single missing mode will be significantly higher than that of multi-filling mode, because the missing data of multi-filling mode are relatively large, and the interference to feature extraction and restoration is higher than that of single missing mode.
Data Availability
The data sets used and/or analyzed during the current study are available from the author on reasonable request.
Conflicts of Interest
The author declares no conflicts of interest.