Abstract

In order to enhance the load balance in the big data storage process and improve the storage efficiency, an intelligent classification method of low occupancy big data based on grid index is studied. A low occupancy big data classification platform was built, the infrastructure layer was designed using grid technology, grid basic services were provided through grid system management nodes and grid public service nodes, and grid application services were provided using local resource servers and enterprise grid application services. Based on each server node in the infrastructure layer, the basic management layer provides load forecasting, image backup, and other functional services. The application interface layer includes the interfaces required for the connection between the platform and each server node, and the advanced access layer provides the human-computer interaction interface for the operation of the platform. Finally, based on the obtained main structure, the depth confidence network is constructed by stacking several RBM layers, the new samples are expanded by adding adjacent values to obtain the mean value, and the depth confidence network is used to classify them. The experimental results show that the load of different virtual machines in the low occupancy big data storage process is less than 40%, and the load of each virtual machine is basically the same, indicating that this method can enhance the load balance in the data storage process and improve the storage efficiency.

1. Introduction

In the process of operation and development, enterprise networks will accumulate a large number of low occupancy big data [1, 2]. Effective analysis of this low occupancy big data can obtain implicit data and knowledge, resulting in data value added and providing and attaching a variety of services [3, 4], which shows that low occupancy big data is very key. These low occupancy big data are stored in the form of message file or database and rise exponentially [5], so high-quality storage methods are required. Intelligent classification technology is a very important big data management technology. At present, it has been successfully used in various fields, but the problems of high occupancy rate and low classification efficiency of traditional big data classification are still difficult to solve. In view of the above phenomena, scholars at home and abroad have put forward the following solutions.

In [6] with the rapid development of sensing and digital technology, network physical system is regarded as the most feasible platform to improve architectural design and management. It investigates the possibility of integrating energy management system with network physical system to form energy network physical system to promote building energy management. However, due to the dynamics of building occupants, minimizing energy consumption while realizing the architectural function of energy network physical system is a challenge. Because occupant behavior is the main source of uncertainty in energy management, ignoring it will usually lead to energy waste caused by overheating and undercooling, as well as discomfort caused by insufficient heat and ventilation services. In order to alleviate this uncertainty, an energy network physical system related to occupancy is proposed, which combines occupancy detection based on WiFi probe. The framework uses integrated classification algorithm to extract three forms of occupancy information. It creates a data interface connecting the energy management system and the network physical system and realizes automatic occupancy detection and interpretation by assembling multiple weak classifiers for WiFi signals. A validation experiment was conducted in a large office to check the performance of the proposed network physical system occupying linked energy. Experimental and simulation results show that the proposed model can save about 26.4% of refrigeration and ventilation energy consumption with appropriate classifiers and occupancy data types. However, this method does not conduct in-depth research on data classification methods. In [7] in complex image environment, text detection and location from natural images embedded with text is still a challenging problem. Foreground object segmentation and classification are common methods for this task. Therefore, component level target classification in a clutter environment is an important subproblem. Proper extraction of foreground objects can achieve effective classification, so as to improve the performance of text detection. A feature vector based on equidistant pixel area distribution is proposed for text/nontext classification. The generated feature descriptor is script invariant and is very effective in the actual scene. Five different pattern classifiers are used to evaluate the proposed feature set on our data set. The experimental results show that the classification accuracy of the feature set is more than 86% regardless of the script. However, this method only classifies image data and has high limitations. Reference [8] uses three kinds of random forest classifiers to solve the problem of anomaly classification of sensor data. Consider the sensor deployment scenario specifically, where the sensor fields of view may overlap. According to the sensor type, we design signal features in time, frequency, and space-time domain. The results show that the proposed random forest classifier has higher true positive rate and lower false positive rate than unsupervised k-means method and random forest classifier with single signal energy feature. However, the classification effect of this method for low occupancy data is still unknown.

Grid technology can transform different computer resources distributed in a wide range of space into a computing power and data processing power characterized by universality, standardization, accuracy, and economy, so as to realize wide area resource sharing [9]. In essence, grid technology can be understood as maximizing the use of existing resources (software and hardware) in the network, meeting the storage, sharing and calculation of data and resources, improving the problem of resource island [10], and realizing the value-added network resources. Grid technology is widely used in data storage fields such as e-commerce, academic research, and so on. Based on this, this paper studies the low occupancy big data classification method based on grid technology and uses grid technology to integrate low occupancy big data and realize high-quality storage of low occupancy big data.

2. Low Occupancy Big Data Processing and Classification

2.1. Big Data Preprocessing Stage

Multisource network data is massive and high-dimensional. Cross-source classification algorithm feature extraction of multisource network data needs dimension reduction function. For multisource heterogeneous data, the traditional dimensionality reduction method can not determine the dimensionality of all kinds of data. The amount of data contained in multisource heterogeneous data is high, and its workload is also high. Feature extraction methods with high computational efficiency are needed.

Incremental orthogonal component analysis (IOCA) is selected to reduce the dimension and extract features of multisource network data, so as to improve the time complexity of massive multisource network data classification. IOCA method does not need to set a fixed target dimension, and the target dimension can be adjusted according to the changes of input data in the learning process [11]. Using this method to preprocess massive data will form better orthogonal components and avoid data redundancy and good compression dimension.

IOCA method can use the prefetched multisource network data to obtain the orthogonal component space that can automatically determine the dimension, so as to realize the rapid dimensionality reduction of multisource network data. The dimensionality reduction of multisource network data needs to be realized through the following two steps:(1)Let the existing new data be represented by and the learned orthogonal component space be represented by , and calculate the new potential orthogonal component possibly generated by and the linear independence between them.(2)Set the adaptive threshold, and use the set adaptive threshold to judge whether can be used as a new orthogonal component added to .

The specific calculation process of extracting data features by IOCA method is as follows:(1)Initialize the orthogonal component space with the initial dimension .(2)Use to represent the new input data and meet .(3)Use to represent the eigenvector and calculate .(4)Calculate .(5)Calculate .(6)Calculate .(7) represents the original data dimension. When , it means that belongs to the new orthogonal component. At this time, is added to .(8)Let repeat the above process until all data preprocessing is completed.

The features of multisource network data are extracted by using the corresponding orthogonal space to obtain multiple groups of feature vectors with lower dimensions [12]. The feature vectors with lower dimensions are sent to the multicore learning support vector machine classifier to realize the cross-source classification of multisource network data.

2.2. Intelligent Classification Algorithm Selection

Support vector machine is a machine learning method based on structural risk minimization principle and statistical learning theory [13], and support vector machine has good generalization performance. Support vector machine realizes multisource network data classification by searching the optimal hyperplane [14]. The solution formula of the optimal hyperplane obtained through the quadratic optimization problem is as follows:

In formula (1), and represent the regularization term and the relaxation variable measuring the training error of the sample, respectively. The smaller the value of , the lower the training error. At this time, the support vector machine has higher classification accuracy; represents the penalty coefficient of the balance adjustment parameter between the training error and the regularization term. The smaller the value, the lower the degree of penalty error classification; represents the hyperplane to be solved, and and represent the normal vector and offset of the hyperplane, respectively.

2.2.1. Multicore Learning Method

Multisource heterogeneous data can replace a single kernel with multiple kernels to improve the interpretability of decision function and the classification performance of support vector machine classifier [15, 16]. The convex combination formula based on kernel function is as follows:

In formula (2), , . and represent the total number of cores and the number of positive definite cores in the same input space , respectively. The classical kernel with different parameters is represented by each basic kernel .

Through the above process, the weight is selected to replace the data representation of the kernel.

Based on the gradient descent of the target value of support vector machine, the gradient descent of support vector machine solver is used to determine the combination of different problem kernels, that is, multicore learning method. The multicore learning is realized by clarifying the coefficient of the learning process of the decision function. The multicore learning formula is as follows:

In formula (3), and represent the penalty coefficient and relaxation variable, respectively, and represents the number of samples. When , the square norm of in each control objective function needs to be 0 so that the objective value is limited.

Let the Lagrange function exist as follows:

In formula (4), both and represent Lagrange multipliers related to support vector machine problems. Both and represent Lagrange multipliers related to constraints on .

Relative to the original variable, set the Lagrange function gradient to 0, substitute the set optimization conditions into the Lagrange function, and obtain the dual problem formula as follows:

The formula (5) is transformed into the dual formula of standard support vector machine by using kernel combination . is used to represent the optimal objective function of cross-source classification of multisource network data. Multicore learning has strong duality, so can be used as the objective function of the dual problem of formula (5) at the same time. Select the descending direction of the gradient descent method and update through the obtained gradient. The updating process is , where represents the step size. It is necessary to search the maximum allowable step according to the descent direction to judge whether there is a decrease in the target value. When the target value decreases, needs to be updated. Repeat the above process until the target value does not decrease.

2.2.2. Support Vector Machine Classification of Multicore Learning

Support vector machine needs to use the combination of multiple binary classifiers to solve the multiclass classification problem. Suppose that the category in the massive multisource network data is , and the number of binary classification tasks is through pairing the categories with the number of . Send the multisource network data to the trained support vector machine classifier, and finally obtain the classification results with the number of .

Use to represent the target value of cross-source classification of multisource network data, search for the kernel combination that can jointly optimize all decision functions as an even number [17], and obtain the objective function of optimizing cross-source classification of multisource network data according to kernel weight as follows:

In formula (6), and , respectively, represent all binary classifier sets to be considered and the target value of support vector machine for binary classification problems related to binary classifier. The Lagrange multipliers of each binary classification problem are obtained according to the gradient descent method, and the kernel combination of all binary classification problems is obtained, that is, the sum of maximized intervals, so as to realize the cross-source classification of multisource network data.

3. Build a Big Data Classification Platform with Low Occupancy

The low occupancy big data classification platform based on grid technology is composed of four parts: advanced access layer, application interface layer, basic management layer, and infrastructure layer, as shown in Figure 1.

Using the open grid services architecture (OGSA) and GT3 toolkit, build the infrastructure layer of the low occupancy big data private cloud platform on the basis of ensuring the low occupancy big data system structure. Develop the basic management layer, application interface layer, and advanced access layer according to the actual application requirements of low occupancy big data storage.

The infrastructure layer is the foundation of the physical and storage devices in the low occupancy big data system classification platform, which mainly includes four parts: grid system management node, grid public service node, local resource server, and enterprise grid application service [7, 18]. Each part of the equipment is scattered, and some types of server equipment have a large number and scattered locations. Therefore, WAN connection is adopted between different server devices [19].

The main function of the basic management layer is to use integrated or distributed control to manage the collaborative operation of all storage devices in the low occupancy big data private cloud, complete the optimal utilization of resources, and build a work contradiction mechanism for problems such as workflow contradiction. At the same time, this layer is also responsible for data backup, data encryption, and data disaster recovery in the process of low occupancy big data classification.

The application interface layer includes all interfaces used by the low occupancy big data classification platform to connect with other equipment and servers and provides interfaces and corresponding services to power management institutions at all levels according to the actual application requirements and corresponding levels of different power enterprises. The user can successfully log into the platform by inputting the corresponding account password through the public interface of the low occupancy big data classification platform and collect the corresponding data resources according to the account level authority.

The main function of the advanced access layer is to provide the low occupancy big data classification platform with the interface required for operation. On the basis of ensuring the primary application of low occupancy big data classification, the advanced application is developed according to the actual application requirements of power enterprises to realize the interpersonal interaction function.

3.1. Hardware Design
3.1.1. Infrastructure Layer

Grid technology has a wide range of applications and can be effectively deployed on LAN, Wan, and Internet [20]. In order to meet the internal low occupancy big data sharing of power enterprises, a low occupancy big data network including shared areas is established by using grid technology. The application of grid technology can ensure the safe access and sharing of different low occupancy big data resources in the grid system.

The high-speed network inside the power enterprise is used to connect different application system servers, database servers, storage backup servers, and other servers. Set up the grid operating system, set up professional grid scheduling servers, registration servers, server pools, and other devices, and integrate these servers and devices into a grid system to build a low occupancy big data grid. Using the open grid service architecture and GT3 toolkit, build the infrastructure layer of the low occupancy big data private cloud platform on the basis of ensuring the low occupancy big data system structure. The infrastructure layer grid is shown in Figure 2.

The infrastructure layer designed by using grid technology includes grid basic services provided by grid system management node and grid public service node, grid application services provided by local resource server, and enterprise grid application services.

Dispatching coordination server, registration server, and other servers and equipment that realize basic grid technology form a grid system management node, and different servers are equipped with hot standby machines to ensure their uninterrupted operation [21, 22]. The main function of each server in the grid system management node is to provide basic grid services and ensure the effective work of the grid system. Dispatching coordination server is the main component of grid technology and is responsible for controlling and coordinating all grid operations and services.

Storage server, computing server, and instance pool form a grid common service node [23]. The main function of the storage server is to provide data storage services in the grid, which has the characteristics of high capacity and reliability. The computing server usually adopts high-performance computer, which has high-precision computing performance. The instance pool contains different services in the grid, such as web services (the basis for building grid services) and public services.

As the infrastructure of grid technology, local resource server is mainly composed of heterogeneous systems such as application system and automatic office system [24].

Enterprise grid application service is an advanced service that defines and implements different grid applications according to the needs of power enterprises based on the foundation and public services provided by public service nodes. Based on the concept of grid, different computing resources and storage resources in low occupancy big data grid can be realized through grid services.

3.1.2. Basic Management

The basic management layer includes various functional modules applied in the process of low occupancy big data classification, such as load prediction module, image backup module, data security module, physical host module, etc. The structure hierarchy of each functional module is shown in Figure 3.

3.1.3. Load Forecasting Layer

Based on the application requirements of low occupancy big data classification platform, it is necessary not only to provide the historical load change trend of virtual machine (storage server) to power enterprises, but also to display the load prediction results of virtual machine to power enterprises, so as to meet the purpose of diversified application of low occupancy big data resources [25, 26]. Using load forecasting, you can apply for an appropriate amount of virtual machines to balance the load before the virtual machine load reaches the upper limit [27], so as to alleviate the access pressure. When the virtual machine load forecast is lower than a certain load value within a certain time, the virtual machine can be recovered to improve the utilization rate of the virtual machine and meet the requirements of environmental protection.

Load forecasting is based on the load monitoring results. After preliminary processing, the load monitoring results are calculated by BP neural network [28], and the prediction results are obtained. The prediction results are stored in the database and displayed through the interface in the advanced access layer. The load forecasting process is shown in Figure 4.

3.1.4. Image Backup Module

The main function of the image backup module is to backup each virtual machine in the low occupancy big data classification platform within a fixed time under the control of the timer, so as to restore the virtual machine in case the platform cannot run due to an accident. The image backup process is shown in Figure 5.

In the process of image backup, it is necessary to judge whether the virtual machine has image backup. If so, delete the old image backup and build a new image backup to ensure the uniqueness of image backup.

3.2. Software Optimization

Based on the above hardware platform, the low occupancy data after part 2 training and classification are input into the deep confidence network to optimize the software part of the low occupancy big data classification platform.

3.2.1. RBM Training

In the process of classifying big data with low occupancy in complex scenes by using depth confidence network, the depth confidence network is obtained by stacking several RBM (restricted Boltzmann machine) layers [29], and each RBM layer is trained separately by contrast divergence method.

RBM includes visible layer and hidden layer, which are represented by and , respectively. The connection weight only exists in and nodes, and there is no connection weight between nodes of each layer. Under the condition that is known, all nodes have conditional independence. When is input, can be obtained based on conditional probability, and can also be obtained according to . Under the condition of optimizing the internal parameters of RBM, if the obtained by is the same as the initial , it indicates that the obtained is another description form of .

As a standard energy model, the energy function of RBM can be described by

In formula (7), , , and represent the weight between E and R and the offset between them, respectively. Based on the joint configuration energy function shown in formula (7), the joint probability [30, 31] of and can be obtained. The formula is described as follows:

Since there is no connection between nodes in the layer in RBM, the conditional probability can be obtained according to the joint probability. The formula is described as follows:

After obtaining the conditional probability of RBM network, in order to make the Gibbs distribution described infinitely approximate to the fitting input data [32], it is necessary to learn and other parameters. Generally, the parameter solution is obtained by solving the upper likelihood limit of the input sample.

In view of the long training time, the contrast divergence method can be used to improve the training efficiency. In this method, the gradient of the number likelihood function is solved by two approximations: (1) the average summation in the gradient calculation process is approximately replaced by the samples obtained in the conditional distribution; (2) Gibbs sampling was performed only once. The algorithm based on contrast divergence can obtain the optimization results of and other parameters at a faster speed, so as to realize RBM network training.

3.2.2. Classification Process Based on Deep Confidence Network

The main feature of the deep confidence network is that it requires a large number of training samples. Therefore, when using the deep confidence network to divide the categories of low occupancy big data, it is necessary to reconstruct the data, expand the samples, and reduce the dimension of the samples [33].

When expanding samples, new samples can be obtained by adding adjacent to each other to find the mean. This method not only improves the number of training samples, but also considers the spatial correlation of samples. The specific process is described as follows:

Calculate the adjacent sum of each and its surrounding four directions, and divide the sum by 3 to obtain 4 times the number of new samples in the original low occupancy big data.

The same category of big data with low occupancy rate is relatively unified. Therefore, optimizing the input data based on spatial combination can greatly improve the classification accuracy of deep confidence network. If the input is simply added to the surrounding neighbors through training, the dimension and redundancy will be improved.

In order to reduce the dimension on the basis of spatial combination, principal component analysis or self-encoder is used to reduce the dimension of object metadata. The former method mainly reduces the dimension of linear data and the latter method mainly reduces the dimension of unstructured data. Compared with the two dimensionality reduction methods, the self-encoder can maximize the protection of the characteristics of low occupancy big data metadata, while the principal component analysis method can describe the data through fewer dimensions [34, 35].

The image after expanding the sample data and dimensionality reduction is used as the input of the first RBM visible layer of the depth confidence network, and several training depth confidence networks are randomly determined.

RBM layer training is to obtain parameter values through iteration, so as to describe low occupancy big data in other ways. The activation probability of the -th node of the hidden layer in RBM can be obtained through formulas (8)–(10). The formula is described as follows:

Meanwhile, under the condition that is known, the activation probability of the -th node can be described by

The training is carried out according to the contrast divergence method, and the isoparametric value in the RBM layer can be obtained after several iterations. Under this condition, is another way to describe big data with low occupancy.

Taking as the RBM input of the next layer and implementing iterative training in the same way, will be obtained, and the corresponding isoparametric values in each layer can be obtained. The pretraining process of the deep confidence network model can be realized through the layer-by-layer training process.

In order to obtain higher classification accuracy, a BP layer optimization parameter is introduced after the last hidden layer. If and are used to represent the expected classification result and the final hidden layer output result, respectively, the residual of the two is determined by is transmitted from back to front. In each layer of , is determined as

Determine the partial derivatives of and according to of each layer:

After obtaining the partial derivative, the weights of and can be updated. After several iterations, the depth confidence network model optimization after pretraining can be realized, and the low occupancy big data classification method based on grid technology can be realized.

4. Experimental Analysis

In order to verify the effectiveness of the low occupancy big data classification method based on grid technology, taking the low occupancy data of an enterprise network in China as the research object, the low occupancy big data classification test was carried out using this method. The test results are as follows.

4.1. Experimental Environment and Parameter Setting

The experimental environment is the hardware environment of Intel Celeron Turing 1 GHz CPU and 384 mb SD memory and the software environment of Matlab6.1. The simulation system includes data interference module, resource scheduling module, and task generation module. Based on the composition of the above parts, ms-coco data set is used as the data acquisition source, and 1000 task data are arbitrarily taken in the data set. At the same time, the task data is proposed to be stored in the simulation system, and the data bytes are controlled between [256∼568 kb]. The experiment termination condition was set to 300 times. The simulation algorithm parameters in this paper are shown in Table 1.

4.2. Virtual Machine Load Prediction Results

Using this method, according to the load of the research object in the past 30 days, the load trend in the next 5 days is predicted and compared with the current load. The results are shown in Figure 6.

According to the analysis of Figure 6, the load prediction results of this method in the storage platform during the implementation of low occupancy big data classification are basically consistent with the actual load, which shows that this method can accurately predict the load in the low occupancy big data storage process, make use of load balancing control, and meet the actual application requirements of the research object.

4.3. Load Balancing Test

The experiment verifies the load balancing performance of this method in the process of low occupancy big data classification with two indicators of load balancing and response time. The results are shown in Figure 7.

By analyzing Figure 7(a), it can be seen that the load of different virtual machines is less than 40% and the load of each virtual machine is basically the same in the process of low occupancy big data classification of the research object by using the method in this paper. By analyzing Figure 7(b), it can be seen that the response time of different virtual machines in the process of big data classification with low occupancy is controlled between 0.5 s and 0.7 s. This shows that the load balancing degree is high in the process of low occupancy big data classification, and the classification efficiency can be significantly improved through load balancing.

4.4. Storage Synchronization Test

Storage synchronization is one of the key indexes to evaluate the performance of storage methods. The storage performance of this method is verified by taking the storage synchronization as the index, and the results are shown in Table 2.

The calculation formula of bit error rate is as follows:

In the above formula, and , respectively, represent the error code and the total number of codes input in the process of power application data input.

Analysis of Table 2 shows that when the low occupancy big data input frequency gradually increases, the synchronization bit error rate of this method under the condition of different number of virtual machines shows an upward trend with the increase of data input frequency. When the number of virtual machines is 10, the synchronization error rate of this method basically maintains a linear upward trend in the low occupancy big data input frequency. When the input data frequency reaches 100 Hz, the synchronization error rate of this method is 0.013%. When the number of virtual machines rises to 30, the synchronization error rate of the method in this paper is basically the same as that under the condition of 10 virtual machines. When the number of virtual machines increases to 50, the synchronization error rate of this method increases obviously, and when the data input frequency is less than 70 Hz, the synchronization error rate fluctuation of this method is small. When the data input frequency reaches more than 80 Hz, the synchronization error rate of this method increases rapidly. When the input data frequency reaches 100 Hz, the synchronization error rate of this method is 0.089%. This shows that the number of virtual machines and low occupancy big data input frequency have a significant impact on the synchronization bit error rate of this method. When the number of virtual machines is 30, it can ensure that the synchronization bit error rate is low and meets the application needs of power enterprises.

4.5. Storage Comparison

Storage capacity is one of the main indicators to verify the data storage method. Under the same network environment and hardware facility environment, compare the storage capacity of the research object in the low occupancy big data storage process before and after using this method. The results are shown in Table 3.

According to the analysis in Table 3, from the perspective of average storage capacity, the average storage capacity of the research object data before using this method is 365.50 W/s, and after using this method, the average storage capacity increases by 88.25 W/s. From the perspective of storage volume fluctuation, the data storage volume in the low occupancy big data storage process has obvious fluctuation before using this method. After using this method, the data storage volume in the low occupancy big data storage process is relatively stable. The experimental results show that this method can effectively improve the storage capacity of large data with low occupancy and optimize the storage process.

5. Conclusion

High-precision classification of low occupancy data helps to achieve efficient network transmission and improve data utilization. Therefore, aiming at the problems of low efficiency and large space of intelligent classification of low occupancy big data, this paper proposes an intelligent classification method of low occupancy big data based on grid index. The main contents and process of this paper are as follows:(1)Feature selection of big data is done by constructing intelligent classification model.(2)Build a low occupancy big data intelligent classification platform based on grid index technology.(3)The constructed depth confidence network model is trained, and the selected features are intelligently classified.(4)Simulation results show that this method has the advantages of high classification efficiency and low occupancy in the intelligent classification of low occupancy big data based on grid index.

Data Availability

The dataset used in this paper can be accessed upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by Tianjin Major Scientific and Technological Research Plan under Grant 16ZXHLSF00160; Industry University Cooperative Collaborative Education Project of Higher Education Department of the Ministry of Education under Grant 202101012003; Tianjin Education Commission Scientific Research Project of Mental Health Education Special Task under Grant 2020ZXXL-GX45.