Abstract

The massive amount of sensing and communication data that needs to be processed during the production process of complex heavy equipment generates heavy storage pressure on the cloud server-side, thus limiting the convergence of sensing, communication, and computing in intelligent factories. To solve the problem, based on machine learning techniques, a storage optimization model is proposed in this paper for reducing the storage pressure on the cloud server and enhancing the coupling between communication and sensing data. At first, based on the operation rules of the distributed file system on the cloud server, the proposed model screens and organizes the system logs. With the filtered logs, the model sets feature labels, constructs feature vectors, and builds sample sets. Then, based on the ID3 decision tree, a file elimination model is trained to analyze the files stored in the cloud server and predict their reusability. In practice, the proposed model is applied in the Hadoop Distributed File System and helps the system delete underutilized and low-value files and save storage space. Experiments show that the proposed model can effectively reduce the storage load on the cloud server and improve the integration efficiency of multisource heterogeneous data during complex heavy equipment production.

1. Introduction

With the development of information and communication technologies like 5G-A, Internet of Things, edge computing, and artificial intelligence, intelligent manufacturing based on new technologies has been developing rapidly [1]. Among all the latest technologies, based on the combination of traditional production processes, the Industrial Internet enables the interconnection between equipment and people. It achieves the goal of sensing, identifying, calculating, and managing multisource production data, thus improving production efficiency and reducing production costs [2].

However, the modern Industrial Internet for complex heavy equipment production can not be easily established since collecting industrial control and sensing data from multiple sources in intelligent factories is challenging. Besides, the vast types of transmission protocols and network architectures used in the complex heavy equipment production line also significantly impact the informatization of the equipment manufacturing process. With the continuous development of technologies such as extended reality and industrial sensing in recent years, more functional and performance requirements are put forward for the new generation of wireless Internet [3]. In contrast, constrained by the established architecture and working paradigm, the traditional Industrial Internet cannot meet the needs of high bandwidth and low latency while processing multidimensional sensing data, and it cannot effectively manage the production process of complex heavy equipment.

To provide intelligent, efficient, agile, safe, and reliable end-to-end integrated services for complex heavy equipment production lines, the new generation of intelligent networks needs to evolve towards the direction of cooperative integration of communication, sensing, and computing (ICSC) [4]. At present, along with the development of the 6th generation of communication networks, the ICSC-related technologies are gradually embedded into the new 6G system in an endogenous way and serve the field of intelligent manufacturing [5]. But since the new ICSC system needs to synthesize massive data transmitted by the edge network to integrate the perception information of a large area, the cloud servers responsible for computing always face high data storage and management pressure. The traditional solution to relieve stress is to increase the hardware investment and expand the storage scale of the cloud server to store the data [6]. However, the cost of such methods is high since most sensing data’s value and reuse degree are low. Therefore, the conventional approach causes the waste of hardware resources and labor costs, and how to lighten the storage burden in cloud servers is becoming a hot issue in the field of future Industrial Internet.

In order to save the storage space of the cloud servers, in this paper, based on modern machine learning technology, we have proposed an optimization strategy of data storage for cloud servers in the integrated communication, sensing, and computation system. Through the proposed model, the operational overhead of cloud servers could be reduced, and the space of cloud storage services could be optimized simultaneously. The technical contributions of this paper are illustrated as follows: (1)Summarizing the access records of stored data in the distributed file system and classifying all records according to the different types of operations on the data(2)Extracting features of data access records, constructing vector spaces, and training the deletion model for the stored data based on the ID3 decision tree(3)Introducing the trained decision tree to screen the data in the storage server and delete the low-utilized data

2. Motivation

The traditional complex heavy equipment production process is sophisticated, involving more links, and each link has different requirements for network latency and communication reliability. Therefore, wireless access technologies such as massive multiple-input multiple-output (M-MIMO), cognitive radio (CR), beamforming, and nonorthogonal multiple access (NOMA) are no longer sufficient to meet the demand for intelligent connectivity and flexible communication between multiple sensors [79]. In order to meet the needs of multidimensional perception, collaborative communication, and intelligent computing of sensor data in the production environment in complex heavy equipment production scenarios, it is necessary to improve the sensing accuracy of industrial wireless network by making full use of the joint design of infinite air interface protocol, multiplexing of air-time and frequency resources, and sharing of software and hardware. In addition, it should also improve bandwidth while reducing communication latency and combine intelligent cloud-edge-device collaborative computing to enhance network business capabilities and utilization efficiency [1013]. Based on the above purposes, the integrated system of communication, sensing, and computing is proposed, which enables wireless communication, ubiquitous perception, and intelligent computing to cooperate in the smart factory to jointly promote the informatization of complex heavy equipment production.

Figure 1 shows the architecture of the integrated communication, sensing, and computing system for the production of complex heavy equipment. In Figure 1, the sensing layer includes various sensing units and intelligent equipment with different functions. The sensing units are responsible for collecting multisource data, such as the production equipment’s location, vibration conditions, and temperature. The intelligent equipment achieves the integration of communication and sensing through shared RF transceivers and shared frequency bands [14]. The network layer includes base stations and edge computing devices, where base stations utilize communication-aware fusion technology to sense the production environment on the one hand and interact with intelligent machines promptly on the other hand. In contrast, edge computing devices take advantage of intelligent computing methods for local area network control by fusing the information received from base stations [15]. The computation layer consists of powerful cloud servers synthesizing a wide range of regional sensing information sent from base stations and edge computing devices. Specifically, the computation layer is responsible for analyzing the regional sensing information to build an intelligent scheduling model to realize global sensing and then flexibly schedule computing power, so that communication, sensing, and computing functions can deeply support each other to improve the efficiency of complex equipment manufacturing processes [16].

In Figure 1, as the core part of the integrated system for complex heavy equipment production, the computing layer is responsible for aggregating the sensing data of the integrated network elements, extracting key sensing parameters, eliminating global irrelevant parameters, and sensing the global situation of the network, while using the computing capabilities to make intelligent adjustments to the network’s energy efficiency, data rate, spectrum efficiency, regional traffic capacity, latency, number of connections, resource allocation, and other parameters. Besides, the computing layer also intelligently tunes the network’s energy efficiency, data rate, spectrum efficiency, regional traffic capacity, latency, number of connections, and resource allocation [17]. All the works rely on the mighty computing power the cloud servers offer. The current commercial cloud servers mainly adopt distributed computing architecture, with a shared file system to store massive flux data and combine multicomputer collaborative computing technology to process data rapidly. However, as working hours increase, enormous data is constantly transferred from base stations and edge computing devices, which puts heavy storage pressure on cloud servers [18]. In this case, a set of proven storage optimization methods will significantly improve the storage efficiency of servers, which is essential for both reducing computing costs and optimizing computing efficiency.

The traditional storage optimization means include optimizing the hardware design of cloud servers and changing the architecture of distributed computing framework. The current optimization ideas are mainly divided into distributed file system load balancing methods that use data value as a measure and strategies that use clustering algorithms to optimize the distance between storage nodes and compute nodes to promote storage and compute integration [19, 20]. Although these methods can optimize the distributed file system and improve the storage space utilization rate to some extent, they require changes to the underlying architecture and core allocation rules of the distributed file system and thus are difficult to implement. At present, some researchers are exploring lightweight, deduplication-based storage optimization techniques [21]. Such techniques mainly are aimed at eliminating and deleting duplicate data in the file system, but the proportion of duplicate data is small during the complex heavy equipment production. In contrast, the data with low value and only used once or a few times takes up most of the storage space. In summary, the current research results are not yet able to effectively screen and delete the data with low reuse in the integrated communication, sensing, and computing system.

To address the current problem of excessive storage pressure on the computing layer of the ICSC system, this study analyzes the logs of cloud servers, classifies the access records of the sensing and communication data in them, and trains the elimination model based on machine learning methods to help the system automatically screen out the low-reuse and low-value files and delete them, so as to improve the management level of the distributed file system under the cloud service environment, improve the storage efficiency, and reduce the storage cost.

3. Model Design

In the ICSC system, sensing data is stored as files in the distributed shared file system. Commonly used file systems include GFS (Google File System), MooseFS, SeaweedFS, GlusterFS, and HDFS (Hadoop Distributed File System). Among them, HDFS is often deployed on large-scale machine clusters because of its fault tolerance and high scalability, which is suitable for processing big data and has an inherent advantage in offline batch processing of big data [22]. Therefore, we take HDFS as the research object and explain the general approach of the storage optimization model for cloud servers.

3.1. Log Filtering

The principle of HDFS design is “Moving Computation is Cheaper than Moving Data,” so HDFS keeps multiple copies of datasets on different nodes in a cluster, replacing data transfer with transferring computational tasks in parallel computing. Following such a design principle, there are three ways to obtain file access logs on HDFS: through HDFS API, stubbing-based methods, and analyzing system logs. Among them, getting comprehensive file access logs is difficult using HDFS API and stubbing. Therefore, in this study, we analyze the system logs to obtain information related to file operations. On this basis, the file access records can be sorted out by combining the working principle of HDFS.

At first, it is necessary to set the filtering conditions for logs, i.e., to extract only the logs related to file access. However, in the actual environment, frequent file operations in HDFS will generate a large amount of log information. Therefore, the method we adopted in this study is to combine the working principle of HDFS, extract meaningful logs for judging the data value, and reduce the computation of log analysis. In summary, the following requirements are made for the log analysis in this study: (1)The log message should contain a timestamp(2)The log message should contain the filename since most operations on HDFS are performed on the underlying blocks or streams and do not contain filename information(3)Log messages should be representative and exclusive. The logs should present the specific operations performed on a particular file and not be confused with other operations

According to the above requirements, as little as possible, the most relevant content for analyzing file access operations can be selected from the system logs. Based on this, we could combine the working principle of HDFS to make the classification and key data extraction of these logs to train the file elimination model.

3.2. Key Information Extraction
3.2.1. Information Extraction for File Writing

In HDFS, all files are read-only, so a write file operation is also a create file operation. The principle of write file operation is shown in Figure 2. First, the client initiates a file creation request to the NameNode. If the NameNode confirms that the file to be created does not exist, it makes a new file in the namespace and allocates space for it. After the creation operation, the NameNode returns the FSDataOutputStream to the client, initializes the data flow pipeline of the DataNode, and then opens the DataNode’s data reception service. At this time, the client writes data to the FSDataOutputStream through the data stream. Specifically, the FSDataOutputStream divides the data into chunks, stores them in the data queue, and organizes them into packets to send to the DataNode. The DataNode receives the packets and passes them to other DataNodes through the data stream pipeline. Accordingly, the DataNode that gets the data returns an acknowledgment message. Finally, the client closes the data stream and sends a “completeFile” message to the NameNode, and then, the write file operation ends.

To summarize the flow in Figure 2, in all write file operations, the operations interacting with the distributed file system include as follows: (1) request of file creation, (2) create new files and allocate space, (4) send packet, (5) ack, and (7) complete file. However, no log information is stored in the log system related to operation (1). In addition, operations (4) and (5) both represent data writing operations to distributed files, so in the proposed model, only operation (4) is kept. In summary, as Table 1 shows, the critical operations of file writing include operations (2), (4), and (7), and the logs corresponding to these operations can be collected and used for subsequent analysis.

3.2.2. Information Extraction for File Reading

Reading files is more straightforward than writing files and involves fewer operations and, correspondingly, fewer logs. Specifically, the basic workflow is shown in Figure 3. First, the client initiates an open file request to the NameNode, which confirms the client’s operation permission, and then returns the corresponding data block. Afterward, the NameNode returns the FSDataInputStream to the client for reading the file.

As Figure 3 illustrates, the client reads the file through FSDataInputStream by reading the data blocks that make up the file sequentially. Specifically, FSDataInputStream selects the nearest DataNode to the client from all the data nodes that contain the first data block, then connects the node to the client and starts reading. Then, FSDataInputStream selects the nearest DataNode containing the following data block to read. When all the file data blocks have been read, the client closes the data stream and sends a “completeFile” message to the NameNode. During this process, if a communication error occurs in one DataNode, the NameNode automatically connects to the next DataNode containing the same data block, removing the error node and not connecting again.

To summarize the flow in Figure 3, the operations interacting with the distributed file system include as follows: (1) open file request, (2) return data block list, (4) read data, and (6) complete file. However, the logs related to operation (1) are not stored in the system log. In addition, the logs about operation (2) and operation (4) do not contain the filename, so it is impossible to decide which files should be eliminated through such records. Therefore, only logs about operation 6 could be collected and used for analysis, and the details are illustrated in Table 2.

Combining Tables 1 and 2, it is easy to see that either writing to or reading from a file will generate logs with the keyword “completeFile,” indicating the completion of the write or read file operation. Such logs record the timestamp and filename during file access and are representative and exclusive. Therefore, we can use filters, combined with lambda expressions, to obtain all information containing the completeFile keyword from all log records and extract the timestamps and filenames for file elimination model training.

3.2.3. Information Extraction for File Deletion

Writing to and reading from a file can be collectively called accessing the file. Unlike file access operations, file deletion operations are more complicated. Accordingly, the log analysis for file deletion is also more tedious. This is because system logs only record deletion operations on a block basis, so analyzing the system log is not easy to know the deletion records in HDFS. To overcome this difficulty, it is still necessary to first sort out which logs are helpful for model training and extract the critical information from them in conjunction with the workflow of file deletion in HDFS. The basic process for file deletion is shown in Figure 4. First, the client initiates a file deletion request to the NameNode. Meanwhile, NameNode looks up the data chunking of the file in the namespace and adds the corresponding block to the invalid block list. Finally, DataNode gets the list of invalid blocks from the NameNode by sending a heartbeat signal, then removes the corresponding block’s data.

Summarize the process in Figure 4, the logs related to file deletion are illustrated in Table 3. However, in Table 3, both NameNode and DataNode record the block’s name to be deleted, which is duplicated. Therefore, to reduce the log amount, only the file deletion information on the NameNode is collected as the basis for subsequent training of the file deletion model.

Besides, it can also be seen from Table 3 that the logs do not contain the filename, so it is necessary to obtain the details of the deleted files with the block matching approach. In specific, it can be concluded by summarizing the logs recorded in Tables 1 and 2 that the block name is included in both the file creation and deletion logs. Therefore, it is possible to locate which file the block being reclaimed belongs to by seeking the file creation log that contains the same <block name> as the deletion log. In addition, we can also know the file deletion time and thus obtain all the records related to the file deletion through this way.

Eventually, by analyzing the logs corresponding to file writing, file reading, and delete deletion operations, respectively, structured file operation records can be extracted, and critical information such as filenames and timestamps can also be obtained. The specific process is to extract and organize all the information containing keywords from the selected system logs and sort them according to the time stamps. Then, set flags for the sorted records: 1 represents a deletion operation on a file, and 0 illustrates an access operation on a file. Use for the filename and for the time when the operation occurred, as shown in Table 4. All the sorted logs are used to construct the sample set and then used to train the file elimination model.

3.3. Model Training

After log filtering and analyzing, the file elimination model is built by these labeled logs. In the study, the ID3 decision tree is introduced for model training, and the details are as follows.

The first step for model training is to determine the feature labels. We specify the label of the sample set as “can be deleted or not.” The tuple labeled as “yes” is a positive sample, which means that the possibility of file reuse is low and can be deleted. Meanwhile, the tuple labeled as “no” belongs to the negative sample, indicating that the file may be reused and should be kept. Then, each file access record or deletion record can be transformed into a tuple of feature vectors with labels. The type in Table 4 is employed as the label information for the tuple feature. The primary correspondence rules for labels are the following (for recordfor any file operation): (1)If , the log is recorded as a file access log. It indicates that the corresponding file was reused at the time when this file operation occurred. The file has a reuse possibility at time and cannot be deleted. The tuple is labeled as “no” and is a negative sample(2)If , the log is recorded as a file deletion log. It means that the corresponding file no longer has the possibility of reuse at the time of the file operation and later can be deleted. The tuple is labeled as “yes,” which is a positive sample

The mapping relationship between specific file operation logs and samples is in Table 5.

Feature extraction is performed after labeling the sample set, in order to provide the basis for file elimination model training. During the extraction, the possibility of file reuse is judged by the file access and deletion records. For instance, assuming that the operation timestamp for file is . Correspondingly, the creation date of file is marked as , the last access date is illustrated as , and denotes how many times the file accessing happens by the time of . The three features can be extracted as follows.

The length of file existence can be calculated by

The length of time since the file was last accessed could be calculated by

The average daily access frequency since the file creation can be derived by

After the calculation, the , , and participate in constructing the feature vectors as the sign values, and the illustration of a calibrated sample set is presented in Table 6.

When the calibrated dataset is obtained, it could be divided into training and test sets, and then, the file elimination model could also be trained and evaluated to see if the files should be deleted. The ID3 decision tree is employed to build the file elimination model in the study. The ID3 decision tree utilizes information entropy with its underlying information gain as the selection criteria for the decision tree division attributes and division values. Among them, information entropy is introduced to quantify the amount of meaningful information. The specific calculation approach for information entropy is

In Equation (4), represents the dataset, denotes the number of attributes, and refers to the probability that the tuple label in is , which can be estimated by .

Besides, each feature’s information entropy is calculated according to Equation (5) when it is regarded as a division node on the decision tree.

When the information entropy of one feature is calculated, the information gain can be conducted according to

When the information gain is calculated, the greedy algorithm is employed to obtain the attribute with the most significant information gain as the classification attribute in each round of nonleaf node generation of the decision tree. The process is continued until the decision tree is built from top to bottom, as shown in Algorithm 1.

Input: The calibrated training set:
    List of Properties: attribute_list
    Delineation of attribute selection methods: Attribute_selection_method
Output: Decision tree
Method:
1. Node = new Node();
2. if tuples in are all of the same class C:
3.  N.isLeaf = true; N.label = C
4.  return
5. if attribute_list is empty:
6.  N.isLeaf = true; N.label = majority class in
7.  return
8. splitting_criterion = Attribute_selection_method(, attribute_list)
9. label as splitting_criterion
10.   foreach outcome j of splitting_criterion
11.    = {data tuples in satisfying outcome j}
12.    if is empty
13.     attach a leaf labeled with the majority class in to
14.    else
15.     attach the node returned by Generate_decision_tree()
16.   return

In Algorithm 1, the file elimination model is trained with the dataset in which each element contains three attributes: the length of creation time , the length of last access time , and the average daily access frequency since file creation . The depth of the decision tree is 3.

Once the decision tree is well trained, the model can be borrowed to analyze the files stored in HDFS and predict their reusability. The specific prediction process is as follows: for each file, all the operation records of the file are extracted from the system log by the filename, and a feature vector is got by calculating the of the file at the moment , which is then sent to the trained decision tree for prediction and returns the label as “can be deleted” or “recommended to keep.”

4. Model Implementation

On the basis of the theoretical model, we also describe the model implementation and how the model works during the production process of complex heavy equipment. In the practical production environment, a trimendous amount of multisource heterogeneous data is generated in the custom manufacturing process of complex heavy equipment, and accordingly, its cloud service side will also generate ten million-level system logs. In order to process massive logs offline in real time, we use Apache Spark as the implementation platform of the proposed model [23]. By transforming system logs into elastic datasets using the machine learning tool called MLlib in Apache Spark, we can quickly analyze stored files and provide timely feedback to clients for deletion.

4.1. System Architecture

According to the theoretical model, the file elimination system consists of four modules: file operation record extraction, feature extraction, training decision tree, and prediction of file reusability. To be specific, the function of the first two modules is to extract the operation records of files in a specified date range through log analysis. The training decision tree module is responsible for building and training the decision tree, which only needs to be made once and can be used for a long time. The last module shows users the files with low reusability on HDFS and suggests deletion. The overall architecture of the system is in Figure 5.

As shown in Figure 5, firstly, we need to read the logs from HDFS for a specified period recently, then filter and organize them into structured file operation records. Secondly, we should import the structured records for feature extraction, then pass the records containing labels for decision tree training, and evaluate the decision tree in terms of accuracy, precision, and recall. Finally, suppose that the trained decision tree meets the expected functional and performance requirements. In that case, the list of files is imported from HDFS, and the trained decision tree is hired to predict the reusability of individual files and organize them into a list of recommended files for elimination, which is returned to the user. Such a procedure not only facilitates the transfer of data between different modules but also reuses the results of the previous stage, reduces response time, and allows the proposed system to be saved and reused for a long time.

According to the system architecture, the entity classes of the system and their interrelationships are shown in Figure 6. In Figure 6, the proposed system is constructed with five entity classes and one enumeration type. The DecisionTree class is the decision tree for file elimination, which predicts the reusability of a file and includes a reference to the root node of the decision tree in its attributes. The Node class is the node on the decision tree, forming a many-to-one relationship with the DecisionTree class. The Node class also contains the split attribute splitAttr, the split value splitValue, the left and the right subtree, and other information. The LabeledPoint class is the basis for training the decision tree, and it can be referred to as labeled records, which are the result of feature extraction. The FileOperationRecord class is a structured file operation record class, while the File class is a helper class that contains detailed file information and maintains a one-to-many relationship with the FileOperationRecord class. Finally, the Attribute is an enumerated type that enumerates various attribute types that coincide with the internal attributes of the LabelPoint class. The Attribute type is introduced to identify split attributes in the Node class, and the dashed arrows in Figure 6 indicate the dependencies of the Node class and the LabelPoint class.

4.2. Log Analysis and Sample Set Construction

The following describes the implementation details of each module of the system separately. The first two are log analysis and sample set construction, whose role is to obtain and analyze the records related to file operations, in order to collect key information like filenames and the time when file operation occurs. The collected information is used to construct sample sets for training the decision tree. For complex heavy equipment manufacturing processes, the scale of log records to be processed is large and can generally reach several gigabytes or even terabytes. Therefore, we take advantage of Apache Spark to analyze and process the logs and realize the fast processing of massive data through multisegment transformation and action operations.

Figure 7 depicts the detailed data flow diagram of the log analysis based on Apache Spark. In detail, the figure shows that the file operation record extraction during log analysis is implemented in two ways: one by file access record extraction and the other by file deletion record extraction. Firstly, by PySpark SDK provided by Apache Spark, the system logs are read in from HDFS and transformed into a resilient distributed dataset. Secondly, critical information like filename and access time is extracted from the file access logs in the dataset. Similarly, key information is extracted from the block recovery logs in the dataset to obtain the records of deleted data blocks from HDFS. Then, the latest allocation record of the block is found from the block allocation log to know which file the deleted block belongs to, followed by obtaining the deletion record of the file by matching the block and the file. Finally, the file access and deletion records are formed into the structured file operation records.

With lambda expressions and anonymous functions, as well as Apache Spark’s filter and action parallel operations, all the procedures can be easily accomplished in cluster computing [24]. In the beginning, we construct conditional expressions by lambda expressions to identify keywords in logs. Based on this, the conditional expressions are passed into the filter function as filtering conditions, thus overfiltering the elements in the elastic distributed dataset that satisfy the requirements. After that, the operation of extracting filenames and times is described in the form of anonymous functions. At last, through the action function, the anonymous functions are mapped to the dataset to perform the information extraction concurrently and thus generate structured file operation records. The execution flow of filter and action operations in Apache Spark is demonstrated in Figure 8.

After the log extraction, the amount of data to be processed is significantly reduced. Therefore, the PySpark SDK can be directly applied for extracting features from the newly generated structured file operation logs, and a sample set consisting of multiple LabelPoint class objects can be obtained by Equations (1), (2), and (3).

4.3. Decision Tree Training and Application

In this study, 30% of the sample set data will be selected for building and training the decision tree, and the remaining 70% will be used as a test set for model evaluation of the decision tree performance. If the decision tree is identified as well trained after evaluation, it would be saved in a JSON file on HDFS and play a role in the file deletion in the long term.

The decision tree is trained by the machine learning library MLlib offered by Apache Spark, the deepest level of the decision tree is set to 3, and the impurity measure is information entropy. Specifically, we split the procedure into two parts: the training model to get parametric models by training samples and the prediction module, whose job is to take advantage of the parametric models to make a prediction on the test samples and give out the predicted value. The details of the training process are demonstrated in Figure 9.

At the very beginning, the machine learning task would be delivered to the parser to make pretreatments in Figure 9. Then, the LLP (logical learning plan) will participate in selecting the training strategies. These strategies include the type of decision tree that would be adopted and how to choose parameters. Afterward, the task will be transferred to the optimizer for tuning. The optimizer is the core of MLlib, in which the processed dataset will be divided into segments. Each of them would be given a suit of corresponding decision trees and parameters to test which collocation is the best. Then, the processed data is submitted to PLP (physical learning plan) to conduct the physical execution. Since MLlib is based on a master-slave distributed module, the master node will distribute the tasks with proper strategy and measure to the slave nodes to carry out results. Finally, the master node collects the result, forms a final output, and returns it to clients. The end of decision tree training means the file elimination model is built. At this point, the current list of files can be read from HDFS, and then, the trained file elimination model can be used to predict whether they can be deleted.

5. Experiment and Analysis

At last, we also design specific experiments to verify the system’s functional and performance metrics from the perspective of empirical evaluation. For functional evaluation, we prepare the verification regarding the accuracy, recall, F1-measure, etc. In terms of performance evaluation, we set up the scalability test to verify the system’s performance.

5.1. Experiment Scheme

The experimental environment is constructed with the help of Shaanxi Provincial Key Laboratory of Network Computing and Security Technology, and the proposed system is implemented on a cluster of seven isomorphic computers. In terms of hardware configuration, all the machines are equipped with Intel® Core i7-12700K CPU, 32 GB of RAM, and 1 TB of NVME solid state drive, and these machines are in the same LAN and interconnected with each other by the cisco SG220-52-K9-CN 10 gigabit switch. As to the software configuration, Spark 1.5.1 and HDFS with Hadoop 2.6.0 is adopted in building up the cluster. Besides, Ubuntu 15.10 is chosen as the operating system, and compiling environment is Python 3.5.0.

For the experimental data, in cooperation with China National Heavy Machinery Research Institute Co., Ltd., we acquired the sensing and communication data generated during the production of the 20,000-ton horizontal extruder for hard-to-deform alloys. These experimental data were divided into five groups according to the acquisition time to verify the function and performance of the system under different amounts of data and load conditions.

5.2. Functional Experiment of the Model

In functional experiments, we introduced the -fold cross-validation method, a standard method for classifier evaluation [25]. This method divides the original dataset into disjoint subsets of similar sizes like . Then, train and test the subsets fortimes: in the-th training,is kept as the prediction set, and the remaining data is the training set. Finally, the outcome of the functional estimation is obtained by averaging all the testing results.

Besides, the labels of the classifier can be divided into positive and negative labels. The positive label corresponds to positive samples, which are the part the classifier is interested in and wants to retrieve; the negative label corresponds to negative samples, which are the hidden part. In the experiment, the positive label “yes” indicates that the logs can be deleted, and the negative label “no” means that it is recommended to keep. The correspondence between the possible label combinations and the prediction result is listed as follows: (1)True positive (TP): the prediction label is positive, and the prediction result is correct(2)True negative (TN): the prediction label is negative, and the prediction result is correct(3)False positive (FP): the prediction label is positive, and the prediction result is incorrect(4)False negative (FN): the prediction label is negative, and the prediction result is wrong

As for the evaluation indicator, the precision, recall, and F1 score were selected to evaluate the model’s functionality. The precision describes the accuracy of the returned positively labeled results. Recall describes how many of all positively labeled tuples can be retrieved accurately. F1-score is the summed average of precision and recall. The three indicators are calculated as follows:

The experimental data were received from China National Heavy Machinery Research Institute Co., Ltd., for the historical log data collected in one of its production lines from 2018 to 2019. After the experiment and calculation, the results are demonstrated in Table 7.

It can be seen from Table 7 that the average recall of our model is as high as 97.7%, indicating that it can find most of the useless files in HDFS, so the function of the model meets the general requirements for classifiers. However, the average precision of the model is slightly lower than the recall, which is 73.2%. The main reasons for this result are, firstly, the management on HDFS is usually not regular, and the files are not immediately deleted after they lose their usefulness. Secondly, during the training process, the negative samples (file access logs) are much larger than the positive samples (file deletion logs), which causes a sparse training set and influences the positive samples’ training process. Therefore, in future work, it will be considered that the decision tree can be iteratively trained using the sample extraction method to improve both the recall and accuracy of the decision tree. Finally, the mean value of F1-score, which combines precision and recall, is 83.7%, indicating that the model performs well in terms of overall functionality and can recommend eliminated files to clients more accurately.

5.3. System Performance Experiments

In addition to the function experiment, we also evaluate the system’s performance through the scalability test. Scalability is an essential measure of parallelism and reveals the ability of an algorithm to take advantage of the massive computational resources a distributed system provides. However, the increase of this capability is usually limited because as the computational resources increase, the corresponding scheduling and parallelism overheads also gradually rise, affecting the system’s overall performance.

The experimental scheme is designed as follows: first, record the time the system runs in different cluster sizes with a fixed dataset; then, set a constant cluster size and count the time it takes to process different datasets. During our experiments, the cluster size is set as 1/2/4/6 computing nodes, and the experimental data are rearranged according to their sizes. After the performance experiment, the system’s running time is calculated and shown in Figure 10.

It can be concluded from Figure 10 that the time consumption of our system gradually decreases with the continuous addition of computational nodes for a fixed dataset. This indicates that the system can effectively utilize computational resources and reduce running time. In addition, when the data size is small, investing more computing resources does not significantly improve the system’s efficiency, but when the data volume is large, the system’s utilization of computing resources keeps improving, and the running time changes markedly. Finally, although the system time consumption is positively correlated with the data size under any size of the computing cluster, the slope of the system time consumption curve is prominent in the small-scale computing cluster. In contrast, the system time consumption curve changes gently and has a slight slope under the large cluster size, which indicates that despite the additional parallel overhead intervention, the system still has good scalability and can fully utilize the cluster computing resources, especially in processing the dataset with large scale.

6. Conclusions

In order to solve the problem of excessive storage pressure caused by the massive and complicated data in the integrated communication, sensing, and computation system, this paper proposes a storage optimization model for the cloud server based on machine learning techniques. The proposed model first collects the file operation history of the distributed file system as the basis for analysis. Then, three features, the length of creation time, the length of last access time, and the average daily access frequency since creation, are extracted from the collated history records and used to form the corresponding feature vectors. With the feature vectors, the file elimination model based on the ID3 decision tree can be established to determine the reusability of files stored in the file system. Finally, by periodically suggesting to users to delete low-value or low-access data, the spatial optimization of the distributed file system can be achieved. In practice, the file elimination system based on the proposed model is implemented on Apache Spark to reduce storage consumption of HDFS, a popular distributed file system in the ICSC system, and experiments with the practical production data prove that the system functions as expected. The recall rate of the system is high, and it can accurately find out the files that should be eliminated in HDFS. We also evaluate the system’s scalability. The results show that the system meets the expected performance goals and has high availability. In total, the proposed model can be employed to optimize the storage space on the cloud service side of the integrated communication, sensing, and computation system during the production of complex heavy equipment. However, there is still some room for optimization of the current model. For example, it is proposed to improve the decision tree by multiple training through the sample extraction method to enhance the accuracy in the future.

Data Availability

The experiment data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

All authors declare no conflict of interest in this paper.

Acknowledgments

This research work is supported by the National Key Research and Development Plan (No. 2018YFB1703003), Natural Science Foundation of Shaanxi Province (No. 2020JM-537), Basic Research in Natural Science and Enterprise Joint Fund of Shaanxi (No. 2021JLM-58), National Natural Science Foundation of China (No. 61861018), and Jiangxi Provincial Natural Science Foundation (No. 20212BAB212001).