Abstract
Due to the defects caused by limited energy, storage capacity, and computing ability, the increasing amount of sensing data has become a challenge in wireless sensor networks (WSNs). To decrease the additional power consumption and extend the lifetime of a WSN, a multistage hierarchical clustering deredundancy algorithm is proposed. In the first stage, a dualmetric distance is employed, and redundant nodes are preliminarily identified by the improved means algorithm to obtain clusters of similar nodes. Then, a Gaussian hybrid clustering classification algorithm is presented to implement data similarity clustering for edge sensing data in the second stage. In the third stage, the clustered sensing data is randomly weighted to deduplicate the spatial correlation data. Detailed experimental results show that, compared with the existing schemes, the proposed deredundancy algorithm can achieve better performance in terms of redundant data ratio, energy consumption, and network lifetime.
1. Introduction
Wireless sensor networks (WSNs) are common in people’s lives and are widely used in various fields [1, 2]. WSNs are deployed in different areas to monitor environments and objects, such as temperature, humidity, and seismic events [3, 4]. To obtain accurate sensing data for events, a large number of sensors are utilized to collect the edge sensing data and transmit the data to an aggregation/sink node in a highfrequency manner. In general, edge sensing data have a high spatialtemporal correlation and contain considerable redundant information [5, 6]. Additionally, the transmission of redundant data leads to unnecessary energy consumption and bandwidth costs, which increase the overhead and decrease WSN lifetimes. Therefore, reducing redundant data and the transmission energy consumption to extend network lifetimes becomes a key issue in WSNs.
To reduce redundant data effectively, the existing work concentrates on two aspects: optimizing sensing data and predicting sensing data. On the one hand, the former is aimed at reducing redundant sensing data with some optimized schemes. Considering the constrained resources in WSNs, a spatialtemporal correlation data reduction scheme was proposed to determine the optimal sampling strategy for the deployed sensor nodes (SNs) [7]; the strategy reduces the overall sampling/transmission rates while preserving the quality of the data. Considering that the data volume increases with unexpected ratios in WSNs, an integrated divide and conquer method with an enhanced means scheme was proposed [8], which removes redundant data from the collected measures. To save the limited energy of WSNs, a data transmission (Dat) protocol, which can reduce the data transmission cost inside each sensor node by removing redundant data to save energy while maintaining a suitable level of accuracy in the received readings at the sink, was presented [9]. To conserve energy and enhance the lifetime of a WSN, reducing the amount of data communicated by exploiting the temporal and spatial correlations of the sensed data is a suitable approach. An energyefficient semantic clustering model was proposed to mitigate the highenergy consumption problem in a clustered WSN [10]. To reduce the energy consumed during data transmission, an adaptive data reduction method, which is based on a convex combination of two decoupled leastmeansquare windowed filters, was proposed [11].
On the other hand, the predictionbased scheme tries to reduce sensing data with forecasting schemes. To improve data processing efficiency, a distributed data prediction model based on least squares, which tries to use a data predictionbased filtering scheme, was proposed to decrease transmission data [12]. Alduais et al. [13] presented an updating frequency metric, which is defined as the frequency of updating the model reference parameters during data collection, to evaluate the performance of different multivariate data reduction models for WSNs. To schedule data communications between SNs and a sink to reduce power usage with the aim of maximizing the network lifetime, a predictionbased data communication scheme, which utilizes the hierarchical leastmeansquare adaptive filter to predict the measured values both at the source and at the sink, was presented [14].
Although the schemes mentioned above provide efficient solutions to reduce redundant sensing data in WSNs, the following defects still need to be addressed comprehensively. Firstly, a large range of edge sensing data and errors in local best values can lead to local characteristics being lost. Then, the errors of sensing data will result in a similarity threshold failure problem. And the existing distancebased correlation reducing redundant sensing data schemes, which only consider the spatial corrections of sensing event and omit the temporal correlations of sensing data, tend to degrade the accuracy of sensing data. Furthermore, predictionbased schemes require relatively longterm data sensing and processing abilities, which increase the burden of resourcelimited sensors and decrease the lifetime of WSNs. Hence, it is necessary to consider both spatial and temporal correlations and location and data similarity clustering to decrease the sensing data transmission and processing. Focusing on the issue mentioned above, this paper explores the sensing data deredundancy problem to decrease energy consumption and extend the lifetime of a WSN.
The main contributions of this paper are summarized as follows: (1)A multistage hierarchical clustering similarity deredundancy (MHCSD) algorithm is proposed to reduce the power consumption and extend the lifetime of a WSN. MHCSD considers both spatial and temporal correlations and location and data similarity clustering to overcome the accuracy degradation of sensing data(2)A dualmetric distance is employed in the first stage, and an improved means algorithm is proposed to judge the similarity of nodes based on the dualmetric distance in sinks. A Gaussian hybrid clustering algorithm is presented to judge the similarity of edge sensing data within the same cluster and can improve the similarity accuracy and the deredundancy ratio. The clustered sensing data are randomly weighted to further deduplicate the spatial correlation data in the third stage
The remainder of this paper is organized as follows. Related work is explored in Section 2. Section 3 presents the proposed multistage hierarchical clustering deredundancy algorithm in WSNs. Section 4 shows the experimental results, which verify the proposed scheme. The paper is concluded in Section 5 finally.
2. Related Work
Although a large amount of applicationspecific data are generated in WSNs, most of the sensing data detected by sensors are redundant. Processing and transmitting massive superfluous data can lead to additional power consumption and greatly decrease network lifetime [15, 16]. To improve data processing performance, a path merging protocol, which supports partial discrete wavelet transformbased compression schemes to reduce redundant data transmission in a significant manner through the appropriate aggregation of data packets from merging paths, was proposed in [17]. To manage energyefficient data collections in WSNs, a dataaware energy conservation scheme and predictionbased data collection framework were proposed to reduce data transmission [18], where the inherent correlation between the consecutive observations of SNs and the data similarity measures between the neighboring SNs are utilized.
Considering that the data volume in WSNs is quickly increasing, a hybridstream big data analytics model, which utilizes a multidimensional convolutional neural network (CNN), minimal correlation model, and minimal redundancy model to optimize data processing, is proposed to perform big data analysis [19]. To provide a complete description of an environment and make a robust decision, a redundancy removal strategy, which mines the spatial and temporal data from collected data to select the appropriate information before forwarding to a base station or a cluster head (CH) in a WSN, is proposed [20]. To avoid generating, transmitting, and storing unwanted data from redundant messages, an immunizationbased redundancy elimination scheme, which independently selects the correct number of acknowledgment frames distributed to respond to variations in the amount of redundant data in a dynamic fashion, was proposed [21]. An image fusion method was proposed based on histogram similarity and multiview weighted sparse representations [22].
By introducing histogram similarity, different weights are given to lowresolution highfrequency components and source image highfrequency components, and complementary information is effectively used. Diwakaran et al. [18] used the inherent correlations between the continuous observations of SNs and the data similarity measures of adjacent SNs to reduce data transmission. A new model based on monkey tree search behavior inspired by fauna was explored in [23], and the fuzzy reasoning mechanism was used to complete data collection and dissemination. Rida et al. [24] utilized data aggregation techniques based on the Euclidean distance to reduce similar data. Lin et al. [25] proposed a semantic data annotation method based on semantics. A data clustering method, which groups homogeneous data into clusters and then performs data reduction by selecting the average value of each cluster, was proposed based on histograms for data reduction [26].
Additionally, to address the problem of redundant data collected by sensors, data aggregation and semaphore processing based on similar functions are applied in WSNs, and SNs are aggregated with a palm tree method [27]. Wan et al. [28] proposed a similar sensory data aggregation scheme based on fuzzy means. A spatialtemporal correlation search mechanism between SNs based on the Euclidean distance is proposed [29]. An energysaving redundant traffic handling scheme, which utilizes short beacon information to process redundant packets generated in areabased routing, was presented in [30]. The parameter estimation problem was considered in [31], and two censoring algorithms were proposed to enable SNs to transmit sampled data based on local decisionmaking. The dual prediction scheme is used to reduce the transmission between cluster nodes and CHs, while the data compression scheme is used to reduce the traffic between CHs and sink nodes [32]. A low redundancy data acquisition scheme, which selects some nodes for data detection and transmits less data to CHs, was proposed based on matrix completion [33]. To reduce transmitted data, a differential data processing (DDP) method was proposed in [34].
Although there are many effective deredundancy processing schemes in WSNs, the following limitations still need to be addressed. Data correlation analysis does not consider homologous data, which can result in a lower deredundancy ratio and loss of local characteristics. Furthermore, unconscionable deredundancy can degrade the accuracy of sensing data. Focusing on filling this gap, this paper proposes a multiphase hierarchical clustering similarity deredundancy algorithm to overcome the limitations mentioned above.
3. Proposed Scheme
3.1. System Model
There is a set composed of SNs. And the edge sensor nodes will collect data. The system model is shown in Figure 1, and Table 1 lists the notation that we use in the paper.
The sink calculates the similar distances between nodes according to the coordinates of the nodes and divides the nodes into clusters according to the similar distances , where . of each cluster collects the sensing data generated by the nodes in the cluster at time as set . Gaussian mixed clustering is adopted to classify the collected data into similar clusters and then classifies the nodes in the cluster as , , where .
3.2. Deredundancy Algorithm
The proposed MHCSD algorithm includes three stages: the local clustering stage, the similar data clustering stage, and the data deredundancy processing stage. In addition, the framework of MHCSD is shown in Figure 2. In the first stage, the sink will perform the improved means clustering algorithm, which clusters similar nodes according to the spatial position coordinates of the nodes. Then, in the second stage, CHs adopt the Gaussian hybrid clustering algorithm to further seek similar clusters. In the third stage, based on the maximum time threshold, the SNs utilize an adaptive step length in the data deredundancy scheme (TCDA) to eliminate duplicate sensing data with spatialtemporal correlations.
3.2.1. Similarity of SNs
In the first stage, the node similarity analysis is performed according to the node position coordinates in the sink. To delete duplicate edge sensing data effectively, local clustering needs a precise similarity measure among nodes and sensing data. Among various distance metrics, the Euclidean distance may be the most commonly used in data processing. However, the Euclidean distance only describes the amplitude difference between two eigenvectors, and the Euclidean distance of two feature vectors with different shapes may be smaller than that of feature vectors with similar shapes. To overcome the defect in the Euclidean distance, a dualmetric similarity distance is employed: where is the Euclidean distance, is the Pearson correlation distance, and is a scale factor that indicates the influence of on the weight of . In addition, we have
The dualmetric similarity distance meets three distance characteristics: positivity, symmetry, and reflexivity. In terms of , any active feature vector pair can be compared from both the amplitude of the Euclidean distance and the change in the shape of the related distance.
For , the spatial position coordinate of node is . The sink performs the improved means algorithm as shown in Algorithm 1.

According to coordinate position set , nodes are classified in disjoint subsets . and , where and , . In addition, the minimum squared error is defined as
where is the mean vector of cluster .
3.2.2. Similarity of Sensing Data
After clustering the similar nodes by the spatial positions in the first stage, to refine redundant judgments of nodes, the Gaussian hybrid clustering algorithm is adopted in the second stage and is shown in Algorithm 2.

To further analyze the similarity of the data collected simultaneously within cluster , the probabilistic model is used to analyze and describe the prototype data in Gaussian hybrid clustering. The cluster division is mainly determined by the posterior probability corresponding to the prototype. The Gaussian distribution is defined as the random variable in the dimensional sample space , and its probability density function is defined as where represents the dimensional mean vector and denotes the covariance matrix. Since the Gaussian distribution is determined by the mean vector and the covariance matrix , for the convenience of description, the probability density function for the dependence of the Gaussian distribution on the corresponding parameters is expressed as . The Gaussian mixture distribution is where and are the parameters of the ith Gaussian mixed component, is the corresponding mixing coefficient, and . consists of mixed components, and each mixed component corresponds to a Gaussian distribution.
For composed of clusters, data generated by a cluster can be expressed as a set , and , where is the set of time series generated by the sensor node every seconds. In the WSN, each CH continues to classify the correlated data of the nodes in the cluster, and through the Gaussian hybrid clustering algorithm, the data collection in similar clusters of the same spatial nodes is simultaneously divided into clusters, where .
It is assumed that the random variable represents the Gaussian mixture component of the sensing data of node . The prior probability of corresponds to . According to Bayes’ theorem, the posterior distribution of corresponds to
is expressed as sample generated by the th Gaussian mixture composition of the a posteriori probability, expressed as .
After the Gaussian mixture distribution, the cluster becomes the sample set divided into subclusters and expressed as set ; the cluster markers of each sample are defined as follows:
We can solve the parameters as
The expectation maximization algorithm is used for the iterative optimization solution. To maximize equation (8) by , we use and with , we can obtain where is the mean of each mixed component and can be estimated by the weighted average of samples. The sample weight is the posterior probability of each sample belonging to the component. Similarly, from , we can obtain
For mixed coefficient , in addition to maximizing , it needs to satisfy and .
The Lagrange form of is where is the Lagrange multiplier. The derivative of equation (12) with respect to is 0, and
Both sides are multiplied by , all of the components of the mixture are summed, , and namely, the mixing coefficient of each Gaussian component is determined by the average posterior probability of the sample.
(1) Elimination of Similar Data. According to the result of cluster set in the second stage, the CH randomly weights the data generated by the nodes in the cluster with similar data simultaneously, and the TCDA algorithm is proposed to perform timedependent deredundancy. CHs finally transmit the deredundant data to the sink, and we have where are weighting factors; ; are the sensing data generated from nodes at ; and .
3.2.3. HMDA Algorithm
To reduce redundant data in WSNs, a hybrid multistage deredundancy algorithm (HMDA), which combines MHCSD and TCDA to reduce redundant data comprehensively, is proposed based on spatialtemporal correlations, as shown in Algorithm 3.

The MHCSD algorithm reduces redundancy in terms of spatial correlations, and the TCDA algorithm further reduces redundant data in terms of temporal correlations. TCDA fully considers the following factors in the process of deduplication: when the range of data variation is large, there is a large error in the local maximum or minimum value and a missing local eigenvalue, and when the data fluctuation is stable, the data similarity threshold cannot work effectively. Considering the ratio of deduplication, TCDA guarantees the timeliness of the sensing data with a maximum time threshold to prevent a failure in the data similarity threshold. Furthermore, an adaptive step size mechanism is proposed to reduce the complexity of calculation and energy consumption. Hence, HMDA reduces network energy consumption and extends the lifetime of a WSN simultaneously. In addition, the flow chart of HMDA is shown in Figure 3.
3.3. Performance Analysis
3.3.1. Algorithm Complexity
In the first stage, the sink aggregates through all node positions, classifies all nodes, and assumes that model training requires cycles. In the first step, the position set and classification number of nodes are input, and the time complexity is ; in the second step, samples are randomly selected as the initial mean vector, and the time complexity is ; in the third step, the distance between each sample and means is calculated, and the time complexity is ; in the fourth step, the mean vector is updated, and the time complexity is ; in the fifth step, the cluster division results are output, and the time complexity is . In the second stage, the CH performs a data similarity analysis of the cluster data generated at each moment, assuming that model training requires cycles. The first step is to input the sensing data of nodes and similarity number , and the time complexity is ; the second step is to calculate the posterior probability generated by each mixed component, and the time complexity is ; the third step is to calculate each model parameter, and the time complexity is ; the fourth step is to calculate the cluster tag’s classification, and the calculation complexity is ; the fifth step is to output classification clusters, and the time complexity is . In the third stage, the CHs perform random weighted transmission to reduce redundant data in similar nodes, and the time complexity is .
Hence, we can obtain that the complexity of the model scheme is .
3.3.2. Energy Consumption
Most of the energy in the sensor node is consumed by its transceiver module. The channel model of the transmitter has two kinds of free space models and multipath fading models, and the energy consumption is related not only to the amount of data but also to the transmission distance . Therefore, the energy consumption of the node to send bit data is where represents the energy consumption of the circuit sending or receiving data and and , respectively, represent the energy consumption of the signal amplifier:
The energy consumption of the node receiving bit data is
The energy consumption of nodes processing bit data is where represents the energy consumption of processing unit data. The node’s remaining energy consumption is where represents the remaining energy consumption of the node, represents the initial energy, represents the total amount of data transmitted, represents the total amount of received data, and represents the total amount of data processing.
4. Experimental Results
4.1. Experimental Setup
To verify the effectiveness of the proposed method, the temperature sensing data from the Intel Berkeley Laboratory are used [35]; these data include 54 nodes, and each node collects sensing data every 0.5 minutes. The map is shown in Figure 4. To verify the deredundancy ratio of edge sensing data and the network lifetime, the data transmission model and node energy consumption model are adopted. The experiments consider the following metrics: the deredundancy ratio, the deredundancy error, the influence of the amount of similar data clusters on the deredundancy ratio, and the energy consumption. The proposed HMDA will be compared with the TCDA, TSDA, and Dat algorithms [9]. The parameters and their values are shown in Table 2.
4.2. Performance Evaluation
First, the performance results of the three stages are analyzed separately. In the first stage, clustering classification positions of similar nodes are obtained; in the second stage, clustering classification of similar data nodes is obtained; in the third stage, as a result of the first and second stages, the generated sensing data are made deredundant by means of random weighting. Second, the influence of on the deredundancy ratio in the second stage is analyzed. Finally, the energy consumption is analyzed with Dat [9], TCDA, TSDA, and HMDA.
In the first stage, the sink performs clustering according to the nodes’ coordinate positions by running the improved means clustering algorithm. is assumed to be 4, and }. The results of the four clusters also change significantly as varies. The diamond in the figure represents the cluster center of the four clusters. The node’s cluster distribution probability is shown in Table 3, and the clustering results are shown in Figure 5.
(a)
(b)
(c)
(d)
(e)
According to the probability ratio, the sink classifies the nodes that are prone to change into corresponding clusters. As shown in Table 3, the classification results are , , , and .
In the second stage, the CHs perform Gaussian mixture clustering. By successively acquiring edge sensing data from nodes within each cluster, the CHs can analyze data similarity according to further improve the deredundancy ratio. Similar classification results in cluster are shown in Figure 6.
As shown in Figure 6, is divided into 4 subclusters: , , , and . Similarly, cluster includes 3 subclusters: , , and ; is classified into , , and; and is divided into ,, and.
In the third stage, the data in similar clusters will be randomly weighted to optimize the redundancy ratio. For subcluster in cluster , the deredundancy performance, redundancy error, and mean square error are shown in Figures 7 and 8 and Table 4, respectively.
As shown in Figure 7, the sensing data of subcluster tend to be the middle values with randomly weighted optimization. The sensing data of node 29 and node 33 are close to the deredundancy results. However, the values of node 27 and node 31 are relatively far away. From Figure 8, we can see that the mean square errors of nodes 29 and 33 are relatively lower than those of nodes 27 and 31. According to the results in Table 4, it can be seen that the mean square errors of nodes 27, 29, 31, and 33 are 0.035, 0.004, 0.034, and 0.006, respectively, which indicates that even if the data are similar, there are still differences between the sensing data. Therefore, for the methods of data similarity analysis with the coordinates of nodes, the lack of spatial correlation analysis can cause greater errors. Multistage clustering improves the accuracy of sensing data similarity.
Since the deredundancy ratio of MHCSD is related to the data similarity clustering in the second stage and the value of affects the accuracy of the data correlation, the effect of on the deredundancy ratio is shown in Figure 9.
As shown in Figure 9, the deredundancy ratio gradually decreases as increases. When , it indicates that clusters , , , and are not divided into any similar subclusters, and MHCSD treats all nodes in clusters , , , and as redundant nodes, which perform random weighting to optimize sensing data. Therefore, the deredundancy ratio is maximized. However, when , it is equivalent to clustering all nodes by position similarity without considering the data similarity cluster, which leads to a larger error. When , MHCSD classifies the nodes into 10 similar subdata clusters in each cluster , , , and and randomly weights them to deduplicate the sensing data. Hence, the redundancy ratio is the lowest, which guarantees the accuracy of the deredundant data. To ensure both the accuracy of the data and the deredundancy ratio of the data, we set in the following performance analysis.
Figure 10 shows that as the number of nodes increases, the deredundancy ratio also increases and varies between 65% and 75%. When the number of nodes is 23, the deredundancy ratio is the highest (75%). When the number of nodes is less than 3, the proposed scheme omits the spatial correlation deredundancy and transmits the sensing data to the corresponding CHs, which degrades the deredundancy ratio. When the number of nodes is larger than 3, MHCSD performs a spatial correlation deredundancy algorithm, deduplicates redundant nodes in clusters, and obviously improves the deredundancy ratio.
As seen from Figure 11, the deredundancy ratio of the HMDA algorithm varies between 97.50% and 98.0%, which is obviously higher than those of TCDA and Dat. Compared with TCDA and Dat, the deredundancy ratio of HMDA increases by 1.7% and 4.7%, respectively. Therefore, HMDA combines MHCSD and TCDA to reduce redundant data comprehensively and can further remove 70% of the redundant data. Additionally, the accuracy of the deredundancy nodes is maintained between 0.004 and 0.035, and within the allowable error range for a user, the deredundancy ratio reaches the highest. The results in Figure 11 also verify that HMDA is effective in improving the deredundancy ratio based on spatialtemporal correlations.
The energy consumed by different schemes is shown in Figure 12. The energy consumed by the four algorithms increases gradually as the number of nodes increases. Among the different schemes, the energy consumption of HMDA is much lower than those of the other three algorithms. For the proposed HMDA scheme, energy consumption increases very slowly. When the number of nodes is 50, the energy consumption of HMDA is only 0.12 J, which is obviously lower than those of the other three algorithms. The reason is that HMDA will adaptively perform both spatial correlation and temporal correlation analyses.
The network lifetimes of different schemes are shown in Figure 13. When the number of nodes is lower than 15, the lifetimes of HMDA, TCDA, Dat, and MHCSD remain stable at 260 s, 250 s, 120 s, and 15 s, respectively. The reason is that the deredundancy ratios of the 4 schemes are 97.5%, 96.3%, 93%, and 70%, which ensures that all nodes perform the same data processing scheme with constant energy consumption. It is obvious that the lifetime of HMDA is longer than that of the other 3 schemes. Especially when the number of nodes increases to 50, the lifetime of HMDA is 109 s, which is 12.6, 3.0, and 3.9 times higher than those of MHCSD, TCDA, and Dat, respectively. The results in Figures 11–13 demonstrate that the proposed HMDA scheme can achieve better performance in terms of the deredundant ratio, energy consumption, and network lifetime.
5. Conclusion
Focusing on the problem of data redundancy in WSNs, a multistage hierarchical clustering deredundancy algorithm is proposed to decrease the additional power consumption and extend the lifetime of a WSN. Based on the improved means clustering method, all nodes are classified according to the node position information and temporal similarity. The Gaussian hybrid clustering method is adopted to improve the redundant similarity of edge nodes. According to the secondary classification results, the sensing data generated by the redundant nodes are randomly weighted to remove the redundant data. Detailed analysis and experimental results show that, compared with the existing schemes, the proposed scheme is superior in terms of the deredundancy ratio, power consumption, and lifetime of a WSN.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work is supported by the National Science Foundation of China (No. 61772562, 62062019), the Key Project of Hubei Provincial Science and Technology Innovation Foundation of China (No. 2018ABB1485), the Hubei Provincial Natural Science Foundation of China (No. 2019CFB815), the Fundamental Research Funds for the Central Universities (No. CZP19004), and the Youth Elite Project of State Ethnic Affairs Commission of China (No. 2016308).