Abstract

Aiming at the anomaly detection problem in sensor data, traditional algorithms usually only focus on the continuity of single-source data and ignore the spatiotemporal correlation between multisource data, which reduces detection accuracy to a certain extent. Besides, due to the rapid growth of sensor data, centralized cloud computing platforms cannot meet the real-time detection needs of large-scale abnormal data. In order to solve this problem, a real-time detection method for abnormal data of IoT sensors based on edge computing is proposed. Firstly, sensor data is represented as time series; K-nearest neighbor (KNN) algorithm is further used to detect outliers and isolated groups of the data stream in time series. Secondly, an improved DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm is proposed by considering spatiotemporal correlation between multisource data. It can be set according to sample characteristics in the window and overcomes the slow convergence problem using global parameters and large samples, then makes full use of data correlation to complete anomaly detection. Moreover, this paper proposes a distributed anomaly detection model for sensor data based on edge computing. It performs data processing on computing resources close to the data source as much as possible, which improves the overall efficiency of data processing. Finally, simulation results show that the proposed method has higher computational efficiency and detection accuracy than traditional methods and has certain feasibility.

1. Introduction

In recent years, with the continuous development and integration of technologies such as the Internet, IoT, and cloud computing, a large number of sensor devices have been widely used in different fields such as power systems and thermal systems [1, 2]. Usually, sensors collect data at a certain frequency and send data to corresponding data receivers. The data receiver receives one or more sets of observations in a strict order; these observation data are basically time series data [3]. Time series data accurately records the real-time changes of a certain parameter and reflects trends and change law within a certain time range. Therefore, the time series data collected by sensor devices are not only an important data source for data visualization but also the basis of data mining (such as classification, prediction, clustering, and association) [46]. However, there will always be some abnormalities in the data collection and transmission processes of sensor equipment, such as error codes and sensor failures in actual data collection scenarios [7, 8]. Thus, in order to provide high-quality source data for subsequent data mining research, it is very necessary to effectively identify outliers in sensor data from the perspective of time series data analysis.

According to the wireless sensor network (WSN) characteristics, anomaly detection methods are divided into statistics-based, classification-based, clustering-based, and neighbor-based methods. Single-sensor data streams usually use the time correlation of data for anomaly detection, and many applications are based on statistical analysis and nearest neighbor distance for anomaly detection. Multisensor data streams have both time and space correlation, and cluster-based methods are usually used for detection. The time correlation of sensors is often ignored in the clustering process. For example, reference [9] proposed a new anomaly detection algorithm for time series data by constructing a distributed recursive computing strategy and KNN quick selection strategy. Reference [10] proposed a clustering algorithm that used local parameters for unbalanced data to detect abnormal data. Reference [11] applied the K-means algorithm to cluster analysis of iris data with 5 attributes. Compared with traditional methods, outlier removal clustering (ORC) technology achieved better results. Reference [12] was based on Spatiotemporal (ST) correlation and detected outliers by calculating the cross-correlation between sensor data streams. However, traditional detection algorithms mainly focus on the sequence continuity of single-source sensor data and ignore the correlation between multisource sensor data. In addition, it needs to be particularly emphasized that the current common sensing data anomaly detection and processing method use relatively mature cloud computing models and common big data processing products to directly transmit the data obtained by various data collection devices to cloud computing center for processing and storage. And the powerful computing power of the cloud computing center was used to complete corresponding anomaly detection and data cleaning work [13]. Although there may not be a clear correlation between time series data from different sensors, its inherent characteristics may have a high correlation. If these data are uploaded to the data center for feature extraction, it will cause a lot of computational pressure on the data center. Since underlying devices, including sensors, have certain computing capabilities, hidden features can be extracted.

In order to improve the detection accuracy of sensor abnormal data, a real-time detection method of abnormal data based on edge computing is proposed to better meet the real-time detection requirements of large-scale abnormal data. The parameters use nonmutation characteristics of time series data, and the improved DBSCAN algorithm uses the spatial correlation characteristics of multidimensional data in the KNN algorithm. The proposed algorithm fully considers data relevance and can effectively mine their potential relationship. Furthermore, a distributed anomaly detection model for sensor data based on edge computing is proposed to process data on computing resources close to the data source as much as possible. Experiments have proved that the proposed algorithm has positive significance for improving algorithm detection accuracy and the overall data processing efficiency.

2. Sensor Data Modeling Based on Time Series

Definition 1. The multisource data collected and transmitted by sensors can be expressed as the following time series data:where , , and are the time series data sets of multisource data, and is multisource number. in equation (2) is the data set of the data source . in equation (3) is the length of , and is the sensed value at time .
Based on the single data source representing , a sliding window (SW) is introduced to store part of the data . Therefore, the length of is expressed as .

Definition 2. According to some known correlations in the multisource data set , perform the necessary combination and transformation of to obtain a new time series (denoted as ). And enter it into the so-called related parameter set . Then, abnormal data can be found by detecting the linear correlation of in the data correlation detection (DCD) process.
According to Definition 1, from the linear correlation of in same , the correlation of can be realized. Considering that there may not be linear correlation or nonlinear correlation in , it is necessary to convert into a multisource signal with linear correlation characteristics for subsequent DCD processing. Here are the three correlations that exist:(1)Basic Correlation. It can also be called linear correlation. Taking the power system as an example, a binary time series composed of generator active power output and grid frequency is defined, where , . Due to the droop characteristic of the power system, the active output power and grid frequency satisfy a binary linear correlation, namely . Thus, let be the binary related parameter set. Similarly, if ternary time series data satisfies ternary linear correlation, the corresponding set of related parameters can be given by .(2)Combination Correlation. This shows that there is no linear correlation of a single time series in a given time series , but there is a linear correlation after combining . Taking the thermal system as an example, a ternary time series composed of thermal power , instantaneous temperature observation value , and time is defined. However, according to the basic heat power theorem in thermodynamics, there is a positive linear relationship between heat and temperature change rate . In other words, , where , , and are coefficients. Through a certain combination, the time series in the original can be converted into a binary linear model. The corresponding binary time series is put into a parameter set , denoted as .(3)Conversion Correlation. That is, in a given , there is no basic correlation or combination correlation in time series, but there is a nonlinear correlation, such as exponential model, hyperbolic model, polynomial model, etc. For example, flow optimization coefficients and of radiator satisfy the hyperbolic relationship: in a thermal system. Another example is that kinetic energy and the angular velocity of the generator rotor satisfy a polynomial relationship: in the power system. Through some data conversion methods, nonlinear correlation models can be transformed into linear correlation models.

3. Proposed Real-Time Detection Algorithm for Abnormal Sensor Data

Due to the huge amount of sensor data in IoT, traditional centralized cloud computing framework may have low efficiency in solving detection algorithms when computing resources are limited. Therefore, a detection framework based on edge computing is proposed to detect abnormal sensor data.

3.1. Real-Time Detection Framework of Sensor Data Based on Edge Computing

The linear growth of centralized cloud computing power has been unable to meet the rapid growth of data processing needs for edge devices [14]. Besides, from a technical or economic point of view, it is unlikely that the ever-growing edge data will be concentrated in one or more data computing centers to complete corresponding computing tasks.

In the edge computing framework, computing tasks are assigned to many distributed devices with certain computing capabilities [1518]. Therefore, computational efficiency can be improved while reducing the performance requirements of computing equipment. So, an edge computing architecture is built to detect abnormal sensor data in real time. As shown in Figure 1, the corresponding edge layer data node is established near the sensor data collection terminals to complete the detection task of related data while receiving sensor data.

Edge computing function is the core function of this system. In this paper, the main content of edge computing is abnormal data detection, estimation, and correction, and other edge computing tasks can also be added to this functional module according to actual needs. The realization of the edge computing module mainly realizes the functions of sequence generation, anomaly detection, and correction: retrieve the configuration information in the database and record the parameters that need to be edged. And the data to be processed is divided into different sequences according to different parameters to facilitate subsequent data processing. Then, perform anomaly detection and estimation correction algorithms on the corresponding sensor data sequence, mark abnormal data, and add estimated correction values.

3.2. Outlier and Isolated Group Detection Based on KNN Algorithm

The basic rule of the KNN algorithm is to find the nearest neighbors of samples (where ). When , KNN problem is equivalent to the nearest neighbor problem. represents the number of sensor data belonging to the category. In general, the judgment rule for judging which type of sensor data belongs to is the voting principle. In addition, set to an odd number to avoid divergence caused by equal votes. The voting principle can be expressed mathematically as follows:where is the sampled data, and is the number of sensors belonging to the category.

The steps of the KNN algorithm are as follows:(1)Distance Calculation. For a given sensing data set, calculate the distance between each object in the training set. In this paper, Euclidean distance is used as follows.For two-dimensional vectors and , Euclidean distance given is as follows:where is Euclidean distance between sensor data and .(2)Neighborhood Discovery. The nearest training objects are identified as the nearest neighbors of test objects.(3)Classification. The test objects are classified according to the main categories of the above neighbors.

3.3. Abnormal Sensor Data Detection Based on DBSCAN Algorithm

The basic DBSCAN algorithm uses globally unique parameters and to achieve clustering. Inspired by reference [10] using local parameters for clustering, this paper proposes a method based on SW data partition, uses local parameters to achieve density clustering of small sample data. The algorithm flow is shown in Figure 2.

The clustering process consists of three parts, namely parameter update, clustering, and anomaly detection. During the parameter update process, set the size of the clustering window and calculate the average distance difference between attributes in the window, take as the number of points in the neighborhood , and Euclidean distance between attributes as radius to ensure that the data clustering is correct in a single case. The formula is as follows:where is the average distance difference of attributes; is the size of SW.

DBSCAN algorithm is used for clustering in the clustering process. In view of the inconsistency between attributes, weights are assigned to each attribute to reduce the impact on the clustering effect. The weight is calculated by the correlation coefficient, and the formula is as follows:where is the covariance of attribute and attribute ; is the variance of attribute ; is the variance of attribute .

The anomaly detection process will analyze clustering results. In the clustering process, the object marked as an abnormal point for the first time is recorded as a candidate abnormal point, and the abnormal score is set plus 1 (the initial value is 0). The candidate abnormal points enter the next cycle, continue clustering, and update the abnormal score. If the abnormal score is equal to the number of clustering (the number of clustering is the inverse of SW overlap rate), it is marked as an abnormal point; otherwise, it is a normal point.

According to Definition 1, the multisource sensor data set is represented as a time series . When multisource sensor data enter , select the area data observation value of whose length is equal to the length of to form a new multisource sensor data set , where contains the area data observation value of length , namely , . According to Definition 2, part of the time series set with known correlation in is represented as , which can be combined or transformed into a new time series with linear correlation. Then, enter the parameters in into the set . Since there is usually a certain correlation between data collected by sensors, the correlation between sensor data can be used to determine whether the sensor data is abnormal in a certain time range.

3.4. Algorithm Performance Analysis

The time complexity of the DBSCAN algorithm is the time required to find a point in radius neighborhood, and its time complexity is in the worst case. Improved DBSCAN algorithm uses a SW to divide data, and its time frequency can be expressed as follows:where is the algorithm input scale, is SW size, that is, is a constant. is a constant; that is, the sliding step is . Thus, the time complexity of the improved DBSCAN algorithm is as follows:

The time complexity of the proposed algorithm is the sum of the time complexity of KNN and the improved DBSCAN clustering algorithm. It can be expressed as follows:

It can be seen that the time complexity of the anomaly detection algorithm increases linearly. As the amount of processed data samples increases, the time efficiency is higher than the basic DBSCAN algorithm.

4. Case Study and Discussion

4.1. Experimental Environment

In order to verify the effectiveness and feasibility of this method, some case studies are carried out. The hardware environment uses Think station P910 with Intel Xeon E5V4 CPU, 32 GB memory and 1 TB SSD. The software environment is ZooKeeper V3.4.8, jdkv1.8.0 and Storm V1.0.0. All algorithms are running on CentOS 6.5.

The experimental data comes from the urban heating system in Baohe District, Hefei, with 18031 users and 400 buildings. In the data collection process, the transmitters of each building will collect the data recording heating information of each room and send them to the corresponding receivers.

4.2. Execution Efficiency of Cloud Computing and Edge Computing Platforms

The algorithm is implemented on a centralized cloud computing platform [19] and a distributed edge computing platform. Table 1 shows the average processing time for each step. As can be seen from Table 1, due to the large amount of data in cloud computing, the corresponding bandwidth pressure is also greater. Therefore, the transmission delay of cloud computing is longer than that of edge computing. Moreover, due to the relatively strong computing power of cloud computing platforms, the time required for each step of using cloud computing has increased by 211.3 ms and 101.1 ms, respectively. Therefore, the computational efficiency of edge computing is higher. This is because the paper proposes an anomaly detection model for sensory data based on edge computing and uses the big data processing idea of edge computing to process corresponding data as much as possible on computing resources close to the data source. It improves the overall efficiency of data processing while reducing the pressure on network transmission bandwidth.

4.3. Detection Results Analysis

In this subsection, the performance of the proposed algorithm is evaluated from two aspects by experimental results.(1)Using the anomaly detection algorithm proposed in this paper, local detection of sensor data including accumulated heat, thermal power, accumulated temperature, flow, and temperature difference is performed. The number of selected sensor data is 500000. Using the proposed algorithm to find abnormal sensor data, the test results are shown in Table 2. It can be seen from Table 2 that there are 2650 abnormal , or records, and 1997 abnormal , or records. represents the total number of abnormal sensor data, and represents the abnormal sensor data successfully detected. Thus, the detection accuracy (expressed as ) can be calculated as follows:It can be seen from Table 2 that the basic DBSCAN algorithm can find 2650 abnormal records of 2,320 and 1997 abnormal records of 1802. The detection accuracy rates were 87.5% and 90.2%, respectively. The improved DBSCAN algorithm proposed in this paper can detect 2530 cases out of 2650 abnormal records and 1909 cases out of 1997 abnormal records . The detection accuracy rates were 95.5% and 96.0%, respectively. It can be seen that the improved DBSCAN algorithm proposed in this paper can effectively improve the detection results of abnormal data.(2)In order to verify the method superiority in this paper, the methods in reference [10], reference [11] reference [12] are selected as benchmarks. 3,433,756 sensor records are selected from the data of the past two years, the proposed algorithm and three benchmark methods are used to detect anomalies. Figure 3 shows the average detection accuracy.

It can be seen from Figure 3 that the detection accuracy of the proposed method is increased by 1.91%, 2.04%, and 2.7%, respectively, reaching 96.4% compared with the methods in reference [10], reference [11], and reference [12]. This is because benchmarking methods do not effectively use the correlation between multisource time series to accurately assess the change trend.

This paper intercepts the detection results of sensor data in data set for statistics. Among them, there are 15 point anomalies, 49 cluster anomalies, and 13 correlation anomalies. The detection rate is 97.8% and the false alarm rate is 2.2%. In order to describe the situation of abnormal points being marked more clearly, this paper intercepts the first 180 sample points of air temperature data for drawing; the detection results are shown in Figure 4. The abnormality of air temperature data occurs within a short period of time (each sample is collected at an interval of 10 minutes), and the reasons for abnormality are all errors.

In order to further verify the time efficiency of the proposed algorithm, the result is shown in Figure 5 by comparing and verifying the datasets.

It can be seen from the figure that the running time of the method in reference [12] increases the fastest, and the running time of the proposed method and the method in reference [10] increases slowly. When the dataset reaches 8440 hours, that is, when the number of data points reaches 337760, the running time of the proposed method and the method in reference [12] is far less than the running time of methods in reference [10] and reference [11]. Combined with the previous analysis of time complexity, it can be seen that the improved DBSCAN algorithm takes advantage of the spatial correlation characteristics of multidimensional data, fully considers the data relevance, and effectively mines the data potential relationship. Therefore, when the sample size increases to a certain extent, the time efficiency of the proposed method is lower than that of several comparison algorithms. In summary, the proposed method can be used for anomaly detection of multisensor data streams and is feasible.

5. Conclusion

This paper proposes a real-time detection method for abnormal data of IoT sensors based on edge computing, which combines the ST correlation of sensor data streams and ideas of nearest neighbor algorithm and clustering algorithm. The method optimizes parameters according to the characteristics of environmental data and overcomes fixed nearest neighbor distance threshold, global clustering parameters, and slow convergence speed problems, which improves anomaly detection efficiency. For a relevant multisensor data stream, its effect can meet the use of the current environment. Simulation results show that the proposed method has higher computational efficiency and detection accuracy than traditional methods and has certain feasibility. However, limited to the author’s level, the algorithm in this paper still has room for improvement. In the future, we will focus on optimizing edge computing model and extending the detection algorithm to other real-time data application scenarios.

Data Availability

The data included in this paper are available without any restriction.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.