Abstract

Internet of Things (IoT) is emerging, and 5G enables much more data transport from mobile and wireless sources. The data to be transmitted is too much compared to link capacity. Labelling data and transmit only useful part of the collected data or their features is a promising solution for this challenge. Abnormal data are valuable due to the need to train models and to detect anomalies when being compared to already overflowing normal data. Labelling can be done in data sources or edges to balance the load and computing between sources, edges, and centres. However, unsupervised labelling method is still a challenge preventing to implement the above solutions. Two main problems in unsupervised labelling are long-term dynamic multiseasonality and heteroscedasticity. This paper proposes a data-driven method to handle modelling and heteroscedasticity problems. The method contains the following main steps. First, raw data are preprocessed and grouped. Second, main models are built for each group. Third, models are adapted back to the original measured data to get raw residuals. Fourth, raw residuals go through deheteroscedasticity and become normalized residuals. Finally, normalized residuals are used to conduct anomaly detection. The experimental results with real-world data show that our method successfully increases receiver-operating characteristic (AUC) by about 30%.

1. Introduction

Together with rapid development of 5G, the connection requirement of wireless devices is also developing due to the eased connectivity and much shorter (in milliseconds) delay. A result is that Internet of Things (IoT) technologies are now used by more than a quarter of mainstream business compared to 13% six years ago. A great number of industry companies started to put attention on their IoT time series data, including but not limited to health care [1] and transportation [2]. While lots of mobile vehicles are connected to the IoT network as data sources [3], much more data is produced. On one aspect, it is an opportunity for machine learning-based data processing methods. On the other aspect, data transmission is now more challenging.

Moving and remote data source create a challenge that it is hard to send data, especially using wireless ways, as it is still expensive to use limited wireless resource to transfer data even for 5G service providers. In some situations, if real-time moving vehicle information is needed while radio signal is limited, then wireless and wired connection may be both needed to provide support together [4, 5]. This situation is shown in Figure 1.

For this situation, one way to solve it is to label data near to sources. It not only reduces the amount of data to transfer but also balances the computing load between edges and centres [6]. One more benefit is that labelling different types of data is good for later prediction [7]. However, most solutions require labelled data to train labelling models or human expert rich experience to configurate parameters.

In this work, we try to solve this problem by enhancing data preprocessing. Our previous initial feasibility experiments show promising results [8] and we complete the design here. The main contributions of this work include detailed steps of the data-driven method to handle heteroscedasticity of Internet of Things (IoT) data and comparison of possible unsupervised labelling methods as well as analysis of the reasons.

The remaining content is organized as follows. First, the section introduces the problem and related definitions, together with previous research that tried to tackle this problem. Second, the proposed method is thoroughly documented in the section including steps of data preprocessing, model building, model adaptation, residual matrix construction, and anomaly detection. Third, the section describes a series of experiments using real-world data that are carried out in order to evaluate and compare the performance of the proposed method in terms of different metrics. Finally, experimental results are shown and analysed in the section, and conclusions are made in the section.

Here, we consider a system with a centre node. IoT data processing happens across the entire system [9]. It starts as early as the source application data part as shown in the updated TCP/IP architecture in Figure 2. Example source application data include camera images, video streaming, temperature, and other environmental sensed values [10]. The sensed data are then sent via possible networking routing which could be fully used for distributed processing [11], especially together with the application layer [1214]. Physical layer choice matters as the emergency level and importance level differ among transported data which should be optimized carefully [15, 16]. When the data finally arrive at the centre, data mining algorithms could be applied [17] to analyse and conduct prediction in most cases.

Regarding labelling and detection of anomalies in time series, much work has been done. Previous work can be categorized in different ways from different aspects [18]. A typical categorization includes the following categories. Probability-based methods calculate a density distribution and use some kind of thresholds to the distribution centre to label anomalies [19]. Distance-based methods set thresholds regarding how far an instance deviates from its neighbours. The measurement can be defined distances, such as in -nearest neighbours [20], or some kind of cost of separation such as decision tree-based methods [21]. Reconstruction-based methods catch patterns and calculate the expected values of instances to get the difference, i.e., residuals, and then use residuals to conduct labelling [22, 23]. Boundary-based methods, such as support vector machine [24], provide a boundary or hyperplane to separate abnormal instances from normal ones. In addition, ensemble methods can be used to improve the accuracy and robustness of above methods [25]. For the above-mentioned methods, reconstruction-based methods give not only residuals but also comprehensive patterns and models. Thus, this work focuses on providing a preprocessing procedure to calculate and standardize residuals as the first step of reconstruction-based methods.

For reconstructed residuals, as the original saved data is huge and long-term, one common problem is the variance of residuals are time-dependent, i.e., heteroscedasticity [26]. Using traffic flow as an example, the variance is high during noon time when the flow itself is high as shown in Figure 3. Vice versa, the flow and its variance are both low after midnight. This causes problems for labelling algorithms as many of them cannot distinguish high variances with anomalies.

During literature review, we found two methods that try to solve the above two problems at the same time. One method is SARIMA-GARCH (Seasonal Auto-Regressive Integrated Moving Average-Generalized Auto-Regressive Conditional Heteroscedasticity) [27]. Another one is TBATS (Trigonometric Box-cox transform, ARMA errors, Trend and Seasonal component) [26]. Thus, those two methods are also tested in this work. For the final detection part, SHESD (Seasonal Hybrid Extreme Studentized Deviate test) [28] shows promising results in experiments [2931] and is used here. It is worth mentioning that there are plenty of alternative methods while this work focuses on preprocessing.

3. Methodology

The proposed method includes three main steps which are preprocessing, building day-of-week (DOW) models, and solving flow-level-heteroscedasticity problem. This part describes the method in detail. The entire procedure is summarized in Figure 4.

3.1. Preprocess Data

In this part, data are loaded and then divided into seven groups according to day of week.

For consecutive zeros (continuous three or more zeros) which means controlled access or device malfunction, set flags and replace the instances with null: where is the th measured flow rate value.

Instead of using original natural daily periods, we use a new starting point. The purpose is to find a base where the starting flow rates of seasons are low and similar so that the robust fitting could work better in latter steps. It is worth mentioning that (daily) seasons may start from other time than midnight. Actually, the starting point is calculated to be around 3 am in the experiments. where is the number of complete seasons, is the original number of instances (about ), and is the number of periods (i.e., instances) per day (e.g., 288 per day for 5-minute interval data).

All complete seasons are put together to construct a matrix: with each season constructing a column, e.g., .

Then, separate the seasons/columns into groups; here, we use the day of week of the season starting point as the criteria; thus, there are 7 groups () with similar number of instances for each group.

3.2. Build the Main Models

Now, seven day-of-week (DOW) models are built with the key concept of median. The building algorithm is designed in the way that it can set up several workers in parallel to improve building performance.

To get a specific model , a matrix is constructed by using all seasons (all columns) of :

is the number of complete seasons for a specific model .

Seven DOW models ( where is 7 in this paper) are built by applying median filters to .

We can present all models as columns of a matrix: where the indicates model index and the indicates time point (period) index of day. Thus, where , i.e., and contains all s with the time point index of day which belongs to model .

3.3. Adapt like Regressors

This part calculates fitted models using -estimation considering the above model matrix and each individual season.

An -estimator is then computed iteratively with reweighted least squares (IRLS): where the scaling and addition parameters , and residuals from the previous fit (using season belongs to model as an example):

Thus, the residual matrix:

During the estimation, the weights are calculated as: where is a scaling factor: and is in Huber family: while is a constant 0.675 and is 1.345 which correspond to regression estimator 95% efficiency. If -estimation fails (rarely), then constrained -estimation (CM) [32] is used (which is always working for our data). CM is proposed by Mendes and Tyler for regression and is more robust while keeping the same breakdown point (i.e., 1/2) though slower.

3.4. Construct the Residual Matrix

While having the adapted models, the raw residuals can be calculated directly. However, the raw residuals contain different variations on different flow levels. Thus, this part also removes flow-level-related heteroscedasticity.

For adapted models, i.e., , let us take values of adapted models and round them to integers then we get flow levels as integers of each time point.

Be aware that the flow levels are rounded from adapted model values instead of measured. For example, suppose 9 am traffic is 85 in the DOW model, 90.3 in the adapted model, but only 10 in the measured traffic (due to an incident or so); then, the traffic flow level is 90, i.e., flow level is a adapted and generalized description which represents what the traffic should be during a similar day.

Suppose the minimum and maximum integers (levels) in are: then we can generate a level vector which contains all integers from to and denotes the number of total flow levels.

For all level items/values in adapted models , adapted models do element-wise XNOR logic and we get a mask matrix with ones indicating the time points/instances with flow levels of .

Let us apply this mask to and take all the matched values then calculate the variance (standard deviation) for an arbirtary level items and related calculation are ignored during this process.

The variances for different levels vary, thus heteroscedasticity. When putting all variances for all levels to get a variance/heteroscedasticity vector, note that residuals from neighbour levels are used when the amount of residuals is insufficient.

Later, all residuals are divided by the time point’s level’s variance to get “normalized residuals.” First, for levels of each time point, i.e., , find its corresponding variance:

Generate a matrix of all residual’s corresponding variance:

Normalized residuals are: where

3.5. Detect Using Normalized Residuals

Finally, normalized residuals are sent to detection algorithms. The entire procedure is also presented in pseudocode (Algorithm 1).

1: procedure DOW-FLH (Original Time Series)
2:  set flags for consecutive zeros ▹Handel Dirty Data
3:  for each day do
4:    find the time point index (TPI) of the lowest flow
5:  end for
6:  find TPIs’ median number as starts of daily seasons, e.g., 3 am
7:  for model in all DOW models do▹Build DOW Models
8:    take all seasons related to to a group
9:    remove flagged consecutive zeros
10:    calculate median of grouped seasons as the model
11:  end for
12:  for model in all DOW models do ▹Fit/Adapt to Get Scalings and Additions
13:    for each realted season do
14:      remove flagged consecutive zeros
15:      estimate by robustly fitting to the season
16:      rounding all values of the fitted model to integers as the season’s flow levels
17:      get residuals as the difference between the fitted and the season
18:    end for
19:  end for
20:  for each flow level Standardize Residuals (FLH) do▹Standardize Residuals (FLH)
21:    take all residuals for this flow level (or with neighbours if not enough)
22:    calculate standard deviations (STD)
23:  end for
24:  consider all STDs with all flow levels as the flow level heteroscedasticity (FLH)
25:  divide each residual with timely corresponding STD to standardize
26:  for each detection algorithm do ▹Detection
27:    feed the entire standardized residual time series to the algorithm
28:    get algorithm-specific anomalies or anomaly scores
29:  end for
30:  return the list of anomalies or anomaly scores
31: end procedure

4. Experiments

This section describes data, practical procedure, and the way we conduct experiments.

4.1. Data Specification

The one-year long real-world data are collected from a highway. Ground truth anomaly (incidents) labels are generated by using the extended system mentioned in [33]. The data are imputed using the method from [34] before any processing. One device sends a monitored flow record at five-minute intervals. Each record contains some traffic statistics such as flow rate and average speed. This road carries undersaturated flow except in holidays’ noons, where is 15 min average?

4.2. Experimental Setup

The experiments are done in a desktop computer with AMD Ryzen 5-3600 (6 Cores, 3600 MHz) and 16 GB DDR4 memory. To be fair, we only implement our method; other algorithms are taken from public domain such as GitHub.

Our implementation is done in the R programming environment version 3.4.3 with RStudio 1.3.1056, AnomalyDetection 1.0, forecast 8.2, feather 0.3.3 as well as the Python programming environment version 3.6.7/3.6.9 with library arch 4.8.1, statsmodels 0.9.0, feather-format 0.4.0/0.4.1, numpy 1.16.0/1.19.4, pandas 0.23.4/1.1.4, scikit-learn 0.19.2/0.23.2, scipy 1.2.2/1.5.4, ipykernel 5.3.4, and ipython 7.16.1.

SHESD was originally implemented to give only binary results so we modified it by adding to get anomaly/outlier scores. Also, as the max allowed anomaly (outliers) ratio is 50%, we mark all nontested ones the same score as the lowest score.

4.3. Evaluation Measurement and Metrics

Receiver-operating characteristic (ROC) is used as the main evaluation metric as it provides an accurate and visualized way to present detecting results. One important value from ROC is area under curve (AUC) which is also known as A (“a-prime”), or concordance-statistic (-statistic). It is a measure of goodness of fit that is often used for binary classification modelling results evaluation; therefore, we use it here.

5. Results and Analysis

As shown in Figure 5, our DOW and DOW-FLH methods are superior with regard to AUC. DOW with and without FLH performs similar considering AUC of 0.693 from both algorithms which are 26.9% better coverage than other algorithms on average (AUC 0.546). What is more, DOW-FLH is preferred for less false positives on the optimal cut-off point compared to DOW without FLH due to the data sensitivity to false positive. May move below to analysis? For unbalanced datasets such as traffic flows, this behaviour gives positive influence. The reason is that some false-negative instances introduce only minor issues for true-negative ones as negative instances are majority while the same amount false-positive instances impact true anomalies (incidents) much more.

We analysed the detection ratio and AUCs for different situations and found some interesting results. For device malfunction incidents, most algorithms cannot notice it as shown in Figure 6. The possible reason is that other algorithms are tracking no-flow situation without considering normal situation. Note that good seasonal modelling (DOW) should work with suitable variance handling methods, as inappropriate variation handling (i.e., GARCH) may otherwise reduce the effectiveness.

Figure 7 shows level to residual characteristics diagnostics. The mean of residuals (blue line) is mostly under 2 but increases rapidly to be about 5 when the flow level is greater than 150. This is due to the fact that extreme levels (greater than 150) occur only during few big holidays, so this scenario is hard to be caught by models. The standard deviations (green line) is mainly increasing which represents one key problem, i.e., heteroscedasticity. The purple line represents the number of instances per level, and it becomes very small for extreme scenarios in both directions of -axis. The number of span is used to include neighbour levels when one level’s corresponding instances are too few to calculate reasonable statistics. In summary, it can be seen that the relation mapping from levels to residual characteristics are nonlinear. This explains why the proposed data-driven algorithms perform better.

DOW successfully modelled patterns and FLH successfully suppressed heteroscedasticity for normal data compared to others as the residuals are shown in Figure 8. Other algorithms, when being compared to DOW-FLH, cannot distinguish data with vs. without abnormalities, such as shown in Figure 9. This could be an advantage for GARCH-based methods when tracing rapid change in (nonseasonal) time series with heteroscedasticity, but it becomes an disadvantage and hides possible abnormal data instances here. The problem with TBATS and SARIMA is that they could not successfully model the patterns and produces residuals with much noise which leads to low signal-noise ratio as shown in Figure 10.

Previous work has shown that ARIMA and GARCH cannot be adapted to seasonality with many periods such as here 288 periods per season. Instead, they will adapt to local trend or rapid change add plots; therefore, they are not suitable to detect anomalies lasting beyond their detection abilities. This characteristic could be an advantage when quick predicting traffic for short-term time is needed.

6. Conclusion and Future Work

The experiment results show that the proposed DOW algorithm is good at matching multiseasonality time series patterns, and FLH can solve heteroscedasticity problem. DOW-FLH-modelled residuals can be used for labelling anomalies; then, the chosen data can be sent to either edges or centres for further process.

As discussed above, the proposed DOW-FLH in this work is good at modelling and labelling multiseasonal IoT time series for the edge-centre structure. However, other compared algorithms, including SARIMA- and TBATS-based ones, are more mature and may be good at local trend prediction. Also, edge computing can engage crowdsourcing and related active learning [35] to make full use of advantages provided by edge-centre structure.

This point can be further tested in later research.

Labelling can be treated as a classification question, and many new algorithms can work on this task. Especially, recent development regarding classification using belief theory is showing promising results [36], and it is good for multisource scenarios in edge-centre computing. Thus, this might be a good enhancement for our current work, and we look forward to investigate more about it in the future work.

In summary, the proposed DOW-FLH method performs well during experiments using multiseasonal IoT time series and should be considered to use when labelling is needed in edge-centre computing structure.

Data Availability

Access to data is restricted in general; please contact the authors for access when necessary.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We would like to thank the reviewers for their detailed suggestions that have indeed made great contribution to this work. This work is supported by the Shandong Natural Science Foundation under ZR2020MF067 and Shandong Key Research and Development Program under 2019JZZY021005.