Complexity

Volume 2018, Article ID 3683969, 21 pages

https://doi.org/10.1155/2018/3683969

## Simulation Study on Clustering Approaches for Short-Term Electricity Forecasting

Department of Informatics, Warsaw University of Life Sciences (SGGW), Warsaw, Poland

Correspondence should be addressed to Krzysztof Gajowniczek; lp.wggs@kezcinwojag_fotzsyzrk

Received 22 December 2017; Revised 21 February 2018; Accepted 11 March 2018; Published 26 April 2018

Academic Editor: Tiago Pinto

Copyright © 2018 Krzysztof Gajowniczek and Tomasz Ząbkowski. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Advanced metering infrastructures such as smart metering have begun to attract increasing attention; a considerable body of research is currently focusing on load profiling and forecasting at different scales on the grid. Electricity time series clustering is an effective tool for identifying useful information in various practical applications, including the forecasting of electricity usage, which is important for providing more data to smart meters. This paper presents a comprehensive study of clustering methods for residential electricity demand profiles and further applications focused on the creation of more accurate electricity forecasts for residential customers. The contributions of this paper are threefold: using data from 46 homes in Austin, Texas, the similarity measures from different time series are analyzed; the optimal number of clusters for representing residential electricity use profiles is determined; and an extensive load forecasting study using different segmentation-enhanced forecasting algorithms is undertaken. Finally, from the operator’s perspective, the implications of the results are discussed in terms of the use of clustering methods for grouping electrical load patterns.

#### 1. Introduction

Throughout the EU, there is considerable interest in smart electricity networks, where increased control over electricity supply and consumption is achieved by investments and improvements in new technologies such as advanced metering infrastructure. Smart metering is part of this movement and is perceived as a necessary step in achieving the EU’s energy policy goals by the year 2020 (i.e., cut greenhouse gas emissions by 20%, improve energy efficiency by 20%, and ensure that 20% of the EU’s energy demand is supplied by renewable sources) [1].

Clustering analysis is an unsupervised learning technique that has been widely used to identify different energy consumption patterns, particularly among commercial customers, individual industries, or large aggregations of residential customers [2]. Recently, the fast-growing stream of meter data has motivated further research on the application of clustering techniques to individual residential customers [3]. By clustering time series of hourly load data, each customer can be represented by a number of load patterns, thus allowing variability information to be derived. Clustering can therefore serve as a valuable preprocessing step, providing fine-grained information on customer attributes and sources of variation for subsequent modeling and customer segmentation.

Many dissimilarity measures between time series have been proposed in the literature. They can be grouped into four categories [4, 5]:(i)Shape: Minkowski distance, short time series distance, and dynamic time warping distance.(ii)Editing: edit distance for real sequences and longest common subsequence distance.(iii)Features: autocorrelation-based distances, short time series distance, Fourier coefficients-based distance, TQuest distance, and periodogram-based distances.(iv)Structure: Maharaja distance and Piccolo distance.

This study examines the first three categories, as they can be used in the majority of cases. The fourth category requires some a priori assumptions that affect the obtained results. For instance, they assume that the observed time series is the result of a certain parametric base model, mainly the autoregressive integrated moving-average (ARIMA) model. This implies the need to prioritize the relevant parameters of the model in advance.

Dissimilarity between time series can be also computed using information theory. For instance, Kullback–Leibler (KL) divergence is a measure of how one time series probability distribution diverges from a second (expected) probability distribution [6]. This measure is distribution-wise and asymmetric and thus does not qualify as a statistical metric of spread. In the simple case, a KL divergence of 0 indicates that we can expect similar, if not the same, behavior of two time series distributions, while a KL divergence of 1 indicates that the two time series distributions behave in such a different manner that the expectation given the first distribution approaches zero. The second example is mutual information (MI) of two time series which is a measure of the mutual dependence between two investigated time series [7]. Intuitively, mutual information measures the information that two time series share: it gives the idea of how one of these time series reduces uncertainty about the other. For example, if two time series are independent, then knowing the first time series does not give any information about the second time series and vice versa, so their mutual information is zero. At the other extreme, if the first time series is a deterministic function of the second one, then the mutual information is maximal.

As their input, most clustering algorithms take parameters such as the number of clusters, density of clusters, or, at least, the number of points in a cluster. Nonhierarchical procedures usually require the user to specify the number of clusters before any clustering is accomplished, whereas hierarchical methods routinely produce a series of solutions ranging from clusters to only a single cluster [8]. As such, the problems of determining a suitable number of clusters for a dataset and evaluating the clustering results have been considered by several researchers. The procedure of evaluating the results of a clustering algorithm is known as cluster validity analysis [9].

In general, there are three approaches for investigating the cluster validity. The first, based on external criteria, compares the cluster analysis results to some known results, such as externally provided class labels. The second approach, based on internal criteria, uses information obtained from within the clustering process to evaluate how well the results fit the data without reference to external information. The third approach is based on relative criteria and consists of evaluating a clustering structure by comparing it with other structures given by the same algorithm using different parameter values (e.g., the number of clusters). In this paper, we consider this third class of measures [10].

Based on the selected time series similarity measures and hierarchical clustering algorithms, this study developed forecasting models for the aggregate electricity demand of individual groups of households. The predictions given by these models were compared with the results of a base model built for all households (aggregate over 46 consumers for* WikiEnergy* data [11]). In particular, using smart metering data, we aim to answer the following research questions:(1)To what extent is it possible to provide accurate 24-hour load forecasting for the group of households?(2)Which of the proposed time series similarity measures and the measures determining the relevant number of clusters gives the greatest increase in forecast accuracy?(3)What kind of forecasting methods and algorithms is appropriate to address highly volatile data?

The remainder of this paper is organized as follows. In Section 2, various time series similarity measures, grouped into three categories, are introduced. As different clustering algorithms usually lead to different numbers of clusters, Section 3 discusses several measures for determining the relevant number of clusters. In Section 4, based on* WikiEnergy* data gathered from 46 households, various numerical experiments regarding the clustering of these households are presented. Section 5 describes the methods used in our forecasting experiments, and Section 6 presents the results from a number of numerical experiments that provide 24-hour forecasts at different data aggregation levels. Finally, Section 7 concludes the paper.

#### 2. Time Series Similarity Measures

The following subsections briefly describe three categories of time series similarity measures. In the remainder of this section, unless otherwise specified, and denote partial realizations from two real-valued processes and , respectively. Note that serial realizations of the same length are initially assumed, although this limitation can be omitted in some cases.

##### 2.1. Measures Based on the Shape of Time Series

A simple approach for measuring the similarity between and is to consider conventional metrics based on the closeness of their values at specific points in time. Some commonly used raw-values-based dissimilarity measures are introduced below.

###### 2.1.1. Minkowski Distance

The Minkowski distance of order , where is a positive integer, is also known as the -norm distance [12]. This measure is typically used with , giving the Euclidean distance, and is very sensitive to signal transformations such as shifting or time scaling (stretching or shrinking of the time axis). The proximity notion relies on the closeness of the values observed at corresponding points in time, so that the observations are treated as if they were independent. In particular, is invariant to permutations over time [4, 5].

###### 2.1.2. Short Time Series Distance

The short time series (STS) distance was introduced by Möller-Levet et al. [13] as a metric that adapts to the characteristics of irregularly sampled series [4, 5].

###### 2.1.3. Dynamic Time Warping Distance

The goal of dynamic time warping (DTW) is to find patterns in time series [14]. The DTW distance determines a mapping between the series which minimizes a specific distance measure between the coupled observations. This allows similar shapes to be recognized, even in the presence of signal transformations such as shifting and/or scaling, and ignores the temporal structure of the values because the proximity is based on the differences, regardless of the behavior around these values [4, 5].

##### 2.2. Measures Based on Editing the Time Series

The edit distance, which was initially developed to calculate the similarity between two sequences of strings, is based on the idea of counting the minimum number of edit operations (delete, insert, and replace) that are necessary to transform one sequence into the other. The problem of working with real numbers is that it is difficult to find exact matching points in two different sequences and, therefore, the edit distance is not directly applicable.

###### 2.2.1. Edit Distance for Real Sequences

The distance between points in the time series is reduced to 0 or 1 [15]. If two points and are closer to each other in the absolute sense than some user-specified threshold , they are considered to be equal. On the contrary, if they are further apart, they are considered to be distinct and the distance between them is set to 1. As an additional property, the edit distance for real sequences (EDR) permits gaps or unmatched regions in the time series but penalizes them with a value equal to their length values [4, 5].

###### 2.2.2. Longest Common Subsequence Distance

In this metric, the similarity between two time series is quantified in terms of the longest common subsequence (LCSS), with gaps or unmatched regions permitted [16]. As with EDR, the initial mapping between the series uses the Euclidean distance between two points, which is reduced to 0 or 1 depending on some threshold values [4, 5].

##### 2.3. Measures Based on the Time Series Features

Instead of using the raw data values in the series, this category of distance measures aims to extract a set of features from the time series and calculate the similarity between these features.

###### 2.3.1. Distance Based on the Cross-Correlation

This distance is based on the cross-correlation between two time series. The maximum lag considered in the calculation should not exceed the length of the series values [4, 5].

###### 2.3.2. Autocorrelation-Based Distances

Several researchers have considered measures based on the estimated autocorrelation functions [17]. These rely on the autocorrelation between two time series with a maximum lag of values [4, 5].

###### 2.3.3. Periodogram-Based Distances

Most of the measures discussed so far operate in the same domain, that is, the time domain. However, the signal representation in the frequency domain provides a good alternative for measuring the similarity between time series. The key idea is to assess the similarity between the corresponding spectral representations of the time series values [4, 5, 18].

###### 2.3.4. Fourier Coefficients-Based Distances

This measure is based on comparing the discrete Fourier transform coefficients of the series [19]. The value of each coefficient measures the contribution of its associated frequency to the series. Based on this, the inverse Fourier transform provides a means of representing the sequences as a combination of sinusoidal forms. Note that the Fourier coefficients are complex numbers that can be expressed as . In the case of real sequences such as time series, the discrete Fourier transform is symmetric, and therefore it is sufficient to study the first coefficients. Furthermore, it is commonly considered that most of the information is found within the first Fourier coefficients, where . Based on this information, the distance between two time series is given by the Euclidean distance between the first coefficients [4, 5].

###### 2.3.5. TQuest Distance

The fundamental idea of the TQuest distance [20] is to define a set of intervals in a time series in which the stochastic processes exceed a predetermined threshold . The final distance between two time series is defined in terms of the similarity between sets of intervals based on this threshold value. Intuitively, two time intervals are said to be similar if they have similar start and end points. TQuest is independent of the size of the individual time series; this is important because time series exhibiting similar properties can be converted into intervals of different lengths. Another advantage is that this measure only takes into account the local similarity, with the remaining (continued) intervals not affecting the final result [4, 5].

#### 3. Measures for Determining the Relevant Number of Clusters

Different clustering algorithms usually lead to different clusters of data; even for the same algorithm, the selection of different parameters or the order in which data objects are presented can greatly affect the final clustering partitions. Thus, effective evaluation standards and criteria are critically important to ensure confidence in the clustering results. At the same time, these assessments provide meaningful insights into how many clusters are hidden in the data. In most real-life clustering situations, users face the dilemma of selecting the number of clusters or partitions in the underlying data. As such, numerous indices for determining the number of clusters in a dataset have been proposed.

##### 3.1. CH Index

The value of that maximizes CH() specifies the number of clusters [21].

##### 3.2. C Index

The minimum value of [21] is considered to be the relevant number of clusters in a given dataset.

##### 3.3. Duda Index

Reference [21] proposed the ratio as a criterion, where is the sum of squared errors within clusters when the data are partitioned into two clusters and is the squared error when only one cluster is present. The optimal number of clusters is that which gives the smallest value of this ratio.

##### 3.4. Ptbiserial Index

This index [21] is simply a point-biserial correlation between the raw input dissimilarity matrix and a corresponding matrix consisting of 0s or 1s. A value of 0 is assigned if the two corresponding points are clustered together by the algorithm; a value of 1 is assigned otherwise. Given that larger positive values reflect a better fit between the data and the obtained partition, the maximum value of the index is used to select the optimal number of clusters in the dataset.

##### 3.5. DB Index

This index [21] is a function of the ratio of within-cluster scatter to between-cluster separation. The value that minimizes the index is considered to be the optimal number of clusters in a given dataset.

##### 3.6. Frey Index

The Frey index [21] can only be applied to hierarchical methods; it is the ratio of difference scores from two successive levels in the hierarchy. The numerator is the difference between the mean between-cluster distances from the two hierarchy levels (level and level ), whereas the denominator is the difference between the mean within-cluster distances of levels and .

##### 3.7. Hartigan Index

The maximum difference between hierarchy levels is taken to indicate the correct number of clusters in the data [21].

##### 3.8. Ratkowsky Index

Charrad et al. [21] proposed a criterion for determining the optimal number of clusters based on . The value of is the average of the ratios of (), where BGSS is the sum of squares between the clusters (groups) for each variable and TSS is the total sum of squares for each variable. The optimal number of clusters is the value of that maximizes .

##### 3.9. Ball Index

Ball and Hall [21] proposed an index based on the average distance between data items and their respective cluster centroids. The largest difference between levels is used to indicate the optimal solution.

##### 3.10. McClain Index

The McClain and Rao index [21] is the ratio of the average within-cluster distance divided by the number of within-cluster distances to the average between-cluster distance divided by the number of cluster distances. The minimum value of this index indicates the optimal number of clusters.

##### 3.11. KL Index

The value of that maximizes KL() [21] specifies the optimal number of clusters.

##### 3.12. Silhouette Index

The maximum value of this index is used to determine the optimal number of clusters in the data [21].

##### 3.13. Dunn Index

The Dunn index [21] is the ratio between the minimal intercluster distance and the maximal intracluster distance. If the dataset contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, the Dunn index should be maximized.

##### 3.14. SD Index

The SD validity index is based on the concepts of average scattering for clusters and total separation between clusters [21]. The number of clusters that minimizes this index gives the optimal value for the number of clusters in the dataset.

#### 4. Splitting Households into Clusters

##### 4.1. Data Characteristics

Numerical analyses were performed using data from 46 households taken from* WikiEnergy*. The* WikiEnergy* dataset, constructed by Pecan Street Inc., is a large database of consumer energy information. This database is highly granular, including usage measurements collected from up to 24 circuits within the home. The households considered in the analysis are located in Austin, Texas, USA. We extracted 14 months of data (March 2013–April 2014) from 46 households in nearby neighborhoods at a granularity level of 1 hour. Thus, it was possible to aggregate the hourly demand values and divide the consumers into homogeneous groups [1]. From the aggregated data (Figure 1), we could see that the highest electricity consumption takes place in summer, between June and August, most likely due to air conditioning.