Abstract

A functional data analysis (FDA) based methodology for detecting anomalous flows in urban water networks is introduced. Primary hydraulic variables are recorded in real-time by telecontrol systems, so they are functional data (FD). In the first stage, the data are validated (false data are detected) and reconstructed, since there could be not only false data, but also missing and noisy data. FDA tools are used such as tolerance bands for FD and smoothing for dense and sparse FD. In the second stage, functional outlier detection tools are used in two phases. In Phase I, the data are cleared of anomalies to ensure that data are representative of the in-control system. The objective of Phase II is system monitoring. A new functional outlier detection method is also proposed based on archetypal analysis. The methodology is applied and illustrated with real data. A simulated study is also carried out to assess the performance of the outlier detection techniques, including our proposal. The results are very promising.

1. Introduction

The objective of a water distribution system is to supply water to meet user demand. Therefore, the water distribution system should be monitored in order to identify any fault in the system (Rodriguez et al. [1]). Detecting burst pipes, exceptional water use, or any other anomalous behavior in the water supply network is very important problem. Early detection of any anomaly in the water network is very valuable. Nowadays, thanks to technology we are able to record primary hydraulic variables in real-time. Nevertheless, all these data are useless without adequate statistical processing that enables information to be extracted for decision-making (Palau et al. [2]).

The recording of data by telecontrol systems allows data to be saved every few minutes. However, this type of systems usually presents two kinds of problems (Quevedo et al. [3]). On the one hand, poor maintenance of flow meters generates false data, i.e., somewhat noisy data, sometimes disproportionately so. On the other hand, communication problems with the sensor result in missing data. Both problems have to be solved during the first stage, which is referred to as validation/reconstruction in the hydraulic literature. In the second stage, data can be used to identify atypical behaviors in terms of water use, burst pipes, or illegal connections.

For the first stage, several alternatives have been proposed. The validation methods (finding erroneous data) include elementary signal-based (“low-level”) methods (Burnell [4]), as well as other model-based (“higher level”) methods. These also provide a methodology for reconstructing the signal in case of missing data. For instance, time series analysis was used by Quevedo et al. [3] or Prescott and Ulanicki [5], Kalman filters by Piatyszek et al. [6], neural networks by Valentin and Denœux [7], etc. As regards burst pipe detection, different techniques have been considered, such as multivariate principal component analysis (Palau et al. [2]), artificial neural networks (Mounce et al. [8]), a Bayesian system identification methodology (Poulakis et al. [9]), or a modified cumulative sum (CUSUM) test (Misiunas et al. [10]). In industrial quality control, the statistical monitoring of complex signals goes under the name of profile monitoring. According to Colosimo and Pacella [11], the approaches for profile monitoring share the same structure: control charts for the estimated parameters of a certain parametric model are designed.

A different point of view is taken in this paper. Note that our data are recorded every 5 minutes, so they are discretized functions. However, a continuous curve or function lies behind these data. Therefore, we propose a functional data analysis (FDA) approach for both stages. To the best of our knowledge, this is the first time this kind of data has been viewed as functional data (FD), which they are in fact.

FDA comprises statistical techniques for functional observations; i.e., a whole function is a datum. The goals of FDA are basically the same as those of any other branch of statistics. Although FDA is a relatively new field, some references have already become classics, such as Ramsay and Silverman [12], who offer an outstanding overview, Ferraty and Vieu [13], with new methodologies for studying FD nonparametrically, Ramsay and Silverman [14], who provide interesting applications in different disciplines, and Ramsay et al. [15] regarding software in this field. In the water domain, FDA has been used for analyzing water quality (Henderson [16]; Díaz-Muñiz et al. [17]; Yan et al. [18]), classifying stream-flow hydrographs (Ternynck et al. [19]), or identifying major water usage patterns (Cheifetz et al. [20]), for example. Using FDA has the advantage of using all available information on shape, peak, and timing; i.e., all the information contained in the time series is taken into account.

In this paper, we propose a FDA-based method for detecting anomalous flows in urban water networks. The main novelties of this work consist of treating real-time primary hydraulic variables as FD for the first time, proposing a method entirely based on FDA, to validate and reconstruct those data and identify anomalies, and proposing a new procedure for functional outlier detection based on functional archetype analysis (Epifanio [21]) and comparing its performance with other procedures. Section 2 presents the available data and introduces the procedures for Stages 1 and 2. A review of FDA techniques and a new proposal for functional outlier detection are also presented. In Section 3, the proposal is applied to real data and simulated data. The experiments were conducted in R (R Core Team [22]). Finally, conclusions and future work are discussed in Section 4.

2. Material and Methods

2.1. Data

Our data consist of the water inflows recorded every 5 minutes by flow meters in three district metering areas (DMA) belonging to three municipal water utilities on the eastern coast of Spain. For the sake of confidentiality, we refer to these locations as A, B, and C. There are 288 observations per day and the observations from three years (2014, 2015, and 2016) are available. Although pressure levels were also available, according to an expert in hydraulic engineering, the inflows would be more informative, so pressure levels are not considered.

It is supposed that the water demand pattern is repeated every day, but it varies at weekends, and there are also changes in demand between winter and summer (Rodriguez et al. [1]). Therefore, we decided to divide the data into 8 groups according to the four seasons and weekdays or weekends. For each group, the method is outlined in the following section. As regards holidays, we have kept them in their corresponding group, although their patterns may not correspond to weekdays or weekends. For example, Christmas Days are detected as anomalous in the set of both weekdays (years 2014 and 2015) and weekends (year 2016). The data could also be divided according the particularities of the location.

2.2. Procedure for Detecting Anomalies

The outline of the procedure (details are given in the following sections) is as follows.

Stage 1. Validation/reconstruction:Step 1: length of the interval (definition of the period)Step 2: visualization (rainbow plots)Step 3: cleaning of false dataStep 4: smoothingSituation 1: smoothing dense FDSituation 2: smoothing sparse FD

Stage 2. Identifying anomalies:Phase I: iterative procedure to detect functional outliersPhase II: system monitoring

2.3. Stage 1

Data are collected as a long time series, but as mentioned above the water demand pattern is repeated every day. The first step is to decide the length of the interval over which functions are defined, i.e., the functional dimensionality (Ramsay and Silverman [12]). It can range from a short interval, such as 30 minutes, to a long interval (24 hours or more). Remember that data are recorded every 5 minutes (data resolution), so intervals should contain at least several observations for each function. In Section 3, the results for intervals of different lengths will be shown, although we will concentrate on medium-length results: 6 hours. Specifically, we will focus on results for night hours, from 00:00 to 6:00 a.m. since, according to an expert in hydraulic engineering, the water demand is rather stagnant during this period (Misiunas et al. [10]) and it is easier to identify anomalous behaviors.

In the next step it is advisable to examine the data by carrying out an exploratory data analysis. Visualization methods help users reason about the data and discover features that might not have been apparent using mathematical models and summary statistics, such as a different pattern over the time. A rainbow plot (Hyndman and Shang [23]) could be used, which is a simple but effective plot where data are plotted with colors according to their order in time. The colors of the functions follow the order of a rainbow, with the oldest data in red and the most recent data in violet. It is available in the R package rainbow (Shang and Hyndman [24]).

The advantages of an FD approach are evident in the following steps. In the fourth step, the discrete measures recorded at time points , with , from a functional datum (, is the sample size) must be converted into a function . As explained in Section 1, these values are not free from errors and smoothing should be carried out to remove those errors. Smoothing is performed by representing functions with basis functions. This has the advantage that it can be applied to data where the functions have not all been recorded at the same instant, data observations do not have to be equally spaced, and the number of sampling points can vary across cases. This is the case in the applications in question because communication problems with the sensor result in missing data.

The key point is to choose a proper basis and number of basis functions (less computation is required for a smaller number of basis functions). This is a prevalent issue in all FDA problems. Ideal basis functions should have characteristics that match those known to belong to the functions being approximated (see Ramsay and Silverman [12] for a precise and detailed explanation about smoothing FD). In addition, the individual functions can be smoothed in a different way depending on whether the number of missing data is high or low. There are therefore two possible situations.

In both situations, once the function has been expressed as a linear combination of basis functions, this allows us to evaluate the function at any desired time point and to obtain estimations for the periods with missing data and correct noisy data. However, FDA works with basis coefficients, not discretized functions, since all the information is better saved as coefficients. They provide us with the flexibility that we need together with the computational power to fit hundreds of points in a short length vector. Furthermore, coefficients allow us to express the calculations within the well-known field of matrix algebra.

Nevertheless, before moving on to the smoothing step, disproportionately noisy data (false data) should be located and removed. Smooth function assumes that a pair of adjacent data values, and would be similar. Otherwise, data should be treated as multivariate rather than functional. In our application, the smoothness property makes sense.

2.3.1. Cleaning False Data

Tolerance intervals for univariate data provide limits that contain a given proportion () of individual observations in a population with a given level of confidence. They should not be confused with confidence intervals or prediction intervals. We can use tolerance intervals to identify unusual values.

Tolerance intervals were extended to the functional framework by Rathnayake and Choudhary [25]. They defined tolerance bands for both dense and sparse FD. They used Functional Principal Components Analysis (FPCA) in a mixed model framework to represent the measurements and the tolerance factors needed for the bands are approximated by bootstrapping. The R code is available at Rathnayake and Choudhary [25]. We have used it with = 0.05 and = 0.97. Data points outside the tolerance bands are considered as suspected of being false data, removed, and considered as missing data.

2.3.2. Situation 1: Smoothing Dense Functional Data

When functions are densely observed, i.e., the number of missing data is low, the basis coefficients can be estimated separately for each function. This is the most common situation in our application. In the basis approach, each function is expressed as a linear combination of known basis functions with : = = , where stands for transpose and indicates the vector of length of the coefficients and the functional vector whose elements are the basis functions. Coefficients are estimated by ordinary least squares, although other kinds of smoothing such as the weighted least squares or regularization approaches can be considered. All of these are available in the R package fda (Ramsay et al. [26]). Specifically, the following expression should be minimized: = = , where is a matrix that contains the values . The solution is = , and the fitted values are = .

In the application, the B-spline basis system of order 4 (cubic splines) with equally spaced knots has been selected due to its suitability for nonperiodic data (Ramsay and Silverman [12]), its great flexibility, and very fast computation. As suggested by Ramsay and Silverman [12], the number of bases can be determined by selecting that makes the variance decrease substantially: = . For the whole sample formed by functions, we computed the pooled variance using 4 to 22 bases and selected the number of bases that makes it decrease substantially.

2.3.3. Situation 2: Smoothing Sparse Functional Data

When functions are sparsely observed, i.e., the number of missing data is high with long periods without data, the information from all functions should be employed to estimate the coefficients for each function (James [27]). This situation is not very common in our application, but we have used the recognized methodology developed by Yao et al. [28]: Principal components Analysis through Conditional Expectation (PACE) for illustration purposes in Section 3. Yao et al. [28] proposed a version of FPCA in which the FPC scores are framed as conditional expectations. The function using the first eigenfunctions is approximated by

where is the estimate of the mean function and are the FPC scores. PACE uses local smoothing techniques to estimate the mean and covariance functions of the trajectories, and it was implemented by the function FPCA in the R package fdapace (Dai et al. [29]) (default parameters are considered in the computation). With the scores and the estimated eigenfunctions, we obtain an approximation of the functions, which can be used to predict unobserved portions of them.

2.4. Stage 2

Once the data have been cleaned, reconstructed, and smoothed we are ready to find anomalous behaviors. In statistical terms, this is equivalent to finding outliers. We distinguish two phases, as in-control chart applications (Montgomery [30]). The first phase is a retrospective analysis, where data are analyzed to find anomalies and a set of data that is free from anomalies is obtained for use in Phase II. In Phase II we monitor the process, so each successive datum is analyzed to identify whether or not it is an anomaly.

Before explaining these phases, let us review the types of functional outliers and several functional methods for detecting functional outliers. A new method for detecting functional outliers is also introduced.

2.4.1. Types of Functional Outliers

According to Febrero et al. [31], functional outliers can result from errors in measurements and recording or, despite being correctly observed functions, other causes can mean that they do not follow the same pattern as the rest of the curves. In Section 2.3.1, false data due to measurement errors are detected. In this paper, we will detect anomalous behaviors. Another question is to identify their potential root causes, which is key in statistical control quality.

Hyndman and Shang [23] distinguish two types of functional outliers: first the so-called magnitude outliers, which correspond to curves that lie outside the range of the vast majority of the data. This either could happen for a short period only (Hubert et al. [32] call these isolated outliers) or may be persistent. Persistent outliers can have the same shape as other curves, but their scale differs. Hubert et al. [32] call these amplitude outliers. The shape can even be identical but translated in time. Hubert et al. [32] call this type shift outliers. The second type is the so-called shape outliers, whose shape differs from the rest without necessarily standing out at any particular time point. Of course, outliers may exhibit a combination of these characteristics.

Unlike magnitude outliers, which are easily detected even by the naked eye, identifying shape outliers is not so easy. In fact, shape outliers may be camouflaged in the middle of the sample.

2.4.2. Methods for Detecting Functional Outliers

Several methods for functional outlier detection have been developed recently. Most of them rely on different notions of functional depth, such as Febrero et al. [33], Febrero et al. [31], Sun and Genton [34], Gervini [35], and Arribas-Gil and Romo [36], together with Hubert et al. [32] for the multivariate functional case. Others rely on robust principal components (PCs) (Hyndman and Ullah [37], Hyndman and Shang [23], and Sawant et al. [38]) and random projections (Fraiman and Svarc [39]). The R code of those methods is available in most cases. We will review them below and introduce the acronyms used to refer to them.

On the one hand, the R package rainbow (Shang and Hyndman [24]) implements many of those methods and others. Specifically consider the following: first, in the method by Febrero et al. [33], they use a likelihood ratio test (LRT). Second, in Febrero et al. [31] outliers are determined as functions whose depth levels are below a certain threshold. This threshold is determined by a bootstrap procedure based on either trimming (TRIM) or weighting of the sample (POND). These methods are also implemented in the R package fda.usc (Febrero-Bande and de la Fuente [40]). Third, in the method by Hyndman and Ullah [37] (ISFE), they use integrated square forecast errors. Fourth, the method by Rousseeuw and Leroy [41] (RMAH) uses the robust Mahalanobis distance but considering the functions as multivariate observations. Fifth, the methods that are proposed by Hyndman and Shang [23] are like the functional highest density region (HDR) boxplot. On the other hand, the R package fda (Ramsay et al. [26]) implements the method (FB) by Sun and Genton [34], who extend the classical boxplot to FD. The outliergram (OUG) proposed by Arribas-Gil and Romo [36] is available in the R package roahd (Tarabelloni et al. [42]). In the R package mrfDepth (Segaert et al. [43]), we find the implementation of the Functional Outlier Map (FOM) by Hubert et al. [32], improved by Rousseeuw et al. [44], which is based on functional outlyingness measures.

2.4.3. Detection of Functional Outliers by Archetype Analysis

Our proposal is based on archetype analysis (AA), which was introduced by Cutler and Breiman [45]. The objective of AA is to approximate data through a convex combination of pure or extremal types called archetypes. The premise of this is that extremes are better than central points for human interpretation (Thurau et al. [46]). Archetypes are built as a convex combination of observations. A variation of AA is archetypoid analysis (ADA), which was introduced by Vinué et al. [47]. The pure types in ADA are not a mixture of cases, but archetypoids are real cases from the sample. Archetypal analysis has been applied in different fields, but the only known references in engineering are Epifanio et al. [48], Vinué et al. [47], and Epifanio et al. [49]. AA and ADA have recently been extended to dense (Epifanio [21]) and sparse FD (Vinué and Epifanio [50]). AA can be computed with the R package archetypes (Eugster and Leisch [51]) and ADA with Anthropometry (Vinué [52]).

Besides archetypal patterns, both return the mixture coefficients contained in a matrix , where is the number of archetypes (or archetypoids). The coefficients indicate the amount of the contribution of each archetype (or archetypoid) to the approximation of each case. Each is the weight of the archetype (or archetypoid) for the case , and with and .

Archetypes are located on the boundary of the convex hull of data (Cutler and Breiman [45]). Therefore, AA and ADA are sensitive to outliers (Eugster and Leisch [53]). This is precisely what can be used to locate outliers. The basic idea is described below.

Let us assume that we have a homogeneous sample of FD, i.e., generated from a unique (unimodal) distribution. Then, it could be reasonable to think that the observed FD are between two extremal functions (two functional archetypes or archetypoids); i.e., the sample of FD belongs approximately to the band (López-Pintado and Romo [54]) given by the two extremal functions (see Figure 1 for an illustration of this idea). Therefore, the sample can be approximated by a mixture of these two extreme functions. In fact, this idea was used by D’Esposito and Ragozini [55] for ranking multivariate data. As a consequence, the coefficients will be distributed from 0 to 1 continuously with a strictly positive density function for each archetype (or archetypoid).

Let us assume now that there is at least one outlier generated from another distribution. Then, if we consider = 3 archetypes (or archetypoids), in an ideal setting one of the archetypes (or archetypoids) should be similar to the outlier (in the case of ADA, one of the archetypoids should be one of the outliers), which is an extremal function, but the other two archetypes (or archetypoids) should be the two extremal functions covering the rest of the sample. In that situation, the distribution of the coefficients corresponding to the archetype (or archetypoid) similar to the outlier will have a special profile: only a few cases would have high values (above a certain threshold that depends on the configuration of the data). Those few cases correspond to the functions that are similar to the outlier, i.e., generated from the same distribution as the outlier, and therefore also outliers. As the other functions do not share that pattern, their coefficients for that archetype (or archetypoid) will be small and we can observe a “hole” (an area without data) in those values. The “hole” will separate the outliers from the rest of the sample.

If there are outliers of different types; we can compute 3 archetypes (or archetypoids) again, once the previously detected outliers have been removed from the sample, and so on. Note that if outliers are not present in the sample and we compute 3 archetypes (or archetypoids), the corresponding coefficients will also be distributed from 0 to 1 without “holes” for each archetype (or archetypoid).

Based on these premises, our proposal is as follows. Although AA could be also used, we will focus on ADA, as the archetypoids are real cases and, therefore, in the case of outliers, one of the archetypoids should be an outlier and not a mixture of outliers, as would be the case in the AA. We refer to this method as FOADA.(1)Compute ADA with = 3 for the sample.(2)If no “hole” is detected in the coefficient distribution for each archetypoid, then no outlier is detected and the procedure ends.(3)Otherwise, consider the archetypoid that has fewest cases with high (above the threshold that determines the “hole”) coefficients. Identify those cases as outliers and add them to a list of outliers. Return to Step 1, after removing the identified outliers from the sample.

This procedure depends on the detection of “holes” in the distribution of coefficients for each archetypoid. As only 3 archetypoids are computed each time, we can visualize the distribution of coefficients in 2D using a ternary plot without loss of information. A ternary plot makes it possible to visualize compositional 3-dimensional data in an equilateral triangle. Therefore, outliers could be detected empirically with the naked eye or with a classic multivariate detection method, such as the robust Mahalanobis distance (Rousseeuw and Leroy [41]).

Illustrative Example: Shape Outliers (Changes in the Average Shape). The following example aims to clarify the procedure. A set of = 100 functions has been generated from the following model, which was used previously by Febrero et al. [31], Fraiman and Svarc [39], and Arribas-Gil and Romo [36]. are generated from = , while the remaining functions are generated according to this contamination model: , where and is a Gaussian process with zero mean and covariance function = . The functions are observed at 50 equidistant points between 0 and 1. Note that those are shape outliers; their average is different.

Figure 2 shows an example of functions generated with = 0.02 in the left-hand panel, the 3 archetypoids in the central panel, and the ternary plot in the right-hand panel. The first archetypoid is one of the outliers, while the other two archetypoids are extreme functions between which almost the entire sample is included. In the ternary plot, it is evident that almost all the components of the sample, except the two outliers, are expressed as a mixture of archetypoids 2 and 3. Only two points, the outliers, have high values for archetypoid 1.

Now, these two outliers are removed from the sample and ADA is computed again. Figure 3 shows the 3 archetypoids obtained and the corresponding ternary plot. As before, archetypoids 2 and 3 are extreme functions resembling band functions between which almost of the sample is included. Archetypoid 1 is a function similar to archetypoid 3 in the first third of the interval and similar to archetypoid 2 in the last third of the interval. The values are now spread out inside the triangle.

Illustrative Example: Outliers due to an Increase in Variability. Let us see how the method proceeds, when, besides shape outliers, there are also outliers because the variability is greater. Figure 4 (left-hand panel) shows the previous outliers plus two new outliers (in red). They are built by adding a white noise process to the mean of the previous main model. In the first iteration of the procedure, the shape outliers are detected and removed from the data set for the following iteration. However, no more outliers are detected in the second iteration with the raw data. Nevertheless, it should be remembered that these are functional data, so changes in variability are equivalent to changes in the derivatives. Therefore, we apply the procedure to the first derivatives of the functions in the data set (see the right-hand panel of Figure 4). In the following iteration one of the outliers is detected, while the other outlier is detected in the next iteration. No more outliers are detected. So, all the outliers are finally detected. Note that outliers in variability are shape outliers in the derivatives.

Illustrative Example: Amplitude Outliers. Figure 5 (left-hand panel) displays the previous shape outliers (in black) plus two amplitude outliers (in red), i.e., with = 0.02. Therefore, the percentage of outliers in the data set is 4%. The amplitude outliers are generated by adding 2 to the mean from the main model. The amplitude outliers are detected in the first iteration of our procedure, while the shape outliers are detected in the second iteration.

Illustrative Example: Isolated Outliers. Figure 5 (central panel) displays the previous shape outliers (in black) plus five isolated outliers (in red), i.e., with = 0.05. Therefore, the percentage of outliers in the data set is 7%. The isolated outliers are generated by adding the values of a standard normal density to the first 14 observed points of the mean from the main model. The shape outliers are located in the first iteration of the procedure, while the isolated outliers are located in the following iteration.

Illustrative Example: Shift Outliers. Figure 5 (right-hand panel) shows the previous shape outliers (in black) plus three shift outliers (in red), i.e., with = 0.03. Therefore, the percentage of outliers in the data set is 5%. The shift outliers are generated by translating the mean of the main model in -0.1 time units. All except one shift outliers are detected in the first iteration of the procedure. The nondetected outlier is displayed with a thicker solid red line in the right-hand panel of Figure 5.

2.4.4. Phase I

The objective of this phase is to clear the data of anomalies to ensure representative data from the in-control system. Possible functional outliers are identified using a particular method from Section 2.4.2. They should be investigated in order to determine the causes. The identified outliers are removed from the sample and the functional outlier detection method is used again. If new outliers are detected, we repeat the procedure (investigation and exclusion of outliers) until no new outliers are detected. At the end of this iterative procedure, it can be assumed that a clean sample that represents the performance of the in-control system has been obtained for use in Phase II.

In any statistical problem, having a large, representative sample is preferable to a small sample. Small samples may not cover all the information in the process and may therefore give rise to false alarms. In a somewhat similar problem, the determination of limits for multivariate case control charts, sample sizes of at least 20 and 50 are recommended (Lowry and Montgomery [56]) and even 200 would be desirable (Jensen et al. [57]).

2.4.5. Phase II

The objective of this phase is system monitoring. Each new functional datum is cleaned for false data and smoothed as explained in Stage 1. It is then added to the sample obtained in Phase I and a functional outlier detection method is used to test whether this new datum is atypical. If it is detected as anomalous, it should be investigated. Otherwise, the sample obtained in Phase I can be updated by adding the new datum. In this way, the sample size can be increased. In any case, data should be visualized periodically as in Step 2 of Stage 1.

3. Results

3.1. Real Case Study

The procedure introduced in Section 2.2 has been applied to our data. For the sake of conciseness, we have decided to illustrate each of the components of the procedure. As previously discussed, unless it is stated otherwise, we focus on the results for night hours, from 00:00 to 6:00 a.m. The results for other periods will be shown in some points for illustration purposes. We begin with some examples of results of Step 2, Stage 1, and follow the illustration of each element of the procedure in order.

Figure 6 shows rainbow plots for two different statuses. In the first one, we can see how the colors show a different distribution according to year; in 2014 the functional distribution (red-yellow curves) is different from 2015 and 2016 (green-violet curves) due to external causes. They could be faults in the pipelines or a modification of the limits of the sector (opening/closing of border valves), a variation in the supply pressure of the sector, etc. This means that only data from 2015 and 2016 could be used for that sector. On the other hand, in the right-hand plot, no temporal pattern appears, so we use data from all the years in that sector.

An example of cleaning false data can be seen in Figure 7 (left-hand panel). Detected false data are plotted in red. The figure clearly shows how those points deviate from the trajectory. In this example, the length of interval is 24 hours to show the different possibilities available for step 1. The right-hand panel of Figure 7 shows the same data but smoothed using 15 cubic B-splines basis functions, i.e., using Step 4 in situation 1. Note that the small amount of noise has been erased and the curve has been reconstructed at the points where false data were found. Figure 8 reveals an elbow at 15, so this basis number was chosen.

In the left-hand panel of Figure 9, raw data with somewhat long periods of missing data are displayed. In the right-hand panel, those data have been smoothed and reconstructed using Step 4 in situation 2.

We now show a sample of the results from Phase I of Stage 2. Figure 10 shows two iterations for identifying outliers. For Phase II, Stage 2, the data in the right-hand panel of Figure 10 would be a sample of the in-control process and an outlier detection method should be used with that sample together with the new datum gathered for monitoring the system. A more lengthy presentation of the results is available in Millán-Roures [58], for readers who are interested in the particular results, i.e., the specific outliers detected for each sector, season, and weekdays or weekends.

Note that a leak would probably be represented as an amplitude outlier that persists until it is corrected. But, with the proposed methodology, we can also detect, for example, isolated outliers, i.e., different from the rest of data only in some part of the interval. It could correspond to some kind of fraud, for example. In short, we can detect anomalous behaviors; the following step is to check the reason: leaks, damage to measuring equipment, changes in the consumption pattern, etc.

3.2. Simulation Study

A simulation study is carried out to compare the different outlier detection methods in a controlled setting. As mentioned above, outliers that are completely outlying in shape but not so in magnitude are not easy to detect. Therefore, we begin our simulation study with this more difficult problem, considering the same models used in Section 2.4.3 with different percentages of shape outliers. Then, we concentrate on examining the results when only amplitude, isolated, and shift outliers are present, respectively. To appreciate the difficulty of each problem, remember that the left-hand panel of Figure 2 illustrates the problem with 2% shape outliers (2 of the 100 functions). The first part of the simulation study only changes this percentage, i.e., from zero to a higher percentage of outliers. In Figure 5, we see examples for the problems with amplitude, isolated, and shift outliers, the black shape outliers being discarded; i.e., shape outliers are not part of the data sets in the second part of the simulation study. In the third part of the simulation study, a new model for the data set is considered.

The contamination rates used with shape outliers are 0 (no outliers), 0.05, and 0.15 (although = 0.15 is quite high and it could suggest that a subsample of the population is being generated, we have decided to use it because it was also considered in previous simulation studies such as Arribas-Gil and Romo [36]). A total of 100 simulations have been run. Default parameters have been used, except for the methods that require the coverage probability of the outlying region or trimming proportion to be specified. For those methods (LRT, TRIM, POND, and HDR), the true values have been used. Obviously, this could give them some advantage, especially when = 0. For our proposal, we have considered the method by Rousseeuw and Leroy [41] for detecting outliers in the ternary plot.

Table 1 displays the results for each method and the contamination rate. We consider that the best performance corresponds to the methods that identify many true positives (outliers as outliers) and few false positives (nonoutliers as outliers). The method that locates the highest number of outliers is FOADA, with a success rate of almost 100% in the case of = 0.05, and also the highest rate in the case of = 0.15. RMAH and OUG also provide very good results, although the success rate decreases slightly for OUG in the case of = 0.15. The percentage of falsely identified outliers is low for these three methods, the lowest rate with these three methods corresponding to RMAH, especially when no outliers are present. TRIM and POND provide an excellent percentage of successfully identified outliers when = 0.05, but the results are poor for the case of = 0.15. Success rates for ISFE and HDR are medium-high. The results for FB are not very good for this problem, although the worst one is LRT, since it does not detect any outlier in any case.

Table 2 shows the results for each method and amplitude, isolated, and shift outliers with = 0.05. The setting of each method is as above. For amplitude outliers, the best method is TRIM, with a success rate of 74%, followed by POND, RMAH, and FOADA. The success rate increases to 72.4% (38.9) for FOADA, if instead of using the method by Rousseeuw and Leroy [41] for detecting outliers in the ternary plot we examine it with the naked eye and consider as outliers those points with higher than a threshold of 0.5 for the archetypoid with the fewest points nearest to them in the ternary plot. In any case, note that it is a hard problem; the amplitude outliers generated are not too far from the rest of the data (see the left-hand panel of Figure 5). As their authors explained, OUG has not been conceived to detect magnitude outliers, which explains its poor results with this kind of outliers. The best method for isolated outliers is ISFE, followed by FOADA, with 100% and 96.2% success rates, respectively. TRIM and POND also give very good results. As regards shift outliers, the best results are provided by FOADA (99.2% success rate), followed by RMAH, OUG, and TRIM.

In the third part of the simulation study, a model similar to example 3 devised by Hyndman and Shang [23] and model 7 by Sun and Genton [34] is considered. We simulate 99 functions of the form = + , where 0 t , observed in steps of 0.1, and and follow normal distributions with mean 1 and standard deviation 0.1. One function, the atypical one, is simulated by the same functional form, but now the mean is 1.5. A data set is displayed in Figure 11. With FOADA, the method by Rousseeuw and Leroy [41] for detecting outliers in the ternary plot gives computational errors, so we consider a threshold of 0.5 and the same settings as above for the rest of the methods. Over 100 runs, the outlier is detected 100% of the times by FB, FOADA, POND, RMAH, and TRIM, 99% by HDR, 89% by FOM, and only 2% by OUG. LRT and ISFE do not detect it in any of the 100 runs. The percentage of falsely detected outliers is 0.020% for FB, 0.052% for FOADA, 0.053% for POND, 1.1% for RMAH, 1.4% for TRIM, 0.01% for HDR and FOM, and 0 for the rest of the methods. For this example, some methods achieve excellent results, but other methods are not able to detect the outlier.

4. Conclusions

We have proposed a method for monitoring and detecting anomalous flows in urban water networks based on FDA. Flow data are FD and the whole procedure is based on FDA techniques. Each phase and step has been illustrated with real data and the results are very promising. The proposed approach could be incorporated into integrated software to help decision-making for an overall water network management strategy based on computer-aided tools. We have also proposed a new method for identifying outlier functions based on archetypal analysis. This new procedure has been compared with other alternatives in a simulated setting, with very positive results. FOADA has provided consistently very good results for all the types of outliers. These results could be even improved if a finer procedure for detecting “holes” in the distribution of coefficients for each archetypoid could be devised. Maybe this could involve estimating the distribution of compositional or closed data and finding the watersheds and catchments basins on the representation ([59], Ch. 9).

In future work, we could consider other functional variables, such as pressure, in addition to flow. Multivariate functional techniques should be used when more than one functional variable is available. As regards the functional outlier detection part, other tools could be developed. As shape outliers are the type of outliers that are most difficult to detect, techniques based on the shape of the functions could be taken into account. These ideas worked very well in classification problems, not only for univariate functions (Epifanio [60]), but also for multivariate functions, with one (Epifanio and Ventura-Campos [61]) or more arguments (Epifanio and Ventura-Campos [62]). These ideas could therefore be applied to the functional outlier detection problem.

Data Availability

The code and data for the simulation study and a script for the real case study are available at http://www3.uji.es/~epifanio/RESEARCH/fouada.rar.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to acknowledge the company FACSA for providing them with the data set. This work was carried out while Laura Millán-Roures was enjoying a grant supported by the Cátedra FACSA at Universitat Jaume I. This work has been partially supported by the Spanish Ministerio de Economía y Competitividad (AEI/FEDER, UE) Grant DPI2017-87333-R and Universitat Jaume I, UJI-B2017-13.