Abstract

Anomaly detection has emerged as a popular technique for detecting malicious activities in local area networks (LANs). Various aspects of LAN anomaly detection have been widely studied. Nonetheless, the privacy concern about individual users or their relationship in LAN has not been thoroughly explored in the prior work. In some realistic cases, the anomaly detection analysis needs to be carried out by an external party, located outside the LAN. Thus, it is important for the LAN admin to release LAN data to this party in a private way in order to protect privacy of LAN users; at the same time, the released data must also preserve the utility of being able to detect anomalies. This paper investigates the possibility of privately releasing ARP data that can later be used to identify anomalies in LAN. We present four approaches, namely, naïve, histogram-based, naïve-, and histogram-based- and show that they satisfy different levels of differential privacy—a rigorous and provable notion for quantifying privacy loss in a system. Our real-world experimental results confirm practical feasibility of our approaches. With a proper privacy budget, all of our approaches preserve more than 75% utility of detecting anomalies in the released data.

1. Introduction

Security of local area networks (LANs) has been getting more attention in the last few decades. Traditional LAN defense mechanisms based on a firewall are no longer effective in preventing malware infection since malware can simply circumvent the firewall or infect the network through other means [2, 3]. A prominent example is the recent emergence of ransomware that can infect LAN devices via phishing attacks; these attacks remain effective even if the LAN’s firewall is active and configured correctly [4, 5]. In addition, with the rise of the Internet-of-things (IoT), the so-called “smart” devices have become widely popular and, at the same time, are also extremely vulnerable to malware attacks [6]. These devices may be infected from the outside world and introduce malware to the LAN.

To overcome this challenge, several anomaly detection techniques have been proposed to detect malicious activities in LAN. Among those, techniques based on the Address Resolution Protocol (ARP) are shown to be promising in detecting anomalous activities in LAN without requiring a change to existing devices [7, 8], making it suitable to the current IoT networks.

Despite this success, there still remains a severe privacy concern to LAN users, which has not been thoroughly explored in the previous work. Often times, the anomaly detection must be performed by an entity outside LAN [911] or third-party software [12, 13]. Thus, it is equally important to ensure privacy of the data exposed to this external and potentially malicious entity. For instance, a LAN admin in an enterprise may choose to outsource an anomaly detection analysis to an external widely-popular service, e.g., Microsoft’s Anomaly Detector [12], or the admin simply wants to release some features of network data for transparency or academic purposes. In either case, it would require the LAN admin to output network data (which is an input to the anomaly detection algorithm) to an untrusted party. Doing so may lead to having such party learn privacy-sensitive information about the LAN users. For example, it may directly disclose personally identifiable information (PII), e.g., IP/MAC addresses, which can be used to uncover the identity of LAN users. It may also cause an indirect information leakage by revealing information about access patterns (e.g., the time of the day that a specific user is online) or relationship between users [14].

While it is possible to simply erase all users’ sensitive information from the output data, this kind of technique does not provide strong and provable privacy guarantees. A motivated adversary may still be able to deanonymize users through other means, e.g., performing a side-channel analysis [15] or correlating the remaining network traces with the physical world data [16]. Therefore, there is a need for a technique with rigorous privacy guarantees, while preserving the utility of detecting anomalies in the LAN environment.

Contributions: to this end, the goal of this paper is to investigate the possibility of privately publishing ARP data that can later be used to identify anomalies in LAN. Our work presents the following contributions:(i)Privacy Notions for ARP Publication. We identify four concrete privacy notions in the context of ARP-data publication. Each notion is defined over a different type of information that needs to be privacy-protected as well as the probability that this protection holds. Specifically, they are derived from the widely-known differential privacy [17] notion, which allows us to mathematically prove whether a specific algorithm adheres to any of these notions. We argue that this is a necessary and essential step towards designing, implementing and deploying any privacy-preserving approach into the real world. Without it, it is doubtful whether any meaningful guarantee can be obtained from our approaches.(ii)Releasing ARP for Anomaly Detection with Various Degrees of Privacy. We present four approaches capable of privately releasing ARP data that still preserves the utility of detecting LAN anomalies. Our approach provides a wide range of privacy-preserving degrees, making them suitable to different scenarios:(a)The first approach requires small additive perturbations to the input ARP data in exchange for privacy protection of user relationship(b)The second approach perturbs the input data by a relatively higher amount but it can attain a stronger privacy protection guarantee for each individual LAN device/user(c)The third and fourth are variants of the first two approaches that require even smaller data perturbations; however, they sacrifice some small probability that the privacy guarantee will not hold, making them an appropriate option for scenarios where data utility needs to be maximized(iii)Practicality via Real-World Deployment. We demonstrate practicality of our approaches by implementing and deploying them as part of a large-scale real-world project, called ASEAN-Wide Cyber-Security Research Testbed Project (https://www.nict.go.jp/en/asean_ivo/ASEAN_IVO_2020_Project03.html). Overall, the aim of this project is three-fold: (1) to capture network data from multiple LANs across the ASEAN region, (2) to determine malware behaviors based on the captured data, and (3) to make the captured data sharable in the public domain. Our work fits perfectly in this project as it fulfills the third goal by providing a privacy-preserving mechanism for releasing captured ARP data.(iv)Evaluation on Real-World Dataset. We evaluate our approaches on a real-world ARP dataset captured from 3 LANs over 30 weeks. The experimental result shows feasibility of our approaches as they introduce only low error values ( in the root-mean-square error) to the original data. In addition, we assess utility of the released data by testing it on the existing LAN anomaly detector [7]. The result is promising as our approaches can achieve anomaly detection rate.

Organization: the rest of the paper is organized as follows: Section 2 overviews existing work related to LAN anomaly detection and differential privacy. The background in Address Resolution Protocol and differential privacy are discussed in Section 3. Section 4 describes the system and adversarial models targeted in this work. Section 5 presents privacy notions in the context of releasing ARP data. Sections 6 and 7 present four approaches and prove that they satisfy privacy notions defined in the previous section. Experiments are carried out and reported in Section 8. Several issues are discussed in Section 9. Finally, the paper concludes in Section 10.

2.1. Differential Privacy in Anomaly Detection

To the best of our knowledge, there has been no prior work that proposes a release mechanism for ARP data with differential privacy guarantees while retaining the utility of anomaly detection in the LAN setting. The closest related work can be found in [18], where the authors employ PINQ differential privacy framework [19] to detect network-wide traffic anomalies. The main difference between our work and the work in [18] lies in the type and magnitude of the released data as well as the privacy guarantee. The work in [18] aims to privately release link-level traffic volumes of ISP whose overall value tends to be much larger than noise introduced by any differentially-private release mechanism. On the other hand, our work operates on more restricted input (ARP-degree) which generally contains a much smaller value, making it more noise-sensitive than ISP’s traffic volume. Reducing this sensitivity poses a main challenge addressed in this work. Further, the work in [18] provides no privacy protection guarantee for individual network users. Achieving this guarantee is nontrivial, as discussed in Section 6.2.

Besides the work in [18], several existing work focuses on providing anomaly detection with differential privacy guarantees in non-networked settings, e.g., web browsing [20], social network [21], health care [22], or syndrome surveillance [23]. Due to the difference in the target setting, the aforementioned techniques are not directly applicable to our work.

2.2. LAN Anomaly Detection

There are a number of existing research that aims to detect anomalies in LAN without providing privacy protection. Zhang et al. [24] present an approach based on honeypot to detect malicious LAN activities. Yeo et al. [25] propose a framework to monitor a network traffic and detect anomalies in the Wireless LAN (WLAN) environment via the IEEE 802.11 MAC protocol. Nonetheless, this approach is specific to WLAN and thus cannot be directly applied to the wired LAN setting. Our approaches are based on ARP requests, making them suitable for both wired and wireless LAN environments.

Several prior works focus on detecting LAN anomalies based on ARP-related data. Whyte et al. [26] propose an anomaly detection approach that distinguishes anomalous activities through statistical analyses of ARP traffic. Yasami et al. [8] propose to model normal ARP traffic behaviors using Hidden Markov Model. Farahmand et al. [27] detect LAN anomalies based on four features: traffic rate, burstiness, dark space, and sequential scan. Matsufuji et al. [7] present an anomaly detection algorithm based on the degree of destination of ARP requests.

3. Background

3.1. Address Resolution Protocol (ARP)

In a nutshell, ARP is a request-response protocol that provides a mapping between dynamic IP addresses and permanent link-layer addresses (also known as MAC addresses), allowing one computer to discover a MAC address of another from its IP address. This protocol is essential in a LAN environment since it enables communication between any two computers within the same subnetwork as follows:

In LAN, when one computer needs to connect with another, it uses ARP to broadcast a request asking for the MAC address associated with the IP address of the destination computer. Therefore, an ARP request contains the requester’s IP and MAC addresses as well as the destination’s IP address. Upon receiving the ARP request, every computer checks whether the received IP address matches with one of its network interfaces. If it does, it unicasts an ARP response back to the requester along with its IP and MAC addresses. At the end of this process, the requester successfully retrieves the destination’s MAC address and can use this information to construct Ethernet frames for transmitting subsequent data to the target computer.

Similar to other network protocols, ARP involves using sensitive data that has previously been shown to be directly (e.g., IP address) or indirectly (e.g., traffic volume [16]) linkable to the identity of network users. Hence, this privacy concern must be taken into account when designing an approach for releasing ARP data.

3.2. Differential Privacy (DP)

Consider a setting in which there are users who send individual data to a trusted curator. The curator then applies an algorithm and outputs these results to an untrusted party. In a strong notion of privacy, the data of an individual must be kept private from strong adversaries–even ones who get a hand on the data of the other users.

The differential privacy (DP) is a viewpoint of this notion given in a seminal paper by Dwork, McSherry, Nissim, and Smith [17]. First, we say that two databases and are neighboring if they differ by exactly one database entry. The differential privacy is then satisfied if changing to does not change the probability of observing an output of by very much. With differential privacy, presence of a single entry will not affect the published output by much. Therefore, outputs from a differentially-private algorithm cannot be used to infer about any single entry from the input dataset.

Definition 1. (differential privacy). An algorithm satisfies -differential privacy (-DP) if, for every pair of neighboring datasets and and every subset ,where is referred as a privacy budget. We will refer to -DP as -DP. Intuitively, smaller values of and lead to a stronger privacy guarantee. Conversely, higher values of and imply a weaker guarantee with possibly better utility/accuracy of the released data.
A related notion of differential privacy is the concentrated differential privacy, which aims to control the moments of the privacy loss variable: , where is distributed as .

Definition 2. (Rényi divergence). Let and be probability densities. The Rényi divergence of order between and is defined as

Definition 3. (concentrated differential privacy [28]). An algorithm satisfies -zero-concentrated differential privacy (-zCDP) if, for every pair of neighboring datasets and and every,One useful property of the differential privacy is that it is preserved under post-processing.

Proposition 1. (postprocessing [29]). For any -DP (-zCDP) algorithm and arbitrary random function , the algorithm is also -DP (-zCDP).
There may be some certain situations in which we want to apply multiple DP algorithms, e.g., releasing continual or time-series data. In this case, the resulting algorithm is also differentially private. However, every new DP algorithm comes with a cost of privacy loss, as stated in the following proposition.

Proposition 2. (composition [29]). For any -DP (-zCDP) algorithms for , the algorithm defined by is -DP (-zCDP).
To introduce one of the most ubiquitous -DP algorithms, we start with the -sensitivity of a randomized algorithm , which is the maximum change in the output as a result of modifying a single datum. We denote this sensitivity as , and formally define it as:

Theorem 1. (Laplace mechanism [29]). Let be an algorithm with sensitivity and be a noise generated by sampling from a Laplace distribution at scale , i.e., , then the randomized algorithm defined byis -DP.
In addition to the Laplace mechanism, the Gaussian mechanism is also commonly used to provide -zCDP:

Theorem 2. (Gaussian mechanism [28]). Let be an algorithm with sensitivity and be a noise generated by sampling from a Gaussian distribution at scale , i.e., , then the randomized algorithm defined byis -zCDP.
In view of Proposition 2, a composition of Laplace mechanisms at scale is -DP, while that of Gaussian mechanisms at scale is -zCDP. We see that, for successive use of a DP mechanism, the Gaussian mechanism gives comparatively smaller noise than the Laplace mechanism. The following lemma shows how the two definitions of differential privacy are related.

Lemma 1. (see [28]). Any -zCDP algorithm is also an -DP algorithm for any given and

Conversely, for any given and , any -zCDP algorithm whereis also an -DP algorithm.

4. System and Adversarial Models

Figure 1 illustrates the system model considered in this work. We consider a system in which an entity, called Admin, possesses a LAN consisting of Users (i.e., computing devices). In addition, Admin introduces a monitoring device to this LAN in order to observe ARP requests of all Users. We denote to be aggregate ARP requests originated from User , measured and accumulated at the interval.

In this work, we assume the time interval to be in a unit of “a week,” since this time scale allows us to use data collected from a long period of time without losing too much privacy budget from the composition (Proposition 2). is denoted the result after appending all ARP requests of all User-s generated in week , i.e. .

As shown in Figure 1, our system starts by having the monitoring node (periodically) send aggregate ARP requests——to Admin, corresponding to step ➊ in Figure 1. Admin is interested in learning whether the LAN as a whole has had any anomalous activities for the last weeks in a private way. Thus, in step ❷, he proceeds to apply a certain algorithm with the goal of hiding sensitive information from the input and then releases the output to an external entity Analyst in step ❸. In step ❹, Analyst in turn performs an anomaly detection analysis on and returns the result back to Admin. contains that allows Admin to identify whether the LAN contains an anomaly at week . We summarize notation used throughout the paper in Table 1

4.1. Adversarial Model

Analyst is assumed to be honest-but-curious, i.e., he always honestly applies an anomaly detection algorithm on any given input data and returns the correct output to Admin. However, during the process, he may attempt to learn sensitive information about Users or their relationship, and use it for his own benefits.

4.2. Goal and Scope

In this work, we focus on addressing privacy concerns in the aforementioned system, where data from LAN is exposed to an external party. Hence, we do not consider other LAN settings capable of handling and processing this data locally, e.g., LANs in a large corporate with its own internal anomaly detection tool.

The goal of this work is to design approaches that can be appropriately used as the algorithm in step ❷ of Figure 1. In other words, our approaches must allow the process of releasing ARP data with some levels of provable privacy guarantees. Besides privacy, utility of the privatized/released data for anomaly detection is also important. We must ensure that the privatized value does not change by a significant amount, compared to the non-privatized counterpart; otherwise, it will not be useful in detecting anomalies.

5. DP Notions for ARP-Request Data

In this section, we describe 4 variants of differential privacy notions related to our system model. The summary of DP notions discussed throughout this Section is shown in Table 2.

To understand privacy (i.e., what concrete information needs to be private and hidden from Analyst) in our target scenario, we first describe the characteristic of ARP-request data. Figure 2 illustrates an example of a LAN that consists of 3 Users producing 4 ARP requests over a specific time interval. We define the (ARP-request) “degree” of User as the number of Users that receives ARP requests from User . In this example, the degrees of User 1, 2, and 3 are 2, 2, and 0, respectively.

Using this model, we can view —aggregate ARP-request data at week —as a directed graph, where User can be represented by a node; whereas an arrow (or a directed edge) from node to node indicates ARP request(s) generated by User and sent to User in the same time interval. The degree of User is then equivalent to the number of directed edges originating from User .

As a directed graph, can not directly represent a database entry, required by Definition 1. Thus, the aforementioned notion of differential privacy does not accurately capture the privacy guarantee in our scenario. Fortunately, there was prior work focusing on expressing differential privacy of a graph database. Specifically, the work in [30] presents notions of differential privacy between graphs by first defining two types of neighboring graphs: two graphs are edge-neighboring if they differ by a single edge. Likewise, they are node-neighboring if they differ by a single node.

We now proceed to present two notions of privacy in edge-neighboring graphs:

Definition 4. (-edge-DP). Let be the set of graphs between Users. An algorithm satisfies -edge-differential privacy or -edge-DP if, for every pair of edge-neighboring graphs and and every subset ,

Definition 5. (-edge-DP). An algorithm satisfies -edge-differential privacy (-edge-DP) if and only if it satisfies -edge-DP.
Since an edge in our system refers to ARP requests between a pair of Users, Definitions 4 and 5 provide privacy protection for these ARP requests. This means that an algorithm satisfying -edge-DP/ -edge-DP is guaranteed to reveal no information about all ARP requests exchanged between any pair of Users, resulting in hiding the ARP relationship of all Users. This, for example, could hide the source of infection in LAN as it is common for malware to utilize ARP as the first step to discover and infect other LAN User-s.
Nonetheless, the guarantee provided by these definitions is not strong enough to protect privacy of individual Users. To achieve this stronger guarantee, we adopt the following notions:

Definition 6. (-node-DP). Let be the set of graphs between Users. An algorithm satisfies -node-differential privacy or -node-DP if, for every pair of node-neighboring graphs and and every subset ,

Definition 7. (-node-DP). An algorithm satisfies -node-differential privacy (-node-DP) if and only if it satisfies -node-DP.
Indeed, by removing a node we also have to remove all of its edges. One then has that -node-DP is stronger than -edge-DP. In our scenario, an algorithm satisfying -node-DP/ -node-DP prevents information leakage about presence or absence of any individual User.

Remark 1. Recall represents an upper bound of the probability that an algorithm fails to satisfy the -DP notion. As an example, an algorithm satisfying -node-DP has at most probability that will leak some information about an individual node in a graph. To make -edge/node-DP notions meaningful in practice, one must minimize this failure probability by ensuring that is negligible in terms of number of data points considered in the DP notion [29]. One way to achieve this is to set to: for some small .
In -node-DP notion, is the number of nodes; whereas, in -edge-DP, corresponds to the number of possible directed edges . Thus, it is easy to see that in -edge-DP must be set smaller than that in -node-DP in order to attain the negligible probability.

6. Releasing ARP-Request Data with -Edge/Node-DP

In this section, we present two approaches, called naïve and histogram-based; the former guarantees -edge-DP while the latter is proven to satisfy the -node-DP notion. Later in Section 7, we describe variants of these approaches that satisfy the more relaxed -edge/node-DP notions.

6.1. Naïve Approach
Input: ,
,
Output:
(1)for to do
(2)
(3)
(4)ifthen
(5)else
(6)end

The naïve approach is described in Algorithm 1.

In the rest of this section, we discuss non-trivial details of this approach and show that it indeed satisfies -edge DP.

Theorem 3. The naïve approach as described in Algorithm 1 is -edge-DP.

Proof. Let be the directed graph of ARP requests in week . Let be the algorithm that computes the weekly total degrees and (line 2 of Algorithm 1), which also corresponds to the total number of edges in . To preserve -edge-DP of each User’s ARP requests, one can simply use the Laplace mechanism. To do so, we need to find an upper bound of the sensitivity . Let be an edge-neighboring graph of in week and . Then, and we have the following Laplace mechanism (line 2-3) guarantee the -node DP:where (line 3).
Algorithm 1 can then be represented aswhere is a postprocessing function (line 4-5) that: (i) precludes a negative output by thresholding it to 0, and (ii) rounds a nonnegative privatized value into the closest integer in order to prevent the floating point attack [31].
By Proposition 1 and 2, we can conclude that this algorithm is -edge-DP or -edge-DP.
To prevent excessive information loss, one needs the Laplace noise to be smaller than , i.e., or . This can be achieved in realistic settings, e.g., in our experiment (Section 8) where and the lower quartile of is 20.
On the other hand, a similar analysis for the -node-DP results in much bigger Laplace noises; consider two node-neighboring directed graphs of Users. The degrees defined as above satisfy , which cannot be improved further. Thus, in order to employ the Laplace mechanism, the noises have to be sampled from . In contrast to the edge-DP regime, the scale of the noise comes with a factor of . As a result, for a large number of Users, it is no longer feasible to preserve both privacy and utility at the same time.

6.2. Histogram-Based Approach

As seen in the previous subsection, the naïve approach cannot be used to satisfy -node-DP in practice due to its high sensitivity, leading to too strong additive noises which in turn significantly lower utility of the released data. Instead, we propose a second approach utilizing a histogram that helps reduce the -node-DP sensitivity to a reasonable amount.

Our histogram-based approach is shown in Algorithm 2. The rationale behind this approach is to transform the degree data in such a way that its sensitivity is minimized when any User is removed from . Naturally, a histogram is a good fit for this approach since it provides a way to partition data into disjoint groups/bins, where each bin in this case represents a range of degrees. Thus, this approach first computes the degrees of each User in a specific week and uses this degree data to construct a histogram, as shown in line 2 of Algorithm 2. This histogram data minimizes the -node-DP sensitivity because removing a User from the histogram data affects only one bin, i.e., the one this User belongs, and it only decreases its bin count by one; other histogram bins are unaffected by this change. We then can apply the Laplace mechanism on each bin (line 3), threshold and round the resulting value to the closest integer (line 5-6) and finally return this noisy histogram as an output.

We now formally show that the histogram-based approach satisfies -node-DP.

Theorem 4. The histogram-based approach as described in Algorithm 2 is -node-DP.

Proof. Let and be node-neighboring directed graph at time , i.e., can be obtained from by adding or removing a single node. Let be the algorithm that computes the histogram of the degrees, i.e., the entries of and are the count of nodes by their degrees. Then and differ by one in the entry corresponding to the degree of User , who only exists in either or . Therefore, .
Observe that line 2-7 of Algorithm 2 can be written as a randomized algorithm defined bywhere and corresponds to the threshold-then-round function computed on all bin counts (line 5-6). It follows from Theorem 1 and Proposition 1 that is -node-DP.
Then, we can define Algorithm 2 as a randomized algorithm as follows:By Proposition 2, we have that the histogram-based approach (described in Algorithm 2) is -node-DP or -node-DP.

Input: , ,
Output:
(1)for to do
(2)
(3)foreachdo
(4)  
(5)  ifthen
  
(6)  else
(7)end
(8)end

7. Releasing ARP-Request Data with -Edge/Node-DP

The approaches in the previous section require adding a noise proportional to , which may not scale well in practice when is large. We explore an alternative by instead adopting the Gaussian Mechanism in order to reduce additive noise from to . We call these variants, naïve- and histogram-based-, which guarantee -edge-DP and -node-DP, respectively.

7.1. Naïve- Approach

In conjunction with the naïve approach (Algorithm 1) which gives a strong privacy guarantee by adding considerably large amount of noises, we develop here another approach that adds less noises, but provides a weaker -edge DP guarantee. The algorithm is described in Algorithm 3. Similar to Algorithm 1, we round the noisy outputs to the nearest integers to protect the data from floating point attacks. In the rest of this section, we discuss nontrivial details of this approach and show that it indeed satisfies -edge DP.

Theorem 5. The naïve- approach as described in Algorithm 3 is -edge-DP.

Proof. Let be the directed graph of ARP requests in week . Let be the algorithm that computes the weekly total degrees and (line 3 of Algorithm 3). As in the proof of Theorem 3, the edge-sensitivity satisfies . Observe that line 3-6 of Algorithm 3 can be written as a randomized algorithm defined bywhere and corresponds to the threshold-then-round function computed on all bin counts (line 5-6). It follows from Theorem 2 and Proposition 1 that is -zCDP.
Then, we can define Algorithm 3 as a randomized algorithm as follows:By Proposition 2, we have that the Algorithm 3 is -zCDP or -zCDP. Using Lemma 1 and recalling the definition of in line 1 of Algorithm 3, we conclude that this algorithm is also -edge-DP.

Input: , , ,
Output:
(1)
(2)for to do
(3)
(4)
(5)ifthen
(6)else
(7)end
7.2. Histogram-Based- Approach

We aim to construct an -node-DP with less noises compared to the -node-DP algorithm in Section 6.2. We still rely on a histogram-based approach as it has small sensitivity upon adding/removing a node. Our histogram-based- approach is described in Algorithm 4.

Theorem 6. The histogram-based- approach as described in Algorithm 4 is -node-DP.

Proof. Let and be node-neighboring directed graph at time , i.e., can be obtained from by adding or removing a single node. Let be the algorithm that computes the histogram of the degrees, i.e., the entries of and are the count of nodes by their degrees. As in the proof of Theorem 4, the node-sensitivity satisfies
Looking at Algorithm 4, we observe that line 3-7 can be written as a randomized algorithm defined bywhere and corresponds to the threshold-then-round function computed on all bin counts (line 6-7). It follows from Theorem 2 and Proposition 1 that is -node-DP.
Then, we can define Algorithm 4 as a randomized algorithm as follows:By Proposition 2, we have that the histogram-based approach (described as in Algorithm 4) is -zCDP or -zCDP. From the definition of in line 1 of Algorithm 4, we conclude using Lemma 1 that this algorithm is also -node-DP.

Input: , , ,
Output:
(1)
(2)for to do
(3)
(4)foreachdo
(5)  
(6)  ifthen
  
(7)  else
(8)end
(9)end

8. Evaluation

In this section, we evaluate our approaches by deploying them as part of a large-scale research project and reporting their utility from a real-world dataset extracted from such project.

8.1. Real-World Deployment
8.1.1. Background

ASEAN-Wide Cyber-Security Research Testbed Project is a large-scale research project with collaboration between multiple universities primarily located in Southeast Asia including Prince of Songkla University, Thailand (PSU), Universitas Brawijaya, Indonesia (UB), University of Computer Studies Yangon, Myanmar (UCSY), Institute of Technology of Cambodia, Cambodia (ITC), University of Information Technology, Myanmar (UIT), and The University of Tokyo, Japan (UT). The ultimate goal of this project is to create a real-world public testbed of malware behaviors captured in ASEAN countries.

Independent of our work, the first phase of this project involves capturing, collecting and analyzing LAN data in Southeast Asian countries. To achieve this task, a small monitoring device, implemented atop of a raspberry-Pi 3B in Figure 3, is introduced and placed into several LANs across the ASEAN region. This monitoring device observes and captures the network traffic flowing within a LAN and periodically outputs the captured data to our server, in which such data is analyzed and a model of ASEAN malware is eventually created.

8.1.2. Deployment

Our work plays an important role in the second phase of this research project. It allows us to privately share aggregate ARP data collected from the previous phase with other project members as well as to the public domain.

Our approaches enable a release mechanism of ARP-request data that still retains the utility of LAN anomaly detection. To assess utility, we evaluated our approaches on a subset of data captured and extracted from this research project.

The extracted dataset contains all ARP-request data observed and collected from 3 real-world LANs over a 30-week period. These LANs are located in: (1) The University of Tokyo, Japan (thus, its dataset is labeled as JPN), (2) Prince of Songkla University–Phuket Campus, Thailand (HKT) and (3) Prince of Songkla University–Hatyai Campus, Thailand (HDY). Details about these monitored LANs can be found in Table 3.

8.1.3. Parameter Selection

As we collected ARP requests over a 30-week period, . The naïve approach involves no other parameters. Meanwhile, the histogram-based approach consists of an additional set of parameters: the number of bins and the width of each bin.

Intuitively, a larger number of bins leads to smaller bin counts.

In such case, the noise injected by our approach would become too large, severely decreasing utility of the released data. To avoid this problem, we select the number of histogram bins to be relatively small – 3. Specifically, we choose the first two bins to correspond to the number of Users whose degrees are 1 and 2, respectively; the third bin contains the number of User-s with degree .

Finally, the approaches in Section 7 consist of another parameter . Recall from the Remark 1 in Section 5 that must be negligible with respect to the number of data points . In other words:

In our target system, corresponds to and for the node-DP and edge-DP notions, respectively; See Table 3 for the number of Users in each monitored LAN. Unless stated otherwise, we use for all experiments. Nonetheless, the impact of different values on the utility is also assessed in the next subsection.

8.2. Utility Assessment: RMSE
8.2.1. RMSE

In the context of differential privacy, one common utility metric is defined as an error between the released privatized values and the nonprivatized aggregates . We adopt a similar approach and select the root-mean-square error (RMSE) as our first evaluation metric:where and represent the data point in and , respectively. For the naïve approach and its variant, corresponds to the sum of all User’s ARP degrees observed in week , while refers to the privatized output on the same ARP data. On the other hand, represents a histogram bin in the histogram-based and histogram-based- approaches.

8.2.2. Impact of

Recall that refers to a privacy budget in the DP notion and a lower value of implies stronger privacy, while possibly sacrificing utility.

Figure 4 shows the impact of on the utility of the proposed approaches. Unsurprisingly, we achieve lower errors and thus better utility from a higher . For all 3 monitored LANs, seems to be a pragmatic choice in order to maintain a low error for all approaches.

Next, we show how much utility can be improved by using the approaches in Section 7 instead of their counterparts in Section 6. The result, illustrated in Figure 5, suggests that both naïve- and histogram-based- approaches enjoy higher utility (i.e., a utility gain) when . However, as the gets larger, this utility gain becomes smaller; in fact, the naïve- approach incurs a utility loss when for all monitored LANs. This result suggests using the approaches in Section 7 only when one needs stronger privacy, i.e., small .

Figure 5 also indicates the histogram-based- approach significantly outperforms the naïve- approach in terms of the utility gain. For , the histogram-based- approach provides utility gain, while a smaller amount of utility gain can be realized in the naïve- approach. This is expected because the histogram-based- approach introduces a smaller value of (see the Remark 1 in Section 5), making the additive noise smaller and thus resulting in the higher utility gain.

In addition, also has a direct impact to and hence to the overall utility. As seen in Figure 5, among all monitored LANs, HDY has the highest number of Users and therefore suffers the lowest utility gain.

8.2.3. Impact of

We now assess the impact of on the utility of our approaches. Figure 6 shows RMSE of the naïve- and histogram-based- approaches for different values of . As expected, increasing results in a decrease in RMSE and thus improves the utility of our approaches. This decrease is logarithmic as a function of .

The utility gain of the naïve- and histogram-based- approaches with respect to their original counterparts is illustrated in Figure 7. Our approaches benefit from the higher utility gain when is larger. For most values, the histogram-based- approach provides a positive utility gain over the histogram-based approach. Meanwhile, a utility gain can be achieved from the naïve- approach when .

This experimental result suggests that both naïve- and histogram-based- approaches still provide a utility advantage over their original counterparts even for smaller than (up to for the naïve- approach and for the histogram-based- approach). In practice, one may choose to opt for smaller if a stronger privacy guarantee is needed.

8.3. Utility Assessment: Anomaly Detection Accuracy
8.3.1. Anomaly Detection Algorithm

In addition to low errors, it is also essential that outputs produced by our approaches can still be useful in identifying anomalous activities in LAN. Hence, we further evaluate utility of our approaches by assessing them via a LAN anomaly detector. In this experiment, we consider our approaches to preserve the utility of anomaly detection if the anomaly detector classifies the privatized data the same way as the original (nonprivatized) data.

For the anomaly detector, we choose an approach based on exponentially weighted moving average and variance [32] proposed by Matsufuji et al. [7] since it is tailored specifically for detecting LAN anomalies based on ARP data, which is also the focus in this work. All parameter values are selected based on the recommendation from [7].

It is worth noting that the anomaly detector in [7] only supports input of type univariate time series. However, the histogram-based approach and its variant produce a multivariate time series output (i.e., a time series of histograms), and hence cannot be used directly as input to the anomaly detector. To address this issue, we perform a simple transformation that converts two consecutive histograms into a single variable using the distance function; the result of this transformation is then given as input to the anomaly detector. More formally, the transformation is defined as

8.3.2. Metrics

In this experiment, we evaluate utility of our approaches using two metrics: true positive rate and score. In particular, we consider , a noisy data point produced by our approach, to be a true positive if the anomaly detector classifies both and as an anomaly, where represents the original nonprivatized counterpart. is a false positive if the anomaly detector finds an anomaly in but not in . A true negative and a false negative are also defined similarly.

Based on these definitions, and metrics can be formulated as

A high value of implies that a high percentage of anomalies detected in the original data is also captured as an anomaly in the privatized data. On the other hand, a high value of implies relatively small values of and compared to .

8.3.3. Results

Figures 8 and 9 show the utility of our approaches evaluated using and metrics, respectively. First, we can see that does not affect utility of the naïve and naïve- approaches as both approaches still provide almost perfect utility scores in all monitored LANs.

On the other hand, the histogram-based and histogram-based- approaches yield low utility for small . The utility scores then become higher as increases. For HKT, both approaches achieve a reasonable score of with . Meanwhile, must be set to 6 in order to achieve the same utility score in HDY. JPN requires the highest in order for the histogram-based- approach to perform .

Lastly, the results also confirm that the histogram-based- approach significantly outperforms the naïve- approach in terms of utility. Thus, we recommend to deploy the histogram-based- approach over the histogram-based approach when one needs to publish ARP-request data with user privacy protection (i.e., corresponding to the node-DP notion); whereas, if edge-DP is sufficient, the naïve approach is a more reasonable choice over the naïve- approach as the former provides a stronger privacy guarantee while both approaches achieve the similar utility performance.

8.3.4. Comparison with RMSE

In most cases, the utility results from and metrics are consistent with the previous results measured using RMSE in Section 8.2. That is, a higher leads to higher utility with lower RMSE and higher and . On the other hand, an extremely low value of (e.g., ) renders the output data useless as it can no longer be used to reveal anomalies due to its low /. There is, however, one exception: the naïve and naïve- approaches surprisingly can still attain high and utility despite low . This indicates that such approaches are more robust to additive noises than other approaches.

9. Discussion

9.1. ARP Fields

Our approaches take as input ARP-degree data, which in turn makes use of only 5 fields in ARP packets: SHA, SPA, THA, TPA, and OPER. In this work, we choose to discard the rest of the ARP fields (i.e., Hardware Type/Length (HTYPE/HLEN) and Protocol Type/Length (PTYPE/PLEN)) from our analysis. This is because, in practice, these discarded fields usually have fixed values that contain neither sensitive information nor anything meaningful to our approaches. For instance, since ARP is only applicable to IPv4, the PLEN field is always set to the value of 4 indicating the size of an IPv4 address; or HTYPE usually contains the value of 1 representing the ubiquitous Ethernet hardware type. As these fields are generally constant for all ARP packets, their absence does not affect privacy or utility to our approaches.

9.2. DP Mechanisms

In this work, we focus on releasing ARP-degrees in differentially-private manners. Publishing degrees has sensitivity of 1 (removing a user’s ARP request alters the total ARP-degrees by 1), which is small compared to the number of ARP requests sent by all users. Thus we choose the noise perturbation methods, namely the Laplace and the Gaussian mechanism, to privatize the ARP-degrees. Another well-known differential privacy mechanism is the randomized response, whose standard deviation is [33], which is worse than the standard deviation of the Laplace and Gaussian mechanism, which is . There are also differential privacy mechanisms based on data synthesis [34]. However, as anomaly detection algorithms look for “spiking” behaviors at a particular time interval, these data synthetic approaches, which try to replicate the distribution of the data as a whole, will not be able to retain the spikes as well as the perturbation mechanisms.

9.3. Time Interval

In our evaluation, we consider the time interval for ARP-data collection to be in a unit of a week. Albeit a bit long, this design choice is necessary as it allows us to incorporate all data (which spans for 30 weeks) into our analysis with higher utility rate and without losing too much privacy budget.

To illustrate this point, we conduct a new experiment on the JPN network where we aggregate and process ARP data on a shorter period, i.e., every day instead of every week. Compared to the original experiment, we have observed a drastic decrease in the utility rate for all our approaches. As an example, for the naïve approach with , the RMSE has increased by a factor of 6 (from 10 to 60), while the and score have reduced substantially from 1.0 to .

9.4. Utility Metrics

We evaluate our approaches using two utility metrics: RMSE and Anomaly Detection Accuracy. We select the former because it is one of the most common metrics for measuring utility from a DP mechanism [35]. Intuitively, it tells us “how far apart the privatized data is from the original data.” Since an anomalous activity appears as an unusual value in the data, a privacy-preserving mechanism with small RMSE would not perturb that value by much, allowing such activity to be detected from the privatized data. Besides RMSE, there are other similar metrics with the same purpose, e.g., Mean Absolute Error. Even though we do not include them in this work, we expect the results from such metrics to be in line with our current results.

Nonetheless, the RMSE does not directly indicate the “true” utility in this work since our end goal is to detect LAN anomalies, not minimize error rates. To this end, we choose to include Anomaly Detection Accuracy as our second metric. This metric realistically gives us an idea of how effective our approaches are when performing on a real-world LAN anomaly detector [7].

Finally, we do not consider other utility metrics that target different types of data publication. For example, -Error [36] and Hausdorff Distance [37] are geared towards measuring utility in location privacy protection. Also, information-theoretic metrics [38] require the input to be generated from a probability distribution, which is not the case in this work.

10. Conclusion

This paper presents four approaches to privately releasing ARP-request data that can later be used for identifying anomalies in LAN. We prove that the naïve approach satisfies edge-differential privacy, and thus provides privacy protection on the user-relationship level. On the other hand, the histogram-based approach can provide node-differential privacy, thus leaking no information about a presence of each individual user. We also propose two alternatives, named naïve- and histogram-based-, which require even smaller additive noises than their original counterparts in exchange for a small probability that the privacy guarantee will not hold. Feasibility of our approaches is demonstrated via real-world experiments in which we show that, with a reasonable privacy budget value, our approaches yield low errors ( in RMSE) and also preserve more than 75% utility of detecting LAN anomalies.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The preliminary (and much shorter) version of this manuscript was published in IEEE International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) 2021 [1].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The ASEAN IVO (https://www.nict.go.jp/en/asean_ivo/index.html) project, ASEAN-Wide Cyber-Security Research Testbed Project, was involved in the production of the contents of this work and financially supported by NICT (https://www.nict.go.jp/en/index.html). This work was also financially supported by Chiang Mai University, Thailand.