Abstract
Supernode detection has many applications in detecting network attacks, assisting resource allocation, etc. As 5G/IoT networks constantly grow, big network traffic brings a great challenge to collect massive traffic data in compact and realtime way. Previous works focus on detecting supernodes in a measurement point, while only a few works consider it in the distributed monitoring system. Moreover, they are not able to measure two types of node cardinalities simultaneously and reconstruct labels of supernodes efficiently due to large calculation and memory cost. To address these problems, we propose a novel reversible and distributed traffic summarization called RDS to simultaneously measure source and destination cardinalities for detecting supernodes in the distributed monitoring system. The basic idea of our approach is that each monitor generates a summary data structure using the coming packets and sends the summary data structure to the controller; then, the controller aggregates the received summary data structures, estimates node cardinalities, and reconstructs labels of supernodes according to the aggregated summary data structure. The experimental results based on real network traffic demonstrate that the proposed approach can detect up to 96% supernodes with a low memory requirement in comparison with stateoftheart approaches.
1. Introduction
Traffic measurement provides valuable information for network security and network management, such as traffic accounting, load balancing, and anomaly detection [1–5], where measuring cardinality is an important task of traffic measurement. Cardinality measurement is a prerequisite for detecting supernodes, which have been paid extensive attention to by both academic and industrial organizations, despite many efforts in detecting supernodes over recent decades.
The cardinality of a node indicates the number of distinct other nodes it communicates with. We consider two types of node cardinalities: source cardinality and destination cardinality, where source cardinality indicates the number of distinct destinations that a node connects to and destination cardinality indicates the number of distinct sources that a node is connected to [6]. Supernodes are defined as the nodes whose cardinalities are more than the predefined threshold, where nodes with large source cardinality are supersources and nodes with large destination cardinality are superdestinations. A packet stream in traffic measurement can be modeled as the set of twotuple (s, d), where s is a source which consists of some source fields from packet header, such as source address, source port, and source address and source port pair, and d is a destination which consists of some destination fields from packet header, such as destination address, destination port, and destination address and destination port pair. The problem we solve in this paper is called supernode detection, which is to report nodes whose cardinalities exceed the predefined threshold in a measurement period. For each source s, the cardinality SC (s) of s is the number of distinct destination d. If SC (s) is more than the predefined threshold of source cardinalities, s is a supersource. Similarly, for each destination d, the cardinality DC (d) of d is the number of distinct source s, and d is a superdestination as DC (d) is more than the predefined threshold of destination cardinalities.
There are three basic tasks in traffic measurement, that is, flow size, flow persistence, and flow cardinality. Flow size is the number of elements contained in packets with the same flow label, where elements may be the entire packets, bytes in payload, and the specific content in packets. Flow persistence is the number of timeslots in which its packets occur. Flow cardinality is the number of distinct flows with the same source or destination. The measurement of flow cardinality is different from that of flow size and flow persistence. Measuring flow cardinality only counts once when the same flow appears several times, whereas measuring flow size and flow persistence needs to count the number of occurrences of one same flow. Let the measurement period be 10 minutes. Considering a case where a number of 1000 distinct hosts send 100000 packets to a server in one measurement period and these packets constitute a flow whose flow label is server address, the flow size is obviously 100000. If these packets occur in 20 timeslots, its persistence is only 20 as the measurement period is divided into 100 timeslots. Meanwhile, its cardinality is 1000, which may indicate the beginning of an attack.
Of particular importance are heavy hitters with large flow size, persistent items with large flow persistence, and supernodes with large flow cardinality. Heavy hitter detection has been extensively studied [7–14], and it has many applications, such as traffic accounting and load balancing. Persistent item detection is widely used in anomaly detection, stealthy network attack, etc. [15–18]. In this paper, we focus on supernode detection, which is a more challenging work since it is difficult to only count distinct flows appearing in one measurement period. It is important for many applications [19–24], such as DDoS attack detection and network scanning detection. For DDoS attack, the attacker makes use of infected hosts to launch a large number of requests to the targeted server in a short period, leading to consuming its resources on a large scale and making it unable to provide normal services, where the targeted server with large cardinality is a victim called superdestination. For network scanning, a malicious host attempts to connect to a large of distinct destination addresses or ports so as to discover the vulnerability in the network system, where the malicious host with large cardinality is an attacker called supersource.
There have been many efforts on cardinality estimation. A simple method is to maintain distinct connections between any two hosts in the network [25, 26]. The method can measure the cardinality of each host accurately, but it is not practical to process massive network traffic due to large memory overhead [27, 28]. Samplingbased methods [29, 30] and data streambased methods [31–33] are proposed to tackle the challenges caused by big network traffic. However, the estimation accuracy of samplingbased methods depends on the sampling rate, that is, they maintain a small number of distinct connections at the small sampling rate, but the accuracy of cardinality estimation decreases; on the contrary, the accuracy of cardinality estimation increases at the high sampling rate, but the memory overhead also increases. Data streambased methods demonstrate great superiority on memory utilization and measurement accuracy, where various types of summary data structures have been widely used in traffic measurement due to the excellent compression efficiency and acceptable accuracy.
However, the existing data streambased methods still encounter challenges regarding supernode detection. First, when massive network traffic constantly generates at a high rate, they are difficult to store a large number of flows due to limited system resources on measurement points. Therefore, there need to be compact data structure by which network traffic is efficiently compressed. Second, they cannot well support supernode detection in the distributed monitoring systems, including distributed traffic collection and aggregation, centralized cardinality estimation, and supernode detection where distributed traffic collection and aggregation is to merge network traffic from multiple measurement points and centralized cardinality estimation and supernode detection are to identify supernodes based on estimated cardinality using the aggregated data structure. Since the connections established by supernodes may span the entire network, there may be a small number of its connections observed at one measurement point, while its aggregation from multiple measurement points will have a large cardinality. Hence, there is necessary a traffic collection approach, which can handle data in a distributed manner to measure node cardinalities and detect supernodes. Third, it is difficult to measure two types of node cardinalities simultaneously. Usually, they can only measure one type of node cardinality, since their summary data structures are twodimensional bit arrays with the fixed row size or column size. Though we can measure various types of cardinalities using existing summary data structures, multiple instances of summary data structure need to be established by distinct keys, which is likely to cause excessive memory consumption. Besides, big network traffic brings great challenges to simultaneously measure two kinds of supernode cardinalities due to its onepass requirement. Thus, it is essential to design a cardinality estimation approach for detecting supersources and destinations simultaneously. Fourth, they are unable to reconstruct labels of supernodes efficiently due to large calculation cost and unavoidable false positives and false negatives. Consequently, it is vital to reconstruct their labels for detecting supernodes efficiently and accurately.
Likewise, we also confront some challenges in detecting supernodes. First of all, it is not easy to count cardinalities of each node from multiple measurement points, since a number of connections may cross the entire network and are not simply added. In addition, to simultaneously measure two types of cardinalities for supersources and destinations, source and destination addresses need to be compressed into summary data structure once. Finally, it is very difficult to efficiently recover labels of supernodes by only summary data structure without storing source and destination addresses. These challenges in detecting supernodes motivate our work.
To solve the challenges, we propose a novel solution of supernode detection, and it obtains many flows whose cardinalities are larger than the predefined threshold at the end of one measurement period. The proposed method maps each flow to one bit of twodimensional bit arrays, aggregates the generated twodimensional bit arrays, estimates node cardinalities using probabilistic counting algorithm, reconstructs flow labels of supernode by inverse calculation, and detects supernodes. It mainly includes four steps: (i) update operation is used to extract flow labels (e.g., source and destination pairs) from packet stream and compress them into twodimensional bit arrays using a group of hash functions. Then, the generated twodimensional bit arrays are aggregated into one twodimensional bit array with the same size at the end of each measurement period. It supports distributed traffic collection and centralized analysis. (ii) Estimation operation can be used to simultaneously measure source and destination cardinalities by only utilizing summary data structure once, and the minimal estimated cardinalities are taken as their estimations in order to mitigate the overestimation problem. (iii) Reconstruction operation is used to efficiently create supersources and destinations by inverse calculation without storing the information associated with sources and destinations. (iv) Detection operation is used to identify supersources and destinations under the guidance of abnormal rows and columns without searching the entire possible abnormal rows and columns through reconstruction operation. Our main contributions are summarized as follows.
In this paper, we design a novel reversible and distributed summary data structure to be suitable for supernode detection with accuracy and memory size guarantees in the distributed monitoring system. It is able to effectively handle a large amount of network traffic arriving by a group of hash functions, simultaneously estimate two types of source and destination cardinalities by probabilistic counting method, and efficiently detect supersources and destinations by reconstructing flow labels of supernodes based on the aggregated summary data structure. We theoretically analyze the computation, space complexity, and estimation accuracy of our method. We conduct extensive experiments on real traffic traces from WIDE to evaluate the performance of our method. The experimental results demonstrate that our method achieves superior performance compared to stateoftheart methods in accurately and efficiently detecting supersources and destinations.
The rest of this paper is organized as follows: Section 2 summarizes related works; Section 3 formulates the problem; Section 4 presents our method and its theoretical analysis; Section 5 evaluates performance and conducts experiments; Section 6 concludes this work.
2. Related Work
Network traffic measurement has been extensively applied in many fields, where the perflow [34–39], heavy hitter [9–12], persistent item [16, 17], and supernode measurement [19–21] are still hot topics.
Much prior work focuses on perflow size and cardinality measurement. The task of perflow size measurement is to count the number of elements in each flow, where flows can be TCP flows, UDP flows, or any other types defined according to the specific application requirements, and elements may be packets, bytes, or occurrences for certain events. The simple method allocates a counter for each flow to measure its size. When each packet arrives, the corresponding counter is increased by an integer, for instance, one at the packet level, the number of bytes at the byte level, or the number of accesses to websites. However, there are a large number of flows in highspeed networks, leading to enormous memory consumption. Therefore, many strategies to improve memory utilization were proposed, for instance, CAESAR [40] and virtual sketch [41], which make flows share counters or sketches to reduce memory consumption. Compared to perflow size measurement, perflow cardinality is difficult to be measured, since it counts the number of distinct elements in each flow. Perflow size measurement is not able to be used in that of its cardinality directly. Most existing mechanisms were that distinct elements in each flow share the same bit vector, such as [42–44], which create a virtual bit vector for each flow, each bit of which is selected from the same bit vector using hash functions, so as to reduce memory cost.
Heavy hitter measurement is to find flows whose sizes are more than the predefined threshold in the measurement period. The universal approach is to keep track of a small set of flows, trying to retain large flows in the set while replacing small ones with new flows, such as Lossy Counting [45] and Space Saving [46]. Persistent item measurement is to find flows that occur in many timeslots. Supernode measurement is to find flows whose cardinalities are more than the predefined threshold in the measurement period, and it has been used to find attackers or victims. Supernode detection can be treated as a special case of heavy hitter detection through identifying each source or destination with a large number of connections. However, the existing solutions [47] of heavy hitter detection cannot be directly used to solve the problems of supernode identification, since they are not able to filter repetitive connections in the data streams using counters allocated to each flow. We discuss the existing solutions to detect supernodes below.
The traditional approaches maintain all distinct connections for each source or destination to detect DDoS attackers or targets in the measurement period. Although they can detect supernodes accurately, they cause great memory usage due to a large number of flows in highspeed networks.
Flow samplingbased approaches are used to monitor a set of flows whose hash values are smaller than the predefined sampling rate [48]. Therefore, the flows with many connections are very probable to be sampled. Flow samplingbased approaches improve memory efficiency, but the accuracy of supernode identification depends on the sampling rate. Moreover, they maintain the sampled flows at high calculation and memory access cost.
Data streamingbased approaches are used to detect supernodes. Most existing solutions usually design summary data structures that fit in fast memory, and they encode flow labels extracted from arriving packets to be stored in the summary data structures. However, they cannot recover supernodes merely using the summary data structures in fast memory due to their irreversibility. Sketches are one type of summary data structures, which are designed to detect supernodes and solve irreversibility [49]. Wang et al. [20] proposed a double connection degree sketch (DCDS) that is used to reconstruct host addresses with large cardinalities based on Chinese Remainder Theorem. Liu et al. [21] designed a vector bloom filter for supernode detection, which extracts several bits directly from flow labels. However, the calculation cost is obvious for large address space.
Some networkwide measurement systems solve the problem of supernode detection [50, 51]. The proposed summary data structures can be a component of networkwide measurement systems.
Besides, some variants of supernode detection are proposed in the literature [52, 53]. Zhou et al. [52] proposed the solution of persistent spread problem, which counts the number of distinct elements in each flow persistently occurring in the predefined measurement periods. Huang et al. [53] further solved the kpersistent spread problem, which measures the number of distinct elements in each flow appearing in at least k out of t measurement periods.
3. Problem Formulation
In this paper, we consider a distributed monitoring system composed of the controller and a set of monitors, as shown in Figure 1. Let P = p_{1}, p_{2}, …, p_{t}, … be a sequentially arriving packet stream generated by network traffic packets, where p_{t} = (s_{t}, d_{t}) is some fields of the tth packet, in which s_{t} and d_{t} are the corresponding source and destination, respectively. Source or destination space consists of distinct source or destination over one measurement period, which are denoted as S and D. A source can be any combination of source fields in the packet header, such as source IP, source port, or their combination. Similarly, a destination can be any combination of destination fields in the packet header, such as destination IP, destination port, or their combination. The source and destination are determined according to the specific application requirement. In this work, we use the source and destination pair of one packet as its flow label. The whole measurement time is partitioned into many measurement periods T with equal length. At the beginning of one measurement period, the monitor extracts some fields from the arriving packets and compresses them into summary data structures. At the end of one measurement period, the monitor sends the generated data structures to the controller and resets data structures. Then, the controller aggregates the received data structures and detects supernodes.
For any source s ∈ S, its cardinality is defined as the number of distinct destinations that s connects to in one measurement period. Similarly, for any destination d ∈ D, its cardinality is defined as the number of distinct sources that connects to d in one measurement period. Supernodes are divided into two types: supersources and superdestinations. Supersources or superdestinations are the hosts whose cardinality exceeds the predefined threshold θ_{1}D_{1} or θ_{2}D_{2} in one measurement period, where D_{1} or D_{2} denotes the sum of source or destination cardinality in one measurement and θ_{1} and θ_{2} are constants, 0 < θ_{1} and θ_{2} < 1. Supersources and superdestinations are expressed aswhere SC (s) indicates the cardinality of source s, DC (d) indicates the cardinality of destination d, D_{1} is computed as , and D_{2} is computed as .
Supernode detection is widely used in many areas. For example, port scanning attacks are performed by trying to connect to numerous distinct destination addresses or ports for the existence of vulnerable services. In this case, the attacker with large cardinality is a supersource. Besides, Distributed DenialofService (DDoS) attacks are launched by using a large number of connected devices as attackers to send a lot of requests to a victim such as server, so that legitimate users are not able to utilize its resources. Similarly, the victim with high cardinality is a superdestination.
The goal of this work is to design a distributed monitoring framework which consists of updating module and detecting module. The former stores information associated with cardinalities of flows by summary data structures on each monitor. The latter aggregates the generated data structures from each monitor, estimates source and destination cardinality, reconstructs supersource and destination candidates, and identifies supersources and destinations on the controller. We perform theoretical analysis and performance evaluation.
4. Our Algorithm
In this section, we first introduce a novel summary data structure. Then, we define the main operations of our algorithm, containing updating and aggregating summary data structures, estimating cardinality, reconstructing sources and destinations, and detecting supersources and superdestinations. Theoretical analysis on updating complexity and estimation accuracy is performed. Next, we elaborate them in detail. The framework of our method is shown in Figure 2.
4.1. Data Structure
The summary data structure is denoted as follows:
Each B_{i} (1≤i ≤ H) is a twodimensional bit array with the size of n_{i} × m_{i}, each bit of which is denoted as B_{i}[ j][k] (0 ≤ j ≤ n_{i} − 1 and 0 ≤ k ≤ m_{i} − 1), as shown in Figure 3. Each B_{i} (1 ≤ i ≤ H) with different sizes is used to store sufficient flow information and effectively trace attackers or victims. We use a row hash function f_{i} and a column hash function h_{i} to locate the index in each B_{i} (1 ≤ i ≤ H). The row and column hash functions are expressed aswhere N and M are the size of source space and destination space and n_{i} and m_{i} indicate the number of rows and columns in each B_{i} (1 ≤ i ≤ H).
To recover abnormal sources by simple computation, the row hash function f_{i} and column hash function h_{i} are defined aswhere c_{i} and are the values of modulus operation, n_{1}, n_{2}, …, n_{H} are selected as pairwise coprime integers to make the summary data structure reversible, and m_{1}, m_{2}, …, m_{H} are also pairwise coprime integers.
The monitors use the same summary data structure B as the controller. Let the number of monitors be R. The summary data structure on the monitors is indicated as follows:
4.2. Updating Summary Data Structures
Updating operation is used to collect flow information from massive network traffic. At the beginning of one measurement period, all bits in B and each B^{r} (1 ≤r ≤ R) are initialed. Then, each monitor updates the corresponding bit in each by the row hash function f_{i} (1 ≤i ≤ H) and column hash function h_{i} (1 ≤i ≤ H) when a packet arrives. At the end of one measurement period, the controller aggregates the generated summary data structures from each monitor. The updating process is described in Algorithm 1.

When packet stream arrives sequentially, each monitor extracts its flow label (s_{t}, d_{t}) from one packet and sets the corresponding bit in each (1 ≤ i ≤ H, 1 ≤ r ≤ R) by the row hash function f_{i} (s_{t}) (1 ≤ i ≤ H) and column hash function h_{i} (d_{t}) (1 ≤ i ≤ H) to one, which is denoted as follows:
At the end of one measurement period, each monitor sends the generated summary data structure to the controller. Then, the controller performs the bitwiseOR operations on the same bits in each (1 ≤ i ≤ H and 1 ≤ r ≤ R). If one bit in (1 ≤ r ≤ R) is one, the corresponding bit in each B_{i} (1 ≤ i ≤ H) in B is set to one. The bitwiseOR operation is denoted as follows:where (1 ≤ r ≤ R) indicates the corresponding bit in the ith bit array in B^{r} and ⊕ is a bitwiseOR operator.
4.3. Estimating Source and Destination Cardinality
Estimating operation is used to obtain an approximate estimation of cardinality based on the aggregated data structure B = (B_{1}, B_{2}, …, B_{H}). Estimating operation is shown in Algorithm 2. For each source s ∈ S, we compute the hash value f_{i}(s) of source s to locate the row in each B_{i} (1 ≤ i ≤ H) of B. The flows associated with source s are mapped to H rows B_{i}(s) = B_{i} [ f_{i} (s)] [·] (1 ≤ i ≤ H). As a result, we obtain H bit vectors B_{i} (s) (1 ≤ i ≤ H) to store the cardinality information of source s. The source cardinality for each bit vector B_{i} (s) (1 ≤ i ≤ H) is estimated as (11) by the probabilistic counting algorithm.where is the number of zero bits in each bit vector B_{i}(s) (1 ≤ i ≤ H).

If other sources are not mapped to the f_{i} (s)th row, the estimated source cardinality is very close to its real value. Actually, each B_{i} (s) (1 ≤ i ≤ H) may contain noise caused by other sources, so that the source cardinality may be overestimated. Therefore, we use the minimum value of estimated source cardinalities SC_{i} (s) (1 ≤ i ≤ H) as its estimation [40], which is denoted as follows:
Similarly, for each destination d ∈ D, we calculate the hash value h_{i} (d) of destination d to locate the column in each B_{i} (1 ≤ i ≤ H) of B. The flows associated with destination d are hashed to H bit vectors B_{i} (d) = B_{i} [·] [h_{i} (d)] (1 ≤ i ≤ H). Therefore, the destination cardinality for each bit vector B_{i} (d) (1 ≤ i ≤ H) is estimated as (13) by the probabilistic counting algorithm.where is the number of zero bits in each bit vector B_{i} (d) (1 ≤ i ≤ H).
We use the minimum value of estimated destination cardinalities DC_{i} (d) (1 ≤ i ≤ H) as its estimation, which is denoted as follows:
4.4. Reconstructing Sources and Destinations
Reconstructing operation is to recover abnormal sources and destinations. For ease of understanding, we first consider the simple situation. Suppose that there is only one abnormal row in each B_{i} (1 ≤ i ≤ H), which is denoted as c_{i} (1 ≤ i ≤ H). According to the predefined row hash function f_{i}, we can map a source s to the f_{i} (s)th row in each B_{i} (1 ≤ i ≤ H), that is, f_{i} (s) ≡ c_{i} (mod m_{i}) (1 ≤ i ≤ H). The problem of finding abnormal source s is converted to the solution of equations f_{i} (s) ≡ c_{i} (mod m_{i}) (1 ≤ i ≤ H). Based on the Chinese Remainder Theorem (CRT) [44], the solutions are denoted as follows:where , , and .
Similarly, we assume only one abnormal column in each B_{i} (1 ≤ i ≤ H), which is denoted as (1 ≤ i ≤ H). The problem of finding abnormal destination d is converted to the solution of equations (1 ≤ i ≤ H). Therefore, the solutions are expressed as follows using the CRT:where , , and .
For the general situation, we assume abnormal rows in each B_{i} (1 ≤ i ≤ H). There are wH combinations consisting of one abnormal row or column in each B_{i} (1 ≤ i ≤ H). We use the CRT to solve the reversible problem for each combination. The entire abnormal sources or destinations are the union of solutions with each combination. However, the reverse calculations cause large computational overhead and increase false positive rate and false negative rate due to a few false combinations and hash collisions.
We design a strategy to establish relations between two consecutive rows to which sources are mapped by row hash functions. Taking one source s as example, we obtain the row index f_{i} (s) (1 ≤ i ≤ H) and the next row index f_{i + 1} (s) (1 ≤ i ≤ H − 1) using row hash function f_{i} (1 ≤ i ≤ H). Therefore, the row index f_{i} (s) is associated with the next row index f_{i + 1} (s) (1 ≤ i ≤ H − 1) by one hash table, that is, T_{i} [f_{i} (s)] = f_{i + 1} (s) (1 ≤ i ≤ H − 1). However, different sources may be mapped to the same rows, leading to hash collisions. Assuming another source s´, T_{i} [f_{i} (s)] = {f_{i + 1} (s), f_{i + 1} (s´)} (1 ≤ i ≤ H − 1). We conduct row combinations based on the hash table to reduce computational overhead and improve false positive rate and false negative rate. After that, we can accurately recover sources from row combinations using inverse calculation. Similarly, supposing that there are two destinations d and d´ mapped to the same columns, T_{i} [h_{i} (d)] = {h_{i + 1} (d), h_{i + 1} (d´)} (1 ≤ i ≤ H − 1). Likewise, we conduct column combinations based on the hash table and recover destinations using inverse calculation to reduce computational overhead and improve false positive rate and false negative rate.
As a result, it reduces false positive rate generated by false row combinations. Supersources and destinations show abnormal rows and columns at high probability.
4.5. Detecting Supersources and Superdestinations
For detecting supersources, we should identify abnormal rows generated by supersources at high probability in each B_{i} (1 ≤ i ≤ H). We view each row B_{i} [j] [·] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − 1) as one bit vector. For each row B_{i} [j] [·] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − 1), its cardinality is estimated as (17) by the probabilistic counting algorithm.where is the number of zero bits in each row B_{i} [j] [·] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − 1). If the cardinality of each row B_{i} [j] [·] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − ) is more than the predefined threshold α_{i} which is the ratio of summation of source cardinalities in a measurement period, the row B_{i} [j] [·] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − 1) is defined as one abnormal row. Similarly, for detecting superdestinations, we should identify abnormal columns caused by superdestinations at high probability in each B_{i} (1 ≤ i ≤ H). For each column B_{i} [·] [j] (1 ≤ i ≤ H, 0 ≤ j ≤ n_{i} − ), its cardinality is estimated as (18) by the probabilistic counting algorithm.where is the number of zero bits in each column B_{i} [·] [j] (1 ≤ i ≤ H, 0 ≤ j ≤ m_{i} − 1). If the cardinality of each column B_{i} [·] [j] (1 ≤ i ≤ H, 0 ≤ j ≤ m_{i} − 1) is more than the predefined threshold β_{i} which is the ratio of summation of destination cardinalities, the column B_{i} [·] [j] (1 ≤ i ≤ H, 0 ≤ j ≤ m_{i} − 1) is defined as one abnormal column.
After that, we obtain abnormal rows and columns in each B_{i} (1 ≤ i ≤ H). Therefore, supersources and superdestinations can be identified by reconstructing operation. The detection operation is shown in Algorithm 3.

4.6. Theoretical Analysis
In the updating process, there need to be 2H hash calculations to determine the row and column in H twodimensional bit arrays and 2H memory accesses for each packet. For the aggregation operation, it executes H + 2 memory accesses. Therefore, the time complexity to update a packet is O (H). For simplicity, we only consider the size of summary data structures. Each monitor needs m_{i}n_{i}H memory to store the related cardination information, and the controller needs the same memory size to detect supernodes. Since the distributed monitoring system consists of the controller and multiple monitors, the required memory space is m_{i}n_{i} (H + 1). Therefore, the space complexity is O (m_{i}n_{i}H).
We only derive the deviation and standard error of source cardinality estimation in each twodimensional bit array B_{i} (1 ≤ i ≤ H) according to the lineartime probabilistic counting algorithm as follows.
Let ; the estimation of source cardinality in each twodimensional bit array B_{i} (1 ≤ i ≤ H) is denoted as follows:
The Taylor series of at is expressed as follows:
We take the first three items of (20), and the estimation of source cardinality in each twodimensional bit array B_{i} (1 ≤ i ≤ H) is approximately represented as follows:
The mathematical expectation of source cardinality estimation is denoted as follows:
Since , we obtain
The mathematical expectation of relative error is denoted as follows:
Let . Equation (24) is transformed as follows:
Figure 4 illustrates the relationship among these parameters, namely, the source cardinality SC_{i}, the ratio t_{i}, and the relative error. When the source cardinality SC_{i} is constant, we can see that the relative error decreases as the ratio t_{i} decreases. Therefore, we can select the column number n_{i} of summary data structure to obtain the required relative error for each SC_{i}.
To derive the variance of the ratio , we take the first two items of (20) as the approximate estimation of source cardinality in each twodimensional bit array B_{i} (1 ≤ i ≤ H), which is represented as follows:
The variance of the ratio is denoted as follows:
Since , we obtain (28) according to the previous formula:
Therefore, the standard error of the ratio is denoted as follows:
Let ; equation (11) is transformed as follows:
Figure 5 shows the relationship among these parameters, namely, the source cardinality SC_{i}, the standard error of the ratio , and the ratio t_{i}. When the source cardinality SC_{i} is constant, we can also see that the standard error of the ratio decreases as the ratio t_{i} decreases. Therefore, we can select the column number n_{i} of summary data structure to obtain the required standard error for each SC_{i}.
5. Experiment and Evaluation
In this section, we evaluate the performance of our algorithm in comparison with other ones. The experiments are extensively conducted on real traffic data. All algorithms are implemented using C++ on a server with Intel E2224 CPU and 32 GB memory. The influence of parameters on the algorithm performance is discussed.
5.1. Datasets
To evaluate the performance of our algorithm, we select the traffic trace in the first three minutes from traffic traces without packet header over 15 minutes published by MAWI [54], which are divided into three traffic traces with one minute called data1, data2, and data3, respectively. Table 1 shows the statistical information of three traffic traces used in our experiments, where #packet denotes the number of packets in the traffic trace, SIP denotes the number of distinct source IP addresses (SIP for short), DIP denotes the number of distinct destination IP addresses (DIP for short), and (SIP, DIP) denotes the number of distinct source IP address and destination IP address pairs. As shown in Table 1, the average number of packets, source IP addresses, destination IP addresses, and source IP address and destination IP address pairs in three traffic traces is 2.4M, 39K, 212K, and 676K.
In this paper, we use SIP and DIP as sources and destinations, respectively. Figure 6 shows the cardinality distribution in three traffic traces, where the xcoordinates indicate the logarithm of source or destination cardinality and the ycoordinates indicate the logarithm of number of sources or destinations. We can see that the number of sources or destinations decreases as source or destination cardinality increases. As shown in Figures 6(a), 6(c), and 6(e), the sources with low cardinality are obviously more than ones with high cardinality. The number of sources whose cardinality is less than 10 is about 93 percent of overall sources. As shown in Figures 6(b), 6(d), and 6(f), we obtain similar results, that is, the destinations with low cardinality are apparently more than ones with high cardinality. The destinations whose cardinality is less than 10 are about 98 percent of overall destinations. The cardinality approximatively obeys the heavytailed distribution.
(a)
(b)
(c)
(d)
(e)
(f)
In the experiments, we assume that there are three monitors in distributed monitoring systems, each of which processes the 20second traffic trace for each traffic trace. For detecting supersources and destinations, we evaluate the performance of our algorithm compared with the compact spread estimator (CSE) [42], double connection degree sketch (DCDS) [20], and SpreadSketch (SS) [49] in terms of estimation accuracy, detection precision, and memory cost. The CSE constructs a virtual bit vector from the shared onedimensional bit array for each host by a group of hash functions. It can provide good accuracy in a small memory. The DCDS constructs multiple twodimensional bit arrays, only sets several bits selected in a bit array for each coming packet by a group of hash functions, and then reconstructs abnormal hosts by simple inverse calculation. The SS constructs an invertible sketch that is formed by the combination of countmin sketch and multiresolution bitmap, and multiple sketches can be merged to provide a networkwide measurement view for recovering superspreaders and their estimated fanouts by simple computations and small memory.
We also need to know true supersources and destinations obtained using three traffic traces in advance. Table 2 shows the number of true supersources and destinations when the predefined thresholds are 0.1 and 0.01 percent of the overall source and destination cardinality for three traffic traces, where SS and SD denote the number of true supersources and destinations separately.
5.2. Influence of Parameters on Estimation Accuracy
Our algorithm has three parameters, H, m_{i}, and n_{i}, where H denotes the number of hash functions and n_{i} and m_{i} denote the number of rows and columns in twodimensional bit arrays B_{i} (1 ≤ i ≤ H). Updating, estimating, and reconstructing processes use hash functions, and the parameter H impacts the processing time. H changes from 2 to 6. Both n_{i} and m_{i} determine the size of memory consumed by our algorithm. According to the coupon collection issue, the source and destination cardinality can be accurately estimated when they are less than n_{i}lnn_{i} and m_{i}lnm_{i}. In the experiments, source IP addresses and destination IP addresses are used as sources and destinations. Therefore, n_{i} is a prime from 400 to 800 and m_{i} is also a prime from 3000 to 7000.
We evaluate the accuracy of cardinality estimation by the average relative error (ARE for short), which is the mean value of difference between the true cardinality of hosts and their cardinality estimated divided by the true cardinality. ARE is expressed as follows:where c_{i} denotes the true cardinality of host i, denotes the estimated cardinality of host i, and n indicates the number of distinct hosts. The smaller the ARE is, the more accurate the estimated cardinality is.
Figure 7 shows the influence on the performance of our algorithm under different traffic traces. As shown in Figure 7(a), we can see that the ARE of our algorithm decreases as H increases. As shown in Figures 7(b) and 7(c), we obtain similar results. Since distinct sources and destinations are mapped to twodimensional bit arrays by many hash functions, hash collisions are significantly reduced. Besides, the estimation error caused by hash collisions is reduced because the minimum estimation in each twodimensional bit array is used as the cardinality estimation.
(a)
(b)
(c)
Figure 8 shows the influence of parameter n_{i} on the estimation accuracy under different traffic traces. As shown in Figure 8(a), we can see that the ARE obviously decreases as n_{i} varies from 400 to 600 and slowly decreases as n_{i} changes from 600 to 800. As shown in Figures 8(b) and 8(c), we obtain similar results. When other parameters are fixed, the size of memory consumed increases as n_{i} increases. Therefore, the probability that distinct sources are mapped to the same bits in twodimensional bit arrays are reduced to improve the estimation accuracy.
(a)
(b)
(c)
Based on the above analysis, we select (H) = 4, n_{i} = 600, and m_{i} = 5000 in the experiments.
Figure 9 shows the influence of parameter m_{i} on the estimation accuracy under different traffic traces. As shown in Figure 9(a), we can see that the ARE obviously decreases as m_{i} varies from 3000 to 5000 and slowly decreases as m_{i} changes from 5000 to 7000 using data1. As shown in Figures 9(b) and9(c), we obtain the similar results. Since the size of memory consumed increases, the probability of hash collisions is reduced. Therefore, we improve the cardinality estimation accuracy as m_{i} increases.
(a)
(b)
(c)
5.3. Estimation Accuracy
Figure 10 shows the cardinality estimation accuracy under three traffic traces for CSE, DCDS, SS, and our method called RSD. We can see that RSD has the minimum ARE of source cardinality estimation for each traffic trace from Figure 10(a). Similarly, RSD also has the minimum ARE of destination cardinality estimation for each traffic trace (Figure 10(b)). CSE, DCDS, SS, and RSD approximately estimate the cardinality of hosts using probabilistic methods, and their estimation accuracy depends on the memory utilization. RSD can simultaneously and accurately estimate the cardinality of sources and destinations based on the same summary data structure generated by the controller in comparison with CSE, DCDS, and SS. However, the deviation between the estimated cardinality and the theoretical value still exists due to hash collisions.
(a)
(b)
5.4. Detection Precision
We evaluate the performance of our algorithm called RSD compared with CSE, DCDS, and SS under different traffic traces. The reconstructed sources or destinations may not be true supersources or destinations by false combinations of abnormal rows or columns. Therefore, we use the false positive rate and false negative rate to evaluate the detection precision of four algorithms. The false positive rate (FPR for short) is the number of not corrected identified supernodes divided by the number of supernodes identified. The false negative rate (FNR) is the number of not identified supernodes divided by the number of true supernodes. The FPR and FNR are expressed aswhere A is the set of true supernodes and B is the set of identified supernodes.
Figure 11 shows the detection precision of three algorithms under data1. The FPR and FNR of detecting supersources are shown in Figures 11(a) and 11(b). We can see that RSD has the lowest FPR and FNR for supersources in comparison with DCDS, CSE, and SS. The FPR changes from 0.04 to 0.15 and the FNR varies from 0.06 to 0.25 as the threshold θ_{1} increases. The FPR and FNR of detecting superdestinations are shown in Figures 11(c) and 11(d). We can see that RSD has the lowest FPR and FNR for superdestinations compared to DCDS, CSE, and SS. The FPR changes from 0.05 to 0.18 and the FNR varies from 0.08 to 0.24 as the threshold θ_{2} increases. Although there are some false positive and false negative due to the reconstruction of supersources and destinations using false combinations of abnormal rows and columns in twodimensional bit arrays, RSD mitigates wrong reconstruction of supersources and destinations using the extra hash table that conducts the combinations of abnormal rows and columns. In general, RSD can simultaneously identify supersources and destinations based on their accurate cardinality estimation, and it outperforms DCDS, CSE, and SS in terms of FPR and FNR. Figure 12 shows the detection precision of three algorithms under data2. As shown in Figures 11(a)–11(d), we can obtain similar results.
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d)
5.5. Memory Cost
Figure 13 shows the memory cost of four algorithms under different traffic traces. The memory cost of our algorithm called RSD includes the summary data structure, arrays, and hash table. The memory cost slightly increases when the extra bit arrays and hash table are created in the updating and reconstruction process. For simplicity, we treat the size of summary data structure as the memory cost. CSE, DCDS, and SS need to establish the data structure twice to simultaneously measure supersources and destinations, leading to high memory cost. Besides, DCDS uses the additional twodimensional bit arrays to reduce the FPR caused by wrong combinations of abnormal rows and columns. CSE can only measure the cardinalities of supersources and destinations by constructing multiple virtual onedimensional bit arrays and not reconstruct supersources and destinations. SS can measure the cardinalities of candidate supersources and destinations through building twodimensional sketches, each bucket of which consists of a multiresolution bitmap, a label field, and a register. However, RSD can reduce the memory cost in supersource and destination detection by constructing row and column hash functions. In brief, the memory cost of RSD is superior to that of the three algorithms CSE, DCDS, and SS.
6. Conclusion
In this paper, we propose a novel method for detecting supernodes in the distributed monitoring systems. It constructs twodimensional reversible summary data structures to collect information associated with cardinalities according to the specific application requirements. At the end of each measurement period, the generated summary data structures are aggregated to produce the summary data structure with the same size. On the basis of the aggregated summary data structure, it estimates node cardinalities and reconstructs supersources and destinations. Compared to other algorithms, the proposed method can simultaneously measure two types of node cardinalities using the same summary data structures, detect supersources and destinations efficiently, and reconstruct labels of supersources and destinations with small computational complexity. We perform theoretical analysis and conduct extensive experiments on real traffic traces. The experimental results illustrate that our method has good estimation accuracy, detection precision, and memory cost. In the future, we will deploy our method and study superchanger detection problem in the practical distributed monitoring systems and study variants of supernode detection.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China, under Grant no. 61802274, Open Project Foundation of Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China, under Grant no. K939201701, and Scientific Research Foundation for Advanced Talents of Taizhou University, China, under Grant no. QD2016027.