The Scientific World Journal

Volume 2013, Article ID 960348, 10 pages

http://dx.doi.org/10.1155/2013/960348

## Knee Point Search Using Cascading Top-*k* Sorting with Minimized Time Complexity

^{1}Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China^{2}China Organizational Name Administration Center, Beijing 100028, China^{3}Department of Information Science and Applications, Asia University, Taichung 41354, Taiwan

Received 20 May 2013; Accepted 20 July 2013

Academic Editors: Z. Cai and Y. Deng

Copyright © 2013 Zheng Wang and Shian-Shyong Tseng. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Anomaly detection systems and many other applications are frequently confronted with the problem of finding the largest knee point in the sorted curve for a set of unsorted points. This paper proposes an efficient knee point search algorithm with minimized time complexity using the cascading top-*k* sorting when a priori probability distribution of the knee point is known. First, a top-*k* sort algorithm is proposed based on a quicksort variation. We divide the knee point search problem into multiple steps. And in each step an optimization problem of the selection number *k* is solved, where the objective function is defined as the expected time cost. Because the expected time cost in one step is dependent on that of the afterwards steps, we simplify the optimization problem by minimizing the maximum expected time cost. The posterior probability of the largest knee point distribution and the other parameters are updated before solving the optimization problem in each step. An example of source detection of DNS DoS flooding attacks is provided to illustrate the applications of the proposed algorithm.

#### 1. Introduction

Anomaly detection system and many other applications often rely on finding the largest knee point in the sorted curve to perform clustering, classification, anomaly identification, and so forth [1–6]. Here the largest knee point is targeted because the particular interests lie in finding the cluster of the largest points whose values differ significantly from their lower neighbors in the sorted curve.

Knee point is defined as the point whose value is close to its upper neighbor while far from its lower neighbor in the sorted curve and thereby taken as the boundary of the cluster of upper points. For an unsorted list, it is necessary to sort it to facilitate the knee point search. Due to time and space efficiency considerations, the method of first completely sorting and then searching the sorted list is often not the optimal one. An alternative approach is to perform search on the partially sorted, namely, top-*k*, list, hoping to save the cost of sort. Therefore the top-*k* sort algorithm is introduced to help minimize the time complexity of the knee point search in this paper. There have been many efforts for bounding and evaluating the time and space complexity of sort algorithms [7–12]. These works provide component algorithms for our work. But the problem of knee point search via top- sorting has not been addressed by any of the previous works. We present in this paper a knee point search algorithm using top- sorting with minimized time complexity.

This paper is organized as follows: some basic concepts and definitions on knee point search and top-*k* sorting are presented in Section 2; Section 3 will design a knee point search algorithm, including basic idea, top-*k* sort algorithm, time complexity, parameter updating, cascading top-*k* sorting with minimized time complexity, the knee point search algorithm, and the solution of the optimization problem; Section 4 will introduce source detection of DNS DoS flooding attacks as an application example of the proposed algorithm; Section 5 will conclude this paper.

#### 2. Knee Point Search and Selection Sort

Assume there are points whose values are . Let be sorted by their values and let their differential values be , where , and As illustrated in Figure 1, there is usually a notable gap of value between points on the upper left side and those on the lower right side of the sorted curve. We define a knee point as the one whose neighboring differential values differ significantly.

*Definition 1. * Point , is a knee point if it satisfies
where is the threshold, whose value ranges from 10 to 50 in the practice of anomaly detection.

Note that there may be more than one knee point for a list, and the goal of the algorithm is to find the largest one in the sorted curve. For an unsorted list, we should first partially sort the list to find the sorted top-*k *list and then search the sorted top-*k *list for the largest knee point.

*Definition 2. *A top-*k* sort problem of selection number *k *for an unsorted list is a problem that finds *k* largest elements of *L* sorted in descending order.

Apparently, total sort is often not optimal for the problem as knee point search may be successful on a partially sorted top-*k* list if it contains the largest knee point. Therefore it is preferable to selectively sort first using the top-*k* sort algorithm and then search in each step. The procedure may go through many recursive steps until finding the largest knee point for the search may fail in the previous steps. There is a tradeoff between the time cost and expected hit probability of knee point search in each step, both determined by the selection number and both contributing to the expected overall time cost. In this paper, the optimization problem of the selection number is solved by minimizing the maximum expected time cost.

#### 3. The Knee Point Search Algorithm

##### 3.1. Basic Idea

The knee point search algorithm is based on cascading top-*k* sorting. In each step, top-*k* sorting segments the list left to be searched. The optimal selection number is determined by minimizing the expected time cost on the list left. If the search successfully finds the knee point in the sorted top-*k* list, the algorithm ends there. Otherwise, the residual list excluding top-*k* requires further checking. It becomes the objective list for the next step and the function of the expected time cost using top-*k *sorting should also be updated according to the a priori knowledge that the search fails in the previous step. Thus the new top-*k* sort problem, likewise, holds for the next step. The algorithm runs in this way recursively until the knee point is found. There are two cases that bring the algorithm to the end.(1)The knee point is found in the sorted top-*k* list in a step.(2)The optimal selection number in one step equals the length of the objective list. This means the optimal option is total sorting. Therefore the knee point is certain to be found in the completely sorted list.

##### 3.2. Top-*k* Sort Algorithm

We design a quicksort variation as the top-*k* sort algorithm. Quicksort is a very efficient sort algorithm invented by Hoare [7]. Quicksort has two phases: the partition phase and the sort phase, which makes it a good example of the divide and conquer strategy for solving problems.

Top-*k* sorting only aims at treating the largest *k* elements, and thus it can be facilitated by the divide and conquer strategy. The intermediate results of quicksort, namely, the pivot positions, can be leveraged to possibly cut off one of the smaller problems divided from the bigger problem and to be conquered. For the strategy to be effective, the partition phase runs recursively only for the lower part if the pivot falls below position *k*, because there is no need to sort the upper part, which only consists of elements larger than top-*k*. This is the major distinction from the original quicksort algorithm and brings sorting efficiency.

At the same time, the pivots located after position (the optimal selection number) at step in Section 3.6 are potentially useful for the afterwards steps, while they are actually not helpful for the inner top- sorting. Therefore we record those pivots via a stack. A stack is a data structure featured by last in, first out (LIFO). Recalling that the recursive partitions with their pivots after position produce their pivots in a sequential descending position order, we push these pivots into the stack resulting in a stack of pivots ordered by their positions. At the afterwards step , if the optimal selection number is larger than the position of the pivot at the top of the stack, the pivot is popped from the stack used for an inner pivot of top-. Since this pivot is no longer needed for the afterwards steps, it should not be maintained in the stack. When the stack is empty or the pivot at the top of the stack (so do all of the other pivots) is located after the selection number, the partition has to run by itself to find a pivot without the help of the pivot stack.

The top-*k* sort algorithm, namely, QuickSortTopK, can be expressed as in Algorithm 1.

The input of QuickSortTopK is the objective unsorted list indexed from FirstIndex to LastIndex. For all steps, LastIndex is fixed at , whereas FirstIndex is progressively increased to exclude the sorted part of in all previous steps. The output includes the sorted top- elements of indexed from FirstIndex to LastIndex and the stack containing all pivots falling after position obtained in all previous steps.

The termination condition of the recursion is checked in Line 1 of the algorithm. If the stack is nonempty and the top element of the stack falls into the objective range (see Line 2), the top element is used as the pivot for the partition (see Line 3). Otherwise, the pivot is obtained by a partition (see Line 6). Once the pivot is presented, different recursive steps are to be taken depending on the position of the pivot. If the pivot falls after position *k*, it should be pushed into the stack and then run further sorting on the original list subtracting the pivot, hoping to help afterwards steps (see Lines 9, 11). If the pivot is located exactly at position *k*, the pivot itself is the last element of the output and thereafter only top-*k*-1 sorting on the original list subtracting the pivot is needed (see Line 15). If the pivot is located prior to position *k*, both the upper and lower parts should be treated. The action on the upper part is equivalent to Quicksort, while the action on the lower part is actually the recursive running of QuickSortTopK with diminished selection number *k* and the shrinking objective list (see Lines 18, 20).

##### 3.3. Time Complexity

Let totally sorting of a list of length require time. Calculated by the number of comparisons, the average time complexity of is following some efficient algorithms, for example, Quicksort [7].

Let top-*k* sorting of selection number *k* require time, where is the length of the list. The QuickSortTopK algorithm requires an expected time of . So equals .

Let the time complexity of finding the knee point in the sorted list of length be . Recalling (2), takes .

##### 3.4. Parameter Updating

For a list of length , the algorithm divides the overall procedure into steps by a sequence of selection numbers, , . Additionally, to facilitate the formulation, we let and . Let the length of the objective list in step , be ; we have

In the first step, . Let top- sorting for the objective list of length be performed in step , , and we have

Particularly, in step . And thus top- sorting for the objective list of length is actually total sorting of the objective list. If the search of the knee point is successful in step for the sorted top- list, the algorithm ends at step . Otherwise, the algorithm continues with the next step. The algorithm lasts until step if the search misses in all of the previous steps during the progressive search. Since step takes no further selection of the objective list, the algorithm finishes in it.

Let be the position variable of the knee point and . Let represent the probability that , , and thus . The value of is assumed to be known at the beginning of the algorithm. Let be the position variable of the knee point in step and , . Let represent the probability that , , , , and thus .

Lemma 3. * The probability distribution of the knee point in step can be written as
*

*Proof. *At the first step, all knowledge about the probability distribution of the knee point is only given by . But the search in the afterward steps should make use of the posterior distribution of the knee point for it is confirmed not to exist prior to the selection number in the previous steps; for example, when the algorithm comes to step , the knee point is already checked to be not present in the top- list, . Therefore should be updated in step as

Let the hit probability of search in step be .

Lemma 4. * The hit probability of search in step can be written as
*

*Proof. *For the selection number in step , the search for the knee point is successful if and only if the knee point falls into the interval of position among and . According to Lemma 3, the probability distribution of the knee point at step should be updated as (5). Therefore we have

##### 3.5. Cascading Top-*k* Sorting with Minimized Time Complexity

Lemma 5. *Let the expected overall computational time cost in step be ; yields
*

*Proof. * When the search succeeds in step , comes only from top- sorting which requires time and the search in the top- sorted list which requires . However, recalling Section 3.3, the time complexity of is negligible compared to that of , so the summation of them can be approximated by . For the failure of search in step , the residual list of length has to be further checked. Thus the overall computation time cost consists of top-*k* sort and search in the remaining list. Note that the computation time cost of sort and search in the remaining list is no other than . Hence yields
Plugging (3) and (4) into (10), we get (9).

Lemma 5 tells us that the expected computational cost can be calculated iteratively following (7), until reaching where there is no selection for step . Thus only consists of the time cost of total sorting of the list of length and the search in it. As total sorting takes and search takes , thus we have

Let ( and , are integers) denote the set of integers . In every step , the algorithm calculates the current probability distribution of the knee point which determines the hit probability and chooses to be the solution of the following optimization problem:

We see in (12) that for any fixed the minimum of is determined by under the optimal selection of in the next step, . And the optimal for any fixed is also determined by the under the optimal selection of and so on. This kind of iterative dependency finally extends to the last step which has no further selection. So is the function of , . The variation of any choice of selection in any number of steps makes the search space of optimization very huge, especially for the initial steps. Therefore it is not practical to evaluate all possibilities of selections in all of the afterwards steps when solving the optimization problem in step . Thus it is necessary to constrain the variable of the objective function in (12) as mere .

Theorem 6. *The upper bound of the minimum of for a fixed can be written as substituting in (9) by , such that
*

*Proof. *We only need to prove that
As total sort of the list of length can be viewed as performing top- sorting, their time costs are both . This is equivalent to the case when . Because the algorithm ends with the total sorting and there are no afterwards steps, we have no definitions for , . Thus we have
Hence
Plugging (3) into (15) and then (15) into (16), we have (14), and thereby (13) is proved.

Theorem 6 manifests that, for a fixed , is definitely bounded by the time cost of total sort of the residual list of length plus that of search in it. The optimization problem in step described by (13) can be isolated from all of the possible selections of the afterwards steps and becomes a function of mere . This minmax technique brings convenience to our analysis, such that (13) can be simplified as Plugging (5) into (17), we have Solving (14), we can obtain the optimal in step .

##### 3.6. The Knee Point Search Algorithm

The knee point search algorithm runs iteratively using cascading top- sorting. When the optimal selection number is determined at step , top- sorting can be done via running QuickSortTopK . Specifically, the first step starts with QuickSortTopK ().

The procedure can be described as follows.

*Step 1. * According to (5), the probability distribution of the knee point for the optimization problem is initialized as follows
The optimal selection number is obtained by solving the following optimization problem as (18):
Perform *top*- sorting on the list of length . Search for the knee point on the sorted top- list. If successful or , the algorithm ends. Otherwise, go to Step 2.

*Step 2. * The probability distribution of the knee point for the optimization problem is updated as follows:
is derived as the solution of the following optimization problem:
Note that is inherited from Step 1. Perform top- selection sort on the list of length subtracting top- elements in Step 1. Search for the knee point on the sorted top- list. If successful or , the algorithm ends. Otherwise, go to Step 2.

*Step i*.(1)The probability distribution of the knee point for the optimization problem is updated according to (5).(2) is obtained as the solution of the optimization problem in (18), where is known from step . (3)Perform top- selection sort on the list of length subtracting top- elements already sorted in the previous steps.(4)Search for the knee point on the sorted top- list. If successful or , the algorithm ends. Otherwise, go to step .

The knee point search algorithm can be summarized as in Algorithm 2.

The knee point search algorithm can also be expressed recursively.

For each recursive step, we have an unsorted list of length and the probability distribution of the knee point in the sorted list of length :. Thus we can modify the optimization problem in (18) as follows:

Solving (14), we can obtain the optimal in each recursive step.

When the knee point search fails after top- sorting in each recursive step, the algorithm has to go to the next recursive step. First, we need to update the probability distribution of the knee point as well as the residual unsorted list as two parameters for the recursive function. According to Lemma 3, the update of the probability distribution of the knee point yields

In each recursive step, top- sorting can be done via running QuickSortTopK(, 1, length(), ), where is the objective list in the current step and *length * denotes the length of.

A recursive version of knee point search algorithm can be summarized as in Algorithm 3.

##### 3.7. The Solution of the Optimization Problem

In this section, we will assume two forms of the probability distribution of the knee point and discuss the solutions of (18) under these presumptions.

*(1) Uniform Distribution*. is equal for all , and thus . Plugging into (23), we have

Let Although (26) is a discrete function, we still utilize the method of derivation to find the extremum, which can be only applied to the continuous and derivable function. Here we treat the discrete variable as continuous ones, and thereafter (26) turns into a continuous and derivable function. This is a rational approximation of the problem, which facilitates our analysis and solving. The final solution should be the round-off of obtained by solving the continuous function.

For simplicity, we let where ,, and are all constants.

By choosing such that , we have the optimal which satisfies

For large and , we have the approximation of (28) as

To get the explicit mathematical expression of the solution of the nonlinearity equation (29), we used a heuristic approach to simplify the problem. We assume that is a proportion function of , such that where decides the optimum of .

Thus (30) yields then

Leave out the low order item in both sides of (32), and we get

Plugging (33) into (32), the optimum of yields

We can see that the optimum of that takes the form of a proportion function of exists only when . Particularly, if , the optimum of is , which means that the optimum first selection number is half of the length of the list. As the search algorithm may run recursively if the knee point search misses, the optimal method is the so-called binary search or logarithmic search method.

Theorem 7. * If the probability distribution of the knee point follows uniform distribution and the optimal selection number takes the form of (30), the optimal method for the knee point search algorithm is binary search or logarithmic search method. *

*Proof. *We first prove using an inductive method that in each recursive step of the search algorithm, is equal for all .

We know that for the first recursive step, is equal for all as the starting point of the induction. Then for the second recursive step, we can derive according to (24) that is equal for all .

Suppose that for the th recursive step, is equal for all . Similarly, for the th recursive step, is equal for all following (24).

Therefore we can form the concept inductively that in each recursive step of the knee point search algorithm, is equal for all . So the optimization problem in each recursive step can be written as no other than (25), whose solution of number of selection is half of the length of the list discussed above. Here the list is the one left from previous recursive step. Hence the optimal method for the search algorithm is binary search or logarithmic search method.

*(2) Inverse Proportion Distribution*. is in direct inverse proportion to , , and thus , where .

As an approximate treatment of the summation , we consider

Particularly, for and a large , we have

And for a large , we have where is the Euler constant and has the approximate value as 0.5772.

Plugging and then (36) and (37) into (23), we have

Let

Theorem 8. *If the probability distribution of the knee point is in direct inverse proportion to , , the optimal top- k sorting for the knee point search algorithm is full sorting of the list for a large length of the list.*

*Proof. *We only need to prove that is a monotony increase function of , and the equivalent condition for it is
that is,

We assume that is an exponential function of , such that

Thus the left and right sides of (41) can be, respectively, written as
Comparing (43) and (44), we obtain (41) for a large . That means the optimal selection number is for the first step, or the optimal top-*k* sorting for the knee point search algorithm is full sorting.

#### 4. Source Detection of DNS DoS Flooding Attacks: An Application Example

##### 4.1. DNS DoS Flooding Attacks

The Domain Name System is a fundamental and indispensable component of the modern Internet [13, 14]. The availability of the DNS can affect the availability of a large number of Internet applications. Ensuring the DNS data availability is an essential part of providing a robust Internet.

In the past few years, some important DNS name servers on the top level of the DNS hierarchical structure were targeted by the DoS or DDoS attackers, and some of these attacks did succeed in disabling the DNS servers and resulted in parts of the Internet experiencing severe name resolution problems [15–18]. Particularly, DNS DoS flooding attacks are the attacks launched by the attackers towards the DNS name servers with an overwhelming traffic flux in order to disrupt the DNS service for the legitimate clients. However, it is usually not easy to efficiently detect and defend the DoS flooding attacks because the attacking traffic is blended with the legitimate ones, which complicates the distinguishing efforts. Moreover, the detection mechanism should be implementable or should not add heavy computational load. Here we focus on the source-based detection method and show that the problem of source detection of DNS DoS flooding attacks can be addressed by the knee point search in the sorted curve discussed in this paper.

##### 4.2. Detection Using the Knee Point Search

Generally, DNS name servers may receive queries coming from thousands of DNS clients (mostly DNS cache servers), whose traffic volumes are expected to remain far below those of the DoS flooding attacks. The real-time query rates for all incoming sources can be counted by the traffic monitoring system residing at the border gateway in front of the DNS name server. The goal of the DoS attack defense is to realize real-time attacking source detection and then filter out the attacking traffic from these sources accordingly. Therefore the detection problem is equivalent to knee point search in the sorted curve, where all points above the largest knee point are identified as the attacking sources. Moreover, time efficiency is also the key requirement for the problem, for timely attacking detection means timely defending action. Applying the knee point search algorithm proposed in this paper, the expected detection time is minimized.

##### 4.3. Leaning the Knee Point Distribution

The assumption on probability distribution of the knee point is the prerequisite for the knee point search algorithm. However, in the initial rounds of detection we have hardly any a priori knowledge about the knee point. But the distribution estimation of the knee point can be learned based on the empirical data obtained in all previous rounds of detection.

First, suppose that the knee point largely follows stationary random distribution; hence its distribution exhibits almost the same probability model in all rounds of detection. We can fit a statistical model to data and provide estimates for the model's parameters. Here we apply the method of maximum likelihood for the estimation.

Let the count variable of detected knee points so far at position in the ordered list of length be ,. Let the value of be , . If the number of rounds of detection is , we have

Let the probability vector of the knee point at different positions be . The likelihood function of can be written as where is the density function. To calculate , we have where the last item in (47) is actually not an independent one given that all other than are known due to the constraint in (45), such that Plugging (48) into (47), we get Thus let We obtain the maximum likelihood estimation of , :

At the beginning of each round of detection, if the previous round finds the knee point at position , and , , are updated as follows:

The knee point distribution may evolve over time; thus the position of the knee point detected in recent rounds provides more reliable information for the estimation than earlier rounds. Taking the chronological order into consideration, we assign more weight to recent rounds than earlier rounds. This can be done by decreasing the detection results in previous rounds progressively. The deceasing is performed in updating and , , and sums up the current detection and the previous ones at a discount , . Formally, the updating of and , , can be modified as follows:

#### 5. Conclusion

Knee point search in the sorted curve is often used in the practice of anomaly detection and many other applications. Due to the inefficiency of total sorting, top-*k* sorting should be adopted for the knee point search. In this paper, a knee point search algorithm using cascading top-*k* sorting is proposed. The expected time complexity is minimized via optimizing the selection number in each step.

#### Acknowledgments

This work was supported in part by the National Key Technology R&D Program of China under Grant no. 2012BAH16B00 and the National Science Foundation of China under Grant no. 61003239.

#### References

- J. M. Kang, S. Shekhar, C. Wennen, and P. Novak, “Discovering flow anomalies: a SWEET approach,” in
*Proceedings of the 8th IEEE International Conference on Data Mining (ICDM '08)*, pp. 851–856, December 2008. View at Publisher · View at Google Scholar · View at Scopus - H. Hajji, “Statistical analysis of network traffic for adaptive faults detection,”
*IEEE Transactions on Neural Networks*, vol. 16, no. 5, pp. 1053–1063, 2005. View at Publisher · View at Google Scholar · View at Scopus - D. Gao, M. K. Reiter, and D. Song, “Beyond output voting: detecting compromised replicas using HMM-based behavioral distance,”
*IEEE Transactions on Dependable and Secure Computing*, vol. 6, no. 2, pp. 96–110, 2009. View at Publisher · View at Google Scholar · View at Scopus - X. Li, F. Bian, M. Crovella et al., “Detection and identification of network anomalies using sketch subspaces,” in
*Proceedings of the 6th ACM SIGCOMM on Internet Measurement Conference (IMC '06)*, pp. 147–152, October 2006. View at Publisher · View at Google Scholar · View at Scopus - S. Salvador and P. Chan, “Learning states and rules for detecting anomalies in time series,”
*Applied Intelligence*, vol. 23, no. 3, pp. 241–255, 2005. View at Publisher · View at Google Scholar · View at Scopus - Q. Zhao, V. Hautamaki, and P. Fränti, “Knee point detection in BIC for detecting the number of clusters,” in
*Advanced Concepts for Intelligent Vision Systems*, vol. 5259 of*Lecture Notes in Computer Science*, pp. 664–673, Springer, Berlin, Germany, 2008. View at Publisher · View at Google Scholar - C. A. R. Hoare, “Quicksort,”
*Computer Journal*, vol. 5, no. 1, pp. 10–15, 1962. View at Google Scholar - D. Knuth,
*Sorting and Searching*, vol. 3 of*The Art of Computer Programming*, Addison-Wesley, New York, NY, USA, 3rd edition, 1997. - H. M. Okasha and U. Rösler, “Asymptotic distributions for random median quicksort,”
*Journal of Discrete Algorithms*, vol. 5, no. 3, pp. 592–608, 2007. View at Publisher · View at Google Scholar · View at Scopus - J. Pagter and T. Rauhe, “Optimal time-space trade-offs for sorting,” in
*Proceedings of the 39th Annual Symposium on Foundations of Computer Science*, pp. 264–268, November 1998. View at Scopus - D. Cantone and G. Cincotti, “QuickHeapsort, an efficient mix of classical sorting algorithms,”
*Theoretical Computer Science*, vol. 285, no. 1, pp. 25–42, 2002. View at Publisher · View at Google Scholar · View at Scopus - J. Cardinal, S. Fiorini, G. Joret, R. M. Jungers, and J. I. Munro, “Sorting under partial information (without the ellipsoid algorithm),” in
*Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC '10)*, pp. 359–368, June 2010. View at Publisher · View at Google Scholar · View at Scopus - P. Mockapetris, “Domain names—concepts and facilities,” Internet Request for Comments (RFC, 1034), November 1987.
- P. Albitz and C. Liu,
*DNS and BIND*, O'Reilly and Associates, 1998. - Name server DoS Attack October, 2002, http://www.caida.org/projects/dns-analysis/.
- UltraDNS DOS Attack, 2002, http://www.theregister.co.uk/2002/12/14/ddos_attack_really_really_tested/.
- “DoS Attack against Akamai,” 2004, http://news.cnet.com/2100-1038_3-5236403.html.
- “ICANN Factsheet for the February 6, 2007 Root Server Attack,” 2007, http://www.icann.org/announcements/factsheet-dns-attack-08mar07_v1.1.pdf.