International Journal of Analysis

Volume 2014, Article ID 840432, 10 pages

http://dx.doi.org/10.1155/2014/840432

## A New Look at Worst Case Complexity: A Statistical Approach

^{1}Department of Computer Science & Engineering, B.I.T. Mesra, Ranchi 835215, India^{2}Department of Applied Mathematics, B.I.T. Mesra, Ranchi 835215, India

Received 5 June 2014; Revised 17 September 2014; Accepted 18 September 2014; Published 29 December 2014

Academic Editor: Baruch Cahlon

Copyright © 2014 Niraj Kumar Singh et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We present a new and improved worst case complexity model for quick sort as , where the LHS gives the worst case time complexity, is the input size, is the frequency of sample elements, and is a function of both the input size and the parameter . The rest of the terms arising due to linear regression have usual meanings. We claim this to be an improvement over the conventional model; namely, , which stems from the worst case complexity for this algorithm.

#### 1. Introduction

Sometimes theoretical results on algorithms are not enough for predicting the algorithm’s behavior in real time implementation [1]. From research in parameterized complexity, we already know that for certain algorithms, such as sorting, the parameters of the input distribution must also be taken into account, apart from the input size, for a more precise evaluation of time complexity of the algorithm in question [2, 3]. Based on the results obtained, we present a new and improved worst case complexity model for quick sort as where the LHS gives the worst case time complexity, is the input size, is the frequency of sample elements, and is a function of both the input size and the parameter . The rest of the terms arising due to linear regression have usual meanings. We claim this to be an improvement over the conventional model; namely, , which stems from the worst case complexity for this algorithm. It is important to note that our results change the order of theoretical complexity of this algorithm as we get complexity in some situations.

This new model in our opinion can be a guiding factor in distinguishing this algorithm from other sorting algorithms of similar order of theoretical average and/or worst case complexities. The dependence of basic operation(s) on the response is more prominent for discrete distributions rather than continuous ones for the probability of a tie is zero in a continuous case. However, presence of ties and their relative positions in the array is crucial for discrete cases. And this is precisely where the parameters, apart from characterizing the size of the input, of the input distribution come into play.

We make a statistical case study on the robustness of theoretical worst case complexity measures for quick sort [4] over discrete uniform distribution inputs. The uniform distribution input parameters are related as . The runtime complexity of quick sort varies from to depending on the extent of tied elements present in a sample. For example, complexity is when all keys are distinct and when all keys are similar [5]. Apart from this result an important observation with respect to average time complexity is made by Singh et al. [6], which claims quadratic average case complexity of quick sort program under universal data set. This is true especially for certain models where tie-density is a positive linear function over values. With these observations, it would be interesting to know the behavior of quick sort program when the linear growth of is replaced by some superlinear function. This interest is a major motivation towards this research article.

Just as in this case quick sort is found to be worse than it is; the reverse is also possible in other algorithms. That is to say, an algorithm can perform better than what a worst case mathematical bound says. In this case the bound becomes conservative [7]. A certificate on the level of conservativeness can be provided using a statistical analysis only.* Empirical-O*, the statistical bound estimate, will be pointing to some other bound lower than the mathematical bound obtained by theoretical analysis. The difference provides the desired certificate. For a detailed discussion on* empirical-O* reader is suggested to see [8].

It is well known that quick sort’s performance is dependent on the underlying pivot selection algorithm for a proper pivot selection that greatly reduces the chances of getting the worst case instances. As discussed above, the worst case complexity measures can be conservative. That is, for an arbitrary algorithm with worst case complexity, in a finite range setup, we can expect for an complexity, where . However when other input parameter(s) ( in our case) are also taken into account we come up with worst case complexity, which is a novel finding.

*Justifying the Choice of Algorithm*. There are many versions of quick sort. Industrial implementations of quick sort typically include heuristics that protect it against performance when keys are similar. With respect to the quick sort, the question of choosing a proper pivot selection algorithm is more relevant in average case complexity measures, as its average case complexity itself is not robust [5]. The small (but nonzero) probability of getting the worst case instances is often cited as the reason for overall good performance of quick sort. This is true especially for random continuous distribution inputs. This research article is a study on finding the worst case behavior of both the naïve and randomized versions of quick sort algorithm.

*The Organization of the paper*. The paper is organized as follows. Section 2 gives analysis of quick sort using statistical bound estimate. Under Section 2 Section 2.1 gives analysis for sorted data sequences. Section 2.2 gives the analysis for random data sequences with three case studies. Section 2.3 gives justification for worse than complexity of quick sort. Section 3 gives conclusion.

#### 2. Analysis of Quick Sort Using Statistical Bound Estimate

Our statistical adventure explores the worst case behavior of the well-known standard quick sort algorithm [4] as a case study. The worst case analysis was done by directly working on program run time to estimate the weight based statistical bound over a finite range by running computer experiments [9, 10]. This estimate is called* empirical-O*. Here time of an operation is taken as its weight. Weighing allows collective consideration of all operations, trivial or nontrivial, into a conceptual bound. We call such a bound a statistical bound opposed to the traditional count based mathematical bounds which is operation specific. Since the estimate is obtained by supplying numerical values to the weights obtained by running computer experiments, the credibility of this bound estimate depends on the design and analysis of computer experiments in which time is the response. It is suggested for the interested reader to see [11, 12] to get more insight into statistical bounds and* empirical-O*.

This section includes the empirical results obtained for worst case analysis of quick sort algorithm. The samples are generated randomly, using a random number generating function, to characterize discrete uniform distribution models with as its parameter. Our sample sizes, for random data sequences, lie in between and . The discrete uniform distribution depends on the parameter , which is the key to decide the range of sample obtained.

Most of the mean time entries (in seconds) are averaged over 500 trial readings. These trial counts, however, should be varied depending on the extent of* noise* present at a particular sample size value. As a rule of thumb, the greater the* noise* at each point of is, the more the numbers of observations should be.

The interpretations made for the various statistical data are guided by [13].

*System Specification.* All the computer experiments were carried out using* PENTIUM* 1600 MHz processor and 512 MB RAM. Statistical models/results are obtained using* Minitab-16* statistical package. The standard quick sort is implemented using “C” language by the authors themselves. It should be understood that although program run time is system dependent, we are interested in identifying* patterns* in the run time rather than run time itself. It may be emphasized here that statistics is the science of identifying and studying* patterns *in numerical data related to some problem under study.

##### 2.1. Analysis for Sorted Data Sequence

This section includes the empirical results for sorted data sequences. The samples thus generated consist of all distinct elements (at least theoretically). The program runtime data obtained for sorted sequences is fitted for a quadratic model. The regression analysis result is given in Box 1. With a very significant* t-value* (194.92) of quadratic term, the regression analysis statistic strongly supports a quadratic model. The quadratic model goodness is further tested through cubic fit for the same runtime data set.