Table of Contents Author Guidelines Submit a Manuscript
International Journal of Analysis
Volume 2014 (2014), Article ID 840432, 10 pages
http://dx.doi.org/10.1155/2014/840432
Research Article

A New Look at Worst Case Complexity: A Statistical Approach

1Department of Computer Science & Engineering, B.I.T. Mesra, Ranchi 835215, India
2Department of Applied Mathematics, B.I.T. Mesra, Ranchi 835215, India

Received 5 June 2014; Revised 17 September 2014; Accepted 18 September 2014; Published 29 December 2014

Academic Editor: Baruch Cahlon

Copyright © 2014 Niraj Kumar Singh et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We present a new and improved worst case complexity model for quick sort as , where the LHS gives the worst case time complexity, is the input size, is the frequency of sample elements, and is a function of both the input size and the parameter . The rest of the terms arising due to linear regression have usual meanings. We claim this to be an improvement over the conventional model; namely, , which stems from the worst case complexity for this algorithm.

1. Introduction

Sometimes theoretical results on algorithms are not enough for predicting the algorithm’s behavior in real time implementation [1]. From research in parameterized complexity, we already know that for certain algorithms, such as sorting, the parameters of the input distribution must also be taken into account, apart from the input size, for a more precise evaluation of time complexity of the algorithm in question [2, 3]. Based on the results obtained, we present a new and improved worst case complexity model for quick sort as where the LHS gives the worst case time complexity, is the input size, is the frequency of sample elements, and is a function of both the input size and the parameter . The rest of the terms arising due to linear regression have usual meanings. We claim this to be an improvement over the conventional model; namely, , which stems from the worst case complexity for this algorithm. It is important to note that our results change the order of theoretical complexity of this algorithm as we get complexity in some situations.

This new model in our opinion can be a guiding factor in distinguishing this algorithm from other sorting algorithms of similar order of theoretical average and/or worst case complexities. The dependence of basic operation(s) on the response is more prominent for discrete distributions rather than continuous ones for the probability of a tie is zero in a continuous case. However, presence of ties and their relative positions in the array is crucial for discrete cases. And this is precisely where the parameters, apart from characterizing the size of the input, of the input distribution come into play.

We make a statistical case study on the robustness of theoretical worst case complexity measures for quick sort [4] over discrete uniform distribution inputs. The uniform distribution input parameters are related as . The runtime complexity of quick sort varies from to depending on the extent of tied elements present in a sample. For example, complexity is when all keys are distinct and when all keys are similar [5]. Apart from this result an important observation with respect to average time complexity is made by Singh et al. [6], which claims quadratic average case complexity of quick sort program under universal data set. This is true especially for certain models where tie-density is a positive linear function over values. With these observations, it would be interesting to know the behavior of quick sort program when the linear growth of is replaced by some superlinear function. This interest is a major motivation towards this research article.

Just as in this case quick sort is found to be worse than it is; the reverse is also possible in other algorithms. That is to say, an algorithm can perform better than what a worst case mathematical bound says. In this case the bound becomes conservative [7]. A certificate on the level of conservativeness can be provided using a statistical analysis only. Empirical-O, the statistical bound estimate, will be pointing to some other bound lower than the mathematical bound obtained by theoretical analysis. The difference provides the desired certificate. For a detailed discussion on empirical-O reader is suggested to see [8].

It is well known that quick sort’s performance is dependent on the underlying pivot selection algorithm for a proper pivot selection that greatly reduces the chances of getting the worst case instances. As discussed above, the worst case complexity measures can be conservative. That is, for an arbitrary algorithm with worst case complexity, in a finite range setup, we can expect for an complexity, where . However when other input parameter(s) ( in our case) are also taken into account we come up with worst case complexity, which is a novel finding.

Justifying the Choice of Algorithm. There are many versions of quick sort. Industrial implementations of quick sort typically include heuristics that protect it against performance when keys are similar. With respect to the quick sort, the question of choosing a proper pivot selection algorithm is more relevant in average case complexity measures, as its average case complexity itself is not robust [5]. The small (but nonzero) probability of getting the worst case instances is often cited as the reason for overall good performance of quick sort. This is true especially for random continuous distribution inputs. This research article is a study on finding the worst case behavior of both the naïve and randomized versions of quick sort algorithm.

The Organization of the paper. The paper is organized as follows. Section 2 gives analysis of quick sort using statistical bound estimate. Under Section 2 Section 2.1 gives analysis for sorted data sequences. Section 2.2 gives the analysis for random data sequences with three case studies. Section 2.3 gives justification for worse than complexity of quick sort. Section 3 gives conclusion.

2. Analysis of Quick Sort Using Statistical Bound Estimate

Our statistical adventure explores the worst case behavior of the well-known standard quick sort algorithm [4] as a case study. The worst case analysis was done by directly working on program run time to estimate the weight based statistical bound over a finite range by running computer experiments [9, 10]. This estimate is called empirical-O. Here time of an operation is taken as its weight. Weighing allows collective consideration of all operations, trivial or nontrivial, into a conceptual bound. We call such a bound a statistical bound opposed to the traditional count based mathematical bounds which is operation specific. Since the estimate is obtained by supplying numerical values to the weights obtained by running computer experiments, the credibility of this bound estimate depends on the design and analysis of computer experiments in which time is the response. It is suggested for the interested reader to see [11, 12] to get more insight into statistical bounds and empirical-O.

This section includes the empirical results obtained for worst case analysis of quick sort algorithm. The samples are generated randomly, using a random number generating function, to characterize discrete uniform distribution models with as its parameter. Our sample sizes, for random data sequences, lie in between and . The discrete uniform distribution depends on the parameter , which is the key to decide the range of sample obtained.

Most of the mean time entries (in seconds) are averaged over 500 trial readings. These trial counts, however, should be varied depending on the extent of noise present at a particular sample size value. As a rule of thumb, the greater the noise at each point of is, the more the numbers of observations should be.

The interpretations made for the various statistical data are guided by [13].

System Specification. All the computer experiments were carried out using PENTIUM 1600 MHz processor and 512 MB RAM. Statistical models/results are obtained using Minitab-16 statistical package. The standard quick sort is implemented using “C” language by the authors themselves. It should be understood that although program run time is system dependent, we are interested in identifying patterns in the run time rather than run time itself. It may be emphasized here that statistics is the science of identifying and studying patterns in numerical data related to some problem under study.

2.1. Analysis for Sorted Data Sequence

This section includes the empirical results for sorted data sequences. The samples thus generated consist of all distinct elements (at least theoretically). The program runtime data obtained for sorted sequences is fitted for a quadratic model. The regression analysis result is given in Box 1. With a very significant t-value (194.92) of quadratic term, the regression analysis statistic strongly supports a quadratic model. The quadratic model goodness is further tested through cubic fit for the same runtime data set.

Box 1: Regression analysis: versus and (sorted data sequence).

Next the very same program runtime data is fitted for a cubic model. The regression analysis result is given in Box 2. With a value of 23.75 the t statistic for the quadratic term is significantly higher than other terms in the obtained regression model. Remarkably the statistical significance of cubic term is very weak compared to other terms, hence liable to be discarded. The value is the maximum with a very small standard error value. With insignificant t value (0.13) of the cubic term, this statistic analysis result suggests a strong quadratic model for the given data set.

Box 2: Regression analysis: versus , , and (sorted data sequence).

2.2. Analysis for Random Data Sequences

Next the quick sort program runtime data is analyzed for two carefully designed random data sequences whose elements correspond to points on the third degree polynomials shown in Figures 1 and 3, respectively. Due to uniform distribution of sample elements, over increasing values, a linear growth in results in a linear growth of values as well and vice versa. In any such case either or has to be constant. As a change, if growth rate of is made super linear we get random samples in which neither nor remain constant. The empirical results for these random sequences are given in Boxes 37.

Box 3: Regression analysis result for versus and on random data sequences (5 105 ≤ 50 105).

Figure 1: Cubic plot of versus ().
2.2.1. Worst Case Analysis of Naïve Quick Sort (Case Study 1)

As our first case study with random data sequences we have analyzed the worst case complexity measures for the inputs in the range .

The points on horizontal axis in Figures 2(a)2(c) and 2(e) correspond to points on the third degree polynomial shown in Figure 1. It can be seen in Figure 2(a) that the runtime data when fitted to a quadratic model gives an underfit. This fit gets improved significantly when the fitted model is changed to a cubic model of type . We get a more improved fit when the cubic model is replaced by a fourth degree polynomial. We are not interested in higher order models such as fifth or sixth degree polynomials to avoid the problem of overfitting as we wish to catch the general trend of the population rather than a fit by forcing a polynomial to pass through all the input points. Our graphical observation is next verified with more rigorous statistical results given with Boxes 35.

Figure 2: (a) Second degree polynomial curve of versus (). (b) Third degree polynomial curve of versus (). (c) Fourth degree polynomial curve of versus ().
Figure 3: Cubic plot of versus ().

Regression Analysis and ANOVA (Analysis of Variance) Results. The program runtime data obtained for random data sequences corresponding to curve in Figure 1 is fitted for a quadratic model. The corresponding regression analysis result is given in Box 3. The significant t-value of quadratic term suggests for a quadratic complexity. However the value is relatively low and the standard error is high.

The Cubic Model as a Test Of Quadratic Goodness of Fit. A test of quadratic goodness of fit is performed by fitting a cubic model to the same program runtime data. From the regression and ANOVA table (Box 4), this cubic model looks much better than the earlier quadratic model. The is much higher (98.0 against 87.7), the standard error is smaller (3.60446 against 8.19305), but, more importantly, the coefficient of the cubic term is highly significant. Interestingly the cubic term is statistically more significant than the quadratic term (5.49 against −4.03)!

Box 4: Regression analysis result for versus , , and on random data sequences (5 105 ≤ 50 105).

Box 5: Regression analysis result for versus and (Box-Cox transformation) on random data sequences (5 105 ≤ 50 105).

Verifying the Cubic Complexity through “Box-Cox Transformation”. In order to get a better fit of the model we prefer the response to be transformed. In general, transformations are used for three purposes: stabilizing response variance, making the distribution of the response variable closer to the normal distribution, and improving the fit of the model to the data [14]. We perform transformation to simultaneously accomplish more than one of these objectives. The power family of transformations is very useful, where is the parameter of the transformation to be determined.

Below we provide statistical results for Box-Cox transformation of the response variable for (lambda) = 0.5. The detailed result of Box-Cox transformation is provided in Box 5.

Regression Equation is as follows: From the above regression equation we have , as term is statistically the most significant. This implies complexity model. This result once again refutes the theoretical worst case complexity of quick sort algorithm for certain data patterns.

2.2.2. Worst Case Analysis of Naïve Quick Sort (Case Study 2)

We investigate the behavior of quick sort program for yet larger data sets whose horizontal axis elements correspond to points on Figure 3. In this experimental setup the samples vary from 5 to 10 million of data in terms of their size. From the regression and ANOVA table (Box 6), this new cubic model looks much better than the quadratic model in Box 3. The is much higher (97.2 against 87.7), the standard error is still smaller (2.96191 against 8.19305), but, more importantly, the coefficient of the cubic term is highly significant. In fact the cubic term is statistically more significant than the quadratic term. These statistics indicate that a cubic model better describes the sample experimental data, as it does not impart overfit to the complexity data. The smaller PRESS statistic for the cubic models in Boxes 4 and 6 is also favorable. See Figure 4 to get an insight into super quadratic nature of versus curve. These observations lead to a conjecture that the quick sort worst case complexity is . It is important to note that our results do change the order of theoretical complexity of this algorithm.

Box 6: Regression analysis result for versus , , and on random data sequences (50 105 ≤ 100 105).

Box 7: Regression analysis for versus , , and (randomized quicksort) on random data sequences (5 105 ≤ 50 105).

Figure 4: Super quadratic plot of versus ().

The statistical analysis in both case studies suggests a super quadratic worst case complexity for quick sort algorithm provided is a function of both and .

2.2.3. Worst Case Analysis of Randomized Quick Sort (Case Study 3)

As our third case study with random data sequences we have analyzed the worst case complexity measures of randomized quick sort for the inputs in the range . Apart from the version of quick sort used in this case study other input requirements are similar to that in Section 2.2.1. The response variable in this case corresponds to points in Figure 5.

Figure 5: Cubic plot of versus () for randomized quick sort.

The runtime data obtained for randomized quick sort is fitted to a cubic model. The corresponding result is given in Box 7. It can be seen that with a t statistic of 5.27 the cubic term is statistically more significant than the quadratic term. The standard error () of the model thus obtained is relatively small. Also the PRESS statistic (278.092) strongly supports a cubic complexity for the observed data. Compare these values against the corresponding values in Box 3. The worst case complexity thus can be expressed as

2.3. Justification for Worse than Complexity

It is well known that runtime of quick sort depends on the number of equal keys present in the sample [5]. With fixed tie-density value, in a finite but reasonably wide range setup, is a linear function over input parameter . Similarly for fixed , the value grows linearly (see Figures 6(a) and 6(b) for an insight) with respect to input parameter . For a fixed sample size the runtime of quick sort is an increasing function over the input parameter . An important observation as that made in [6] is that, for a linear growth in (of course is to be kept constant to satisfy ), the time complexity is quadratic. With this observation, it is interesting to know the behavior of quick sort program when this linear growth is replaced by some steeper function. This interest is a major motivation towards this research article. In this requirement, due to nonlinear nature of , even is not a constant but rather a decreasing function over input parameter .

Figure 6: (a) Linear growth of value. (b) Quadratic plot for linear growth of value.

3. Conclusions

We conclude this paper with the following remarks.

Worst case analysis is termed as a useful science, since the worst case bounds give a sense of guarantee against the nonfavorable cases. But is the science as useful as it is projected by the computer scientists? Far from it. Worst case bounds are often conservative in the sense that algorithms can and do perform better than what the bound says. No certificate is given on the level of conservativeness [8]. The present finding adds a new chapter to this criticism in that even the guarantee provided against the so-called worst situations can fail as we have discovered in the case of quick sort.

We hope that our findings will encourage other researchers to investigate and discover similar contradictions between theory and practice in algorithmic complexity so that ultimately the correct message is conveyed to the scientific community.

Note. Due to the limitations of the underlying software we have done mapping while analyzing the various runtime data using MINITAB software. Hence in order to obtain the correct coefficients use Table 1 multipliers for the various terms in the models obtained in Box 1 through Box 7.

Table 1

Appendix

The “C” Code Implementations

See Algorithms 1, 2, and 3.

Algorithm 1: The “C” code snippet for simulating discrete uniform distribution inputs.

Algorithm 2: The “C” code implementation of the naive (first element pivot) quick sort.

Algorithm 3: The “C” code implementation of the randomized quick sort.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

  1. D. S. Johnson, A Theoretician’s Guide to the Experimental Analysis of Algorithms, 2001, http://www.researchatt.com/~dsj/.
  2. R. G. Downey and M. R. Fellows, Parameterized Complexity, Springer, New York, NY, USA, 1999.
  3. H. M. Mahmoud, Sorting: A Distribution Theory, Wiley-Interscience Series in Discrete Mathematics and Optimization, John Wiley & Sons, New York, NY, USA, 2000. View at Publisher · View at Google Scholar · View at MathSciNet
  4. C. A. Hoare, “Quicksort,” The Computer Journal, vol. 5, pp. 10–15, 1962. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  5. R. Sedgewick, “Quicksort with equal keys,” SIAM Journal on Computing, vol. 6, no. 2, pp. 240–267, 1977. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  6. N. K. Singh, S. Chakraborty, and D. K. Mallick, “A statistical peek into average case complexity,” International Journal of Experimental Design and Process Optimisation. In press.
  7. S. Chakraborty and S. K. Sourabh, “On why an algorithmic time complexity measure can be system invariant rather than system independent,” Applied Mathematics and Computation, vol. 190, no. 1, pp. 195–204, 2007. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  8. S. Chakraborty, D. N. Modi, and S. K. Panigrahi, “Will the weight-based statistical bounds revolutionize the IT,” International Journal of Computational Cognition, vol. 7, no. 3, pp. 16–22, 2009. View at Google Scholar
  9. K.-T. Fang, R. Li, and A. Sudjianto, Design and Modeling for Computer Experiments, Chapman & Hall, New York, NY, USA, 2006. View at MathSciNet
  10. J. Sacks, W. Weltch, T. Mitchel, and H. Wynn, “Design and analysis of experiments,” Statistical Science, vol. 4, no. 4, pp. 409–423, 1989. View at Google Scholar
  11. S. Chakraborty and S. K. Sourabh, A Computer Experiment Oriented Approach to Algorithmic Complexity, Lambert Academic, 2010.
  12. N. K. Singh and S. Chakraborty, “Partition sort and its empirical analysis,” in Proceedings of the International Conference on Computational Intelligence and Information Technology (CIIT '11), vol. 250 of Communications in Computer and Information Science, pp. 340–346, 2011.
  13. P. Mathews, Design of Experiments with MINITAB, New Age International, New Delhi, India, 2010.
  14. D. C. Montgomery, Design and Analysis of Experiments, John Wiley & Sons, New York, NY, USA, 8th edition, 2013.