Abstract

Following a suggestion of Cichoń and Macyna, binary search trees are generalized by keeping (classical) binary search trees and distributing incoming data at random to the individual trees. Costs for unsuccessful and successful search are analyzed, as well as the internal path length.

1. Introduction

Cichoń, together with his coauthor Macyna, had the seminal idea [1] to generalize approximate counting to approximate counting with counters. While in the original version [2] a stream of letters (a word) is dealt with a counter in a certain way (that is of no interest here), the new version uses counters and chooses for each letter one of these counters (with probability ), where it is dealt with as usual. The result of the procedure is the sum of the individual results of the counters.

This fundamental idea should, however, not be restricted to approximate counting! Indeed, it can be considered within a variety of different contexts. In this paper, the fundamental idea is applied to binary search trees. They are very well understood and described in classic books such as [3, 4], with plenty of backward pointers to the older literature due to Lynch, Hibbard, Louchard, Brown, Shubert, and many others. We assume that, instead of just one, binary search trees are kept, and for each element, when inserting it, a decision is made to which of the trees it is being sent. Of course, for algorithmic purposes, this choice must be deterministic, so that one knows in which tree to search. However, for the analysis, it is assumed that each tree is equally likely and will be selected with probability .

Almost all the information about binary search trees that is known to this day can be found in the encyclopedic books [3, 4]. We only mention that they originate from random permutations (typically of ); a new element is compared to the root and, if there is no space for it, is moved to the left/right if it is smaller/larger than the root; then the process continues. A binary search tree is used as a data structure. It is, thus, essential that one can find existing elements in a reasonable number of steps and also get the information that a searched element is not present after a small number of comparisons.

This is the first paper about binary search trees, and the hope is that many more will be written in the future, by various specialists. Thus, no completeness is aimed at. Three parameters are studied. The cost for inserting a new element into the trees, which is related to the cost of unsuccessful search, then the cost for successful search, which is the average of the level of all the elements in all trees, and then the internal path length, which is the sum of the internal path lengths of the trees.

In the classical case, probability generating functions are available, so that one can extract moments from them, which can be written in terms of harmonic numbers and generalizations. We describe here how this probability generating functions translate to the -model.

We try to use consistent notation: if the probability generating function is , then we write for the transformed -version. We always write for the number of nodes in a classical binary search tree and for the total number of nodes in the binary search trees. Furthermore, we write , , , , , and for probabilities and moments. We use the second factorial moments on our way to the variance.

A crucial expression is with , which is the probability that the data split into sets of sizes each.

It turns out that we have to use three auxiliary quantities, named , , and , which are introduced in the next section. All our quantities of interest can be expressed in terms of them. This is done in full in the section on unsuccessful search but only sketched in the remaining sections, since the actual computations are quite long.

The intuition is of course that each of the binary search tree should have roughly nodes; the analysis that follows will make this precise.

The classic book [5] is an excellent source on harmonic numbers and their manipulation; in fact, quantity appears already in it!

The first parameter that we study is the number of comparisons to insert node into a binary search tree with nodes. This is directly related to searching for a key which is not present, since it is equivalent to insert this (nonexistent) item as the st node. The probability generating function is so that the probability that comparisons are needed is From this, one derives that All this is classical. Now we translate this into the -model. The largest node sits in one of the binary search trees of size . Therefore, The quotient is the probability that the remaining nodes can be chosen. On the level of probability generating functions, this means that The last form is obtained by multiplication by and summing. Moments can be computed from this using differentiations. In order to do so, we need some auxiliary sums, that will be also useful in later sections.

Lemma 1.

Proof. All the proofs are using the basic recursion for binomial coefficients, to create a first order recursion which can be solved by summation. The procedure for is contained in [5]: Therefore,

In our applications, , and then the formulae read

After these long but necessary computations have been done, we can now compute the moments: Therefore, Now, by two differentiations, we find by a similar (but much longer) computation as before:

From these results, we can get the variance explicitly as . We do not display it, since it is quite long. However, we will drop exponentially small terms of the form with ; then the results are a bit more appealing: with The sums can be extended to infinity; the extra terms are absorbed in our exponentially small remainder term.

The remaining sums can be asymptotically evaluated: Not more is required than the generating function of the harmonic numbers. What we have done here is justified by singularity analysis, as described in [6]; note that is the dominant singularity here.

Theorem 2. The expectation and variance of the number of comparisons needed to insert the last element into binary search trees of altogether nodes are given by More terms in the asymptotic expansions are easily available.

Now, we look at successful search in binary search trees. The model is that the comparisons to find all possible nodes are added, and this count is then divided by the total number of nodes. This parameter has the following probability generating function: It translates into the -model as follows: Note that we add the comparisons in each subtree, given by , and then divide by the total number . The following results are classical: Consequently, we can evaluate moments, in the same style as in the last section. We do not present all the long computations here: Further, Once again, we drop exponentially small terms to get shorter formulæ: The asymptotic form is now computed as in the previous section.

Theorem 3. The expectation and variance of the number of comparisons in a successful search related to binary search trees of altogether nodes are given by More terms in the asymptotic expansions are easily available.

4. Internal Path Length

The last parameter that we study is the (internal) path length, namely, the sum of the distances of all the nodes to the root (in the classical case). In the -version, it is simply the sum of the path lengths in the -individual trees. It is known that the probability generating functions satisfy whence, It is known that Therefore, (and again, the extremely long computations are not displayed) Further, One can now plug in the aforementioned explicit formulæ for , , and , which we do not display, because of length. Instead, we decided to produce an asymptotic formula including terms of order or higher: Eventually, we arrive at the last result of this paper.

Theorem 4. The expectation and variance of the internal path length of binary search trees of altogether nodes are given by More terms in the asymptotic expansions are easily available.

5. Conclusion

This was a first step towards the analysis of the -model of binary search trees. Much more is known about binary search trees and could/should be lifted to that level. Just to mention something explicit, one could look at the depth of node in an -binary search tree of random nodes. The average of this (in the classical case) is due to Arora and Dent [7] and is related to the number of passes that the recursive algorithm Quickselect needs to find the th largest element, see [8, 9].

If one wants to compute higher moments, then one needs to introduce sums like and similar ones.

The quantity is related to the dilog function.