Research Article | Open Access
On a Model for the Storage of Files on a Hardware: Statistics at a Fixed Time and Asymptotic Regimes
We consider a version in continuous time of the parking problem of Knuth. Files arrive following a Poisson point process and are stored on a hardware identified with the real line, in the closest free portions at the right of the arrival location. We specify the distribution of the space of unoccupied locations at a fixed time and give asymptotic regimes when the hardware is becoming full.
We consider a version in continuous time of the original parking problem of Knuth. Knuth was interested in the storage of data on a hardware represented by a circle with spots. Files arrive successively at locations chosen uniformly at random and independently among these spots. They are stored in the first free spot at the right of their arrival point (at their arrival point if it is free). Initially Knuth worked on the hashing of data (see, e.g., [1–3]): he studied the distance between the spots where the files arrive and the spots where they are stored. Later Chassaing and Louchard  have described the evolution of the largest block of data in such coverings when tends to infinity. They observed a phase transition at the stage where the hardware is almost full, which is related to the additive coalescent. Bertoin and Miermont  have extended these results to files of random sizes which arrive uniformly on the circle.
We consider here a continuous time version of this model where the hardware is large and now identified with the real line. A file labelled of length (or size) arrives at time at location . The storage of this file uses the free portion of size of the real line at the right of as close to as possible (see Figure 1). That is, it covers if this interval is free at time . Otherwise this file can be split into several parts which are then stored in the closest free portions at the right of the arrival location. We require uniformity of the location where the files arrive and identical distribution of their sizes. Thus we model the arrival of files by a Poisson point process (PPP): is a PPP with intensity on . We denote and assume . So is the mean of the sum of sizes of files which arrive during a unit interval time on some interval with unit length.
We begin by constructing this random covering (Section 2). The first questions which arise and are treated here concern statistics at a fixed time for the set of occupied locations . What is the distribution of the covering at a fixed time? At what time the hardware becomes full? What are the asymptotics of the covering at this saturation time? What is the length of the largest block on a part of the hardware?
It is quite easy to see that the hardware becomes full at a deterministic time equal to . In Section 3.1, we give some geometric properties of the distribution of the covering at a fixed time and we characterize this distribution by giving the joint distribution of the block of data straddling and the free spaces on both sides of this block. The results given in this section will be useful for the problem of the dynamic of the covering considered in , where we investigate the evolution in time of a typical data block.
Then, using this characterization, we determine in Sections 3.2 and 3.3 the asymptotic regimes at the saturation time, which depend on the tail of , as in [4, 5, 7]. More precisely, we give the asymptotic of when tends to (Theorem 3.6) and the asymptotic of restricted to when tends to infinity and tends to (Theorem 3.10).
We derive then the asymptotic behavior of the largest block of the hardware restricted to when tends to infinity and tends to (Corollary 3.11). As Chassaing and Louchard in , we observe a phase transition. Results are stated in Section 3 and proved in Section 4.
It is easy to check that for each fixed time , does not depend on the order of arrival of files before time . If is finite, we can view the files which arrive before time as customers: the size of the file becomes the service time of the customer and the location where the file that arrives becomes the arrival time of the customer. We are then in the framework of the queue model in the stationary regime and the covering becomes the union of busy periods (see, e.g., [8, Chapter 3] or ). Thus, results of Sections 3.1 and 3.3 for finite follow easily from known results on . When is infinite, results are similar though the busy cycle is not defined. Thus the approach is different and proving asymptotics on random sets requires results about Lévy processes (see the appendix) and regenerative sets. One motivation for the case when is infinite comes from storage models which appear by renormalization when both the number of customers who store files independently and the size of the hardware go to infinity. Moreover, as far as we know, the longest busy period and more generally asymptotic regimes on when tends to infinity and tends to the saturation time (Section 3.4) have not been considered in the queuing model.
In this section, we introduce some notations and recall some definitions we need to state the results. We also provide an elementary construction of the model studied in this paper.
Throughout this paper, we use the classical notation for the Dirac mass at and .
If is a measurable subset of , we denote by its Lebesgue measure and by its closure. For every , we denote by the set and By convention, and .
Topology of Matheron
If is a closed interval of , we denote by the space of closed subsets of . For all and we define and we endow with the Hausdorff distance defined for all by The topology induced by this distance is the topology of Matheron : a sequence in converges to if and only if for each open set and each compact , It is also the topology induced by the Hausdorff metric on a compact set using arctan() or the Skorokhod metric using the class of “descending saw-tooth functions’’ (see [10, 11] for details).
Tail of and Lévy Processes Indexed by
We give here several definitions which will be useful for the study of the asymptotic regimes. Following the notation in , we say that if has a finite second moment . For , we say that whenever Then, for , we put We denote by a two-sided Brownian motion; that is, and are independent standard Brownian motions. For , we denote by a càdlàg process with independent and stationary increments such that is a standard spectrally positive stable Lévy process with index , that is, Finally, for all and , we introduce the following processes indexed by : and their infimum process defined for .
Construction of the Covering
We give here an elementary construction of and some basic identities we will use next. They are classical in queuing theory (see, e.g., ) and storage systems (see, e.g., ), and so we skip details and refer to the version  for complete proofs.
We provide a deterministic construction of for any fixed . As does not depend on the order of arrival of files before , this amounts to construct the covering associated with a given sequence of files labelled by . The file labelled by has size and arrives after the files labelled by , at location on the real line. Files are stored following the process described in the Introduction and is the portion of line which is used for the storage.
The covering is the increasing union of the coverings () obtained by considering only the first files, that is, where can be defined in an elementary way by the following induction. Set , and introduce the complementary set of (i.e., the free space of the real line). Let , so is the right-most point which is used for storing the th file. Define then Now we introduce as the quantity of data which we have tried to store at the location (successfully or not) when files are stored. These data are the data fallen in which could not be stored in ; so is defined by Note that in queuing systems, is the workload. This quantity can be expressed using the function , which sums the sizes of the files arrived at the left of a point minus the drift term . It is defined by and Introducing also its infimum function defined for by , we get the following expression, for every :
As a consequence, the covered set when the first files are stored is given by
Finally, we introduce the function defined on by and and its infimum defined for by Assuming that the quantity of data arriving on a compact set is finite, we can let in (2.14). More precisely, the covering is given by the following proposition (see [12, Section 2.1] for the proof).
Proposition 2.1. Assuming (2.17), one has the following. If , then . If , then .
3. Properties at a Fixed Time and Asymptotics Regimes
3.1. Statistics at a Fixed Time
Our purpose in this section is to specify the distribution of the covering using Lévy processes. This characterization will be useful to prove asymptotics results (Theorems 3.6, 3.10 and Corollary 3.11) and for the dynamic results given in . To that end, following the previous section, we consider the process associated to the PPP defined by which has independent and stationary increments, no negative jumps, and bounded variation. Introducing also its infimum process defined for by we can give now a handy expression for the covering at a fixed time and obtain that the hardware becomes full at a deterministic time equal to , which is the random counterpart of Proposition 2.1 (see Section 4 for the proof).
Proposition 3.1. For every , one has a.s. For every , one has a.s.
One can note that in queuing system, is the charge and is the standard claim of stability for , for finite .
To specify the distribution of , it is equivalent and more convenient to describe the distribution of its complementary set, denoted by , which corresponds to the free space of the hardware at time . By the previous proposition, there is the following identity: We begin by giving some geometric properties of this set, which are classical for finite for storage systems (see ) and queuing theory (see ).
Proposition 3.2. For every , is stationary, its closure is symmetric in distribution, and it enjoys the regeneration property: For every , is independent of and is distributed as .
Moreover for every , .
Stationarity is plain from the construction of the covering and regeneration property is a direct consequence of Lemma 4.1 given in the next section. Symmetry is then a consequence of [25, Lemma 6.5] or [26, Corollary 7.19]. Computation of can be then derived from [15, Theorem 1]. See [12, Section 3.1] for the complete proof.
Even though for each fixed the distribution of is symmetric, the processes and are quite different. For example, we shall observe in  that the left extremity of the data block straddling is a Markov process but the right extremity is not.
We want now to characterize the distribution of the free space . For this purpose, we need some notation. The drift of the Lévy process is equal to , its Lévy measure is equal to , and its Laplace exponent is then given by (see the appendix for background on Lévy processes) For sake of simplicity, we write, recalling (2.1), which are, respectively, the left extremity, the right extremity, and the length of the data block straddling , . Note that if .
We work with a subset of of the form , and we denote by the symmetrical of with respect to closed at the left, open at the right. We consider the positive part (resp., negative part) of defined by
Example 3.3. For a given represented by the dotted lines, we give below and , which are also represented by dotted lines. Moreover the endpoints of the data blocks containing are denoted by and (see Figure 2).
Thus (resp., ) is the free space at the right of (resp., at the left of , turned over, closed at the left and open at the right). We have then the identity Introducing also the processes and defined by enables us to describe in the following way (see Section 4 for the proof).
Proposition 3.4. (i) The random sets and are independent, identically distributed, and independent of .
(ii) and are the range of the subordinators and , respectively, whose Laplace exponent is the inverse function of .
(iii) The distribution of is specified by where is an uniform random variable on independent of and is the Lévy measure of .
Remark 3.5. Such results are classical for regenerative sets (see, e.g., [13, 17, 18]). But we need this particular characterization and expressions given in the proof in the next section for forthcoming results.
3.2. Asymptotics at Saturation of the Hardware
We focus now on the asymptotic behavior of when tends to , that is, when the hardware is becoming full. First, note that if has a finite second moment, then Thus we may expect that if has a finite second moment, then should converge in distribution as tends to . Indeed, in the particular case or in the conditions of [4, Corollary 2.4], we have an expression of and we can prove that does converge in distribution to a gamma variable.
More generally, we shall prove that the rescaled free space converges in distribution as tends to . To that end, we need to prove that the process converges after suitable rescaling to a random process. Thanks to (3.3), should then converge to the set of points where this limiting process coincides with its infimum process. We shall also handle the case where has an infinite second moment and find the correct normalization, which depends on the tail of . Proofs are close to those of Section 3.3 and given simultaneously in Section 4.
In queuing systems, asymptotics at saturation are known as heavy traffic approximation (), which depend similarly on the tail of . And for finite, results given here could be directly derived from results in queuing theory (see [8, Section III.7.2] or  if has a second moment order and  for heavy tail of ). The main difference is that can be infinite in this paper. Then the busy cycle is not defined and we consider here the whole random set of occupied locations.
To state the main result, we introduce now the following functions defined for every and by
Recalling Notations of Section 2, we have then the following weak convergence result for the Matheron topology.
Theorem 3.6. If , then converges weakly in as tends to to .
First we prove the convergence of the Laplace exponent after suitable rescaling as tends to , which ensures the convergence of the Lévy process after suitable rescaling (see Lemma 4.2). These convergences will not a priori entail the convergence of the random set since they do not entail the convergence of excursions. Nevertheless, they will entail the convergence of since (Lemma 4.4). Then we get the convergence of as tends to infinity and thus of its range .
Remark 3.7. More generally, as in queuing theory and , we can generalize these results for regularly varying functions . If is regularly varying at infinity with index , then we have the following weak convergence in : For instance, the case with leads to If is regularly varying at infinity with index , there are many cases to consider.
We get then the asymptotic of .
Corollary 3.8. If , then converges weakly as tends to to .
If (resp., ), converges weakly to a gamma variable with parameter (resp., ).
Remark 3.9. The density of data blocks of size in is equal to . By the previous theorem or corollary, this density converges weakly as tends to to the density of data block of size of the limit covering . This limit density, denoted by , can be computed explicitly in the cases , thanks to the last corollary: Note that this is also the Lévy measure of the limit covering .
3.3. Asymptotic Regime on a Large Part of the Hardware
Here we look at the set of occupied locations in a window of size . We consider the asymptotics of when tends to infinity and tends to the saturation time. As far as we know, results given here are new even when is finite. We introduce the following functions defined for all and by And we have the following asymptotic regime (see Section 4 for the proof).
Theorem 3.10. If , tends to infinity and to such that with , then converges weakly in to .
Thus as in , we observe a phase transition of the size of largest block of data in as according to the rate of filling of the hardware. More precisely, denoting where is the sequence of component intervals of ranked by decreasing order of size, we have the following.
Corollary 3.11. Let , tends to infinity and to . (i)If with , then converges in distribution to the largest length of excursion of .(ii)If , then . (iii)If , then .
The phase transition occurs at time such that with . The more data arrive in small files (i.e., the faster tends to zero as tends to infinity), the later the phase transition occurs. In [4, 5], the hardware is a circle and processes required for asymptotics are the bridges of the processes used here. A consequence is that in our model, tends to zero or one with a positive probability at phase transition, which is not the case for the parking problem in [4, 5]. More precisely, denoting by the law of the largest length of excursion of , we have
We give here some complementary results about the distribution of the set of occupied locations at a fixed time and about the storage process.
We can give the distribution of the extremities of : Writing (see (4.9) and (4.10)) and using the identity of fluctuation (A.17) gives an other expression for the Laplace transform of . For all and , we have As a consequence, we see that the law of is infinitively divisible. Moreover this expression will give the generating triplet of the additive process [6, Theorem 2, Section 4].
We can also estimate the number of data blocks on the hardware. If has a finite mass, we write as the number of data blocks of the hardware restricted to at time . This quantity has a deterministic asymptotic as tends to infinity which is maximum at time . And the number of blocks of the hardware reaches a.s. its maximal at time . More precisely, we have the following.
Proposition 3.12. If , then for every ,
Finally, we can describe here the hashing of data. We recall that a file labeled by is stored at location . In the hashing problem, one is interested by the location where the file is stored knowing . By stationarity, we can take and consider a file of size which we store at time at location on the hardware whose free space is equal to . The first point (resp., the last point) of the hardware occupied for the storage of this file is equal to (resp., to ). This gives the distribution of the extremities of the portion of the hardware used for the storage of a file.
Let us now consider three explicit examples.
() The basic example is when (all files have the same unit size as in the original parking problem in ). Then for all and , where the second identity follows from integrating (3.19). Then, and follows a size-biased Borel law:
() Another example where calculus can be made explicitly is the gamma case when . Note that and . Then, for every , Further
() For the exponential distribution , we can get
In this section, we provide rigorous arguments for the original results which have been stated in Section 3.
Proof of Proposition 3.1. First entails that for all , a.s. and condition (A.7) is satisfied a.s. Then, by Proposition 2.1,
(i) If , then and the càdlàg version of is a Lévy process. So we have (see [20, Corollary 2, page 190]) Then Proposition 2.1 ensures that for every , a.s.
(ii) If , then ensures (see [20, Corollary 2, page 190]) that Similarly, we get that for every , a.s.
For the forthcoming proofs, we fix , which is omitted from the notation of processes for the sake of simplicity.
To prove the regeneration property and characterize the Laplace exponent of , we need to establish first a regeneration property at the right extremities of the data blocks. In that view, we consider, for every , the files arrived at the left/at the right of before time :
Lemma 4.1. For all , is independent of and distributed as .
Proof. The simple Markov property for PPP states that, for every , is independent of and distributed as . Clearly this extends to simple stopping times in the filtration and further to any stopping time in this filtration using the classical argument of approximation of stopping times by a decreasing sequence of simple stopping times (see also ). As is a stopping time in this filtration, is independent of and distributed as .
Proof of Proposition 3.4. (i) By symmetry, , , and are identically distributed. The regeneration property ensures that is independent of . By symmetry, is independent of . So , , and are independent.
(ii) As is a.s. the union of intervals of the form , then increases at So, for every , So the range of is equal to . The fact that is a subordinator will be proved below but could be also derived directly from the regeneration property of (see ). Similarly the range of is equal to .
Moreover, on and if is an interval component of . By integrating on , we have a.s for every such that , Then using again the definition of given in Section 3.1 and that is the range of , Moreover and Lemma 4.1 entails that is distributed as a PPP on with intensity . So is a Lévy process with bounded variation and drift which verifies condition (A.7) (use (A.5) and ). Then Theorem A.1 in the appendix entails that is a subordinator whose Laplace exponent is the inverse function of .
As is distributed as , is distributed as by definition.
(iii) We determine now the distribution of using fluctuation theory, which enables us to get identities useful for the rest of the work. We write for the càdlàg version of and Using (3.3) and the fact that has no negative jumps, we have Using again (3.3) and the fact that is regular for (see [20, Proposition 8, page 84]), we have also a.s. where is distributed as by (4.7) and is independent of since is independent of . Then for all with , which gives the distributions of , , and letting, respectively, , , and . Computing then the Laplace transform of where is a uniform random variable on independent of gives the right-hand side of (). So , where is a uniform random variable on independent of .
Proofs of Theorems 3.6 and 3.10 are close and made simultaneously. For that purpose, we introduce now as the Laplace exponent (see (A.1)) of given, for , , and by We denote by the space of càdlàg function from to which we endow with the Skorokhod topology (see [22, page 292]). First, we prove the weak convergence of after suitable rescaling.
Lemma 4.2. If , then for all and , which entail the following weak convergences of processes in :
Remark 4.3. If is regularly varying at infinity with index , then converges to as tends to infinity.
Proof of Lemma 4.2. Using (A.4), we have
where . The first part of the lemma then follows by applying the Tauberian theorem in [20, page 10] which gives the asymptotic behavior of the last term. For a detailed proof, we refer to .
These convergences ensure the convergence of the finite-dimensional distributions of the processes. The weak convergence in , which is the second part of the lemma, follows from [16, Theorem 13.17].
In the spirit of Section 3.1, we introduce the expected limit set, that is, the free space of the covering associated with , and the extremities of the block containing : We have the following analog of Proposition 3.4. and are independent, identically distributed and independent of . Moreover and are, respectively, the range of the subordinators and , whose Laplace exponent is the inverse function of . Finally, using , the counterpart of (4.12) gives for and , The proof of these results follow the proof of Proposition 3.4, except for two points.
() We cannot use the point process of files to prove the stationarity and regeneration property of and we must use the process instead. The stationarity is a direct consequence of the stationarity of . The regeneration property is a consequence of the counterpart of Lemma 4.1 which can be stated as follows. For all , and distributed as . As Lemma 4.1, this property is an extension to the stopping time of the following obvious result: is independent of and distributed as .
() It is convenient to define directly by For , and so we can apply Theorem A.1 and is a subordinator whose Laplace is the inverse function of . Moreover its range is a.s. equal to , since the Lévy process is regular for [20, Proposition 8, page 84].
Lemma 4.4. If , then for all and ,
Remark 4.5. If is regularly varying at infinity of index , we have similarly
Proof. First we prove that
Indeed the function decreases so for all and , we have
and proves (4.24) recalling (3.17).
Then the first part of Lemma 4.2 and the identity give the first part of Lemma 4.4. Indeed for every , . So (4.24) entails Put to get the first limit of the lemma and follow the same way to get the second one.