Two Bootstrap Strategies for a -Problem up to Location-Scale with Dependent Samples
This paper extends the work of Quessy and Éthier (2012) who considered tests for the -sample problem with dependent samples. Here, the marginal distributions are allowed, under , to differ according to their mean and their variance; in other words, one focuses on the shape of the distributions. Although easily stated, this problem nevertheless requires a careful treatment for the computation of valid values. To this end, two bootstrap strategies based on the multiplier central limit theorem are proposed, both exploiting a representation of the test statistics in terms of a Hadamard differentiable functional. This accounts for the fact that one works with empirically standardized data instead of the original observations. Simulations reported show the nice sample properties of the method based on Cramér-von Mises and characteristic function type statistics. The newly introduced tests are illustrated on the marginal distributions of the eight-dimensional Oil currency data set.
Testing for the equality in distribution of two or more real-valued random variables is a classical statistical problem that has been extensively investigated in the literature. Formally, the goal is to test for , where , , and are usually assumed to be independent random variables. In the bivariate case (), the most popular procedures for are those based on the sign, Wilcoxon rank-sum, Kolmogorov-Smirnov, and Cramér-von Mises statistics. The generalization to the -sample situation has been considered by Kiefer  and Bickel  using Kolmogorov-Smirnov and Cramér-von Mises statistics and by Scholz and Stephens  using the Anderson-Darling functional. More recent contributions include Zhang and Wu , Martínez-Camblor and de Uña-Álvarez , Martínez-Camblor , and Wyłupek , among others.
When cannot be taken as independent, the testing of must be handled with care. In that case, one knows from a celebrated theorem of Sklar  that there exists a copula such that the joint distribution of can be written as , where . If the marginal distributions are continuous, then is unique; see Nelsen  for details. As underlined by Quessy and Éthier , the possible asymmetry of invalidates the use of permutation methods. An alternative statistical procedure where values are estimated from an adapted version of the multiplier bootstrap method was proposed by these authors.
The aim of this paper is to extend the test procedures of Quessy and Éthier  to a version of the -sample problem when dependent random variables are rescaled with respect to their mean and variance. More specifically, consider random variables and let and be unknown. The null hypothesis of interest in this work is where and . In other words, one wants to test the shape hypothesis that distributions are equal up to location and scale factors. The main interest behind this hypothesis is that the actual difference between distributions can only be an artefact caused by measures made on different scales.
Because the means and variances are unknown, this problem requires a careful investigation at the level of the computation of valid values. Indeed, the fact that these moments must be estimated from the data has an impact on the limiting distribution of the test statistics. Hence, a naive approach treating and as known values would not work here. As will be seen, the empirical process from which the test statistics are computed has a useful representation in terms of a Hadamard differentiable functional. This is at the heart of the two bootstrap strategies that are developed in this work. The test statistics that are proposed are based on Cramér-von Mises and characteristic function mappings. It will be seen that the latter entail powerful procedures under many kinds of alternatives. Moreover, the nature of these functionals yields simple expressions in terms of quadratic forms both for their sample version and for their bootstrap counterparts.
The paper is structured as follows. The Hadamard differentiable functional which is at the basis of the proposed methodologies is introduced in Section 2 together with the test statistics. In Section 3, two resampling methods based on the multiplier bootstrap method are proposed and their impact on the computation of the values of the test statistics is described. The sample properties of the newly introduced tests are studied in Section 4 with the help of Monte-Carlo simulations. An illustration on the eight-dimensional Oil currency data set is detailed in Section 5.
2. Test Statistics
2.1. Empirical Process of the Rescaled Observations
Let be independent copies of , where . For each , the random variable is assumed to have a finite moment of order two. The estimation of the rescaled distribution functions will be based on the empirically standardized observations where, for each , and are the usual empirical mean and variance based on . Specifically, the univariate distribution function will be estimated for by If denote the usual empirical distribution functions, one obtains easily . Letting , one then has
Remark 1. The estimation of from empirically standardized observations is somewhat similar to the estimation of a quantile function as considered by Parzen . Under the null hypothesis that a random sample comes from a distribution of the form from some fixed distribution function , empirically standardized observations are defined as Under the null hypothesis, are approximately uniform on , which motivated Parzen  to compare their empirical quantile function to the uniform quantile function as a goodness-of-fit criteria.
The large-sample behavior of , where , will be a consequence of the Hadamard differentiability of the functional defined on the product space of -dimensional vectors of bounded functions on . Here, for a given univariate distribution function whose moment of order two exists, the functionals are, respectively, the mean and the variance of the probability distribution associated to . Note that, for each , The next result establishes that is Hadamard differentiable. Recall that a map , where and , are normed spaces, is said to be Hadamard differentiable if there exists a continuous linear mapping , called the derivative of at , such that, for and , If exists only on a subset of , then is said to be differentiable tangentially to . See van der vaart and Wellner  for more details.
Proposition 2. Suppose that have bounded densities given, respectively, by . Then the functional is Hadamard differentiable with derivative at evaluated at given by where with and is a diagonal matrix with , .
Suppose that has continuous marginal distributions and let be the unique copula associated to its joint distribution. From classical arguments, the vector of empirical processes converges weakly to , where are dependent Brownian bridges such that, for and any , where is the copula of . Hence, for any , the limit Gaussian process is characterized by the covariance structure , where, for any , the entries of are The result can be extended to the empirical process , thanks to the Hadamard differentiability of stated in Proposition 2.
Proposition 3. If have bounded densities and finite moments of order two, then converges weakly to a centered Gaussian process of the form where and are Brownian bridges with covariance structure given by . The covariance structure of is given by , where, for any ,
Proposition 3 can be specified to the case when the null hypothesis holds true. In that case, there is a distribution function such that The result is stated as a corollary to Proposition 3.
Corollary 4. Suppose that has a bounded density whose moment of order two exists. Under , the empirical process converges weakly to the centered Gaussian process whose covariance structure is given by where the entries of are for .
2.2. Cramér-von Mises and Characteristic Function Test Statistics
Consider the Cramér-von Mises and characteristic function type functionals with being the component-wise characteristic function of , and is its complex conjugate. Here, is a weight function that gives nonnull mass at each . Alternate representations for are given in the next lemma; the latter will prove useful in the sequel.
Lemma 5. Letting , one has Moreover, if and , then
The statistics that will be investigated in this work are Here, is a combination matrix such that if and only if for some , where and . The null hypothesis can then be reformulated as . The asymptotic behavior of and under is stated in the next proposition.
Proposition 6. Suppose that has a bounded density whose moment of order two exists. Also assume that is a weight function such that , , and . Then, under , where is the limit of identified in Corollary 4.
An interesting feature of the test statistics and is that simple formulas for their computation are available in terms of product of matrices. Explicit expressions similar to those given by Quessy and Éthier  are described in Remarks 7 and 8.
Remark 7 (explicit formula for ). Because , one can show that , where the entries of are Hence, , where . Since by the assumption that vanishes, .
Remark 8 (explicit formula for ). From representation (19) for , where the entries of are
The challenging issue of computing valid values for test statistics based on and is addressed in this section, where two bootstrap methods are developed. One version is based directly on the functional , while the other one is built around its Hadamard derivative. The key element here is the so-called multiplier bootstrap method applied to the empirical process . This resampling method is a very powerful technique which is especially useful when testing for composite hypotheses. In the case of , one defines, for each , the vector of processes where , with , are independent vectors of independent random variables having unit mean and variance and such that . This version of the multiplier method is called the Bayesian bootstrap in Kosorok . Straightforward arguments given in Quessy and Éthier  enable establishing the weak convergence of to , where are independent copies of . The method is extended to in the next subsection.
3.2. Bootstrapping : Two Approaches
Inspired from (10) in the special case when , let where, based on the available data, with and is a uniformly consistent estimator of . Then, define the multiplier bootstrap versions of by , where A second multiplier bootstrap approach for which, unlike , requires no density estimation consists in defining for each . The asymptotic validity of these two resampling methods is established in the next proposition.
Proposition 9. Under the conditions of Proposition 3, one has in the space that where are independent copies of .
3.3. Multiplier Bootstrap Versions of the Test Statistics
For each , multiplier bootstrap versions of the test statistics and are given for by Explicit and easy-to-implement formulas for these multiplier statistics are given in Appendix B. The asymptotic validity of these bootstrapped statistics is a straightforward consequence of Proposition 9 and of the continuous mapping theorem. In other words, one has the weak convergence result for , and similarly for the characteristic function statistics. Hence, asymptotically valid values for the tests based on and are given, respectively, for , by One rejects whenever or for some predetermined significance level . These testing procedures are consistent. Indeed, on one side, if the null hypothesis fails to hold, then , for each , in a subset of with nonnull probability; as a consequence, and then and in probability. On the other side, the weak convergence result (and also for the characteristic function tests) still holds even under a failure of the null hypothesis so that the bootstrap replications are bounded in probability. It shows that the probability of rejection of tends to 1 as .
4. Sample Properties of the Tests
The asymptotic behavior of and as well as the asymptotic validity of two multiplier bootstrap methods has been established in the preceding sections. These results tell little, however, about the sample properties of the tests in small and moderate sample sizes. It is thus important to investigate their ability to keep their nominal level under as well as their power under various alternatives in situations when only a limited number of observations are available. This will be done here via Monte-Carlo simulations.
The models considered for the marginal distributions are the Normal (), Student with degrees of freedom (), and double-exponential ( distributions. In each of the scenarios under study, is the Normal distribution and are either the , , , , or distribution. Because the test statistics are invariant under location/scale transformations, it is enough to consider only the standard version of each of these distributions, that is, when the mean and variance are set to 0 and 1, respectively.
Because one expects that the dependence structure underlying the multivariate distribution has an influence on the results, symmetric and asymmetric dependence structures have been considered. They are all based on the Normal copula that one can extract from the multivariate Normal distribution. If is the -variate Normal density with correlation matrix , then it is implicitly given for by Note that unlike the resampling methods based on permutation, our procedure is valid under asymmetric joint distributions. Asymmetric versions of obtained from Khoudraji’s device  are given by where . In the simulation study, only the special case when was considered, yielding the asymmetric copula model .
For all the simulation results reported, the multiplier random variables are exponential with mean one. For the case , the number of bootstrap samples has been set to ; this has been reduced to when in order to speed up the simulations. For the first bootstrap strategy, the grid for the approximation has one hundred points and is a kernel density estimator based on the whole sample of the empirically standardized observations , , ; that is, where . In the simulations presented, is the Epanechnikov kernel (see Epanechnikov ) and is the optimal bandwidth for the estimation of Normal densities. For the characteristic function statistics, two weight functions have been considered. The first is , for which and ; the second is , for which and . The corresponding test statistics based on the combination matrix are referred to as and .
The results for the case are in Table 1 () and Table 2 (). First, note that the ability of the tests to keep their 5% nominal level is quite good, except for when using the second bootstrap strategy; simulations not reported here indicate that as high as observations are needed in order for to keep its size, as predicted from the asymptotic theory. Overall, considering the complexity of the bootstrap methods involved in the computation of values, the results are very satisfying.
The tests are also very good at rejecting the null hypothesis in case of departures from the equality of standardized distributions. Of course, the probability of rejection is higher when than when , as expected from the consistency of the tests. Also, the Normal distribution is well distinguished from the Student distribution when is low; it is also the case of the double-exponential distribution. Note that the observed powers are larger for a high level of dependence, that is, , compared to low level of dependence, that is, . However, for a given level of dependence, the asymmetric and symmetric dependence structure yielded similar results.
The bootstrap method that is use in conjunction with a test statistic, has a significant influence on the power results. For , it is when the first bootstrap is used that the best powers are observed; the opposite comment applies to . For , the results are similar for both resampling methods. Generally speaking, the best test statistic when using the first bootstrap method is , which is slightly better than ; is markedly less powerful. For the second bootstrap, the results vary with respect to the underlying dependence structure. Indeed, under Student alternatives, and have similar power, and they are significantly more powerful than ; under a double-exponential alternative, it is which is significantly more powerful than its two competitors.
The results for and are presented, respectively, in Tables 3 and 4. One can see that the comments for the case are still valid here. Hence, globally, one can say that the newly introduced test statistics and the two bootstrap methods yield very reliable statistical procedures: they keep their nominal level well under moderate sample sizes and they are powerful under a large variety of alternatives. These properties are well maintained under various dependence structures, including symmetric and asymmetric copulas, as well as negative and positive dependence.
5. Illustration on the Oil Currency Data Set
This data set consists of daily log-returns of the oil price , Standard and Poor’s 500 , and six currency exchange rates, namely, those of Great Britain pound (), United States dollar (), Swiss franc (), Japanese yen (), Danish krone (), and Swedish krona () registered from May 1985 to June 2004. It has been analyzed by Kluppelberg and Kuhn  to illustrate their newly introduced copula structure analysis. With a goodness-of-fit test specifically designed for the metaelliptical families of distributions, Quessy and Bellerive  concluded that a Student copula with an estimated sixteen degrees of freedom was suitable for these data when considering the last observations.
The tests based on , , and for the equality of the standardized distributions will now be illustrated on the twenty-eight possible pairs; the results can be found in Table 5. For each of these three test statistics, their associated value has been estimated from multiplier bootstrap samples from each of the two bootstrap strategies that have been proposed; these resampling methods are referred to as PV I and PV II in Table 5. Overall, all tests are generally in agreement on the acceptance or rejection of the null hypothesis, whatever the bootstrap method that was employed is.
As a specific example, consider the pair for which the null hypothesis is accepted for each test; this formal conclusion could also be expected while looking at the standardized densities presented in Figure 1(d). It is clear from the nonstandardized densities (Figure 1(c)) that the conclusion would have been different if the data were not standardized by their respective means and variances. From the scatterplot in Figure 1(a), one can see that there is a significant positive relationship between the two random variables; it can also be seen from the inspection of the scatterplot of the normalized ranks (Figure 1(b)). It just illustrates the importance of having a statistical procedure that takes into account the possible dependence. For the pair , all the tests agree on a clear rejection of ; this conclusion is in accordance with the standardized densities that one can see in Figure 2.
In order to illustrate the case , consider the quadruplet . The four superimposed estimated densities are presented in Figure 3(a). The null hypothesis of equality of the four standardized distributions is accepted by each of the tests but by a very small amount, especially for the characteristic function statistics. Indeed, one obtains (PV I = 0.112, PV II = 0.172), (PV I = 0.108, PV II = 0.080), and (PV I = 0.076, PV II = 0.068). If one removes and then restricts to the triplet , one obtains (PV I = 0.404, PV II = 0.48), (PV I = 0.24, PV II = 0.18), and (PV I = 0.28, PV II = 0.216), so that the null hypothesis is clearly accepted. This conclusion concords with the standardized densities in Figure 3(b).
Proof of Proposition 2. Consider , , and suppose that converges uniformly to with respect to the usual Euclidian norm on ; that is, as ,
For a sequence of real numbers that tends to zero as and for as defined in (10), note that
where and with
The continuity of the mappings and and the fact that entail as for each . Next, the mean-value theorem ensures that there exists between and such that
Since the Hadamard derivatives of and at are given, respectively, by
the first two terms on the right hand side of the last equation tend to 0 uniformly in (in absolute value). It is also the case for the third term, so that as . Finally,
The proof is complete since both tend to zero as .
Proof of Proposition 3. From Proposition 2 and the functional delta method, converges weakly to , where is the weak limit of . Because is linear, is Gaussian and . Moreover, by the Hadamard derivative of given in (10), one deduces easily that from which obtains from straightforward computations.
Proof of Corollary 4. Simply note that , where is the identity matrix, and that ; then, apply Proposition 3 to that special case.
Proof of Lemma 5. Invoking Fubini’s theorem and using the identity , one obtains From successive integration by parts using the boundary conditions on and , one obtains
Proof of Proposition 6. First observe that, under , , where is described in Corollary 4. While is a continuous functional, it is not as straightforward for . Since , one deduces so that as represented in (20) is a continuous functional. The proof concludes upon applying the continuous mapping theorem.
Proof of Proposition 9. Proof of Proposition: First Bootstrap Method. Note that where The uniform convergence of to entails that of to , because of the consistency of the first two sample moments and because is continuous. Hence, and converge to zero in probability; the uniform convergence of to zero follows from the assumption on . As a consequence, in probability. Thus, , where are independent copies of , with .
Second Bootstrap Method. The functional delta method entails Thus, , where . Hence, are independent copies of .
B. Formulas for the Multiplier Bootstrap Versions of the Test Statistics
B.1. First Bootstrap Method
For each , define , where . Then, note that , where, for a given , the entries of are given by One then obtains where, for some (sufficiently fine) grid , with , and using representation (20) for ,
B.2. Second Bootstrap Method
First note that one can write where, for a given , the entries of are . Letting , one can show that Hence, from the definition of , one obtains where, for all and , Finally, One can then show that where the entries of are
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank a referee for his valuable comments. Partial funding in support of this work was provided by the Natural Sciences and Engineering Research Council of Canada and by the Fonds Québécois de la Recherche sur la Nature et les Technologies.
J. Kiefer, “-sample analogues of the Kolmogorov-Smirnov and Cramer-V. Mises tests,” Annals of Mathematical Statistics, vol. 30, pp. 420–447, 1959.View at: Publisher Site | Google Scholar | MathSciNet
P. J. Bickel, “A distribution free version of the Smirnov two sample test in the -variate case,” Annals of Mathematical Statistics, vol. 40, pp. 1–23, 1968.View at: Google Scholar | MathSciNet
F.-W. Scholz and M. A. Stephens, “k-sample Anderson-Darling tests,” The Journal of the American Statistical Association, vol. 82, pp. 918–924, 1987.View at: Google Scholar
J. Zhang and Y. Wu, “k-Sample tests based on the likelihood ratio,” Computational Statistics and Data Analysis, vol. 51, no. 9, pp. 4682–4691, 2007.View at: Publisher Site | Google Scholar
P. Martínez-Camblor and J. de Uña-Álvarez, “Non-parametric k-sample tests: Density functions vs distribution functions,” Computational Statistics and Data Analysis, vol. 53, no. 9, pp. 3344–3357, 2009.View at: Publisher Site | Google Scholar
P. Martínez-Camblor, “Nonparametric k-sample test based on kernel density estimator for paired design,” Computational Statistics and Data Analysis, vol. 54, no. 8, pp. 2035–2045, 2010.View at: Publisher Site | Google Scholar
G. Wyłupek, “Data-driven -sample tests,” Technometrics, vol. 52, no. 1, pp. 107–123, 2010.View at: Publisher Site | Google Scholar | MathSciNet
M. Sklar, “Fonctions de répartition à n dimensions et leurs marges,” Publications de l'Institut de Statistique de l'Université de Paris, vol. 8, pp. 229–231, 1959.View at: Google Scholar | MathSciNet
R. B. Nelsen, An Introduction to Copulas, Springer, New York, NY, USA, 2nd edition, 2006.View at: MathSciNet
J. Quessy and F. Éthier, “Cramér-von Mises and characteristic function tests for the two and k-sample problems with dependent data,” Computational Statistics and Data Analysis, vol. 56, no. 6, pp. 2097–2111, 2012.View at: Publisher Site | Google Scholar
E. Parzen, “Nonparametric statistical data modeling,” Journal of the American Statistical Association, vol. 74, no. 365, pp. 105–131, 1979.View at: Publisher Site | Google Scholar | MathSciNet
A. W. van der vaart and J. A. Wellner, Weak Convergence and Empirical Processes. With Applications to Statistics, Springer Series in Statistics, Springer, New York, NY, USA, 1996.
M. R. Kosorok, Introduction to Empirical Processes and Semiparametric Inference, Springer Series in Statistics, Springer, New York, NY, USA, 2008.View at: Publisher Site | MathSciNet
A. Khoudraji, Contributions à l'étude des copules et à la modélisation de valeurs extrêmes bivarées [Ph.D. thesis], Université Laval, Québec, Canada, 1995.
V. A. Epanechnikov, “Non-parametric estimation of a multivariate probability density,” Theory of Probability & Its Applications, vol. 14, pp. 153–158, 1969.View at: Google Scholar
C. Kluppelberg and G. Kuhn, “Copula structure analysis,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 71, pp. 737–753, 2009.View at: Publisher Site | Google Scholar
J.-F. Quessy and R. Bellerive, “Statistical procedures for the selection of a multidimensional meta-elliptical distribution,” The Journal of the French Statistical Society, vol. 154, pp. 78–101, 2013.View at: Google Scholar