Abstract

High-dimensional data with a small sample size, such as microarray data and image data, are commonly encountered in some practical problems for which many variables have to be measured but it is too costly or time consuming to repeat the measurements for many times. Analysis of this kind of data poses a great challenge for statisticians. In this paper, we develop a new graphical method for testing spherical symmetry that is especially suitable for high-dimensional data with small sample size. The new graphical method associated with the local acceptance regions can provide a quick visual perception on the assumption of spherical symmetry. The performance of the new graphical method is demonstrated by a Monte Carlo study and illustrated by a real data set.

1. Introduction

Studies on highly complicated random systems or structures pose the problem of measuring a large number of variables simultaneously. Modern technology makes it possible to collect high-dimensional data. Because of the high cost or the great difficulty in measuring a large number of variables at the same time, it is quite common that high-dimensional data are usually associated with small sample size. For example, microarray data are usually obtained by measuring thousands of variables, but the sample size is possibly less than 100; image data could be obtained by measuring more than 10,000 variables at the same time but possibly with sample size of only several hundreds. Analysis of high-dimensional data with a small sample size has become an important research topic in statistics. Many authors have been making great efforts in developing various dimension-reduction techniques, among which sliced inverse regression (called SIR, see, Li [1]) and its extension (Li [2]; Cook [3]; Cook and Li [4]) are powerful. The critical assumption of SIR on population is spherical or elliptical symmetry. Thus, test of high-dimensional spherical or elliptical symmetry will play an important role in practical implementation of those dimension-reduction techniques.

Data visualization techniques are welcome by data analyst in explaining statistical ideas and conclusions to nonstatisticians. Although many graphical methods such as the Q-Q (quantile-quantile) plots have been developed for testing high-dimensional normality (see, e.g., Healy [5]; Small [6]; Ahn [7]; Koziol [8]; Brown and Hettmansperger [9]; Liang and Bentler [10]), there is none for testing spherical symmetry except the Q-Q plots proposed by Li et al. [11]. These existing plotting methods require a sufficiently large sample size to be used effectively. The goal of this paper is to tackle the challenge of high dimension with a small sample size in testing spherical symmetry by a plotting method.

Let be a -dimensional random vector. is said to have a spherical distribution if for any orthogonal constant matrix , and have the same distribution. It is well known that if a random vector has a spherical distribution with a density, then its density must be of the form for some nonnegative scalar function (see, e.g., Fang et al. [12]). Many well-known multivariate distributions, such as the multivariate standard normal distribution, the multivariate -distribution with zero mean and an identity matrix as its covariance matrix, and any scale mixture of the multivariate standard normal distribution, are spherical distributions. See Chapter 3 of Fang et al. [12] for more examples of spherical distributions.

A goodness-of-fit test of spherical symmetry means testing if a set of i.i.d. (i.e., independently identically distributed) -dimensional () sample is from a population with a spherical distribution. Fang and Liang [13] gave an up-to-date overview of existing methods for tests of spherical symmetry and their extensions to testing elliptical symmetry. Several other statistics for testing spherical symmetry have been proposed in the last few years (e.g., Koltchinskii and Li [14]; Diks and Tong [15]). There is no empirical study on the performance of these existing methods when applied to testing spherical symmetry for high-dimensional data with a small sample size. It is for this purpose that we develop a new graphical method in this paper to tackle this problem. The new graphical method integrates the property of spherical distributions and the idea of the -plot, which was proposed by Ghosh [16] for detecting univariate nonnormality. Hence, we will call this version of the -plot as -plot, where “" stands for sphericity. The -plot possesses a special property that it is still useful when the sample size is smaller than the dimension of . This property gives an superiority of the -plot in the case of high dimension with small sample size. This will be demonstrated in Section 3.

The main idea to construct the -plot is to project the high dimensional data to one dimensional space such that the statistic [16] with the projected data behaves like the one with an i.i.d. sample from the univariate standard normal distribution. Thus, a critical issue arising from the projection is how to select some “good" projection directions. We address this issue carefully in Section 2.

This paper is organized as follows. In Section 2 we will give a brief review on Ghosh’s [16] -plot for detecting univariate nonnormality. The theoretical principle for deriving the projection directions is discussed. In Section 3, we will provide some local acceptance regions for the -plot and simulation results on its performance in testing high-dimensional spherical symmetry based on a Monte Carlo study. An application of the -plot to a real high-dimensional data set with small sample size is illustrated. Some concluding remarks are given in Section 4.

2. The -Plot for Testing High-Dimensional Spherical Symmetry

2.1. Background of the -Plot

The -plot is focused on looking for evidence of departure from a spherical distribution for an i.i.d. -dimensional sample . Spherical distributions possess very similar marginal distributions to those of the multivariate normal distribution. For example, all univariate (1-dimensional) marginals of a multivariate normal distribution are still normal. In comparison with this marginal property, all univariate marginals of a spherical distribution are scale-invariant. This scale invariance is an important characteristic of the family of spherical distributions. More discussions and induced results from the scale invariance for a spherical distribution can be found from Fang et al. [12]. Because most existing plotting methods for testing goodness of fit are only applicable for univariate distributions, the key idea of the -plot is to extend the existing univariate -plot [16] to testing multivariate spherical symmetry from some “special principal component" directions by employing the same idea as in Fang et al. [17]. Section 2.2 gives a simple review on Ghosh’s [16] -plot and Section 2.3 summarizes the theoretical details on the extension of Ghosh’s [16] -plot to the -plot.

2.2. A Review of the -Plot

Let be an i.i.d. univariate sample and . Testing goodness-of-fit for univariate normality is to test whether the underlying distribution of the sample is normal with unknown and . The EMGF (empirical moment generating function) for the studentized data is defined as where and are the sample mean and the sample standard deviation, respectively, stands for the set of all real numbers. Denote the th derivative of with respect to by . Ghosh [16] defined the (a function of for fixed ) as The graphical method for detecting nonnormality of the underlying distribution of the sample is based on .

It is noted that is a stochastic process with an index (for some ). Under the normal assumption, Ghosh [16] obtained the asymptotic distribution of , which is a zero-mean Gaussian process with a covariance function (). In particular, The behavior of at reflects evidence of departure from normality for the sample . For example, is proportional to the sample skewness and the slope of at is proportional to the sample kurtosis. Fang et al. [17] proposed the following Kolmogorov-Smirnov (K-S) type statistic : as an analytical test associated with the -plot. Large values of indicate nonnormality of the univariate sample . The exact finite-sample distribution of is not readily obtained under the normal assumption but its percentiles can be well fitted by a quadratic function of using the least squares method. Fang et al. [17] provided these quadratic functions of for significance levels , , and based on a Monte Carlo study.

2.3. Extension of the -Plot to the -Plot

Let be an i.i.d. -dimensional sample from a population characterized by a -dimensional random vector . We want to test spherical symmetry of the sample. The theoretical principle for extending the -plot to the -plot is based on Lemma 2.1 and Theorem 2.2 as follows.

Lemma 2.1 (Theorem 2.22 of Fang et al., Chapter 2 [12]). Let be a statistic based on a (not necessarily i.i.d.) sample such that (scale invariance) for any constant . If has a spherical distribution, then where the sign “” means that the two sides of the equality have the same distribution.

Lemma 2.1 implies that for any scale-invariant statistic associated with a spherically distributed random vector , its distribution is the same as that from taking the spherical random vector to be the standard normal .

Theorem 2.2. Let be i.i.d. with a -dimensional spherical distribution and , and matrix . Assume that is a vector function that is uniquely determined by . Define the random vector Then has a -dimensional spherical distribution.

Proof. First, we point out that the random matrix has a left spherical matrix distribution [18]. That is, it satisfies for any orthogonal matrix that is independent of . We can write the random vector in (2.7) as that is, we obtain for any orthogonal matrix that is independent of . This shows that the random vector given by (2.7) has a spherical distribution by definition. This completes the proof.

The random vector in (2.7) acts as a direction for projecting a left spherically distributed random matrix into a spherically distributed random vector by (2.7). This idea is due to Läuter [19] in constructing tests for the multivariate normal mean in with an unknown .

Based on Lemma 2.1 and Theorem 2.2, we can extend the -plot to the -plot for testing high-dimensional spherical symmetry. At first, we point out that the function given by (2.2) is scale-invariant, that is, for any constant . Therefore, by Lemma 2.1, if is spherically distributed, then Similarly, for the K-S type statistic given by (2.4), it is also true that if is spherically distributed.

Now we propose a series of necessary tests for high-dimensional (the dimension of data is very large) spherical symmetry by Theorem 2.2. The meaning for necessary test is the same as in Fang et al. [20]. That is, when the null hypothesis is not rejected, it implies insufficient information to draw a statistical conclusion from the sampled data. Instead of testing the hypothesis of spherical symmetry for an i.i.d. sample directly, we turn to test a series of hypotheses defined by versus given by (2.7) is nonspherical. Hypothesis (2.14) is for any possible choices of in Theorem 2.2. So hypothesis (2.14) comprises a family of tests for spherical symmetry of . If any of the possible choices of leads to rejection of in (2.14), the hypothesis of spherical symmetry will be also rejected. So any test for (2.14) is a necessary test for the hypothesis of spherical symmetry for the original sample. The -function in (2.2) based on (computed from (2.7)) becomes where where and are the sample mean and the sample standard deviation calculated from by Theorem 2.2. The K-S type statistic in (2.4) becomes where is given by (2.15) and by (2.3).

By (2.12) and (2.13), the principle for using the -plot to detect high-dimensional nonspherical symmetry can be summarized as: plot the -function given by (2.15) versus . If the plot shows a significant departure from the horizontal axis in , hypothesis (2.14) is rejected, and as a result, the i.i.d. sample can be considered from a population of nonspherical distribution. The K-S type statistic in (2.17) can be employed to evaluate the significance of departure from spherical symmetry.

It is obvious that there are numerous choices of the function in Theorem 2.2 in constructing the projection direction . We will study the empirical performance of the choices recommended by Läuter [19], and Läuter et al. [21] in next section.

3. A Monte Carlo Study

3.1. An Overview

Graphical methods can only serve as descriptive statistical inference without associated acceptance regions with the plots. One of the impressive characteristics of Ghosh’s [16] -plot for detecting departure of univariate normality is its associated acceptance regions, which make it possible the plotting method as analytical statistical inference. The purpose of the Monte Carlo study in this section is to provide simulated acceptance regions for the -plot developed in Section 2 by using a similar Monte Carlo method to that in Ghosh [16] and in Fang et al. [17]. Based on the acceptance regions, the empirical performance of the -plot can be partially evaluated through counting the rejection rate (type I error) from a selected set of spherical distributions (which serve as the null hypothesis) and counting the rejection rate (empirical power) from a selected set of nonspherical alternative distributions (which serve as the alternative hypothesis) when applying the -plot.

3.2. The Local Acceptance Regions

A local acceptance region for the plot of given by (2.15) is a critical band for the plot on . If the plot goes outside the critical band, it is an indication that is not spherically distributed, and as a result, the hypothesis of spherical symmetry is rejected. A critical band can be constructed by simulating the percentiles of the finite-sample distribution of the K-S type statistic in (2.17). By Lemma 2.1, we have if the null hypothesis in (2.14) is true. Therefore, in simulating the percentiles of the finite-sample null distribution of in (2.17), we can simply generate the data for from the standard normal . By generating the normal data for with 2,000 replications, we record the -percentiles (e.g., ) of , where the in (3.1) is approximately calculated by taking the supremum on the discrete values of (i.e., ). A quadratic curve of , is fitted for the -percentiles of by the least squares method to find the approximate relation of to the sample dimension . The following quadratic curves based on values (2) 50 (i.e., ) were obtained from the least squares fitting: The quadratic curves given by (3.2) are suitable for estimating the percentiles of the K-S type statistic for in the range . Figure 1 shows the plots of the simulated percentiles of and the estimated percentiles given by (3.2) by the least squares method. The fit in Figure 1 seems to be acceptable.

By using the quadratic curves given by (3.2) for estimating the finite-sample percentiles of the statistic in (2.17) for dimension , the acceptance region for the -function in (2.15) for testing spherical symmetry is given by where is given by (3.2) and is given by (2.3). We will call the acceptance region determined by (3.3) the local acceptance region for the -plot in testing high-dimensional spherical symmetry. When the plot of () goes outside the acceptance region determined by (3.3), hypothesis (2.14) is rejected, and as a result, the underlying distribution of the sample shows evidence of nonspherical symmetry.

3.3. Type I Error Rates

We already pointed out that there are numerous choices for the projection direction in (2.7) to plot the -function (2.15). We will perform a Monte Carlo study on the following choices of that were suggested by Läuter [19], and Läuter et al. [21].(1)Solution to the eigenvalue problem: where , with . To ensure the unique solution, the random matrix is assumed to have positive diagonal elements. The following directions are chosen: where the sign stands for the integer part () of a real number (e.g., and ).(2)The direction based on the SS-test discussed by Läuter et al. [19]. Choose where denotes the diagonal matrix with the same diagonal elements as those of , and is an vector of ones.(3)The direction based on the PC-test discussed by Läuter et al. [21]. Let be the solution matrix to the eigenvalue problem where similar conditions to those on the matrices and in (3.4) are imposed on the matrices and in (3.7) to ensure the unique solution. Let . The three directions are chosen: where is given by (3.5).

The following seven directions are chosen for a Monte Carlo study on the type I error rates and power when using the -plot for testing spherical symmetry:

The Monte Carlo study on type I error rates of the local acceptance region (3.3) is carried out by generating spherical samples from the following six spherical distributions by MATLAB code. These null distributions are discussed in detail in Chapter 3 of Fang et al. [12]. Here we only point out the corresponding parameters for the chosen spherical distributions without explaining their meaning. The six chosen spherical (null) distributions are: (1) the standard normal distribution ; (2) the multivariate -distribution with degrees of freedom ; (3) the Kotz type distribution with , , and ; (4) the Pearson type VII distribution (PVII) with and ; (5) the Pearson type II distribution (PII) with ; (6) the Cauchy distribution. The TFWW algorithm [22] pages 166–170 [23], is employed to generate empirical samples from these spherical distributions except the normal distribution whose samples can be generated from the MATLAB internal function. Table 1 gives the empirical type I error rates () of the -plot in testing spherical symmetry for dimensions and , where the seven directions are given by (3.9) and the type I error rates were calculated by The local acceptance region given in (3.3) for is used to count the number of rejections for the selected null distributions. The simulation was done with 2,000 replications.

Based on Table 1, we can summarize the following empirical conclusions: (1)the -plot seems to have better control of the type I error rates by using the five directions than by using the two directions and , which tend to have lower type I error rates than the significance level . The -plot based on these two directions may over-accept the null hypothesis of spherical symmetry;(2)the performance of the -plot on controlling the type I error rates tends to be slightly affected by the sample size. This may be due to the arrangement of the observation matrix in Theorem 2.2. So we can expect the -plot to have good control on type I error rates in the case of high dimension with a small sample size. This is a good indication for high-dimensional data analysis.

For and , we obtained similar results to those in Table 1 on the type I error rates of the -plot by using the acceptance region (3.3) and the same directions in (3.9). These are not presented to save space.

3.4. Power Study

The power of the -plot in testing spherical symmetry is computed by using the acceptance regions (3.3) and the formula given by (3.10). The KS-type statistic (2.17) is computed in the same way as in Section 3.2. The following six nonspherical alternative distributions are selected: (1)the multivariate -distribution comprises i.i.d. -variables;(2)the multivariate exponential distribution comprises i.i.d. univariate exponential variables with a density function , ;(3)the multivariate gamma distribution that comprises of i.i.d. gamma variables, each has a gamma distribution with a density ();(4)the distribution nor comprises i.i.d. marginals, marginals have a standard normal distribution and marginals have a chi-squared distribution . Here is the sample dimension;(5)the distribution nor + Cauchy comprises two independent marginals, one is the -dimensional standard normal distribution and the other has a -dimensional Cauchy distribution;(6)the distribution Kotz has a similar meaning to that for the distribution given by (5), one marginal has a -dimensional -distribution with parameter and the other has a -dimensional Kotz type distribution with parameters , , and .

By a standard Monte Carlo technique, a sample matrix can be generated from the above six nonspherical alternative distributions according to their marginal distributions. Then is centerized by where the mean value is taken for each element of . By this way, we obtain nonspherical samples distributed in the space like those of spherical samples. Table 2 presents the power () of the -plot in testing spherical symmetry for the six nonspherical distributions by using the seven directions in (3.9).

Based on Table 2, we can summarize the following empirical conclusions: (1)the -plot based on the direction remarkably outperforms (has much higher power) the other six directions under the four choices of the sample sizes and almost all of the selected nonspherical distributions. So can be considered as a general choice for the -plot in testing high-dimensional spherical symmetry;(2)the -plot is slightly affected by an increase of the sample size. Because of the rearrangement of the observation matrix , instead of , the sample size is implicitly taken as the sample dimension , and the sample dimension is taken as the sample size . This is a rotation of the regular observation matrix . This rotation results in a power decrease of the -plot when the sample size increases, and it results in a power increase of the -plot when the sample dimension increases. This can be observed from Table 2. Therefore, the -plot is especially suitable for testing very high-dimensional spherical symmetry with a relatively small sample size.

3.5. Practical Illustration

To illustrate how to apply the -plot in practice, we employ a subset of a real data set. The data set was used in Walker and Wright [24] and was called the VDP (vertical density profile) data set. As described by Walker and Wright [24], manufacturers of engineered wood boards, which include particle board and medium density fiberboard, are very concerned about the density properties of the board produced. The density is measured using a profilometer which uses a laser device to take a series of measurements across the thickness of the board. A profilometer takes multiple measurements on a sample (usually a inch piece) to form the vertical density profile of the board. The VDP data subset that we illustrate here consists of 45 measurements taken 0.014 inch apart, and comprises 2 groups: (A)group A consists of 9 subjects ;(B)group B consists of 11 subjects .

We can consider each subject as an observation with 45 measurements. Then each observation has a dimension of 45. Based on the structure of the complete VDP data set, groups A and B have a sample size 9 and 11, respectively. Because a spherical distribution always has a zero mean vector, the selected two groups of VDP data should be shifted to have the origin as its central location. This is realized by subtracting the group sample mean from each observation in groups A and B, respectively. That is, CA. centerized subgroup A: the sample mean from the 9 subjects is subtracted from each observation in subgroup A;CB. centerized subgroup B: the sample mean from the 11 subjects is subtracted from each observation in subgroup B.

After the centerization, each of the above subgroups is comparable to a sample from a spherical distribution, which has a zero mean.

For illustration purpose, in choosing the projection directions for the -plots for the data of group A and group B as defined above, we consider four directions:(1) determined by (3.5);(2) determined by (3.5);(3) determined by (3.8);(4) determined by (3.8).

On each of these four directions, the -function (2.15) (i.e., the -plot) and the acceptance region given by (3.3) are plotted in Figure 2 for the centerized subsets CA and CB.

The following facts can be observed:(1)the -plot for the centerized data in subgroup A (the CA plots) at the projection direction goes beyond the -acceptance region, showing evidence of nonspherical symmetry of the data at the significance level . As a result, it can be concluded that the null hypothesis of spherical symmetry for the centerized data in subgroup A is rejected at ;(2)the -plots for the centerized data in subgroup B (the CB plots) at the projection directions , , and go beyond the -acceptance regions, showing evidence of nonspherical symmetry of the data at the significance level . As a result, it can be concluded that the null hypothesis of spherical symmetry for the centerized data in subgroup B is rejected at .

The evidence of nonspherical symmetry of the data in subgroups A and B above implies that it may be inappropriate to set up a random effects model with spherically distributed random effects for the centerized data ( for subgroup A and for subgroup B), where stands for a scale parameter. The illustration of detecting nonspherical symmetry in Figure 2 could provide a way to regression diagnostics with the assumption of spherically distributed error terms. One of the regression diagnostic techniques is to check if the residual random vectors like are approximately i.i.d. -dimensional normal deviates by using the probability-plot method, where stands for the identity matrix and denotes an unknown standard deviation. If it shows a lack of fit for the normal assumption, one could consider testing the spherical symmetry for the residual random vectors by providing the -plots as in Figure 2. If no evidence of nonspherical symmetry can be detected from the -plots, the regression model for the observed data could be extended to have spherically distributed error terms. More discussion on regression models with spherically distributed error terms and statistical inference under elliptical distributions (which contains spherical distributions as a special case) can be referred to Fraser and Ng [25] and Fang and Zhang [18]. The illustration of detecting nonspherical symmetry for high-dimensional data by the -plot provides a graphical tool for goodness-of-fit problems in generalized multivariate analysis.

4. Concluding Remarks

Ghosh’s [16] original -plot is an effective graphical method for detecting nonnormality of univariate data. The -plot was extended to detecting nonmultinormality of high-dimensional data by Fang et al. [17]. In this paper we found another application of the -plot in testing high-dimensional spherical symmetry by providing approximate local acceptance regions. The simulation results in Section 3 show that the local acceptance regions given by (3.3) have feasible performance. Although we have not been able to find an optimal projection direction in applying the -plot to real high-dimensional data analysis, those directions used in the Monte Carlo study in Section 3 can provide potential users with a good reference. Theoretically, any projection direction subject to the condition in Theorem 2.2 can be applied to the -plot for testing high-dimensional spherical symmetry. Some directions may perform better than others, as demonstrated in Section 3. For general purpose, the idea in analysis of principal components can provide a guideline for choosing projection directions when applying the -plot to real high-dimensional data analysis. This is illustrated by (3.4), (3.7) and the VDP data set in Section 3.

In this paper we emphasize the -plot for testing spherical symmetry for the case of high dimension with a small sample size. For regular cases of testing spherical symmetry, the Q-Q plots proposed by Li et al. [11], and those analytical methods summarized in Fang and Liang [13], or the methods mentioned in the relatively new references in Section 1, should be used. There has been a lack of new effective methods for analysis of high-dimensional data with a small sample size since the past few years. So the -plot in this paper sheds some additional light on the area of high-dimensional data analysis.