Abstract

This paper proposes computationally tractable methods for selecting the appropriate spatial weighting matrix in the context of a spatial quantile regression model. This selection is a notoriously difficult problem even in linear spatial models and is even more difficult in a quantile regression setup. The proposal is illustrated by an empirical example and manages to produce tractable models. One important feature of the proposed methodology is that by allowing different degrees and forms of spatial dependence across quantiles it further relaxes the usual quantile restriction attributable to the linear quantile regression. In this way we can obtain a more robust, with regard to potential functional misspecification, model, but nevertheless preserve the parametric rate of convergence and the established inferential apparatus associated with the linear quantile regression approach.

1. The Spatial Quantile Regression Model

The spatial quantile regression model [1] is a straightforward quantile regression generalisation of the popular, in spatial econometrics, linear spatial lag model. More specifically it can be written as where is a spatially lagged dependent variable, specified via a predetermined spatial weighting matrix , is the design matrix containing the independent variables (covariates), and is a residuals vector. Here we only have one spatially lagged dependent variable but this is not an essential assumption, and more than one spatial weighting matrix can be easily incorporated. This representation is similar to the linear spatial lag regression model, but here coefficients are allowed to vary with the quantile, rather than being assumed fixed.

This model has some attractive properties. First, the original motivation for Kostov’s [1] proposal is to alleviate the potential bias arising from inappropriate functional form assumptions in a spatial model. In simple terms the underlying logic is as follows. Omitting spatial dependence typically introduces estimation bias in the presence of spatial lag dependence when the wrong functional form specification is employed. Hence a natural way to circumvent the problem is to estimate the underlying function nonparametrically. The sample sizes used in many empirical studies are however often too small for efficient application of nonparametric methods. Semiparametric methods could then be used to alleviate the problem. The linear quantile regression is such a semi-parametric method. Although it cannot be guaranteed to entirely eliminate the adverse effects of functional form assumptions, such methods can greatly reduce them. In particular Kostov [1] argues that for a typical hedonic model the (linear) quantile restriction is appropriate.

A major advantage of the quantile regression approach is the opportunity to estimate a flexible semiparametric model, which is nevertheless characterised by parametric rate of convergence, thus making it suitable for empirical analysis in small sample cases. Furthermore a well-developed set of tools for efficient inference is available (see [1] for details).

Spatial modelling has however been focused mostly on estimation issues. For example, Kostov [1] assumes that the exact form of the process generating the spatial dependence is given. This is a typical assumption of an “estimation focused” approach to spatial modelling in that the spatial weighting matrix used to specify the model is known. The spatial weighting matrix is however a part of the specification process. It needs to be prespecified. There could be cases where the underlying theoretical model provides some guidance but more often than not this is not the case. Consequently in empirical applications of spatial models the selection of spatial weight matrices is characterised by a great deal of arbitrariness. This arbitrariness presents a serious problem to the inference in such models since estimation results have been shown to critically depend on the choice of spatial weighting matrix [24]. Even more importantly, there is an interplay between spatial weighting matrix and functional form choice. Using the wrong spatial weighting matrix has broadly speaking the same implications as ignoring existing spatial dependence. Therefore functional form and spatial weighting matrix specification have to be considered simultaneously. The problem is not as severe in nonparametric models, because most nonparametric estimation methods are typically consistent even in the presence of spatial dependence. The wrong spatial weighting matrix however would still introduce inefficiency in the non-parametric estimates, which with smaller samples can seriously impede inference. In a parametric setup, the wrong spatial weighting matrix introduces bias even when the right functional form is used.

2. Selection of Spatial Weighting Matrix

Owing to these considerations it would be advantageous to have methods to choose an appropriate spatial weighting matrix. Selecting the “right” spatial weighting matrix can serve twofold purpose. First, it will increase the efficiency of the model estimates, as discussed previously. Second, when the nature of the process generating spatial dependence is of particular interest (e.g., in social interaction models) the form of the spatial weighting matrices consistent with that data generation process becomes a major inferential problem. In such cases we need to find the appropriate spatial weighting matrix, since this is the explicit subject of the research problem. In this paper we consider the issue in a spatial quantile regression framework.

In the following we will briefly review some approaches designed to reduce the arbitrariness of spatial weighting matrix choice (mostly) in linear models. Then we will discuss the possible extensions to the spatial quantile regression. The approach taken in this paper falls in the framework of selecting the spatial weighting matrix either implicitly or explicitly from a pre-defined set of candidates.

Holloway and Lapar [5] used a Bayesian marginal likelihood approach to select a neighbourhood definition (cut-off points for the neighbourhood), but one can consider their approach as a general model selection approach, which could be applied to any other set of competing models. A particularly active strand of research is concerned with Bayesian model averaging (BMA) approaches. LeSage and Parent [6] proposed a BMA procedure for spatial model which incorporates the uncertainty about the correct spatial weighting matrix. LeSage and Fischer [7] extended the latter approach into an MC3 (Markov Chan Monte Carlo Model Composition) method to select an inverse distance nearest neighbour type of spatial weighting matrix for the linear spatial model. Crespo-Cuaresma and Feldkircher [8] further extend this procedure to deal with different types of spatial weighting matrices by introducing Bayesian model averaging inference conditional on a given spatial weighting matrix. Crespo-Cuaresma and Feldkircher [8] use spatial filtering to resolve the endogeneity issue and in this way focus on the regression part of the model rather than on the spatial dependence itself. The approach above implicitly assumes that the spatial dependence can be characterised by a single spatial weighting matrix. This assumption can be relaxed but at a considerable computational cost. Eicher et al. [9] proposed instrumental variables Bayesian model averaging procedure which is essentially a hierarchical Bayesian counterpart to the frequentist two-step estimation that accounts for model uncertainty in both steps. Although Eicher et al. [9] do not deal with spatial dependence, but only with the more general issue of endogeneity, since spatial lag dependence is a particular type of endogeneity, their approach can be readily applied to spatial lag models.

Finally from a non-Bayesian point of view Kostov [10] suggested a two-step procedure for selecting spatial weighting matrix that is applicable to a wide range of prespecified candidates. This procedure is motivated by considerations specific to spatial models (and the proposed computational algorithms are tuned for this purpose), but otherwise it deals with the endogeneity problem in the same way as Eicher et al. [9].

3. Proposal Outline

This paper proposes extending the methodology adopted in Kostov [10] to a quantile regression setting. In what follows we will first briefly explain the previously mentioned approach. We will then highlight the particularities of the extension of this procedure to quantile regression models. Furthermore we will briefly comment on the different alternative options and the reasons for the specific choices we adopt. Our contribution is twofold. First we adapt the approach of Kostov [10] to a (linear) quantile regression model. Second, since as we will explain later, the original approach has a prediction focus, we further expand it to focus on structure discovery (i.e., identifying the “true sparsity pattern”).

Kostov’s [10] approach is based on Kelejian and Prucha’s [11] two-stage least squares method to estimate spatial models. In this method, spatially lagged independent variables are used as instruments for the spatially lagged dependent variable. The first step (instrumentation) is a least squares regression of the lagged dependent variable on the lagged independent variables. In the second step, the fitted values from the first stage regression replace the original endogenous variable in the estimation of the model’s coefficients. Kostov [10] retains the first step of this procedure (which projects the spatially lagged dependent variable in the vector space of the instruments). He however suggests implementing this first step for a number of different spatial weighting matrices resulting in an augmented second stage model that includes a large number of transformed, in the first step, variables (instead of the original spatial weighting matrices) to be considered. In this way the problem of choice of spatial weighting matrix becomes a variable selection problem (amongst the previously mentioned transformed variables). The other interesting feature of Kostov’s [10] paper is the application of a component-wise boosting algorithm as a variable selection method in the second step. Any other variable selection method could be used but Kostov’s [10] choice is mainly motivated by computational considerations in dealing with large number of potential alternatives.

In a nutshell the approach of Kostov [10] amounts to transforming the spatial weighting matrix selection problem into a high-dimensional (due to the potentially large number of alternatives) variable selection problem, for which “standard” methods could be applied. The crucial point is Kostov’s [10] approach to establish equivalence between the two-stage spatial least squares method and the proposed component-wise boosting alternative. Therefore in order to extend the same logic to a spatial quantile regression model we need to find a variable selection equivalent to a quantile regression estimation method. We will deal with these two issues in turn.

The first issue is the estimation method for spatial quantile regression. We are aware of two main approaches able to consistently estimate such models. The first is the application of Zietz et al. [12] who use the results of Kim and Muller [13] for quantile regression estimation under endogeneity. The other approach is presented in Kostov [1] who builds upon the methods developed by Chernozhukov and Hansen [14, 15]. In Kostov’s [1] application one minimises a matrix norm over a range of values for the spatial dependence parameter. This is convenient when there is a single spatial weighting matrix. With many candidates however this would involve such minimisation over a multidimensional grid, which makes such an approach prohibitively expensive in terms of computational requirements, particularly when the number of potential spatial weighting matrices is large. Alternatively the methods developed in Chernozhukov and Hong [16] could be used to estimate such a model, but this will still involve considerable computational costs, and we will not pursue this option here. Furthermore, the main appeal of this procedure over the two-stage quantile regression is the availability of robust inference tools, since it is computationally more demanding (see [1] for detailed comparison). Here we are interested in selecting the model specification, rather than estimating a prespecified model. With view to this simpler methods are preferable. Once the final model specification is established and inference is the main focus, any estimation method could be applied, depending on the purpose of the analysis.

The Zietz et al. [12] approach on the other hand represents a simple two-stage quantile regression. As such it is very similar to the spatial two-stage least squares approach of Kelejian and Prucha [11], which is being used in Kostov [10]. Therefore using the theoretical results of Kim and Muller [13] we can extend their two-stage quantile regression estimator to include variable selection, using essentially the same arguments as Kostov [10]. Such an extension however comes at a cost. The previous approach uses two consecutive quantile regression estimators defined at the same quantiles at both steps. In the context of selecting spatial weighting matrices the first step would carry considerable computational burden, mainly because of the large number of alternatives to be considered. This means that the computational burden will be increased since separate first step estimation would need to be carried over each quantile that is to be considered. It would therefore have been very useful if one could have replaced the first step with, for example, least squares estimation, because this would then only need to be carried once. There have been empirical applications of two-stage estimation where the estimators used in the first and the second stage are different. For example, Arias et al. [17] and Garcia et al. [18] used least squares in the first step followed by quantile regression in the second. Unfortunately in general settings such an approach could induce asymptotic bias in the overall estimator (see [13] for details). In simple terms the robustness of two-stage estimators could be lost when the first stage applies an estimator that is not robust. Owing to this we consider here only estimators that employ the same type of estimator for both steps. This means that we will have to use quantile regression in both steps. The use of quantile regression for each estimated quantile greatly increases the computational costs of the method compared to the linear model.

The proposal of Kostov [10] translates into using variable selection algorithm in the second stage estimation. As discussed previously this variable selection algorithm needs to be the same type as the one in the original two-stage estimator. Therefore we need a quantile regression variable selection method. There are several possibilities for the latter. First, the component-wise boosting approach used in Kostov [10] can be adapted to do variable selection in a quantile regression setting. At this end Fenske et al. [19] demonstrated that using the check function used to define the quantile regression as an empirical loss function leads to an alternative quantile regression estimator. Using this approach looks like a natural extension to the logic of Kostov [10], particularly since he does mention the potential use of alternative empirical risk functions.

Another option is to use regularised (i.e., penalised) quantile regression to select covariates. Two of the most popular regularisation approaches, namely the least absolute shrinkage and selection operator (lasso) of Tibshirani [20] and the smoothed clipped absolute deviations (SCAD) method of Fan and Li [21] have already been considered in quantile regression setting (see [2224]. In general these papers have established the consistency of such regularised estimators for quantile regression problems, subject to appropriately chosen “optimal” penalty parameter(s).

So, a straightforward generalisation of the approach of Kostov [10] to quantile regression involves a similar two-step procedure. In the first step a number of quantile regressions are implemented (for each candidate spatial weighting matrix) regressing the spatially lagged dependent variable on the spatially lagged independent variables. The fitted values from the first step are then used as additional explanatory variables (thus augmenting the original set of covariates). This second step is estimated using variable selection methods to effectively select the appropriate spatial weighting matrix.

There are several important features of such implementation. First, since it is based on a consistent two-stage estimator (the two-stage quantile regression estimator of Kim and Muller [13]) it should retain the consistency properties of the original estimator as long as the second step is also consistent. As already discussed the price we have to pay for maintaining such consistency is the need to estimate separate first step quantile regression for each quantile considered. Second, similarly to other the two-step procedures, standard errors, or indeed any inference based solely on the second step estimation would be invalid. One could consider asymptotic inference based on the results of Kim and Muller [13]. Alternatively the overall (two-step) estimator could be bootstrapped. Note however that due to the computational costs of the first step (details of which we present later on) such an implementation would be prohibitively expensive. The best option is to follow the suggestion of Kostov [10] and only use the proposed estimator to select the structure of the model, which can then be estimated using standard methods.

4. Variable Selection Step

From now on we will take the first (instrumentation) step as given and will focus entirely on the variable selection step. We will argue that in order to obtain efficient inference it is desirable that in the second step a variable selection procedure characterised by the so-called oracle property is implemented. In simple words if an estimator possesses the oracle property this means that the asymptotic distribution of the obtained estimates is the same as this of the “oracle estimator,” that is, an estimator constructed from a priori knowledge of which coefficients should be zero. Therefore estimators possessing the oracle property can be used for both variable selection and inference. Here we deviate considerably from Kostov [10] who claimed that since the proposed procedure is only to be used for selecting the model structure, the oracle property in not essential. Actually the brief discussion provided in Kostov [10] implies (without explicitly mentioning it) that instead of consistency, the weaker condition of persistence [25] would be sufficient. While the oracle property aims at minimising prediction error, the persistence tries to avoid wrongly excluding significant variables.

Therefore using persistent estimator implicitly includes a measure of uncertainty very much in the spirit of Bayesian methods. The actual aim in many typical applications however would be to discover the “true sparsity pattern.” For such purposes a combination of consistent and oracle estimators have been shown to be able to discover the underlying structure and retain the oracle property. This idea has been formalised and theoretically developed in Fan and Lv [26]. Their methodology consists of a screening step (using a consistent variable selection method) followed by an oracle method (estimation step) to produce the final model. Even if both methods used in such a combination do not possess the oracle property the overall procedure will gain from improved convergence rates and can still be consistent subject to some additional conditions (see e.g., [27] for detailed discussion and simulation evidence). Here however we prefer to avoid imposing such additional conditions and would prefer applying a method possessing the oracle property in the estimation step.

An additional advantage of combining screening and estimation steps is the reduction in computational requirements and improved convergence rates. The convergence rates of estimators possessing the oracle property depend on the relative (to the complexity of the employed model) sample size. Owing to this it would be desirable if the size of the initial model is reduced. Applying an estimator possessing the oracle property to such a reduced model will improve this estimator’s efficiency (compared to the case when it is applied directly to the larger, unrestricted model). In addition to the theoretical efficiency gain, this could bring considerable practical gains in greatly reducing the computational requirements of the selection algorithm(s) involved. Such a reduced model can be produced by using any consistent estimator (i.e., an estimator that (asymptotically) retains the important variables (i.e., variables with nonzero coefficients)). In simple terms the combination of screening and estimation steps reduces the false positive discovery rate (i.e., falsely retaining unimportant variables) and hence is tuned to structure discovery. Retaining such unimportant variables often improves prediction accuracy or uncertainty measures and hence can result in larger models (see [27], for a detailed discussion).

So we propose applying a combination of screening and estimation steps to the already transformed model. Such a proposal can be viewed as unnecessary complication to an already involved procedure. Nevertheless it has significant advantages. First, as we will show, it nests within itself the straightforward implementation of the Kostov’s [10] proposal. Second since the combination of screening and estimation steps is equivalent to a single step estimation, but has better convergence rates, one can potentially further reduce the set of potential spatial weighting matrices by maintaining the consistency of the overall estimation procedure. The previously mentioned equivalency means that the overall proposed spatial model estimator which comprises three distinct steps (instrumentation, screening, and estimation) is still equivalent to the two-step method used to motivate it (i.e., the two-step quantile regression).

As discussed previously using either a boosting or regularisation approach can be viewed as different implementations of the same idea, namely, implementing a variable selection step in a two-stage quantile regression estimator. In order to ascertain the relative merits of these two alternatives let us first consider their relative computational requirements. The boosting approach is considerably less intensive in terms of computation. It has another important, in the context of spatial weighting matrix selection, advantage over the regularisation approach. Since the component-wise boosting approach processes the candidate variables one by one (see the next section for description of the component-wise boosting algorithm), high degrees of correlation amongst variables (and therefore singularity issues due to a highly nonorthogonal design) do not present significant problem to effectively reduce the set of alternatives. The nature of the spatial weighting matrix selection problem could involve simultaneous consideration of numerically very similar alternatives, which could be infeasible in the regularisation approach. Furthermore although extensively studied and shown to be consistent it is unclear whether the boosting approach possesses the oracle property. It is therefore desirable to implement the component-wise boosting as a screening method. Then an oracle property regularisation approach can be implemented in the estimation step. Note that if we stop after the screening step, we obtain a straightforward quantile regression generalisation of the approach of Kostov [10].

Due to the fact that component-wise boosting is much faster than direct implementation of any regularisation approach, the previous strategy achieves considerable reduction in the computational requirements and makes the overall approach computationally feasible. Note that in addition to the computational requirements, direct application of a regularisation estimator could be infeasible in many spatial problems, simply because of the nature of the spatial weighting matrices to be considered. When a large number of such matrices is considered (as in [10]), the resulting transformed variables could be quite similar numerically. This could result in singularities that would prevent direct application of a regularised quantile regression estimation of the transformed problem.

In addition to the approach outlined previously we will also consider adopting the stability selection approach of Meinshausen and Bühlmann [28] to the boosting estimation. Strictly speaking stability selection is not an estimator per se, but application of a combination of subsamplings (although other forms of bootstrap could be used) and a variable selection algorithm. It provides a measure of how often a variable is selected, and therefore by using a threshold only persistent variables can be selected.

5. Technical Implementation Details

The screening step will use component-wise boosting estimation of quantile regression, following Fenske et al. [19]. Consider the general linear quantile regression model: where and are the dependent and independent variables (the latter collected in the matrix ) and is the quantile of interest.

Boosting can be viewed as a functional gradient descent method that minimises the constrained empirical risk function , where is some suitable loss function. The th quantile regression is obtained when the so-called check function is used as empirical risk:

In the a notation mentioned we intentionally use the general additive predictor since it allows for generalisation of the approach to nonlinear and indeed nonparametric versions of the quantile regression problem. Since the check function is used to define the conventional linear quantile regression estimator of Koenker and Basett [29], using it as an empirical risk function solves an equivalent optimisation problem.

The boosting algorithm is initialised by an initial value for , for example, . This implies an initial evaluation for the underlying function . In this case all underlying functions will be linear. Typically one starts with an offset set to the unconditional mean of the response variable, but in the quantile regression the unconditional median is used instead (see [19] for details and justification of this choice).

Let and denote the evaluations of the corresponding learners (in this case linear functions) for component at iteration . represents the learner (i.e., linear function) fitted to the current “residuals” while is the “global” evaluation of the same function (see the following algorithm).

Then the component-wise boosting algorithm iteratively goes through the following steps.(1) Compute the negative gradient of the empirical risk function evaluated at the current function estimate ( for every step from ): (2) Use the previous calculated negative gradients to fit the underlying function for each dependent variable (component). Here is fitted to the current residuals value of the used function at iteration .Find the best fitting base learner . (3) Update the best fitting base learner for a given step size :

The algorithm iterates between steps (1 and 3) until a maximum number of iterations are reached. The algorithm described above needs an updating step . In this application we will use . See Kostov [10] and references therein for a discussion about this choice and demonstration that the final results are insensitive to a wide range of choices. The other element of interest is the criterion used to decide which is the “best fitting” component in step (2). Here we use norm (see the aforementioned), but other choices are also possible. The greatest advantage of norm is that the base learners can be updated by simple least squares fitting, which is computationally fast and convenient (see [19]). In this particular case, since we use linear quantile regression, updating the base learners amounts to applying univariate least squares.

A regularised linear quantile regression estimator can be formally defined as where is the vector of the linear coefficients pertaining to the covariates, that is, , and is a given penalty function.

The shrinkage effect is determined by the positive penalty parameter , that needs to be chosen according to some criterion (typically information criterion or cross-validation).

The SCAD penalty is symmetric around the origin (i.e., ). It is defined as follows: where and are tuning parameters. In this paper we will set , following Zou and Yuan [30], which would help us avoid searching for optimal tuning parameters over two-dimensional grid and for this reason suppress in the notation previous.

The SCAD estimator can then be formally defined as

Straightforward implementation of regularised estimators is however computationally demanding. The main issue is that expensive repeated optimisation calls are needed to select the regularisation parameter(s) typically via some form of cross-validation. Furthermore the nonconvex nature of the SCAD optimisation problem can lead to considerable increase of the computation time at some quantiles, particularly when larger number of spatial weighting matrices are retained by the screening step, which is consistent with the results of Wu and Liu [23]. In order to select the optimal amount of regularisation we need some criterion. Given the computational costs of SCAD estimation, information criteria would be preferable. Here we will employ the g-prior Minimum Description Length (gMDL) criterion used in Kostov’s [10] boosting application. This choice is however dictated mostly by computational reasons, and up to the best of our knowledge there is no evidence (such as simulation studies) to ascertain the performance of this criterion in empirical studies of nonlinear models.

The adaptive lasso estimator for the linear quantile regression can be defined as a weighted lasso problem in the following way: where denotes the norm, while the weights are given by for some , where are initial estimates for the parameters. In this case will be obtained by an unpenalised quantile regression. The conventional lasso estimator is a particular case when all weights are equal, rather than adaptively chosen.

The adaptive lasso when implemented in a quantile regression setting retains the oracle property [30] similarly to the mean regression case. Therefore the adaptive lasso estimator is a reasonable choice in this setting, particularly bearing in mind the computational cost associated with the transformation step. Furthermore norm estimators are by far the most widely studied regularisation estimators for quantile regression (see, e.g., [23, 24, 30] for variable selection applications).

Li and Zhu [22] proposed an algorithm to estimate the whole regularisation path for lasso type of quantile regression problem. Their proposal is potentially valuable since it can be applied to non- (or semi-) parametric additive quantile regression models and therefore results in a much more general approach, intrinsically immune to functional form misspecification. The advantage to such algorithms is that since they exploit the piecewise linear property of the regularisation path, the latter can be obtained at a fraction of the computational cost of the overall regularised estimator. This facilitates implementation of cross-validation and/or information criteria.

The elastic net [31] penalty is a combination of and norms, and for the quantile regression the resulting estimator can be written as

An important property of the elastic net penalty is that the inclusion of the norm induces a grouping effect in that correlated variables are grouped together. This would help avoid spuriously selecting only one variable from a group of highly correlated variables. Given that in many empirical problems the spatial weighting matrices considered can lead to highly correlated designs, it would be desirable to avoid such a pitfall. One should note however that elastic net penalisation could be expected to retain more variables compared to the other approaches.

The least squares approximation (LSA) estimator [32] is given by: where is the second derivative at the unpenalised loss function, evaluated at the unregularised estimates . It is technically obtained as an approximation based on first order Taylor series expansion (see [32]).

In the case of quantile regression, the respective loss function (i.e., the check function ) is not sufficiently smooth. Nevertheless, as long as , which is in principle any consistent covariance matrix estimate pertaining to the unpenalised problem, can be obtained, the corresponding LSA estimator, defined in (11) exists. Furthermore when regularisation parameters are chosen optimally it possesses the oracle property (see [32] for a formal proof). Since (11) is essentially a linear lasso type of problem, it can be estimated using standard methods. In particular the computationally efficient least angle regression algorithm (LARS) of Efron et al. [33] can be used to compute the regularisation path. Here we will apply the BIC-type tuning parameter selector of Wang et al. [34] to select the optimal amount of shrinkage. Application of the LSA to a quantile regression requires a covariance matrix estimator for the latter. Any consistent estimator would be appropriate. In this paper we will use the kernel-based covariance estimator proposed in Newey and Powell [35].

6. Study Design and Implementation Details

For comparative purposes we follow closely the design outlined in Kostov [10]. This involves using the same dataset, model specification as well as a set of competing alternative spatial weighting matrices. Since all these are discussed in some detail in Kostov [10] we will only briefly sketch them here.

The corrected version of the popular Boston housing dataset [36] is used. It consists of 506 observations and incorporates some corrections and additional latitude and longitude information, due to Gilley and Pace [37]. This dataset contains one observation for each census tract in the Boston Standard Metropolitan Statistical Area. The variables comprise of proxies for pollution, crime, distance to employment centres, geographical features, accessibility, housing size, age, race, status, tax burden, educational quality, zoning, and industrial externalities. A detailed description of the variables, to be used in this study, is presented in Table 1.

The basic model as implemented in Kostov [10] is as follows:

The basic specification mentioned previously is augmented with alternative candidate spatial weighting matrices, constructed using the longitude and latitude information. The set of alternative spatial weighting matrices is constructed using inverse distance raised on a power weights specification and nearest neighbours definition of the neighbourhood scheme. We will adopt the naming conventions used in Kostov [10] combining the codes for the neighbourhood definition and the weighting scheme to refer to the corresponding spatial weighting matrix and the resulting additional variables to be included in the boosting model. All these variables are named using the following convention: nxwy, where x is the number of neighbours and y is the weighting parameter (which is the inverse power of the weight decay). For example, the spatial weighting matrix with the nearest 50 observations as neighbours and inverse squared distance weights as well as the resulting transformed variable will be denoted as n50w2. We employ all values for number of neighbours from 1 to 50 and evaluate w in the interval [0.4, 4] using increments of 0.1. In simple words this means that we are combining 50 possible neighbourhood definitions with 37 alternatives for the weighting parameter resulting in 1,850 alternative spatial weighting matrices to be considered simultaneously.

Kostov [10] projects the spatially weighted dependent variable into the column vector space of the spatially weighted independent variables, by taking the fitted values from a least-squares regression to obtain the transformed variables, named according to the previous convention. As discussed before here we need to replace this first step with a quantile regression defined over a pre-determined quantile to obtain a model augmented with the alternative spatial weighting matrices. The second stage is then implemented in two consecutive steps. First we apply a component-wise boosting quantile regression, defined over the same quantile (as in the first stage) to the augmented model. This is the screening step that reduces the set of variables to be considered in the model. Then a regularized quantile regression (defined over the same quantile) is applied to the screened dataset. The previous three steps (transformation, screening and estimation) can be run over any prespecified quantile, and their consecutive implementation defines our estimator.

In the present setting some caution should be exercised in applying the estimation step. Note that in conditionally parametric models, there is a certain trade-off between variables and spatial dependence. The spatial dependence structure could approximate the effect of missing variables, provided these are spatially correlated. Therefore simultaneously shrinking the coefficients of both variables and spatial lags will be a manifestation of this trade-off. Whenever the model contains such related terms in both the spatial part (i.e., spatial weighting matrices) and in the regression part (variables the effect of which could be approximated by these spatial weighting matrices) simultaneous shrinkage is undesirable. The danger here is that one can spuriously exclude important variables and approximate their effect by additional spatial terms. Note however that if we assume that the regression part is given, this trade-off will disappear. Ideally one would want to eliminate this trade-off. In order to avoid the impact of the approximation on this trade-off we suggest a two-step implementation of the estimation step. In the initial step only the spatial lag coefficients are penalised, while in the following final step all coefficients are penalised. In this way the initial step should select the appropriate spatial dependence structure, while the final step would perform final variable selection. Hence the initial step makes structural inference about the spatial part conditional on the regression part of the model. If the screening step has produced a model that is reasonably close to the true one, then the proposed approach should be able to discover the true underlying structure. Alternatively one may wish to implement an iterative estimation in which the estimator alternates between steps in which only the spatial structure is penalised and steps with only the regression part are penalised until convergence (defined in terms of obtaining a stable structure in that no more terms are eliminated). Such steps can be viewed as conditioning one part (spatial or regression) of the model on the other hence avoiding the trade-off. The latter approach would however be computationally more expensive.

Another issue is the highly correlated design of the spatial quantile regression model, when there are large number of potential spatial weighting matrices. Since in principle the variable selection methods rely upon marginal correlations, they could fail to perform in such highly correlated designs. For the mean regression model recent contributions by Wang [38] and Cho and Fryzlewicz [39] have suggested alternative methods that overcome such a reliance on marginal correlations and hence are applicable to highly correlated designs. It is however unclear how such methods can be extended to the quantile regression case. The two-step approach adopted in this paper conditions selection for the spatial and hedonic variables on the other part of the model and hence reduces this trade-off. Such an approach is justified if the regression part of the model is correctly specified, but could be suboptimal if this is not the case. This is of course an area that deserves further investigation.

7. Results

We implement the proposed estimator for the 0.1 to 0.9 quantiles with a step of 0.1 (i.e., 9 different quantile regressions). Table 2 presents comparative computational time details for the different procedures. All these are calculated from the first of the considered quantiles (i.e., the 0.1th one) and are given as a guidance only since the actual computational time could vary according to the nature of the optimisation problem that can change over different quantiles. All computations are undertaken using the statistical programming language R [40] on Intel Core2 2.13 GHz processor with 2 Gb RAM, not using any parallel computation. Parallelising some of the more computationally demanding tasks and/or using compiled code could considerably reduce the computational time. Furthermore it cannot be claimed that the actual implementation of these procedure is optimised in terms of computational time.

The instrumentation step is the most time-consuming task. In our implementation it takes over 30 minutes for 1850 spatial weighting matrices. In many empirical problems one would probably consider much smaller number of alternative spatially weighting matrices. Furthermore most of the time in this step is spent on creating the spatially weighted dependent and independent variables, rather than fitting the actual quantile regressions.

The actual boosting procedure requires running the boosting algorithm for a large number of iterations and then calculating a stopping criterion to decide upon the estimated structure. The boosting algorithm is very efficient computationally. The stopping criterion calculation however takes considerable time. Efficient parallel implementations for the latter exist, and these can considerably reduce the computation time.

The time needed to calculate the stopping criterion is directly proportional to the number of boosting steps (which is effectively the number of alternative “models” for which it is calculated). Since in this case at all considered quantiles we need at least three times less iterations than the 5,000 used here, practical implementation would have taken 6-7 minutes rather than 18 as reported in Table 2.

We apply the stability selection to the already reduced (in the instrumentation step) dataset. Yet again this is relatively time-consuming procedure, but it can be parallelised for further computational gains.

One has to be careful in directly comparing these implementations of the estimation step, as the instrumentation step mentioned previously demonstrates; calculating the stopping criterion (i.e., the optimal penalty parameters) is by far the most computationally demanding part of these procedures and the reported implementations use different methods for this.

With regard to the estimation methods we report separately the computation times for step one (where only the spatial weighting matrix coefficients are penalised) and the consecutive second step where all the coefficients are penalised. As it is to be expected the LSA is the fastest method. This is due to two underlying facts. The first is that it uses the efficient least angle regression algorithm [33] while the other refers to use of the BIC-type tuning parameter selector of Wang et al. [34] which is easy to compute.

The full path estimation for adaptive lasso, accompanied by cross-validation to choose the optimal amount or regularization, appears to be the most computationally demanding estimation method. Most of the computational costs however come from the use of cross-validation. Furthermore this is the most universally applicable method in the sense that many of the other methods can run into difficulties during the optimisation (at different quantiles) which can considerably inflate their computational costs.

We present computational details for implementing SCAD with gMDL over a predefined grid of 50 penalty values. Although the computational times appear acceptable, one has to take into account some caveats. The nonconvex nature of the SCAD optimisation problem means that in some cases the actual computation time can increase considerably (with a factor of over 100 in some cases). Furthermore we have opted to fix one of the regularisation parameters which artificially reduces the computational time. Another important point to make is that no set of penalisation parameters is ex ante guaranteed to span the whole regularisation path. In our implementation we run a preliminary SCAD estimation over a range of such values designed to identify a feasible set that does span most of the regularisation path and then manually select the grid of such values. In cases where the optimisation is difficult, this can lead to considerable increase of computational time. Therefore a path estimation algorithm for the SCAD estimator for quantile regression is essential if a reliable implementation of this method is to be designed. The use of the gMDL as an optimality criterion is also somehow ad hoc in that there is no firm evidence on its performance for this type of problems, and it is mostly dictated by computational reasons (since cross-validation, for example, would be very costly).

The elastic net implementation is reasonably efficient. Both the BIC and the generalised approximate cross-validation yield the same models. The reported computational costs refer to the routines that compute internally both of the above criteria, but this only marginally increases the computational costs. Most of the computational load comes from the double regularisation needed to solve for the two underlying penalties.

The component-wise boosting algorithm manages to achieve considerable reduction in the model space. It retains between three and eleven spatial weighting matrices across the different quantiles. We will not present these intermediate results here for brevity reasons, but details are available upon request. This intermediate step yields a reduced model space that can be explored for the underlying structure as discussed in the methodology section. Table 3 presents the results from the stability selection applied to the prescreened model (i.e., after the boosting application). Typically stability selection applies a prespecified probability threshold to select variables. Here instead of proper stability selection we present the corresponding inclusion probabilities for the spatial weighting matrices. We omit spatial weighting matrices with inclusion probability less than 10%. Full results are available upon request. Table 3 provides a background against which the actual estimation results can be evaluated. If one was to use a threshold of 0.6, most quantiles would have resulted in a single spatial weighting matrix being selected. Such a choice would however have been base solely on the component-wise boosting algorithm, which as already discussed may not possess the oracle property and therefore may be inappropriate for structure discovery purposes. The estimation results from the least squares approximation (LSA), full-path adaptive lasso (full path), and smoothly clipped absolute deviations (SCAD) by quantile are presented in Tables 412. The elastic net results are not presented. In simple terms due to the implicit correlation penalty, the elastic net possesses a grouping property in that it groups together correlated variables. In this case due to the highly correlated nature of the spatial weighting matrices the elastic net groups them together and cannot exclude spatial weighting matrices whenever they are similar to the ones already included in the model.

Bearing in mind the relative computational cost for the different estimation approaches, it is useful to determine whether the LSA results differ substantially from the other approaches. Table 4 presents the results at the 0.1th quantile. With regard to the main (i.e., hedonic) variables the results are comparable across estimation methods. Where a variable is omitted by one of these methods, it is either dropped by the other methods or found to be statistically insignificant. With regard to the spatial weighting matrices the stability selection approach (see Table 3) determined that three such matrices have inclusion probability of over 0.5, but one such matrix, namely, n5w0.6 has a very high inclusion probability of 0.96. The estimation results reflect this in that n5w0.6 is selected by all methods, while the full path and the SCAD estimators also include another spatial weighting matrix. It hence looks like the LSA chooses a slightly more parsimonious model. One should note however that the additional spatial weighting matrices chosen by the other two methods are statistically insignificant. Therefore all methods point to the same underlying model structure.

Table 5 presents the results for the 0.2th quantile. Again the LSA closely approximates the full path solution. The additional spatial weighting matrix selected by the full path estimation is insignificant, and both methods lead to the same conclusions. The SCAD estimation however produces slightly different results. Most notably it chooses a different spatial weighting matrix. In terms of implementation the SCAD optimisation problem at this quantile did take longer to solve. Nevertheless, the selected by SCAD spatial weighting matrix is the one characterised by highest inclusion probability, according to the stability selection. Due to the somewhat ad hoc implementation of the SCAD estimation in this paper (because it is evaluated over a grid, instead of computing the whole regularisation path and the use of the gMDL as stopping criterion) it is however difficult to evaluate the latter result. As far as the LSA is concerned however, it provides a reliable approximation of the adaptive lasso estimator at a fraction of its computational cost.

The results for the 0.3th quantile (Table 6) are broadly similar across different methods. LSA and the adaptive lasso select the same spatial weighting matrices and although these show some small differences in terms of statistical significance for the latter, the structural inference is essentially the same. Once again SCAD results yield a small difference with regard to the preferred spatial weighting matrices.

At the 0.4th quantile (Table 7) all methods choose n3w0.4 together with a slightly different second spatial weighting matrix. With regard to the second retained spatial weighting matrix LSA is very similar to the SCAD (n6w0.5 versus n6w0.4) while the adaptive lasso selection is slightly different. Hence although LSA results are comparable to the other methods, the approximation to the adaptive lasso is not as close as at the previously considered quantiles. One can note that in this case out of the 8 spatial weighting matrices originally retained by the boosting algorithm, 7 have a model inclusion probability exceeding 0.1 (see Table 3). Therefore one can tentatively conclude that LSA provides a reasonable approximation to the adaptive lasso for quantile regression whenever there are relatively few competing spatial weighting matrices. This property is to be expected given the trade-off between the spatial and the regression parts in a quantile regression model, that was discussed in the methodology section. It is possible that if one applies a different method to control for this trade-off, instead of the two-step implementation of the oracle estimator, implemented here, the quality of the LSA approximation may not deteriorate with increasing dimension of the spatial weighting matrices space.

The estimation results for the 0.5th quantile (see Table 8) are similar across the different methods. The adaptive lasso finds it difficult to discriminate between n5w0.4 and n6w0.4, but since these two are very difficult to distinguish between one can conclude that the methods agree. Interestingly although SCAD produces the same spatial weighting matrix as LSA, the technical difficulty in discriminating between two very similar spatial weighting matrices results in considerable increase in computational time.

Similarly at the 0.6th quantile (Table 9) results are consistent across methods. The LSA and adaptive lasso both select n6w0.6 with LSA also selecting another spatial weighting matrix, which is however statistically insignificant. The SCAD selects n6w0.4, which by the way is the highest inclusion probability spatial matrix, according to the stability selection results (see Table 3). This again raises the issue of the comparative performance of SCAD and adaptive lasso, but the results are nevertheless qualitatively similar.

Table 10 shows the estimation result for the 0.7th quantile. This is where there are considerable differences between the different methods. There is disagreement about the sign of TAX between SCAD and the other two methods. Furthermore all three methods choose different spatial weighting matrices. Once again the source for such difference is probably the size of the (screened) spatial weighting matrices space, which consists of 11 such matrices, 8 of which have stability selection-derived inclusion probability of at least 0.2.

At the 0.8th quantile all methods select n6w1.1 (see Table 11). SCAD selects another spatial weighting matrix, which leads to some differences in the regression part. However since this additional spatial weighting matrix is not statistically significant, omitting it would produce results consistent with the other methods.

Finally Table 12 presents the results for the 0.9th quantile which differ considerably amongst methods. The main reason for these differences is that different main hedonic variables are selected by different methods which results in differences for the spatial part. The latter raises the issue of the trade-off between the regression part and the spatial dependence.

8. Conclusions

This paper proposes methods for selecting spatial weighting matrix in a quantile regression context. We build upon previous work in this area. In discussing our proposal we outline the different alternatives and the potential different implementations of the same set of ideas. Our proposals are designed to reduce the arbitrariness of choice of spatial weighting matrix and the impact of the trade-off between functional form and spatial dependence specifications. The spatial quantile regression is already a quite flexible model allowing for different impacts, across different quantiles. Our procedure introduces additional flexibility in not only different degrees of spatial dependence but also different spatial weighting matrices for different quantiles, resulting in potentially interesting inferences about the nature of the underlying process.

The methodology proposed in this paper consists of three steps: instrumentation, screening, and estimation. Several different alternative methods for the estimation step are explored. This allows for conclusions to be drawn with regard to the performance of these estimators in different circumstances. In general terms, the proposed methods are tractable with small and moderate size samples, where the advantages of the spatial quantile regression model are most pronounced.