Abstract

Spatial data analysis provides valuable information to the government as well as companies. The rapid improvement of modern technology with a geographic information system (GIS) can lead to the collection and storage of more spatial data. We developed algorithms to choose optimal locations from those permanently in a space for an efficient spatial data analysis. Distances between neighboring permanent locations are not necessary to be equispaced distances. Robust and sequential methods were used to develop algorithms for design construction. The constructed designs are robust against misspecified regression responses and variance/covariance structures of responses. The proposed method can be extended for future works of image analysis which includes 3 dimensional image analysis.

1. Introduction

Companies can learn consumer behavior to increase their profits through spatial data analysis. A common consumer behavior pattern can be identified in neighborhoods. When these patterns are identified, companies can reduce expenditures and wastage. Recently, massive amount of spatial data have been collected through remote sensing techniques, magnetic resonance imaging (MRI) scanners, X-ray machines, cameras, governments, and companies. These types of data are mostly nonexperimental observational data [1]. Jaworski et al. [2] noted that data analysis with a large sample is a time-consuming and expensive procedure. The subsampling method can overcome this obstacle and was developed by many authors including Rocke and Dai [3] and Salloum et al. [4]. Moreover, Wang et al. [5] and Yao and Wang [6], among others, discussed the optimal subsampling method for nonexperimental data.

Groundwater contamination started with the industrial revolution. Gas sectors, mining industries, and industrial waste are the main sources of groundwater contamination. Water pollution brings risks to human health. Therefore, groundwater monitoring is important to identify potential water contamination. The selection of optimal wells from a large number of wells lead an efficient understanding of groundwater pollution and a cost reduction in groundwater monitoring [7]. Naturally, the levels of contamination in water are highly correlated if two wells are close to each other. In this paper, we accommodate these kinds of correlations among responses in design construction.

The robustness including correlation structure among responses is discussed in many studies; for instance, see Shi et al. [8] Wiens [9] and Wiens [10] on the construction of designs. The misspecified variance/covariance structure was considered by Wiens [10] in the development of a robust method to construct designs for spatial analysis. However, they developed algorithms to choose optimal locations from equispaced locations. Wiens [11] included the misspecified variance/covariance structure in the model by incorporating robust methods. The universal kriging estimate was used in his development of the loss function for design construction. In this paper, theoretical works of Wiens [11] were applied to establish an algorithm to select optimal locations from permanent locations that are not necessarily equispaced locations.

The rest of this paper is organized as follows. We describe the model formulation and methods in §2. In §3, an algorithm is described using the sequential method, and the proposed algorithm for the design construction is validated by some test cases. In §4, we outline an algorithm to choose optimal locations from fixed permanent locations and give an example using the algorithm. Also, the discussed robust method was applied to ‘coal-ash’ data in the same section. We summarize our findings in §5.

2. Materials and Methods

The material in this section is based on the theory in Wiens [11]. We discuss how to find optimal locations from a design space where with and . contains information regarding th spatial location. We assume that the relationship between responses and locations can be expressed by a linear model. We include robustness by considering the model misspecification and correlations among responses in the construction of designs. We consider the following approximately linear model:for some small model error , and is a homoscedastic measurement error with , -dimensional vector regressors , and model parameters . However, the experimenter assumes the incorrect model . Based on this assumption, the true unknown parameters can be obtained by

Define the matrix having rows and vector with elements . We assume that has full column rank. Condition (2) leads to the following orthogonality requirement:

Responses are correlated having the following covariance matrix:where .

In general, the experimenter has an objective to measure responses at the location . We assume that covariances among responses have the following structure;

We impose the following conditions:where and are constants, is an induced matrix norm. The experimenter has a plan to collect data .

Let be a class of functions satisfying conditions (3) and (6) and be the class of positive semi-definite matrices satisfying condition (6). The model misspecification is accounted by a function in and covariance matrix in . We define the covariance matrix of by

Also, we define the incidence matrix to express in terms of and it is described as follows:

Thus, the covariance matrix can be expressed by

The optimal linear predictors of the random quantities can be obtained by the universal kriging [12]. This task can be achieved by minimizing the prediction mean squared error (PMSE) that is defined by

By using Theorem 1 of Wiens [11]; the PMSE can be written as follows:for any function in and covariance matrix in , where

Theorem 1. Let for be a family of covariance structures, where is the true covariance matrix. Then under the assumptions of and satisfying condition (6), the maximum value of PMSE over and is timeswhere , , with .

The proof of Theorem 1 follows directly from Theorem 2 and Remark 1 of Wiens [11]. In this study, the loss function in Theorem 1 is used for design constructions. We will discuss two types of correlation functions in the next section.

2.1. Correlation Matrix

We assume two correlation functions: (i) the isotropic Gaussian correlation function and (ii) the anisotropic Gaussian correlation function for , where is Euclidean norm [9]. Also, the true correlation matrix has the following form:

Wiens [9] suggests the value of is 0.9 that is the nearest neighbor correlation. We will use the same value of in the construction of optimal locations in §3 and §4.1. The true covariance matrix can be evaluated by for a specified constant .

3. Design Construction

In this section, we will discuss how to choose optimal locations from locations in a two dimensional space. We consider the approximately linear modelwhere , , is a response observed at , and is a small departure from the assumed model by the experimenter. We suppose that a region, , is a two dimensional square with 1 unit length. The region consists of vertices (0, 0), (0, 1), (1, 0), and (1, 1). Let and be the horizontal and vertical distance (in units) from origin (0, 0). In the next subsection, brute-force procedures were applied to pick optimal locations from all possible subsamples for a given set of parameters that are required to compute the loss function.

3.1. Brute-Force Search

The loss function (13) depends on parameters , , , , and . There are restrictions among these parameters which are , , and . We chose these parameters to include a wide range of possible scenarios. Also, small or moderate values were selected for from the interval [0.3, 2]. The selected set of parameters were reported in Table 1.

Eight different values of the parameters were taken to evaluate the proposed algorithm in §3.2 through brute-force sequential search. These values are shown in Table 1. In this section, we construct some test cases to evaluate the performance of Algorithm 1 that is discussed in §3.2. Let be the required number of locations to an investigator. These test cases can be constructed by the brute-force search of all possible subsample locations. We display four test cases with the assumption of isotropic Gaussian correlations structure, and in Figure 1. In this case, 480,700 possible subsample locations were checked to obtain optimal locations. In Figure 2, we show four test cases with the assumption of anisotropic Gaussian correlations structure, and . In this scenario, 30,260,340 possible subsample locations were verified to select optimal locations. We used the MATLAB command “nchoosek” to take all possible subsample locations from locations for the brute-force search. If the number of subsample locations is greater than , we cannot apply the command “nchoosek” for the brute-force search to choose optimal locations. Thus, further research is needed to apply the brute-force search if the number of subsample locations is greater than . However, Algorithm 1 in §3.2 and Algorithm 2 in §4.1 work for any and .

3.2. Sequential Method

The sequential method is widely applied in the area of the construction of optimal designs; for instance, Wiens [10] developed algorithms using the sequential approach to choose optimal designs. In the sequential method, one design point at a time is added to the current design. We collect spatial locations. Therefore, locations are chosen without replacement. Next, we discuss Algorithm 1, which will be based on the sequential approach.

Step 1: Collect n =  locations randomly without replacement from the design space and let be the collected locations and be the corresponding index set, where .
Step 2: Sequentially select  =  location such that
where . Thus, we have the chosen locations and corresponding index set .
Step 3: Remove the initial locations from the set . So, the collected locations are and the corresponding index set is with . Also, we have .
Step 4: Again sequentially choose  =  location such that
, .
Finally, the set contains the selected optimal locations.

4. Applications

In this section, we discuss how to choose optimal locations from permanent locations. The procedure is described in Algorithm 2 and this algorithm is explained in §4.1. In §4.2, we apply Algorithm 1 to the “coal-ash” data.

4.1. Application 1
Step 1: Let , where is the nearest integer function.
Step 2: Identify a rectangle that includes all permanent locations.
Step 3: Generate equispaced locations with size in the rectangle. We assume that the generated design space is , where contains information that is related to th generated location for .
Step 4: Choose optimal locations with size using Algorithm 1 from . Let be the selected optimal locations.
Step 5: Sequentially pick location such that

where the empty set and . Thus, the set contains the chosen optimal permanent locations.

We simulated permanent locations in a square that has vertices (0, 0), (0, 1), (1, 0), and (1, 1). These permanent locations are displayed in Figure 3(a). Algorithm 2 was applied to choose optimal locations from these 90 permanent locations. Equispaced locations were generated with size . These locations are shown in Figure 3(b). The isotropic Gaussian correlation structure was assumed in the construction of 11 grid-based optimal locations. An initial design is required to run Algorithm 2. The number of initial locations was used to run Algorithm 2. Although initial locations were removed and new locations were chosen instead of initial locations at the end of Algorithm 2, the choice of the final locations slightly depends on the initial locations. Thus, we considered 100 runs using Algorithm 2 to obtain the grid-based optimal locations. Figure 4(b) shows the losses of 100 runs, and the minimum loss for the grid-based optimal locations was 133.316 and it occurred in the 32th run. So, we finally chose 11 locations that were generated in the 32th run. These selected grid-based locations are shown in Figure 4(a). The cluster of permanent locations that was nearest to the cluster of grid-based optimal locations was picked as the permanent optimal locations. These permanent optimal locations are displayed in Figure 4(a). In fact, Algorithm 2 can be used to identify optimal locations for an image, for instance, X-rays, a large number of water wells in a region, or a soil test for a given area.

4.2. Application 2

In this section, we study ‘coal-ash’ data to investigate the performance of the discussed method. The coal-ash core measurements were collected from 208 locations in the Pittsburgh coal seam. These locations are with an approximately 2500 feet equispaced distance [12]. Wiens [10] applied his developed method to choose optimal locations for the ‘coal-ash’ study. The values of the parameters , and are essential to constructing optimal locations for this study. The previous study results are a solution to overcome this problem [13]. We used the information on the final optimal locations with size 30 of Wiens [10] to obtain the values for these parameters. The coal-ash core measurement 17.61 was an outlier at location (5, 6) in this information. Generalized least squares estimate performs poorly if there is an outlier in a data set [10]. However, although a data set contains outliers, M-estimators are robust and efficient [14]. Thus, we preferred M-estimate in this application. These M-estimate are .

The performance of the constructed optimal locations was evaluated by the root mean squared error (RMSE) and it is defined bywhere are the -estimates of the unknown true parameters .

Data collection from a small number of locations yields saving expenditure, reduction of time for an experiment, and fast statistical analysis. Thus, the small number of locations, , were considered to verify the performance of our proposed method. The various sizes of were taken to compare information obtained from optimal locations with full locations having size . These results were reported in Table 2. The value of RMSE for the full locations is 1.1220. The maximum difference between RMSE for optimal locations and full locations is 0.0847. Meanwhile, the minimum difference between RMSE for optimal locations and full locations is 0.0082. Therefore, RMSE for full locations is approximately equal to RMSE for optimal locations. That is, the information obtained from optimal locations is approximately the same as information obtained from full locations. Therefore, when we conduct an experiment in the optimal locations, expenditure can be reduced without losing information. Optimal locations having size and are displayed in Figure 5.

We selected 5 sets of parameters , and to observe patterns and test the effectiveness of optimal locations. These sets of parameters are reported in Table 3. is the variance of a homoscedastic measurement error and it depends on the parameters and . Therefore, we computed the values of and these values are in Table 3. Also, the value of (= 0.94) was taken from the paper of Wiens [10] for scenario 1 (S1) and that value was computed using the final 30 optimal locations. We used and for S1. The value of can be calculated by the formula and the calculated value of was 0.79. We used this value for all scenarios in Table 3.

The optimal locations are condensed in the border of the target region (see Figures 6(b) and 6(d)) to the large values of . Meanwhile, the optimal locations are scattered in the target region (see Figures 6(a) and 6(c)) to the small values of . Also, the values of RMSE for the optimal locations are faraway from the value of RMSE for full locations when we assume a large value of . In contrast, the values of RMSE for the optimal locations are approximately the same as the value of RMSE for full locations when we use a small value of .

5. Summary and Conclusion

We have discussed the robust method to construct optimal locations for spatial data analysis. The design constructions are robust against model misspecifications regarding regression responses and variance/covariance structures of responses. The prediction mean squared error was considered to form the loss function. The loss function was obtained by maximizing the misspecified regression function and variance/covariance matrix of responses. Algorithm 1 was developed using the sequential method to choose optimal locations from equispaced locations. However, Algorithm 2 works for the nonequispaced locations. Therefore, Algorithm 2 can be used to choose optimal permanent locations from a two-dimensional space. The proposed approach can be used to answer a scientific question through an effective spatial analysis that includes minimum cost and time. Thus, the proposed sequential method can be applied to choose optimal locations from the Earth for water and soil monitoring, X-rays for diagnosing a disease, and a region for business analytics. The brute-force search only works If the number of subsample locations is less than or equal to . So, further research regarding the brute-force search should be done for any number of subsample locations. However, the proposed sequential method can be applied to select the optimal location from a large number of locations. We can reduce measurement error in data collection when we focus on a small number of optimal locations. Also, an efficient spatial data analysis can be done with optimal locations without losing any information. Optimal locations can be collected regardless of the shape of a region using the proposed method. Also, the proposed method is a way to conduct big data analytics as fast and efficiently as possible. However, it should be verified through future research for image analysis.

Data Availability

The data that was used in Application 2 can be found in Cressie (2015).

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work has been benefited by the comments of Douglas P. Wiens, University of Alberta. The processing fee for the manuscript was granted by the Governing Council of the University of Toronto.