#### Abstract

In situ observation is one of the most direct and efficient ways to understand the ocean, but it is usually limited in terms of spatial and temporal coverage. The determination of optimal sampling strategies that effectively utilize available resources to maximize the information content of the collected ocean data is becoming an open problem. The historical sea surface temperature (SST) dataset contains the spatial variability information of SST, and this prior knowledge can be used to optimize the configuration of sampling points. Here, a configuration method of sampling points based on the variability of SST is studied. Firstly, in order to get the spatial variability of SST in the ocean field to be sampled, the historical SST data of the field is analyzed. Then, -means algorithm is used to cluster the subsampled fields to make the configuration of sampling points more suitable. Finally, to evaluate the sampling performance of the new configuration method of sampling points, the SST field is reconstructed by the method based on compression sensing algorithm. Results show that the proposed optimal configuration method of sampling points significantly outperforms the traditional random sampling points distribution method in terms of reconstruction accuracy. These results provide a new method for configuring sampling points of ocean in situ observation with limited resources.

#### 1. Introduction

SST is a key input to atmospheric and oceanic forecasting and prediction systems [1], and it is always used as boundary conditions for numerical weather prediction (NWP) and ocean forecasting models [2, 3]. SST reveals the process of the exchanges of momentum, heat, moisture, and gas, and it is used to model the upper-ocean circulation and the thermal structure at different time scales [4, 5]. So it is not surprising that high-resolution SST data influences the formation and subsequent evolution of synoptic weather systems [6, 7]. Such models are increasingly used for operational ocean applications including offshore operations, maritime transport, maritime safety, marine pollution control, and wave and surf models. Based on these applications, to explore the ocean effectively, regular, timely, and accurate SST measurements are necessary. In situ observation is one of the most effective methods in exploring the ocean. The ocean observation network (such as the buoy network) and the mobile observation platform are two commonly used ways of in situ observation. The purpose of ocean observation is to obtain data that reflects the distribution characteristics of factors in a target ocean area. Generally, the more the sampling points placed in the target area are, the higher the reconstruction accuracy of the field could be achieved, the more likely it is to reflect the true state of the temperature of the sea area. However, both the mobile observation platforms and the stationary observation networks have energy constraints. Therefore, how to design effective sampling method with realistic constraint, to solve the optimization placement problem of the sampling points and to minimize the reconstruction error of the sea temperature field, and to reveal the ocean temperature field distribution characteristics has become a key problem in the field of ocean observation.

The common sampling strategy is the -optimal criteria [8]. These criteria minimize the mean uncertainty over the sampled field by selecting observing locations. Zhu et al. [9] proposed an optimal method of large-scale ocean sampling by minimizing the Kriging variance. The location of each sampling point is decided by the integrated Kriging variance. The method is effective but suffers from heavy computational complexity. Guestrin et al. [10] proposed a sampling algorithm based on a Gaussian model to optimize sensors. Zhang and Sukhatme [11] presented a sampling scheme for estimating the sampled field using a robotic boat and a sensor network. The above-mentioned sampling strategies attempted to search for the subset of sampling locations that most significantly minimize uncertainties in ocean model prediction. However, the selection of sampling locations inevitably requires large cost of computation.

Recent advances in the field of compressive sensing (CS) [12, 13] show that it is possible to reconstruct a field using random measurements, where is the dimension of the field. Thus, using CS to reconstruct fields may provide a new opportunity for designing a simple but effective strategy in ocean observation. In addition, as the sampling points are not needed to be placed at particular locations, this* lightweight* sampling scheme can be easily applied in mobile platforms. Poduri et al. [14] proposed a random walk sampling method for robots based on the compressive sensing theory. Hummel et al. [15] proposed a trajectory planning algorithm based on the standard CS paradigm to take samples. Chen et al. [16] proposed a distributed adaptive sampling solution using the CS characteristic. These sampling schemes based on CS usually conduct sampling randomly. Because they do not take the spatial variability of variables into account, the reconstruction effect cannot be satisfactory in the region with high dynamic variability.

Due to the large impact of wind and ocean current, the distribution of SST varies spatially. For example, the East China Sea is the main area of the Kuroshio in the midlatitudes and it has obvious hydrological characteristics. As the areas with high spatial variability always contain the maximum information of the sampled field, intensive observations or multiple-samplings are necessary in these areas. How to configure the sampling points is the key question in in situ observation with resource constrains. Here, we propose an optimal configuration method of sampling points based on variability of SST. This paper is organized as follows. Section 2 briefly describes the optimal configuration method of sampling points. Then Section 3 presents the numerical simulations and the corresponding results. Finally, summary and discussions are given in Section 4.

#### 2. Optimal Configuration Method of Sampling Points

Our goal is to design a sampling method to take more measurements in areas with high SST spatial variability while taking less measurements in areas with low SST spatial variability. How to choose the most suitable sampling points to gather maximum information is the fundamental task in SST sampling. The historical SST data of the sampled field contains rich prior knowledge which can then be used to optimize the configuration of sampling points. Sampling rate calculation criteria based on the results analyzed are designed. In order to make the configuration of sampling points more accurate, the -means clustering algorithm is used to do further processing on the data in subsampled fields. Finally, CS reconstruction algorithm is used to reconstruct the SST field to evaluate the sampling performance of the new sampling method. An illustration of the new sampling method is shown in Figure 1.

##### 2.1. Analysis of the Distribution Characteristics of SST

The ocean is an extremely complex dynamical system with strong spatial-temporal variability. Geostatistics is a great tool for analyzing the spatial-temporal variability of natural phenomenon. The semivariogram function is used to quantitatively assess the spatial variability of a physical variable in geostatistics. Here, the physical variable is SST value. Assuming that the SST value at a specific location is noted as , the process is said to be second-order stationary if the mean is the same at each location and its spatial covariance only depends on the spatial distance between location and location . Then the semivariogram function can be written asHere, is the constant step for the distance and is the number of all pairs of points , while and represent the SST value at location and location . Actually, the empirical semivariogram function is always fitted by a positive-definite parametric model (e.g., exponential, Gaussian, and spherical), and the Gaussian model used in the case is as follows:where is the nugget and is scaling factor. is the range of variation (the range of variation of Gaussian model is ). The sill represents the maximum spatial variance. The larger the sill is, the greater the changes in the gradient magnitude of the variables are. In general, the direction of the maximal changes in the magnitude of the gradient is the zoning direction [17].

Figure 2(a) shows the monthly mean spatial SST field in February 2009 in the East China Sea. As shown in Figure 2(a), to distinguish the high variability SST fields and gentle variability SST fields, the sampling area is divided into 16 subsampled fields by dashed lines. The subsampled field in the upper left corner of Figure 2(a) is taken as an example (as shown in Figure 2(b)) to describe the process of the optimal configuration method of sampling points.

**(a) Monthly mean SST field**

**(b) Sub-SST field**

The spatial variability of the SST is analyzed using geostatistics in each subsampled field. As shown in Figure 3, semivariogram functions are fitted in four different directions (i.e., NE-SW, E-W, SE-NW, and N-S). Comparing the sills in four directions, it is shown that the gradient magnitude of SST is consistent with the value of the sill. More sampling points should be placed in the region with high variability to mitigate the error in the estimation of the SST field. Additionally, for each subsampled field, considering the ratio between the maximal and the minimal value of sill in four directions, a great ratio indicates more obvious SST zoning phenomenon in this subsampled field. If the ratio is small, it means that there is no obvious zoning phenomenon in this subsampled field.

**(a) Semivariogram function (N-S)**

**(b) Semivariogram function (NE-WS)**

**(c) Semivariogram function (E-W)**

**(d) Semivariogram function (NW-SE)**

One assumes that the average sampling rate is and sampling rate of lower limit is set to be . Meanwhile, the impact factor of gradient magnitude is set to be and the impact factor of zoning phenomenon is set to be . The sampling rate in each subsampled field is as follows:where is the maximal sill of four directions in the th subsampled field. is the maximal sill of four directions among all subsampled fields, while is the minimal sill of four directions among all subsampled fields. is the ratio between the maximal sills and the minimal sills of four directions in the th subsampled field. is the maximal ratio between the maximal sills and the minimal sills of four directions among all subsampled fields. is the minimal ratio between the maximal sills and the minimal sills of four directions among all subsampled fields and the range of impact factors , and in (3) are 0.50.6, 0.30.5, and 0.50.7. A large change in the gradient magnitude means the difference between and is large. Then we need to increase the value of and reduce the value of to make the gradient magnitude as the main factor to impact the sampling rate. When the change in the gradient magnitude is small, we need to reduce the value of and increase the value of to make the zoning phenomenon as the main factor to impact the sampling rate.

As shown in Figure 2, the gradient magnitude of SST is 18°C. The temperature gradient magnitude in this field is large and the zoning phenomenon is obvious. So the sampling scheme will take the gradient magnitude as the main factor to impact the sampling rate. Then the number of sampling points in each subsampled field is calculated as follows:where is the number of sampling points in the th subsampled field. is the sampling rate in the th subsampled field and it can be calculated by (3), while is the number of data points of one subsampled field.

##### 2.2. Configuration Method of Sampling Points

In Section 2.1, only the spatial variability of the SST distribution over the whole sampled field was analyzed. In some subsampled fields with high variability, such as the subsampled field in Figure 2(b), the temperature gradient magnitude is 9°C which means that it still contains rich spatial information of the SST distribution. So it is necessary to do further analysis in these subsampled fields to distinguish the high and gentle variability SST fields. Here, the -means clustering algorithm is used to do further processing on SST data in each subsampled field.

Even though -means was first proposed over 50 years ago, it is still one of the most widely used algorithms for clustering (Jain, 2010). By using -means algorithm, data objects are clustered into clusters. These clusters should meet the following conditions: given data objects, find clusters based on a measure of similarity so that the similarities between objects in the same cluster are high while the similarities between objects in different clusters are low.

Taking the SST data points as the data objects, SST data objects in a subsampled field are divided into clusters which is shown in Figure 4(a). It is hoped to get the following clustering phenomenon: the data objects in the same cluster are geographically adjacent and their SST value is much close. Cluster which contains less data objects has larger spatial change in the magnitude of gradient in SST distribution, such as cluster III shown in Figure 4(a). Cluster which contains more data objects has smaller spatial change in the magnitude of gradient in SST distribution, such as cluster I shown in Figure 4(a).

The aim of optimal configuration of sampling points is to collect the richest information with limited resources (time or energy). In order to seek the sampling points with the richest information, all clusters in one subsampled field are placed with the same number of sampling points. Then the cluster with larger change in the magnitude of gradient is sampled with higher sampling rate while the cluster with smaller spatial change is sampled with lower sampling rate. So the sampling points number of each cluster in one subsampled field iswhere is the number of sampling points in the th subsampled field while is the sampling rate of the th subsampled field and it can be calculated by (3). Meanwhile, is the number of sampling points of the th subsampled field and is the number of clusters in the th subsampled field.

As the new sampling method does not require to sample at the specific locations, sampling points are randomly placed in each cluster. Figure 4(b) gives an example of the configuration of sampling points in which the subsampled field is divided into 3 cluster blocks and sampling points are indicated by orange dots.

##### 2.3. Evaluation Method

The SST field can be reconstructed based on the sampled data and the reconstruction error can be used to evaluate the sampling performance of the sampling method. CS is a new sampling theory which is often used to construct a data field. In CS, a dimensional signal which can be expressed as , where is an orthonormal basis and is the description of sparse . If has at most nonzero element, signal can be recovered from measurements.

The reconstruction of SST field is a blind sparse reconstruction question. There are many algorithms which can reconstruct the signal without the prior knowledge of sparse, such as BP, SP, SAMP, ASMP, and StOMP. ASMP not only inherits the backtracking refinement that attaches to CoSaMP/SP but also does not require the sparse as an input parameter. Therefore, ASMP algorithm [18] is used here as the reconstruction algorithm of the SST field.

Here, we take one subsampled field as a reconstructing unit. One assumes that the vector of SST data measured is , where the matrix is the coding matrix of sampling point positions while is the observation matrix and . Assuming that is the estimates of , the basic problem of reconstruction is to minimize the loss of the energy . The mathematical expression is as follows:where the orthonormal basis is the DCT basis and .

#### 3. Experiment

In order to test the performance of the new sampling strategy, the eastern part of the East China Sea (22°N30°N, 122°E130°E) is selected as the experimental field. The SST data (getting from the database in the National Oceanic Information Center) is from February 2008 to January 2009 with a horizontal resolution of . So there are totally 4096 SST data grid points. For ease of analysis, the experimental area is divided into 16 subsampled fields equally. The division of the experimental area and the subsampled fields are shown in Figure 5.

**(a) Experimental area**

**(b) Experimental area division**

Reconstruction error is used to evaluate the sampling performance of the new sampling method. For comparison, we also conducted a traditional random sampling method. The number of sampling points for both the new sampling method and random sampling method is 928. And the average sampling rate is about 22%. The reason to take this sampling rate is that reconstruction error in most of the months with sampling rate 22% has reduced significantly. The measured SST data of the two methods are reconstructed by ASMP algorithm with an input threshold of 1.3.

For the random sampling method (RS), all subsampled fields use the same sampling rate (22%). The sampling points are randomly placed in each subsampled field. For the optimal configuration method of sampling points (OS). First, the sampling rate of each subsampled field and the number of sampling points are obtained by the distribution characteristic analysis. Then each subsampled field is divided into 3 clusters. And the number of sampling points in each cluster is the same and sampling points are placed randomly in each cluster. The sampling points placement of the two methods is shown in Figure 6(b) in which sampling points are indicated by white dots.

**(a) RS sampling points**

**(b) OS sampling points**

Table 1 shows the sampling number distribution of OS and RS in each subsampled field in February 2008. The field number in Table 1 corresponds to the sampling subsampled field number of Figure 5(b).

It can be seen from Table 1 and Figure 6 that the number of sampling points is much more than the average number (58) in the subsampled fields 9, 10, 13, and 15 where the variability of the SST is more obvious. The number of the sampling points in subsampled field with lower variability is less than the average sampling number.

Figure 7 is the reconstruction error distribution of two methods from February 2008 to January 2009. It can be seen that OS has a lower reconstruction error compared to RS in all months, especially in the month with larger change in the magnitude of gradient, such as February, March, and April. The OS outperforms the RS with a 10–16% reduction in reconstruction error.

Figure 8 shows the difference of the reconstruction error corresponding to the two sampling methods. In most of the months, from February 2008 to June 2008 and October 2008 to January 2009, the reconstruction error of OS is much lower than that of RS in the northwest of the sampled field. And this region is just the areas with significant variability. These results show that the OS is more efficient when the same number of measurements is taken with RS. Moreover, in the southeastern part of the sea area where the change of gradient magnitude is small, the difference of the reconstruction error between OS and RS is very small.

**(a) February 2008**

**(b) March 2008**

**(c) April 2008**

**(d) May 2008**

**(e) June 2008**

**(f) July 2008**

**(g) August 2008**

**(h) September 2008**

**(i) October 2008**

**(j) November 2008**

**(k) December 2008**

**(l) January 2009**

Figure 9 shows the OS and RS reconstructed SST field in February, May, and August. The RC_HS represents the original SST field, RC_OS represents the reconstructed SST field of the optimal configuration method of sampling points, and RC_RS represents the reconstructed SST field of the random sampling method. It can be seen that the reconstructed SST field of OS has a significant improvement over RS in the northwest of the sampled field with high variability. For the fields where the variability is lower (such as the southeastern part of the sampled field), the performance of OS and RS is similar.

**(a) RC_HS (February 2008)**

**(b) RC_RS (February 2008)**

**(c) RC_OS (February 2008)**

**(d) RC_HS (May 2008)**

**(e) RC_RS (May 2008)**

**(f) RC_OS (May 2008)**

**(g) RC_HS (August 2008)**

**(h) RC_RS (August 2008)**

**(i) RC_OS (August 2008)**

**(j) RC_HS (November 2008)**

**(k) RC_RS (November 2008)**

**(l) RC_OS (November 2008)**

#### 4. Conclusion

How to choose the most suitable sampling points to gather maximum information is the main task in ocean in situ observation. A new optimal configuration method of sampling points based on the variability of SST is proposed in this paper. The variability of SST is examined by geostatistics and the -means algorithm is used to cluster the subsampling fields. Reconstructing method based on CS is used to construct the sampling field to verify the new sampling method. Results show that the new method is effective in taking measurements using limited sampling points and has a significant improvement in the sampled field with high variability. These results provide a guideline for the configuration of sampling points with a prior knowledge in the ocean in situ observation. Although the new configuration method of sampling points has shown great promise, there are still some constraints since it requires a large amount of prior data for analysis. So some challenges remain in applying it to the real-time configuration of sampling points in adaptive sampling strategy. As future work, we will study the online configuration method of sampling points with the constraint of real-time information.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was funded by the National Natural Science Foundation of China (no. 41676088), the National Key Research and Development Project of China (2017YFC1404100, 2017YFC1404102), the Fundamental Research Funds for the Central Universities (HEUCF041705), the Foundation of Key Laboratory of Marine Environmental Information Technology, and the Natural Science Foundation of Heilongjiang Province (no. QC2017067).