Computational Intelligence and Neuroscience

Volume 2019, Article ID 2859429, 12 pages

https://doi.org/10.1155/2019/2859429

## Gaussian Process Regression Tuned by Bayesian Optimization for Seawater Intrusion Prediction

^{1}National Technical University of Athens, 15773 Athens, Greece^{2}Department of Informatics and Computer Engineering, University of West Attica, 12243 Athens, Greece^{3}Institute of Communication and Computer Systems (ICCS), Zografou 15773, Athens, Greece

Correspondence should be addressed to Athanasios Voulodimos; rg.autn.liam@vsonaht

Received 20 April 2018; Revised 25 November 2018; Accepted 11 December 2018; Published 17 January 2019

Academic Editor: José Alfredo Hernández-Pérez

Copyright © 2019 George Kopsiaftis et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Accurate prediction of the seawater intrusion extent is necessary for many applications, such as groundwater management or protection of coastal aquifers from water quality deterioration. However, most applications require a large number of simulations usually at the expense of prediction accuracy. In this study, the Gaussian process regression method is investigated as a potential surrogate model for the computationally expensive variable density model. Gaussian process regression is a nonparametric kernel-based probabilistic model able to handle complex relations between input and output. In this study, the extent of seawater intrusion is represented by the location of the 0.5 kg/m^{3} iso-chlore at the bottom of the aquifer (seawater intrusion toe). The initial position of the toe, expressed as the distance of the specific line from a number of observation points across the coastline, along with the pumping rates are the surrogate model inputs, whereas the final position of the toe constitutes the output variable set. The training sample of the surrogate model consists of 4000 variable density simulations, which differ not only in the pumping rate pattern but also in the initial concentration distribution. The Latin hypercube sampling method is used to obtain the pumping rate patterns. For comparison purposes, a number of widely used regression methods are employed, specifically regression trees and Support Vector Machine regression (linear and nonlinear). A Bayesian optimization method is applied to all the regressors, to maximize their efficiency in the prediction of seawater intrusion. The final results indicate that the Gaussian process regression method, albeit more time consuming, proved to be more efficient in terms of the mean absolute error (MAE), the root mean square error (RMSE), and the coefficient of determination ().

#### 1. Introduction

Seawater intrusion (SI) in coastal aquifers is a complex physical phenomenon, consisting of several physical processes. A number of approaches have been proposed to simulate SI, considering different components. Dispersion mechanisms and water density changes are considered critical components in the accurate representation of SI [1]. Both mechanisms are incorporated in the mathematical description of what is known as variable density (VD) models. Although accurate, VD models are CPU intensive and entail long runtimes because the resulting model equations are solved using complex numerical methods (e.g., finite differences and finite element methods). The time-consuming simulations hinder the exploitation of the high accuracy VD models in applications which require a large number of iterations, such as coastal groundwater management, parameter estimation, sensitivity analysis, and uncertainty analysis. Because of the long runtimes, it is also rather impractical to incorporate VD models in real-time systems, e.g., decision support systems [2]. A common method to tackle the duration problem is the use of very fast approximation models, which could efficiently substitute the original VD models, without compromising the accuracy of the results. These models are usually called surrogate models, metamodels, model emulators, lower fidelity models, proxy models, and response surfaces [2–5].

Surrogate model practice is based on the notion that original model response(s) could be approximated by a computationally more efficient model, for a range of values of the selected model variables [5]. In the present study, the Gaussian process regression (GPR) as a surrogate model for SI is examined. Rajabi and Ketabchi [6] summarized the advantages of GPRs compared with other surrogate models in the following: (i) GPRs provide both an approximation of the original high-fidelity model results and a probabilistic estimate of the approximation uncertainties [7, 8], (ii) GPRs’ structure is relatively simple based on the mean and covariance functions [9], (iii) GPRs are flexible with regard to the probability distributions of the input data, (iv) GPRs can efficiently cope with models of different complexity [10, 11], (v) GPRs provide the ability to calculate the mean and standard deviation, and (vi) GPRs provide the ability to incorporate prior knowledge of the outputs in the metamodel construction process [12].

The GPR results are compared with other widely used methods, specifically, linear regression (LR), support vector machine regression (SVMR), binary regression decision tree (BRDT), and ensemble tree learners (ETL). It should be noted that the examined methods are all univariate. A Bayesian optimization is employed in all surrogate models, to improve their efficiency.

The remainder of this study is structured as follows: Section 2 presents a brief survey of the related work. In Section 3, the seawater intrusion model is described, whereas section 4 presents the proposed Gaussian process regression scheme and the Bayesian optimization process. In Section 5, the experimental evaluation is provided, and finally, section 6 concludes the paper.

#### 2. Related Work

Approximation models have been widely used during the last decade in water resources (e.g., [13, 14]) and especially in groundwater modelling. Razavi et al. [5] and Asher et al. [2] performed an extended review of surrogate model applications in water resources field. Regarding coastal aquifers, surrogate models have been widely used for the prediction of SI, substituting the complex fluid flow and transport processes. For example, Bhattacharjya et al. [15] employed artificial neural networks (ANN) to approximate density-dependent flow in coastal aquifers. In more recent studies, Roy and Datta [16] used the fuzzy C-mean clustering method to predict SI, while Lal and Datta [17] investigated the ability of Support Vector Machine regression (SVMr) to predict the location of SI toe and concluded that the method surpasses other widely used metamodelling methods, such as genetic programming (GP).

A significant number of relative studies are devoted to the use of surrogate models in coastal aquifer management problems to cope with the computational burden, which arises from simulation-optimization schemes [18]. A well-established metamodel which is very common in coastal aquifer management literature is the artificial neural networks (ANNs) [5, 18]. Bhattacharjya and Datta [19] employed an ANN model to approximate a density-dependent model in a genetic optimization framework. Rao et al. [20] and Kourakos and Mantoglou [21] incorporated ANNs in a simulation-optimization scheme to replace the SEAWAT numerical code. Kourakos and Mantoglou [22] proposed a pumping optimization method based on modular neural networks and an Evolutionary Annealing Simplex optimization algorithm. Ataie-Ashtiani et al. [23] combined a simulation-optimization procedure with ANNs to develop an efficient model for the multiobjective management of groundwater lenses in small islands. Christelis and Mantoglou [24] used cubic radial basis functions (RBFs) in two adaptive metamodeling frameworks: (1) the adaptive-recursive approach and (2) the metamodel-embedded evolutionary strategy. The latter proved to be computationally more efficient, providing solutions near the global optimum. Christelis et al. [25] employed two surrogate-based optimization (SBO) frameworks, under restricted computational budgets to improve the efficiency of optimization algorithms in problems of moderate and large dimensionalities. In a more recent study, Christelis and Mantoglou employed variable-fidelity surrogate models and evolutionary algorithms to calculate the maximum allowed pumping rates in coastal aquifers. Sreekanth and Datta in several studies [26–28] examined genetic programming (GP) as a potential surrogate model in multiobjective management of SI in coastal aquifers and compared the proposed method with modular neural network metamodels. Roy and Datta examined several surrogate models to predict SI in coastal aquifers and employed all these models in coastal aquifer management problems [29–31]. A review of surrogate models, focusing on SI and coastal aquifer management, is presented by Roy and Datta [32].

Gaussian process metamodels have been widely used in engineering optimization applications, but according to Razavi et al. [5] and Asher et al. [2], they have not yet attracted much attention in groundwater field, particularly SI. Stone [33] presented a Bayesian emulation methodology as an alternative to Monte Carlo in the analysis of stochastic groundwater models. Zhang et al. [34] employed an adaptive Gaussian process-based method to identify contaminant source in groundwater problems, whereas Crevillén-García et al. [35] used Gaussian process method to perform uncertainty analysis in a convectively enhanced dissolution process model. Raghavendra and Deka [36] used GPR and adaptive neuro fuzzy inference system (ANFIS) to forecast groundwater level time series. In a recent study, Rajabi and Ketabchi [6] used Gaussian process emulators in a simulation-optimization framework to address the computational challenges arising from the large number of the required simulations. Roy and Datta [37] incorporated three metamodels, particularly ANFIS, GPR, and multivariate adaptive regression spline (MARS), in a multiobjective optimization framework to quantify the influence of sea-level rise on coastal aquifer management. In this specific study, they concluded that the ANFIS-based metamodel proved to be more efficient and inexpensive compared with the other two metamodels. The authors also performed a comparative analysis between several surrogate models [38], including GPR, in a coupled simulation-optimization methodology under parameter uncertainty. In this specific study, they concluded that the GPR metamodels and their ensemble (EGPR) proved to be more efficient in terms of prediction compared with other similar methods, such as the MARS metamodel and the regression tree (RT) metamodel.

#### 3. Seawater Intrusion Model

##### 3.1. Variable Density and Salt Transport Model

As mentioned in section 1, VD models are based on the spatial variability of groundwater density, which ranges from saline water density to freshwater density. The driving force of the seawater/freshwater mixing is the dispersion mechanism, which results in the existence of a transition zone across the entire coastline. The width and exact position of the zone depends on the aquifer parameters and the pumping regime. In the current study, thermal and viscosity effects are neglected and the density changes are attributed only to concentration effect. The flow and solute transport equations are used to describe mathematically the VD model. The two equations form a coupled differential equation system, which could be expressed as follows [39]:where is the fluid density, is the specific discharge vector, is the density of water entering from a source of leaving through a sink, is the volumetric flow rate per unit volume of porous medium representing sources and sinks, is the specific storage, is the freshwater head, is the porosity, is the solute concentration, is the hydrodynamic dispersion tensor, is the fluid velocity vector, and is the solute concentration of water entering or leaving through sources and sinks, respectively. Because solute reaction is not considered, fluid density is only a function of the solute concentration , according to the following equation:in which is the freshwater density, is the density difference ratio (equation (4)), is the reference concentration, and is the maximum concentration. In this study, the following values are used for the parameters of equation (3): , , and .

The density difference ratio is expressed aswhere stands for the maximum seawater density. In this study, we consider .

The Darcy flux term of equation (1) for constant viscosity and freshwater properties could be expressed aswhere , , and are the components of the specific discharge in the principal directions, , , and are the components of the freshwater hydraulic conductivity in the same directions, and is the freshwater density.

Equations (1) to (7) are the mathematical representation of the VD approach of seawater intrusion. The well-established SEAWAT code is used to solve numerically the aforementioned equation set. SEAWAT is a modular finite difference computer code created by USGS, which couples MODFLOW and MT3DMS, to solve iteratively the fluid flow and solute transport equations [39].

##### 3.2. Coastal Aquifer Case Study

The VD model is applied on a rectangular-shaped unconfined aquifer. The dimensions of the aquifer model are , , and . The examined aquifer geometrically resembles a real coastal aquifer located at the central eastern part of the Greek island Kalymnos, specifically the elongate aquifer underlying the Vathi valley [40, 41]. It should be noted that the examined model is an abstraction of the real aquifer, which could be considered as a typical aquifer example for the Aegean Greek islands, in terms of size and shape.

Figure 1 outlines the conceptual model of the aquifer model. A hydrostatic boundary condition (BC) is assigned on the seaside boundary. The aquifer is bounded by impermeable geological formations, with the exception of the inland boundary, where a specified flux BC is applied to simulate the lateral inflow from the adjacent aquifer. The groundwater is replenished by a constant recharge, which is uniformly distributed along the entire surface of the aquifer. For simplification purposes, the aquifer is considered homogeneous and an anisotropic factor is assumed, which represents the differential permeability along the vertical direction. Table 1 presents the values of the basic fluid flow and solute transport parameters. An initial simulation for approximately 200 yr without pumping was performed, until steady flow/steady transport conditions are achieved. The final hydraulic head and concentration values of this simulation are used as the initial conditions for the SI simulations, which are related to the training of the surrogate models. All simulations in the current paper are considered steady state, regarding the fluid flow conditions. This assumption resulted in relatively brief VD simulations, which allowed for the creation of an adequate sample for the calibration of the surrogate models. The duration of each simulation was approximately 1–2 min. The simulations were performed in an i7-4770 quad-core processor 3.4 GHz, with 8 GB RAM.