Abstract

We explore the use of stochastic optimization methods for seismic waveform inversion. The basic principle of such methods is to randomly draw a batch of realizations of a given misfit function and goes back to the 1950s. The ultimate goal of such an approach is to dramatically reduce the computational cost involved in evaluating the misfit. Following earlier work, we introduce the stochasticity in waveform inversion problem in a rigorous way via a technique called randomized trace estimation. We then review theoretical results that underlie recent developments in the use of stochastic methods for waveform inversion. We present numerical experiments to illustrate the behavior of different types of stochastic optimization methods and investigate the sensitivity to the batch size and the noise level in the data. We find that it is possible to reproduce results that are qualitatively similar to the solution of the full problem with modest batch sizes, even on noisy data. Each iteration of the corresponding stochastic methods requires an order of magnitude fewer PDE solves than a comparable deterministic method applied to the full problem, which may lead to an order of magnitude speedup for waveform inversion in practice.

1. Introduction

The use of simultaneous source data in seismic imaging has a long history. So far, simultaneous sources have been used to increase the efficiency of data acquisition [1, 2], migration [3, 4], and simulation [57]. Recently, the use of simultaneous source encoding has found its way into waveform inversion. Two key factors play a role in this development: (i) in 3D, one is forced to use modeling engines whose cost is proportional to the number of shots (as opposed to 2D frequency-domain methods where one can reuse the LU factorization to cheaply model any number of shots) and (ii) the curse of dimensionality: the number of shots and the number of gridpoints grows by an order of magnitude.

The basic idea of replacing single-shot data by randomly combined “super shots” is intuitively pleasing and has lead to several algorithms [811]. All of these aim at reducing the computational costs of full waveform inversion by reducing the number of PDE solves (i.e., the number of simulations). This reduction comes at the cost of introducing random crosstalk between the shots into the problem. It was observed by Krebs et al. [8] that it is beneficial to recombine the shots at every iteration to suppress the random crosstalk and that the approach might be more sensitive to noise in the data. In this paper, we follow Haber et al. [12] and introduce randomized source encoding through a technique called randomized trace estimation [13, 14]. The goal of this technique is to estimate the trace of a matrix efficiently by sampling its action on a small number of randomly chosen vectors. The traditional least-squares optimization problem can now be recast as a stochastic optimization problem. Theoretical developments in this area go back to 1950s, and we review them in this paper. In particular, we discuss two distinct approaches to stochastic optimization. The stochastic approximation (SA) approach consists of a family of algorithms that use a different randomization in each iteration. This idea justifies a key part of the approach described in Krebs et al. [8]. Notably, the idea of averaging the updates over the past is important in this context to suppress the random crosstalk; lack of averaging over the past likely explains the noise sensitivity reported by Krebs et al. [8]. The theory we treat here concerns only first-order optimization methods, though there has been a recent effort to extend similar ideas to methods that exploit curvature information [15].

Another approach, called the sample average approximation (SAA), replaces the stochastic optimization problem by an ensemble average over a set of randomizations. The ensemble size should be big enough to suppress the crosstalk. The resulting problem may be treated as a deterministic optimization problem; in particular, one may use any optimization method to solve it.

Most theoretical results in SA and SAA assume that the objective function is convex, which is not the case for seismic waveform inversion. However, in practice one starts from a “reasonable” initial model, and we may be able to converge to the closest local minimum. One would expect SA and SAA to be applicable in the same framework. Understanding the theory behind SA and SAA is then very useful in algorithm design, even though the theoretical guarantees derived under the convexity assumption need not apply.

As mentioned before, the gain in computational efficiency comes at the cost of introducing random crosstalk between the shots into the problem. Also, the influence of noise in the data may be amplified by randomly combining shots. We can reduce the influence of these two types of noise by increasing the batch size, recombining the shots at every iteration, and averaging over past iterations. We present a detailed numerical study to investigate how these different techniques affect the recovery.

The paper is organized as follows. First, we introduce randomized trace estimation in order to cast the canonical waveform inversion problem as a stochastic optimization problem. We describe briefly how SA and SAA can be applied to solve the waveform inversion problem. In Section 3, we review relevant theory for these approaches from the field of stochastic optimization. The corresponding algorithms are presented in Section 4. Numerical results on a subset of the Marmousi model are presented in Section 5 to illustrate the characteristics of the different stochastic optimization approaches. Finally, we discuss the results and present the conclusions.

2. Waveform Inversion and Trace Estimation

The canonical waveform inversion problem is to find the medium parameters for which the modeled data matches the recorded data in a least-squares sense [16]. We consider the simplest case of constant-density acoustics and model the data in the frequency domain by solving where is the discretized Helmholtz operator for the squared slowness (with appropriate boundary conditions), is the discretized wavefield, and is the discretized source function; both are column vectors. The data are then given by sampling the wavefield at the receiver locations: . Note that all the quantities are monochromatic. We hide the dependence on frequency for notational simplicity.

We denote the corresponding optimization problem as where is a frequency slice of the recorded data, and are the corresponding source functions. Note that the dependence of on has been suppressed. denotes the Frobenius norm, which is defined as (here denotes the complex-conjugate transpose. We will use the same notation for the transpose in case the quantity is real). Note that we assume a fixed-spread acquisition where each receiver sees all the sources.

In practice, is never computed explicitly but involves either an LU decomposition (cf., [1719]) or an iterative solution strategy (cf., [20, 21]). In the worst case, the matrix has to be inverted separately for each frequency and source position. For 3D full waveform inversion, both the costs for inverting the matrix and the number of sources increase by an order of magnitude. Recently, several authors have proposed to reduce the computational cost by randomly combining sources [812].

We follow Haber et al. [12] and introduce this encoding in a rigorous manner by using a technique called randomized trace estimation. This technique was introduced by Hutchinson [13] as a technique to efficiently estimate the trace of an implicit matrix. Some recent developments and error estimates can be found in Avron and Toledo [14].

This technique is based on the identity where denotes the expectation over . The random vectors are chosen such that (the identity matrix). The identity can be derived easily by using the cyclic permutation rule for the trace (i.e., ), the linearity of the expectation, and the aforementioned property of . At the end of the section, we discuss different choices of the random vectors . First, we discuss how randomized trace estimation affects the waveform inversion problem.

Using the definition of , we have This reformulation of (2) is a stochastic optimization problem. We now briefly outline approaches to solve such optimization problems.

2.1. Sample Average Approximation

A natural approach to take is to replace the expectation over by an ensemble average This is often referred to in the literature as the sample average approximation (SAA). The random crosstalk can be controlled by picking a “large enough” batch size. As long as the required batch size is smaller than the actual number of sources, we reduce the computational complexity.

For a fixed , it is known that the error is of order (cf., [14]). However, it is not the value of the misfit that we are trying to approximate, but the minimizer. Unfortunately, the difference between the minimizers of and is not readily analyzed. Instead, we perform a small numerical experiment to get some idea of the performance of the SAA approach for waveform inversion.

We investigate the misfit along the direction of the negative gradient (defined below) The data are generated for the model depicted in Figure 1(a), for 61 colocated, equidistributed sources and receivers along a straight line at 10 m depth and 7 randomly chosen frequencies between 5 and 30 Hz. The source signature is a Ricker wavelet with a peak frequency of 10 Hz. We use a 9-point discretization of the Helmholtz operator with absorbing boundary conditions and solve the system via an (sparse) LU decomposition (cf., [22]). We note that this setup is quite efficient already since the LU decomposition can be reused for each source. Reduction of the number of sources becomes of paramount importance in 3D where one is forced to use iterative methods whose costs grow linearly with the number of sources. The search direction is the gradient of evaluated at the initial model , depicted in Figure 1(b). The gradient is computed in the usual way via the adjoint-state method (cf., [23]). The full gradient as well as the gradients for are depicted in Figure 2. The error between the full and approximated gradient, caused by the crosstalk, is depicted in Figure 3. As expected, the error decays as . The misfit as a function of for various , as well as the full misfit (no randomization), is depicted in Figure 4. This shows that the minimizer of is reasonably close to the minimizer of the full misfit , even for a relatively small batch size .

2.2. Stochastic Approximation

A second alternative is to apply specialized stochastic optimization methods to problem (4) directly. This is often referred to as the stochastic approximation (SA). The main idea of such algorithms is to pick a new random realization in each iteration and possibly average over past iterations to suppress the resulting stochasticity. In the context of the full waveform inversion problem, this gives an iterative algorithm of the form where batch size can be as small as 1, represent step sizes taken by the algorithm, and the notation emphasizes that a new randomization is used at every iteration (in contrast with the SAA approach).

We discuss theoretical performance results and describe SAA and SA in more detail in the next section.

2.3. Accuracy and Efficiency of Randomized Trace Estimation

Efficient calculation of the trace of a positive semidefinite matrix lies at the heart of our approach. Factors that determine the performance of this estimation include the random process for the i.i.d. ’s, the size of the source ensemble , and the properties of the matrix. Hutchinson’s approximation [13], which is based on ’s drawn from a Rademacher distribution (i.e., random ), attains the smallest variance for the estimate of the trace. The variance can be used to bound the error via confidence intervals. However, the variance is not the only measure of the error. In particular, Avron and Toledo [14] derive bounds on the batch size in terms of and , defined as follows. A randomized-trace estimator is an -approximation of if The expressions for the minimum batch size for which the relative error is smaller than with probability are listed in Table 1 (adapted from Avron and Toledo [14]). Smaller ’s and ’s lead to larger , which in turn leads to more accurate trace estimates with increased probability.

Of course, these bounds depend on the choice of the probability distribution of the i.i.d. ’s and the matrix . Aside from obtaining the lowest value for , simplicity of computational implementation is also a consideration. In Table 1, we summarize the performance of four different choices for the ’s, namely,

(1) the Rademacher distribution, that is, , ( denotes the th element in the vector ) yielding and for . Aside from the fact that this estimator (see Table 1) leads to minimum variance, the advantage of this choice is that it leads to a fast implementation with a small memory imprint. The disadvantage of this method is that the lower bound depends on the rank of and requires larger compared to ’s defined by the Gaussian (see Table 1);

(2) the standard normal distribution, that is, for . While the variance for this estimator (see Table 1) is larger than the variance for , the lower bound for does not depend on the size or the rank of and is the smallest of all four methods. This suggests that we can use a fixed value of for arbitrarily large matrices. However, this method is known to converge slower than Hutchinson’s for matrices that have significant energy in the off-diagonals. This choice also requires a more complex implementation with a larger memory imprint;

(3) the fast phase-encoded method where ’s selected uniformly from the canonical basis, that is, from . this estimator where is a unitary (i.e, ) random mixing matrix. The idea is to mix the matrix such that its diagonal entries are evenly distributed. This is important since the unit vectors only sample the diagonal of the matrix. The flatter the distribution of the diagonal elements, the faster the convergence (if all the diagonal elements were to be the same, we need only one sample to compute the trace exactly).

The lower bounds summarized in Table 1 tell us that Gaussian ’s theoretically require the smallest and hence the fewest PDE solves. However, this result comes at the expense of more complex arithmetic, which can be a practical consideration [8]. Aside from the lowest bound, the estimator based on Gaussian ’s has the additional advantage that the bound on does not depend on the size or rank of the matrix . Hutchinson’s method, on the other hand, depends logarithmically on the rank of but has the reported advantage that it performs well for near diagonal matrices [14]. This has important implications for our application because our matrix is typically full rank and can be considered nearly diagonal only when our optimization procedure is close to convergence. At the beginning of the optimization, we can expect the residual to be large and a that is not necessarily diagonal dominant.

We conduct the following stylized experiment to illustrate the quality of the different trace estimators. We solve the discretized Helmholtz equation at 5 Hz for a realistic acoustic model with 301 colocated sources and receivers located at depth. We compute matrix for a residue given by the difference between simulation results for the hard and smooth models shown in Figure 1. As expected, the resulting matrix , shown in Figure 5, contains significant off-diagonal energy. For the phase-encoded part of the experiment, we use a random mixing matrix based on the DFT, as suggested by Romberg [24]. Such mixing matrices are also commonly found in compressive sensing applications [7, 2426].

We evaluated the different trace estimators 1000 times for batch sizes ranging from . The probability for the error level is estimated by counting the number of times we were able to achieve that error level for each . The results for the different trace estimators and error levels are summarized in Figure 6. For this particular example, we see little difference in performance between the different estimators. The corresponding theoretical bounds on the batch size, as given by Table 1, are listed in Table 2. Clearly, these bounds are overly pessimistic in this case. In our experiments, we observed that we get similar reconstruction behavior if we use a finer source/receiver sampling. This suggests that the gain in efficiency will increase with the data size, since we can use larger batch sizes for a fixed downsampling ratio. We also noticed, in this particular example, little or no change in behavior if we change the frequency.

3. Optimization

3.1. Sample Average Approximation

The sample average approximation (SAA) is used to solve the following class of stochastic optimization problems: where is the set of admissible models (assumed to be a compact convex set, for example, box constraints ), is a random vector with distribution supported on , , and the function is convex [27]. The last important assumption is the law of large numbers (LLN), that is, with probability 1 as . These assumptions are required for most of the known theoretical results about convergence of SAA methods. The convexity assumption and LLN assumption can be relaxed in the case when is continuous on for almost every and is dominated by an integrable function , so that for every [28]. Given an optimization problem of type (10), the SAA approach [27] is to generate a random sample and solve the approximate (or sample average) problem When these assumptions are satisfied, the optimal value of (11) converges to the optimal value of the full problem (10) with probability 1. Moreover, under more technical assumptions on the distribution of the random variable , conservative bounds have been derived on the batch size necessary to obtain a particular accuracy level [29, equation (22)]. These bounds do not require the convexity assumptions but instead require assumptions on local behavior of . It is worth underscoring that “accuracy” here of solution with respect to the optimal solution is defined with respect to the function value difference , rather than in terms of or other measure in the space of model parameters. From a practical point of view, the SAA approach is appealing because it allows flexibility in the choice of algorithm for the solution of (11). This works on two levels. First, if a faster algorithm becomes available for the solution of (11), it can immediately impact (10). Second, having fixed a large and to obtain reasonable accuracy in the solution of (10), one is free to approximately solve a sequence of smaller problems () with warm starts on the way to solving [12]. In other words, SAA theory guarantees the existence of a large enough for which the approximate problem is close to the full problem; however, the algorithm for solving the approximate problem (11) is left completely to the practitioner and in particular may require the evaluation of very few samples at early iterations.

3.2. Stochastic Approximation

Stochastic approximation (SA) methods go back to Robbins and Monro [30], who considered the root-finding problem in the case where cannot be evaluated directly. Rather, one has access to a function for which . The approach can be translated to optimization problems of the form by considering to be the gradient of and setting . Again, we cannot evaluate directly, but we have access to for which . More generally, for problems of type (10), Bertsekas and Tsitsiklis [31] and Bertsekas and Tsitsiklis [32] consider iterative algorithms of the form where are a sequence of step sizes determined a priori that satisfy certain properties, and can be thought of as noisy unbiased estimates of the gradient (i.e., ). Note that right away we are forced into an algorithmic framework, which never appears in the SAA discussion. The positive step sizes are chosen to satisfy The main idea is that the step sizes go to zero, but not too fast. A commonly used example of such a sequence of step sizes is The main result of Bertsekas and Tsitsiklis [31] is that if satisfies the Lipshitz condition with constant that is, the changes in the gradient are bounded in norm by changes in the parameter space, and if the directions on average point “close to” the gradient and are not too noisy, then the sequence converges, and every limit point of is a stationary point of (i.e., ). Under stronger assumptions that the level sets of are bounded and the minimum is unique, this guarantees that the algorithms described above will find it. A similar family of algorithms was studied by Polyak and Juditsky [33], who considered larger step sizes but included averaging model estimates into their algorithm. In the context discussed above, the step size rule 10 is replaced by A particular example of such a sequence cited by the paper is The iterative scheme is then given by Under assumptions similar in spirit to the ones in Bertsekas and Tsitsiklis [31], there is a result for the convergence of the iterates to the true estimate , namely, almost surely and where the convergence is in distribution, and the matrix is in some sense optimal and is related to the Hessian of at the solution . A more recent report [34] also considers averaging of model iterates in the context of optimizing (not necessarily smooth) convex functions of the form over a convex set . When is smooth, this situation reduces to the previous discussion. Nesterov and Vial [34] choose a finite sequence of step sizes a prior and consider the error in the expected value function after iterations. This is similar to the SAA analysis but is much easier to interpret, because now the desired accuracy in the objective value directly translates to the number of iterations of a particular algorithm where is projection onto the convex set of admissible models . Unfortunately, the error is , where is the diameter of the set (related to the bounds on from our earlier example) and is a uniform bound on , and so the estimate may be overly conservative. If all the are chosen to be uniform, the optimal size is , and then the result is simply For a recent survey of stochastic optimization and new robust SA methods, please see Nemirovski et al. [27].

Note that the error rate in the objective values is , where the constant depends in a straightforward way on the size of the set and the behavior of . Compare this to the error bound for the SAA approach. In contrast to the SAA, the SA approach translates directly into a particular algorithm. This makes it easier to implement for full waveform inversion, but also leaves less freedom for algorithm design than in SAA, where any algorithm can be used to solve the deterministic ensemble average problem.

4. Algorithms

To test the performance of the SAA approach, we chose to use a steepest descent method with an Armijo line search (cf., [35]). Although one could in principle use a second-order method (such as L-BFGS), we chose to use a first-order method to allow for better comparison to the SA results. The pseudocode is presented in Algorithm 1.

While not converged do
 find s.t.
end while

The SA methods are closely related to the steepest descent method. The main difference is that for each iteration a new random realization is drawn from a prescribed distribution and that the result is averaged over past iterations. We chose to implement a few modifications to the standard SA algorithms. First, we use an Armijo line search to determine the step size instead of using a prescribed sequence such as that discussed in the previous section. This assures some descent at each iteration with respect to the current realization of , and we found that this greatly improved the convergence. Second, we allow for averaging over the past iterations instead of the full history. This prevents the method from stalling. The pseudo-code is presented in Algorithm 2.

While not converged do
 draw from a pre-scribed distribution
 find s.t.
end while

5. Results

For the numerical experiments, we use the true and initial squared-slowness models depicted in Figure 1. The data are generated for 61 equispaced, colocated sources and receivers at 10 m depth and 7 randomly chosen (but fixed) frequencies between 5 and 30 Hz. The latter strategy is inspired by results from compressive sensing (cf., [7, 36, 37]). The basic idea is to turn aliases that are introduced by sub-Nyquist sampling into random noise.

The Helmholtz operator is discretized on a grid with 10 m spacing, using a 9-point finite difference stencil and absorbing boundary conditions. The point sources are represented as narrow Gaussians. As a source signature, we use a Ricker wavelet with a peak frequency of 10 Hz. The noise is Gaussian with a prescribed SNR.

We run each of the optimization methods for 500 iterations and compare the performance for various batch sizes and noise levels to the result of steepest descent on the full problem. Remember that by using small batch sizes, the iterations are very cheap, so we can afford to do more. The random vectors are drawn from a Gaussian distribution with zero mean and unit variance. We chose to use the Gaussian because the theoretical bounds on do not depend on properties of the residual matrix. Although the matrix will change constantly during the optimization, we can at least expect a uniform quality of the approximation.

In a realistic application, one might want to add a regularization term. In particular, this would prevent the overfitting that we observe in the noisy case. Note that limiting the amount of iterations also serves as a form of regularization [38].

5.1. Sample Average Approximation

We choose a set of Gaussian random vectors with zero mean and unit variance and run the steepest descent algorithm presented previously on the resulting deterministic optimization problem. The results after 500 iterations on data without noise are shown in the first column of Figure 7. The error between the recovered and true model is shown in Figure 8(a). As reference, the error between the true and recovered model for the inversion with all the sequential sources is also shown. As expected, the recovery is better for larger batch sizes. The recovered models for data with noise are shown in the second column (SNR = 20 dB) and third (SNR = 10 dB) columns of Figure 7. The corresponding recovery error is shown in Figures 8(b) and 8(c), respectively. It shows that the SAA approach starts overfitting in an earlier stage than the full inversion. Also, we are not able to reach the same model error as the full inversion.

5.2. Stochastic Approximation

We run the stochastic descent algorithm for varying batch sizes () and history sizes ().

The results obtained without averaging are shown in Figure 9. The columns represent different batch sizes, while the rows represent different noise levels. The recovery errors for the different batch sizes and noise levels are shown in Figure 10. In the noiseless case, we are able to achieve the same recovery error as the full inversion with only one simultaneous source. When noise is present in the data, one simultaneous source is not enough, however. Still, we can achieve the same recovery error as the full problem with only 10 simultaneous sources. This yields an order of magnitude improvement in our computation, since the total number of iterations needed by the stochastic method to achieve a given level of accuracy is roughly the same as required by a deterministic first-order method used on the full system, but each stochastic iteration requires ten times fewer PDE solves than a deterministic iteration on the full system.

Results obtained with averaging over the past 10 iterations are shown in Figure 11. The rows represent different batch sizes, while the columns represent different noise levels. The corresponding recovery errors are shown in Figure 12. It shows that averaging helps to overcome some of the noise sensitivity, and we are now able to achieve a good reconstruction with only 5 simultaneous sources. Also, the averaging damps the irregularity of the convergence somewhat.

Finally, we show the result obtained by averaging over the full history in Figure 13. The corresponding recovery error is shown in Figure 14. It shows that too much averaging slows down the convergence.

6. Conclusions and Discussion

Following Haber et al. [39], we reduce the dimensionality of full waveform inversion via randomized trace estimation. This reduction comes at the cost of introducing random crosstalk between the sources into the updates. The resulting optimization problem can be treated as a stochastic optimization problem. Theory for such methods goes back to the 1950s and justifies the approach presented by Krebs et al. [8]. In particular, we use theoretical results by Avron and Toledo [14] on randomized trace estimation to get bounds for the batch size needed to approximate the misfit to a given accuracy level with a given probability. Numerical tests show, however, that these bounds may be overly pessimistic and that we get reasonable approximations for modest batch sizes.

Theory from the field of stochastic optimization suggests several approaches to tackle the optimization problem and reduce the influence of the crosstalk introduced by the randomization. The first approach, the sample average approximation, dictates the use of a fixed set of random sources and relies solely on increasing the batch size to get rid of the crosstalk. The stochastic approximation, on the other hand, dictates that we redraw the randomization each iteration and average over the past in order to suppress the stochasticity of the gradients.

We note that, as opposed to randomized dimensionality reduction, several authors have proposed methods for deterministic dimensionality reduction [39, 40]. These techniques are related to optimal experimental design and try to determine the source combination that somehow optimally illuminates the target. It is not quite clear how such methods compare to the randomized approach discussed here. It is clear, however, that by using random superpositions we have access to powerful results from the field of compressive sensing to further improve the reconstruction. Li and Herrmann [11] use sparse recovery techniques instead of Monte Carlo sampling to get rid of the crosstalk.

In our experiments, we were able to obtain results that are comparable to the full optimization with a small fraction of the number of sources. In the noiseless case, we needed only one simultaneous source for the SA approach. Even with noisy data, five simultaneous sources proved sufficient. This is a very promising result, since using five simultaneous sources for the SA method means that every iteration requires 20 times fewer PDE solves, which directly translates to a 20x computational speedup compared to a first-order deterministic method. The key point is that both SA and the full deterministic approach require roughly the same number of iterations to achieve the same accuracy.

Averaging over a limited number of past iterations improved the results for a fixed batch size and allows for the use of fewer simultaneous sources. However, too much averaging slows down the convergence.

The results of the SA approach, where a new realization of the random vectors is drawn at every iteration, are superior to the SAA results, where the random vectors are fixed. However, one could use a more sophisticated (possibly black box) optimization method for the SAA approach to get a similar result with fewer iterations. The tradeoff between using a smaller batch size and first-order methods (i.e., more iterations) versus using a larger batch size and second-order methods (i.e., less iterations) needs to be investigated further. Random superposition of shots only makes sense if those shots are sampled by the same receivers. In particular, this hampers straightforward application to marine seismic data. One way to get around this is to partition the data into blocks that are fully sampled. However, this would not give the same amount of reduction in the number of shots because only shots that are relatively close to each other can be combined without losing too much data.

The type of encoding used will most likely affect the behavior of both SA and SAA methods. It remains to be investigated which encoding is most suitable for waveform inversion.

Acknowledgments

The authors thank Eldad Haber and Mark Schmidt for insightful discussions on trace estimation and stochastic optimization. This work was in part financially supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (no. 22R81254) and the Collaborative Research and Development Grant DNOISE II (no. 375142-08). This research was carried out as part of the SINBAD II project with support from the following organizations: BG Group, BP, Chevron, ConocoPhillips, Petrobras, Total SA, and WesternGeco.