Abstract

Fixed-basis and variable-basis approximation schemes are compared for the problems of function approximation and functional optimization (also known as infinite programming). Classes of problems are investigated for which variable-basis schemes with sigmoidal computational units perform better than fixed-basis ones, in terms of the minimum number of computational units needed to achieve a desired error in function approximation or approximate optimization. Previously known bounds on the accuracy are extended, with better rates, to families of 𝑑-variable functions whose actual dependence is on a subset of 𝑑𝑑 variables, where the indices of these 𝑑 variables are not known a priori.

1. Introduction

In functional optimization problems, also known as infinite programming problems, functionals have to be minimized with respect to functions belonging to subsets of function spaces. Function-approximation problems, the classical problems of the calculus of variations [1] and, more generally, all optimization tasks in which one has to find a function that is optimal in a sense specified by a cost functional belong to this family of problems. Such functions may express, for example, the routing strategies in communication networks, the decision functions in optimal control problems and economic ones, and the input/output mappings of devices that learn from examples.

Experience has shown that optimization of functionals over admissible sets of functions made up of linear combinations of relatively few basis functions with a simple structure and depending nonlinearly on a set of “inner” parameters (e.g., feedforward neural networks with one hidden layer and linear output activation units) often provides surprisingly good suboptimal solutions. In such approximation schemes, each function depends on both external parameters (the coefficients of the linear combination) and inner parameters (the ones inside the basis functions). These are examples of variable-basis approximators since the basis functions are not fixed but their choice depends on the one of the inner parameters. In contrast, classical approximation schemes (such as the Ritz method in the calculus of variations [1]) do not use inner parameters but employ fixed basis functions, and the corresponding approximators exhibit only a linear dependence on the external parameters. Then, they are called fixed-basis or linear approximators. In [2], certain variable-basis approximators were applied to obtain approximate solutions to functional optimization problems. This technique was later formalized as the extended Ritz method (ERIM) [3] and was motivated by the innovative and successful application of feedforward neural networks in the late 80 s. For experimental results and theoretical investigations about the ERIM, see [27] and the references therein.

The basic motivation to search for suboptimal solutions of these forms is quite intuitive: when the number of basis functions becomes sufficiently large, the convergence of the sequence of suboptimal solutions to an optimal one may be ensured by suitable properties of the set of basis functions, the admissible set of functions, and the functional to be optimized [1, 5, 8]. Computational feasibility requirements (i.e., memory occupancy and time needed to find sufficiently good values for the parameters) make it crucial to estimate the minimum number of computational units needed by an approximator to guarantee that suboptimal solutions are “sufficiently close” to an optimal one. Such a number plays the role of “model complexity” of the approximator and can be studied with tools from linear and nonlinear approximation theory [9, 10].

As compared to fixed-basis approximators, in variable-basis ones the nonlinearity of the parametrization of the variable basis functions may cause the loss of useful properties of best approximation operators [11], such as uniqueness, homogeneity, and continuity, but may allow improved rates of approximation or approximate optimization [9, 1214]. Then, to justify the use of variable-basis schemes instead of fixed-basis ones, it is crucial to investigate families of function-approximation and functional optimization problems for which, for a given desired accuracy, variable-basis schemes require a smaller number of computational units than fixed-basis ones. This is the aim of this work.

In the paper, the approximate solution of certain function-approximation and functional optimization problems via fixed- and variable-basis schemes is investigated. In particular, families of problems are presented, for which variable-basis schemes of a certain kind perform better than any fixed-basis one, in terms of the minimum number of computational units needed to achieve a desired worst-case error. Propositions 2.4, 2.7, 2.8, and 3.2 are the main contributions, which are presented after the exposition of results available in the literature.

The paper is organized as follows. Section 2 compares variable- and fixed-basis approximation schemes for function-approximation problems, which are particular instances of functional optimization. Section 3 extends the estimates to some more general families of functional optimization problems through the concepts of modulus of continuity and modulus of convexity of a functional. Section 4 is a short discussion.

2. Comparison of Bounds for Fixed- and Variable-Basis Approximation

Here and in the following, the “big 𝑂,” “big Ω,” and “big Θ” notations [18] are used. For two functions 𝑓,𝑔(0,+), one writes 𝑓=𝑂(𝑔) if and only if there exist 𝑀>0 and 𝑥0>0 such that |𝑓(𝑥)|𝑀|𝑔(𝑥)| for all 𝑥>𝑥0, 𝑓=Ω(𝑔) if and only if 𝑔=𝑂(𝑓), and 𝑓=Θ(𝑔) if and only if both 𝑓=𝑂(𝑔) and 𝑓=Ω(𝑔) hold. In order to be able to use such notations also for multivariable functions, in the following it is assumed that all their arguments are fixed with the exception of one of them (more precisely, the argument 𝜀).

Two approaches have been adopted in the literature to compare the approximation capabilities of fixed- and variable-basis approximation schemes (see also [15] for a discussion on this topic). In the first one, one fixes the family of functions to be approximated (e.g., the unit ball in a Sobolev space [16]), then one finds bounds on the worst-case approximation error for functions belonging to such a family, for various approximation schemes (fixed- and variable-basis ones). The second approach, initiated by Barron [12, 17], fixes a variable-basis approximation scheme (e.g., the set of one-hidden-layer perceptrons with a given upper bound on the number of sigmoidal computational units) and searches for families of functions that are well approximated by such an approximation scheme. Then, for these families of functions, the approximation capability of the variable-basis approximation scheme is compared with the ones of fixed-basis approximation schemes. In this context, one is interested in finding cases for which, the number of computational units being the same, one has upper bounds on the worst-case approximation error for certain variable-basis approximation schemes that are smaller than corresponding lower bounds for any fixed-basis one, implying that such variable-basis schemes have better approximation capabilities than every fixed-basis one.

One problem of the first approach is that, for certain families of smooth functions to be approximated, the bounds on the worst-case approximation error obtained for fixed- and variable-basis approximation schemes are very similar. In particular, typically one obtains the so-called Jackson rate of approximation [4] 𝑛=Θ(𝜀𝑑/𝑚), where 𝑛 is the number of computational units, 𝜀>0 is the worst-case approximation error, 𝑚 is a measure of smoothness, and 𝑑 is the number of variables on which such functions depend. Following the second approach, it was shown in [12, 17] that, for certain function-approximation problems, variable-basis schemes exhibit some advantages over fixed-basis ones (see Sections 2.1 and 2.2, where extensions of some results from [12, 17] are also derived).

In Section 2.1, some bounds in the 2-norm are considered, whereas Section 2.2 investigates bounds in the supnorm. Estimates in the 2-norm can be applied, for example, to investigate the approximation of the optimal policies in static team optimization problems [19]. Estimates in the supnorm are required, for example, to investigate the approximation of the optimal policies in dynamic optimization problems with a finite number of stages [20]. Indeed, for such problems, the supnorm can be used to analyze the error propagation from one stage to the next one, while this is not the case for the 2-norm [20]. Moreover, it provides guarantees on the approximation errors in the design of the optimal decision laws.

2.1. Bounds in the 2-Norm

The following Theorem 2.1 from [12] describes a quite general set of functions of 𝑑 real variables (described in terms of their Fourier distributions) whose approximation from variable-basis approximation schemes with sigmoidal computational units requires 𝑂(𝜀2) computational units, where 𝜀>0 is the desired worst-case approximation error measured in the 2-norm. Recall that a sigmoidal function is defined in general as a bounded measurable function 𝜎 such that 𝜎(𝑦)1 as 𝑦+ and 𝜎(𝑦)0 as 𝑦 [21]. For 𝐶>0, 𝑑 a positive integer, and 𝐵 a bounded subset of 𝑑 containing 0, by Γ𝐵,𝐶 we denote the set of functions 𝑓𝑑 having a Fourier representation of the form𝑓(𝑥)=𝑑𝑒𝑖𝜔𝑥𝐹(𝑑𝜔)(2.1) for some complex-valued measure 𝐹(𝑑𝜔)=𝑒𝑖𝜃(𝜔)𝐹(𝑑𝜔) (where 𝐹(𝑑𝜔) and 𝜃(𝜔) are the magnitude distribution and the phase at the pulsation 𝜔, resp.) such that𝑑sup𝑥𝐵||||𝜔,𝑥𝐹(𝑑𝜔)𝐶,(2.2) where , is the standard inner product on 𝑑. Functions in Γ𝐵,𝐶 are continuously differentiable on 𝐵 [12]. When 𝐵 is the hypercube [1,1]𝑑, the inequality (2.2) reduces to𝑑𝜔1𝐹(𝑑𝜔)𝐶,(2.3) where 1 denotes the 𝑙1-norm.

For a probability measure 𝜇 on 𝐵, we denote by 2(𝐵,𝜇) the Hilbert space of functions 𝑔𝐵 with inner product 𝑔1,𝑔22(𝐵,𝜇)=𝐵𝑔1(𝑥)𝑔2(𝑥)𝜇(𝑑𝑥) and induced norm 𝑔2(𝐵,𝜇)=𝑔,𝑔2(𝐵,𝜇). When there is no risk of confusion, the simpler notation 𝑔2 is used instead of 𝑔2(𝐵,𝜇).

Theorem 2.1 (see [12, Theorem  1]). For every 𝑓Γ𝐵,𝐶, every sigmoidal function 𝜎, every probability measure 𝜇 on 𝐵, and every 𝑛1, there exist 𝑎𝑘𝑑, 𝑏𝑘,𝑐𝑘, and 𝑓𝑛𝐵 of the form 𝑓𝑛(𝑥)=𝑛𝑘=1𝑐𝑘𝜎𝑎𝑘,𝑥+𝑏𝑘+𝑐0,(2.4) such that 𝑓𝑓𝑛2=𝐵𝑓(𝑥)𝑓𝑛(𝑥)2𝜇(𝑑𝑥)2𝐶𝑛.(2.5)

Variable-basis approximators of the form (2.4) are called one-hidden-layer perceptrons with 𝑛 computational units. Formula (2.5) shows that at most 𝑛1=(2𝐶)2𝜀2(2.6) computational units are required to guarantee a desired worst-case approximation error 𝜀 in the 2-norm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the set Γ𝐵,𝐶.

In contrast to this, Theorem 2.2 from [12] shows that, when 𝐵 is the unit hypercube [0,1]𝑑 and 𝜇=𝜇𝑢 is the uniform probability measure on [0,1]𝑑, for the same set of functions Γ𝐵,𝐶 the best linear approximation scheme requires Ω(𝜀𝑑) computational units in order to achieve the same worst-case approximation error 𝜀. The set of all linear combinations of 𝑛 fixed basis functions 1,2,,𝑛 in a linear space is denoted by span(1,2,,𝑛).

Theorem 2.2 (see [12, Theorem  6]). For every 𝑛1 and every choice of fixed basis functions 1,2,,𝑛2([0,1]𝑑,𝜇𝑢), one has sup𝑓Γ𝑑[0,1],𝐶inf𝑓𝑛span1,2,,𝑛[]0,1𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢𝐶(𝑑𝑥)16𝜋𝑒𝜋1𝑑12𝑛1/𝑑.(2.7)

Remark 2.3. Inspection of the proof of [12, Theorem  6] shows that the factors 1/8 and 1/𝑛, which appear in the original statement of the theorem, have to be replaced by 1/16 and 1/2𝑛 in (2.7), respectively.
Inspection of the proof of Theorem 2.2 in [12] shows also that the lower bound (2.7) still holds if the set Γ[0,1]𝑑,𝐶 is replaced by either 𝒮1[]=𝑓0,1𝑑𝐶𝑓(𝑥)=2𝜋𝑙1cos(𝜔𝑥)𝜔=2𝜋𝑙for𝑙{0,1,}𝑑[],𝑙(0,,0)𝑓0,1𝑑𝐶𝑓(𝑥)=2𝜋(2.8) or 𝒮2[]=𝑓0,1𝑑||𝛽||𝐶𝑓(𝑥)=𝛽cos(𝜔𝑥)2𝜋𝑙1,𝜔=2𝜋𝑙for𝑙{0,1,}𝑑[],𝑙(0,,0)𝑓0,1𝑑||𝛽||𝐶𝑓(𝑥)=𝛽,,2𝜋(2.9) where 𝑙 denotes any multi-index and 𝑙1 its norm (i.e., the sum of the components of 𝑙, which are nonnegative). Obviously, when 𝐵 is the unit hypercube [0,1]𝑑, the upper bound (2.5) still holds under one of these two replacements, since 𝒮1𝒮2Γ[0,1]𝑑,𝐶.

The inequality (2.7) implies that for a uniform probability measure on [0,1]𝑑, at least 𝑛2=12𝐶16𝜋𝑒𝜋1𝑑𝑑𝜀𝑑(2.10) computational units are required to guarantee a desired worst-case approximation error 𝜀 in the 2-norm, when fixed-basis approximation schemes of the form span(1,2,,𝑛) are used to approximate functions in Γ[0,1]𝑑,𝐶. Then, at least for a sufficiently small value of 𝜀, Theorems 2.1 and 2.2 show that for 𝑑>2, variable-basis approximators of the form (2.4) provide a smaller approximation error than any fixed-basis one for functions in Γ[0,1]𝑑,𝐶, the number of computational units being the same.

It should be noted that, for fixed 𝐶 and 𝜀, the estimate (2.6) is constant with respect to 𝑑, whereas the one (2.10) goes to 0 as 𝑑 goes to +. So, a too small value of (1/2)(𝐶/16𝜋𝑒𝜋1𝑑)𝑑 in the bound (2.10) for fixed-basis approximation may make the theoretical advantage of variable-basis approximation of impractical use, since for large 𝑑 it would be guaranteed only for sufficiently small 𝜀 (depending on 𝐶, too). In the following, families of 𝑑-variable functions are considered, for which this drawback is mitigated. These are families of 𝑑-variable functions whose actual dependence is on a subset of 𝑑𝑑 variables, where the indices of these 𝑑 variables are not known a priori. These families are of interest, for example, in machine learning applications, for problems with redundant or correlated features. In this context, each of the 𝑑 real variables represents a feature (e.g., a measure of some physical property of an object), and one is interested in learning a function of these features on the basis of a set of supervised examples. As it often happens in applications, only a small subset of the features is useful for the specific task (typically, classification or regression), due to the presence of redundant or correlated features. Then, one may assume that the function to be learned depends only on subset of 𝑑𝑑 features but one may not know a priori which particular subset is. The problem of finding such a subset (or finding a subset of features of sufficiently small cardinality 𝑑 on which the function mostly depends, when the function depends on all the 𝑑 features) is called the feature-selection problem [22].

For 𝑑 a positive integer and 𝑑 its multiple, Γ[0,1]𝑑,𝑑,𝐶 denotes the subset of functions in Γ[0,1]𝑑,𝐶 that depend only on 𝑑 of their possible 𝑑 arguments.

Proposition 2.4. For every 𝑛1 and every choice of fixed basis functions 1,2,,𝑛2([0,1]𝑑,𝜇𝑢), for 𝑛(𝑑+1)/2 one has sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛[]0,1𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢𝐶(𝑑𝑥)8𝜋(2.11) and for 𝑛>(𝑑+1)/2sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛[]0,1𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢𝑑(𝑑𝑥)𝑑1/𝑑𝐶16𝜋𝑒𝜋1𝑑12𝑛1/𝑑.(2.12)

Proof. The proof is similar to the one of [12, Theorem  6]. The following is a list of the changes to that proof, needed to derive (2.11) and (2.12). We denote by 𝑙0 the number of nonzero components of the multi-index 𝑙. Proceeding likewise in the proof of [12, Theorem  6], we get sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛[]0,1𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢𝐶(𝑑𝑥)8𝜋𝑚,(2.13) where 𝑚 is the smallest positive integer 𝑚 such that the number 𝑁𝑚,𝑑,𝑑 of multi-indices 𝑙{0,1,}𝑑 with norm 𝑙1𝑚 and that satisfy the constraint 𝑙0𝑑 is larger than or equal to 2𝑛. More precisely, (2.13) is obtained by observing that for such an integer 𝑚 the set 𝒮2Γ[0,1]𝑑,𝑑,𝐶 contains at least 2𝑛 orthogonal cosinusoidal functions with 2([0,1]𝑑,𝜇𝑢)-norm equal to 𝐶/4𝜋𝑚 and applying [12, Lemma  6], which states that for any orthonormal basis of a 2𝑛-dimensional space, there does not exist a linear subspace of dimension 𝑛 having distance smaller than 1/2 from every basis function in such an orthonormal basis. The constraint 𝑙0𝑑 is not present in the proof of [12, Theorem  6] and is due to the specific form of the set Γ[0,1]𝑑,𝑑,𝐶. Because of such a constraint, the functions in 𝒮2 with 𝑙0>𝑑 do not belong to Γ[0,1]𝑑,𝑑,𝐶.
Then we get𝑁𝑚,𝑑,𝑑=𝑑𝑚+𝑑for𝑑=𝑑or1𝑚𝑑𝑑,(2.14)𝑚+𝑑𝑁𝑚,𝑑,𝑑𝑑𝑑𝑚+𝑑𝑑𝑑for𝑑apositiveinteger>1and𝑚>1.(2.15) Indeed, for 𝑑=𝑑 the equality (2.14) follows recalling that the number of different ways of placing 𝑁𝑜 identical objects in 𝑁𝑏 distinct boxes is 𝑁𝑜+𝑁𝑏𝑁1𝑏1 [23, Theorem  5.1], and for this case it is the same estimate as the one obtained in the proof of [12, Theorem  6]. Similarly, for 1𝑚𝑑 the constraint 𝑙0𝑑 is redundant and we get again (2.14). Finally, for 𝑑/𝑑 a positive integer larger than 1 and 𝑚>1, the upper bound in (2.15) is obtained ignoring the constraint 𝑙0𝑑, whereas the lower bound is obtained as follows. First, we partition the set of 𝑑 variables into 𝑑/𝑑 subsets of cardinality 𝑑, and then we apply to each subset the estimate 𝑁𝑚,𝑑,𝑑=𝑚+𝑑𝑑 obtained by replacing 𝑑 by 𝑑 in (2.14). In this way, the multi-index 𝑙=0 is counted 𝑑/𝑑 times (one for each subset), but the final estimate 𝑁𝑚,𝑑,𝑑𝑑/𝑑𝑚+𝑑𝑑 so obtained holds since for 𝑚>1 there are at least other 𝑑/𝑑1 multi-indices that have been not counted in this process.
In the following, we apply (2.14) and (2.15) for 𝑚=1 and 𝑚>1, respectively. For 𝑚=1, the condition 𝑁𝑚,𝑑,𝑑2𝑛 becomes𝑑1+𝑑=𝑑+12𝑛,(2.16) so 𝑚=1 for 𝑛(𝑑+1)/2. This, combined with (2.13), proves (2.11).
Now, likewise in the proof of [12, Theorem  6], for 𝑚>1 we exploit a bound from Stirling’s formula, according to which 𝑚+𝑑𝑑(𝑚/𝑒𝜋1𝑑)𝑑, so the condition 𝑁𝑚,𝑑,𝑑2𝑛 holds if we impose𝑑𝑑𝑚𝑒𝜋1𝑑𝑑2𝑛,(2.17) which is equivalent to 𝑒𝑚𝜋1𝑑(2𝑛)1/𝑑𝑑𝑑1/𝑑(2.18) (note that, for 𝑛>(𝑑+1)/2, the value of 𝑚 provided by (2.18) is indeed larger than 1, as required for the application of (2.15)). Since 2𝑒𝜋1𝑑(2𝑛)1/𝑑𝑑𝑑1/𝑑𝑒𝜋1𝑑(2𝑛)1/𝑑𝑑𝑑1/𝑑(2.19) we conclude that 𝑚2𝑒𝜋1𝑑(2𝑛)1/𝑑(𝑑/𝑑)1/𝑑 for 𝑛>(𝑑+1)/2. This, together with (2.13), proves the statement (2.12).

For the case considered by Proposition 2.4, an uniform probability measure on [0,1]𝑑, and 0<𝜀<𝐶/8𝜋, formulas (2.11) and (2.12) show that at least 𝑛3=max𝑑+12,12𝑑𝑑𝐶16𝜋𝑒𝜋1𝑑𝑑𝜀𝑑(2.20) computational units are required to guarantee a desired worst-case approximation error 𝜀 in the 2-norm, when fixed-basis approximation schemes of the form span(1,2,,𝑛) are used to approximate functions in Γ[0,1]𝑑,𝑑,𝐶.

Remark 2.5. The quantity 𝑑 in Proposition 2.4 has to be interpreted as an effective number of variables for the family of functions Γ[0,1]𝑑,𝑑,𝐶 to be approximated. Roughly speaking, the flexibility of the neural network architecture (2.4) allows one to identify, for each 𝑓Γ[0,1]𝑑,𝑑,𝐶, the 𝑑 variables on which it actually depends, whereas fixed-basis approximation schemes have not this flexibility. Indeed, differently from the lower bound (2.10), for fixed 𝐶, 𝜀, and 𝑑 the lower bound (2.20) goes to + as 𝑑 goes to +. Finally, similar remarks as in Remark 2.3 apply to Proposition 2.4.

2.2. Bounds in the Supnorm

The next result is from [17] and is analogous to Theorem 2.1, but it measures the worst-case approximation error in the supnorm.

Theorem 2.6 (see [17, Theorem  2]). For every 𝑓Γ𝐵,𝐶 and every 𝑛1, there exists 𝑓𝑛𝐵 of the form (2.4) such that sup𝑥𝐵||𝑓(𝑥)𝑓𝑛||(𝑥)120𝐶𝑛𝑑.(2.21)

Upper bounds in the supnorm similar to the one from Theorem 2.6 are given, for example, in [24, 25]. Moreover, for 𝑓Γ[0,1]𝑑,𝑑,𝐶, the following estimate holds.

Proposition 2.7. For every 𝑓Γ[0,1]𝑑,𝑑,𝐶 and every 𝑛1, there exists 𝑓𝑛[0,1]𝑑 of the form (2.4) such that sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛||(𝑥)120𝐶𝑛𝑑.(2.22)

Proof. Each function 𝑓Γ[0,1]𝑑,𝑑,𝐶 depends on 𝑑 arguments; let 𝑖1,,𝑖𝑑 be their indices. Let 𝑓[0,1]𝑑 be defined by 𝑓(𝑦)=𝑓(𝑥), where 𝑥𝑖1=𝑦1,,𝑥𝑖𝑑=𝑦𝑑, and all the other components of 𝑥 are arbitrary in [0,1]𝑑𝑑. Then 𝑓Γ[0,1]𝑑,𝐶, so by Theorem 2.6 there exists an approximation 𝑓𝑛[0,1]𝑑 made up of 𝑛 sigmoidal computational units and a constant term such that sup𝑥[0,1]𝑑|𝑓𝑓(𝑥)𝑛(𝑥)|(120𝐶/𝑛)𝑑. Finally, we observe that 𝑓𝑛 can be extended to a function 𝑓𝑛[0,1]𝑑 of the form (2.4) such that sup𝑥[0,1]𝑑|𝑓(𝑥)𝑓𝑛(𝑥)|=sup𝑥[0,1]𝑑|𝑓𝑓(𝑥)𝑛(𝑥)|, then one obtains (2.22).

The estimates (2.21) and (2.22) show that at most 𝑛4=(120𝐶)2𝑑2𝜀2,𝑛5=(120𝐶)2𝑑2𝜀2(2.23) computational units, respectively, are required to guarantee a desired worst-case approximation error 𝜀 in the supnorm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the sets Γ𝐵,𝐶 and Γ[0,1]𝑑,𝑑,𝐶, respectively.

The next proposition, combined with Theorem 2.6 and Proposition 2.7, allows one to compare the approximation capabilities of fixed- and variable-basis schemes in the supnorm, showing cases for which the upper bounds (2.21) and (2.22) are smaller than one of the corresponding lower bounds (2.24)–(2.26), at least for 𝑛 sufficiently large.

Proposition 2.8. For every 𝑛1 and every choice of fixed bounded and 𝜇𝑢-measurable basis functions 1,2,,𝑛[0,1]𝑑, the following hold.(i)For the approximation of functions in Γ[0,1]𝑑,𝐶, one has sup𝑓Γ𝑑[0,1],𝐶inf𝑓𝑛span1,2,,𝑛sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛||𝐶(𝑥)16𝜋𝑒𝜋1𝑑12𝑛1/𝑑.(2.24)(ii)For the approximation of functions in Γ[0,1]𝑑,𝑑,𝐶, for 𝑛(𝑑+1)/2, one has sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛(||𝐶𝑥)8𝜋(2.25) whereas for 𝑛>(𝑑+1)/2sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span(1,2,,𝑛)sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛||𝑑(𝑥)𝑑1/𝑑𝐶16𝜋𝑒𝜋1𝑑12𝑛1/𝑑.(2.26)

Proof. For each bounded and 𝜇𝑢-measurable function 𝑔[0,1]𝑑, we get [0,1]𝑑𝑔2(𝑥)𝜇𝑢(𝑑𝑥)sup[]𝑥0,1𝑑||||𝑔(𝑥)[0,1]𝑑𝜇𝑢(𝑑𝑥)=sup[]𝑥0,1𝑑||||,𝑔(𝑥)(2.27) so sup𝑓Γ𝑑[0,1],𝐶inf𝑓𝑛span1,2,,𝑛[0,1]𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢(𝑑𝑥)sup𝑓Γ𝑑[0,1],𝐶inf𝑓𝑛span1,2,,𝑛sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛||,(𝑥)sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛[0,1]𝑑𝑓(𝑥)𝑓𝑛(𝑥)2𝜇𝑢(𝑑𝑥)sup𝑓Γ𝑑[0,1],𝑑,𝐶inf𝑓𝑛span1,2,,𝑛sup[]𝑥0,1𝑑||𝑓(𝑥)𝑓𝑛||.(𝑥)(2.28) Then we get the lower bounds (2.24)–(2.26) by (2.7), (2.11), and (2.12), respectively.

For the case considered by Proposition 2.8, the estimate (2.24) implies that at least 𝑛2 computational units are required to guarantee a desired worst-case approximation error 𝜀 in the supnorm, when fixed-basis approximation schemes of the form span(1,2,,𝑛) are used to approximate functions in Γ[0,1]𝑑,𝐶. Similarly, for 0<𝜀<𝐶/8𝜋, the bounds (2.25) and (2.26) imply that at least 𝑛3 computational units are required when Γ[0,1]𝑑,𝐶 is replaced by Γ[0,1]𝑑,𝑑,𝐶. One can observe that, for each 𝑑, 𝑑 and 𝐶, each of the lower bounds (2.25) and (2.26) is larger than (2.24). Moreover, all the other parameters being fixed, the lower bound (2.24) goes to 0 as 𝑑 tends to +, whereas for 𝑑2𝑛1, the lower bound (2.25) holds, and it does not depend on the specific value of 𝑑. Finally, for 𝑑>2, the upper bound (2.21) is smaller than the lower bound (2.24) for 𝑛 sufficiently large, and similarly, for 𝑑>2, the upper bound (2.22) is smaller than the lower bounds (2.25) and (2.26) for 𝑛 sufficiently large. For instance, in the latter case and for 𝑑 sufficiently small with respect to 𝑑, this happens for 225𝑑2/𝜋2𝑛(𝑑+1)/2 and for𝑛min𝑑+12,𝐾1𝑑𝐾2,(2.29) where 𝐾1=(192021/𝑑𝜋𝑒𝜋1)2𝑑/(𝑑2)𝑑2(2𝑑+1)/(𝑑2) and 𝐾2=2/(𝑑2).

Similar remarks as in Remark 2.3 can be made about the bounds in the supnorm derived in this section.

3. Application to Functional Optimization Problems

The results of Section 2 can be extended, with the same rates of approximation or similar ones, to the approximate solution of certain functional optimization problems. This can be done by exploiting the concepts of modulus of continuity and modulus of convexity of a functional, provided that continuity and uniform convexity assumptions are satisfied. The basic ideas are the following (see also [5] for a similar analysis).

3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity

Let 𝒳 be a normed linear space, 𝑋𝒳, and Φ𝑋 a functional. Suppose that the functional optimization problemmin𝑓𝑋Φ(𝑓)(3.1) has a solution 𝑓, and let 𝑋1𝑋2𝑋𝑛𝑋 be a nested sequence of subsets of 𝑋 such thatinf𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳𝜀𝑛(3.2) for some 𝜀𝑛>0, where 𝜀𝑛0 as 𝑛+. Then, if the functional Φ is continuous, too, one hasinf𝑓𝑛𝑋𝑛||Φ(𝑓𝑓)Φ𝑛||𝛼𝑓𝜀𝑛0as𝑛+,(3.3) where 𝛼𝑓[0,+)[0,+) defined by 𝛼𝑓(𝑡)=sup{|Φ(𝑓)Φ(𝑔)|𝑔𝑋,𝑓𝑔𝒳𝑡} is the modulus of continuity of Φ at 𝑓. For instance, if Φ is Lipschitz continuous with Lipschitz constant 𝐾Φ, one has 𝛼𝑓(𝑡)𝐾Φ𝑡, and by(3.2)inf𝑓𝑛𝑋𝑛||Φ(𝑓𝑓)Φ𝑛||𝐾Φ𝜀𝑛.(3.4) Then, if an upper bound on 𝜀𝑛 in terms of 𝑛 is known (e.g., 𝜀𝑛=𝑂(𝑛1/2) under the assumptions of Theorem 2.1, where 𝑋=Γ𝐵,𝐶2(𝐵,𝜇)=𝒳 and 𝑋𝑛 is the set of functions of the form (2.4)), then the same upper bound (up to a multiplicative constant) holds on inf𝑓𝑛𝑋𝑛|Φ(𝑓)Φ(𝑓𝑛)|. So, investigating the approximating capabilities of the sets 𝑋𝑛 is useful for functional optimization purposes, too.

3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity

When dealing with suboptimal solutions from a set 𝑋𝑛𝑋, the following question arises: suppose that 𝑓𝑛𝑋𝑛 is such that|||Φ(𝑓𝑓)Φ𝑛|||𝛾𝑛(3.5) for some 𝛾𝑛>0, where 𝛾𝑛0 as 𝑛+. This can be guaranteed, for example, if the functional is continuous, the sets 𝑋𝑛 satisfy the property (3.2), and one chooses 𝑓𝑛argmin𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳 assuming, almost without loss of generality, that such a set is nonempty. If this is not the case, then one can proceed as follows. For 𝜖>0, let argmin𝜖,𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳={𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳inf𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳+𝜖}. Then one obtains estimates similar to the ones of this section (obtained assuming that argmin𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳 is nonempty) by choosing 𝑓𝑛argmin𝜂𝜀𝑛,𝑓𝑛𝑋𝑛𝑓𝑓𝑛𝒳, where 𝜂>1 is a constant. Does the estimate (3.5) imply an upper bound on the approximation error 𝑓𝑓𝑛𝒳? A positive answer can be given when the functional Φ is uniformly convex. Recall that a functional Φ𝑋 is called convex on a convex set 𝑋𝒳 if and only if for all ,𝑔𝑋 and all 𝜆[0,1], one has Φ(𝜆+(1𝜆)𝑔)𝜆Φ()+(1𝜆)Φ(𝑔) and it is called uniformly convex if and only if there exists a nonnegative function 𝛿[0,+)[0,+) such that 𝛿(0)=0, 𝛿(𝑡)>0 for all 𝑡>0, and for all ,𝑔𝑋 and all 𝜆[0,1], one hasΦ(𝜆+(1𝜆)𝑔)𝜆Φ()+(1𝜆)Φ(𝑔)𝜆(1𝜆)𝛿𝑔𝒳.(3.6) Any such function 𝛿 is called a modulus of convexity of Φ [26]. The terminology is not unified: some authors use the term “strictly uniformly convex” instead of “uniformly convex” and reserve the term “uniformly convex” for the case where 𝛿[0,+)[0,+) merely satisfies 𝛿(0)=0 and 𝛿(𝑡0)>0 for some 𝑡0>0 (see, e.g., [27, 28]). Note that when 𝒳 is a Hilbert space and 𝛿(𝑡) has the quadratic expression1𝛿(𝑡)=2𝑐𝑡2(3.7) for some constant 𝑐>0, the condition (3.6) is equivalent to the convexity of the functional Φ()𝛿(𝒳)=Φ()(1/2)𝑐2𝒳. Indeed, the latter property means that, for all ,𝑔𝑋 and all 𝜆[0,1], one has1Φ(𝜆+(1𝜆)𝑔)2𝑐𝜆+(1𝜆)𝑔2𝒳𝜆𝜆Φ()2𝑐2𝒳+(1𝜆)Φ(𝑔)1𝜆2𝑐𝑔2𝒳,(3.8) and this is equivalent toΦ(𝜆+(1𝜆)𝑔)𝜆Φ()+(1𝜆)Φ(𝑔)𝜆(1𝜆)2𝑐𝑔2𝒳,(3.9) since one can show through straightforward computations that, for 𝒳 a Hilbert space, one has12𝑐𝜆+(1𝜆)𝑔2𝒳𝜆2𝑐2𝒳1𝜆2𝑐𝑔2𝒳=𝜆(1𝜆)2𝑐𝑔2𝒳.(3.10)

One of the most useful properties of uniform convexity is that 𝑓argmin𝑓𝑋Φ(𝑓) implies the lower bound||Φ(𝑓||)Φ(𝑓)𝛿𝑓𝑓𝒳(3.11) for any 𝑓𝑋 (see, e.g., [5, Proposition  2.1(iii)]). When the modulus of convexity has the form (3.7), this implies (together with (3.5))𝑓𝑓𝑛𝒳2𝛾𝑛𝑐0as𝑛+.(3.12) When (3.2) holds, too, and Φ has modulus of continuity 𝛼𝑓 at 𝑓, one can take𝛾𝑛=𝛼𝑓𝜀𝑛(3.13) in (3.12), thus obtaining𝑓𝑓𝑛𝒳2𝛼𝑓𝜀𝑛𝑐0as𝑛+.(3.14) Again, this allows one to extend rates of function approximation to functional optimization, supposing, as in Section 3.1, that Φ is also Lipschitz continuous with Lipschitz constant 𝐾Φ and that 𝜀𝑛=𝑂(𝑛1/2). Then, one obtains (from the choice (3.13) for 𝛾𝑛 and formula (3.14))|||Φ(𝑓𝑓)Φ𝑛|||𝑛=𝑂1/2𝑓,(3.15)𝑓𝑛𝒳𝑛=𝑂1/4.(3.16)

Remark 3.1. In [29], a greedy algorithm is proposed to construct a sequence of sets 𝑋𝑛 corresponding to variable-basis schemes and functions 𝑓𝑛𝑋𝑛 that achieve the rate (3.15) for certain uniformly convex functional optimization problems. Such an algorithm can be interpreted as an extension to functional optimization of the greedy algorithm proposed in [12] for function approximation by sigmoidal neural networks.
Finally, it should be noted that the rate (3.15) is achieved in general by imposing some structure on the sets 𝑋 and 𝑋𝑛. For instance, the set 𝑋 in [29] is the convex hull of some set of functions 𝐺𝒳, that is,𝑋=co𝐺=𝑘𝑗=1𝛼𝑗𝑔𝑗𝛼𝑗0,𝑘𝑗=1𝛼𝑗=1,𝑔𝑗𝐺,𝑘+,(3.17) whereas, for each 𝑛+, the set 𝑋𝑛 in [29] is 𝑋𝑛=co𝑛𝐺=𝑛𝑗=1𝛼𝑗𝑔𝑗𝛼𝑗0,𝑛𝑗=1𝛼𝑗=1,𝑔𝑗.𝐺(3.18) Functional optimization problems have in general a natural domain 𝑋 larger than co𝐺 (or its closure co𝐺 in the norm of the ambient space 𝒳). Therefore, the choice of a set 𝑋 of the form (3.17) as the domain of the functional Φ might seem unmotivated. This is not the case, because there are several examples of functional optimization problems for which, for suitable sets 𝐺 and a natural domain 𝑋 larger than co𝐺 (resp., co𝐺), the set argmin𝑓𝑋Φ(𝑓)(3.19) has a nonempty intersection with co𝐺 (resp., co𝐺), or it is contained in it. This issue is studied in [20] for dynamic optimization problems and in [19] for static team optimization ones, where structural properties (e.g., smoothness) of the minimizers are studied.

3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization

The proposition follows by combining the results derived in Sections 2.1, 3.1, and 3.2.

Proposition 3.2. Let the functional Φ be Lipschitz continuous with Lipschitz constant 𝐾Φ and uniformly convex with modulus of convexity of the form (3.7), 𝑋=Γ𝐵,𝐶, 𝜇 any probability measure on 𝐵, 𝒳=2(𝐵,𝜇), and suppose that there exists a minimizer 𝑓argmin𝑓Γ𝐵,𝐶Φ(𝑓). Then the following hold.(i)For every 𝑛1 there exists 𝑓𝑛 of the form (2.4) such that 𝑓𝑓𝑛22𝐶𝑛.(3.20) For each such 𝑓𝑛 one has ||Φ(𝑓𝑓)Φ𝑛||𝐾Φ2𝐶𝑛(3.21) and if 𝑓𝑛 of the form (2.4) is such that |||Φ(𝑓𝑓)Φ𝑛|||𝐾Φ2𝐶𝑛,(3.22) then 𝑓𝑓𝑛22𝐾Φ𝐶𝑐14𝑛.(3.23)(ii)For 𝐵=[0,1]𝑑, 𝜇𝑢 equal to the uniform probability measure on [0,1]𝑑, every 𝑛1, and every choice of fixed-basis functions 1,,𝑛2([0,1]𝑑,𝜇𝑢), there exists a uniformly convex functional Φ (such a functional Φ can be also chosen to be Lipschitz continuous with Lipschitz constant 𝐾Φ, but this is not needed in the inequalities (3.24)–(3.29), since they do not contain 𝐾Φ) with modulus of convexity of the form (3.7) and minimizer 𝑓argmin𝑓Γ𝑑[0,1],𝐶Φ(𝑓) such that for every 0<𝜒<1 one has inf𝑓𝑛span1,2,,𝑛𝑓𝑓𝑛2𝐶𝜒16𝜋𝑒𝜋1𝑑12𝑛1/𝑑,(3.24)inf𝑓𝑛span1,2,,𝑛|||Φ𝑓Φ𝑓𝑛|||12𝑐𝜒𝐶16𝜋𝑒𝜋1𝑑212𝑛2/𝑑.(3.25)(iii)The statements (i) and (ii) still hold by replacing the set Γ𝐵,𝐶 by Γ[0,1]𝑑,𝑑,𝐶, for 𝑑 a multiple of 𝑑. The only difference is that the estimates (3.24) and (3.25) are replaced, respectively, by inf𝑓𝑛span1,2,,𝑛𝑓𝑓𝑛2𝐶𝜒8𝜋,(3.26)inf𝑓𝑛span1,2,,𝑛|||Φ𝑓Φ𝑓𝑛|||12𝑐𝜒𝐶8𝜋2(3.27) for 𝑛(𝑑+1)/2 and by inf𝑓𝑛span1,2,,𝑛𝑓𝑓𝑛2𝑑𝜒𝑑1/𝑑𝐶16𝜋𝑒𝜋11𝑑2𝑛1/𝑑,(3.28)inf𝑓𝑛span{1,2,,𝑛}|||Φ𝑓Φ𝑓𝑛|||12𝑐𝑑𝑑2/𝑑𝜒𝐶16𝜋𝑒𝜋1𝑑212𝑛2/𝑑(3.29) for 𝑛>(𝑑+1)/2.

Proof. (i) The estimate (3.20) follows by Theorem 2.1. The bound (3.21) follows by (3.20), the definition of modulus of continuity, and the assumption of Lipschitz continuity of Φ. Finally, (3.23) is obtained by property (3.11) of the modulus of convexity and its expression (3.7).
(ii) (3.24) comes from Theorem 2.2: the constant 𝜒 is introduced in order to remove the supremum with respect to 𝑓Γ[0,1]𝑑,𝐶 in formula (2.7) and replace it with the choice 𝑓=𝑓, where 𝑓 is any function that achieves the bound (2.7) up to the constant factor 𝜒; (3.25) follows from (3.24), (3.11), and (3.7), choosing as Φ any functional that is uniformly convex with modulus of convexity of the form (3.7), and such that 𝑓argmin𝑓Γ𝑑[0,1],𝐶Φ(𝑓).
(iii) The estimates (3.20), (3.21), (3.23) still hold when the set Γ𝐵,𝐶 is replaced by Γ[0,1]𝑑,𝑑,𝐶 since Γ[0,1]𝑑,𝑑,𝐶Γ𝐵,𝐶 for 𝐵=[0,1]𝑑, whereas formulas (3.26)–(3.29) are obtained likewise formulas (3.24) and (3.25), by applying Proposition 2.4 instead of Theorem 2.2.

4. Discussion

Classes of function-approximation and functional optimization problems have been investigated for which, for a given desired error, certain variable-basis approximation schemes with sigmoidal computational units require less parameters than fixed-basis ones. Previously known bounds on the accuracy have been extended, with better rates, to families of functions whose effective number of variables 𝑑 is much smaller than the number of their arguments 𝑑.

Proposition 3.2 shows that there is a strict connection between certain problems of function approximation and functional optimization. For such two classes of problems, indeed, the approximation error rates for the first class can be converted into rates of approximate optimization for the second one and vice versa. In particular, for 𝑑>2, 𝑋=Γ[0,1]𝑑,𝐶, and any linear approximation scheme span{1,2,,𝑛}, the estimates (3.21) and (3.25) show families of functional optimization problems for which the error in approximate optimization with variable-basis schemes of sigmoidal type is smaller than the one associated with the linear scheme. For 𝑑>2 and 𝑋=Γ[0,1]𝑑,𝑑,𝐶, a similar remark can be made for the estimates (3.21) and (3.27) and for the bounds (3.21) and (3.29). Finally, the bound (3.23) shows that for large 𝑛 any approximate minimizer 𝑓𝑛 of the form (2.4) differs slightly from the true minimizer 𝑓, even though the error in approximate optimization (3.22) and the associated approximation error (3.23) have different rates. In contrast, the estimates (3.24), (3.26), and (3.28) show that, for any linear approximation scheme span{1,2,,𝑛}, there exists a functional optimization problem whose minimizer 𝑓 cannot be approximated with the same accuracy by the linear scheme.

The results presented in the paper provide some theoretical justification for the use of variable-basis approximation schemes (instead of fixed-basis ones) in function approximation and functional optimization.

Acknowledgment

The author was partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Adaptive State Estimation and Optimal Control.”