#### Abstract

Fixed-basis and variable-basis approximation schemes are compared for the problems of function approximation and functional optimization (also known as infinite programming). Classes of problems are investigated for which variable-basis schemes with sigmoidal computational units perform better than fixed-basis ones, in terms of the minimum number of computational units needed to achieve a desired error in function approximation or approximate optimization. Previously known bounds on the accuracy are extended, with better rates, to families of -variable functions whose actual dependence is on a subset of variables, where the indices of these variables are not known a priori.

#### 1. Introduction

In functional optimization problems, also known as infinite programming problems, functionals have to be minimized with respect to functions belonging to subsets of function spaces. Function-approximation problems, the classical problems of the calculus of variations [1] and, more generally, all optimization tasks in which one has to find a function that is optimal in a sense specified by a cost functional belong to this family of problems. Such functions may express, for example, the routing strategies in communication networks, the decision functions in optimal control problems and economic ones, and the input/output mappings of devices that learn from examples.

Experience has shown that optimization of functionals over admissible sets of functions made up of linear combinations of relatively few basis functions with a simple structure and depending nonlinearly on a set of “inner” parameters (e.g., feedforward neural networks with one hidden layer and linear output activation units) often provides surprisingly good suboptimal solutions. In such approximation schemes, each function depends on both external parameters (the coefficients of the linear combination) and inner parameters (the ones inside the basis functions). These are examples of *variable-basis approximators* since the basis functions are not fixed but their choice depends on the one of the inner parameters. In contrast, classical approximation schemes (such as the *Ritz method* in the calculus of variations [1]) do not use inner parameters but employ *fixed basis functions*, and the corresponding approximators exhibit only a linear dependence on the external parameters. Then, they are called *fixed-basis* or *linear approximators*. In [2], certain variable-basis approximators were applied to obtain approximate solutions to functional optimization problems. This technique was later formalized as the *extended Ritz method *(ERIM) [3] and was motivated by the innovative and successful application of feedforward neural networks in the late 80 s. For experimental results and theoretical investigations about the ERIM, see [2–7] and the references therein.

The basic motivation to search for suboptimal solutions of these forms is quite intuitive: when the number of basis functions becomes sufficiently large, the convergence of the sequence of suboptimal solutions to an optimal one may be ensured by suitable properties of the set of basis functions, the admissible set of functions, and the functional to be optimized [1, 5, 8]. Computational feasibility requirements (i.e., memory occupancy and time needed to find sufficiently good values for the parameters) make it crucial to estimate the minimum number of computational units needed by an approximator to guarantee that suboptimal solutions are “sufficiently close” to an optimal one. Such a number plays the role of “model complexity” of the approximator and can be studied with tools from linear and nonlinear approximation theory [9, 10].

As compared to fixed-basis approximators, in variable-basis ones the nonlinearity of the parametrization of the variable basis functions may cause the loss of useful properties of best approximation operators [11], such as uniqueness, homogeneity, and continuity, but may allow improved rates of approximation or approximate optimization [9, 12–14]. Then, to justify the use of variable-basis schemes instead of fixed-basis ones, it is crucial to investigate families of function-approximation and functional optimization problems for which, for a given desired accuracy, variable-basis schemes require a smaller number of computational units than fixed-basis ones. This is the aim of this work.

In the paper, the approximate solution of certain function-approximation and functional optimization problems via fixed- and variable-basis schemes is investigated. In particular, families of problems are presented, for which variable-basis schemes of a certain kind perform better than any fixed-basis one, in terms of the minimum number of computational units needed to achieve a desired worst-case error. Propositions 2.4, 2.7, 2.8, and 3.2 are the main contributions, which are presented after the exposition of results available in the literature.

The paper is organized as follows. Section 2 compares variable- and fixed-basis approximation schemes for function-approximation problems, which are particular instances of functional optimization. Section 3 extends the estimates to some more general families of functional optimization problems through the concepts of modulus of continuity and modulus of convexity of a functional. Section 4 is a short discussion.

#### 2. Comparison of Bounds for Fixed- and Variable-Basis Approximation

Here and in the following, the “big ,” “big ,” and “big ” notations [18] are used. For two functions , one writes if and only if there exist and such that for all , if and only if , and if and only if both and hold. In order to be able to use such notations also for multivariable functions, in the following it is assumed that all their arguments are fixed with the exception of one of them (more precisely, the argument ).

Two approaches have been adopted in the literature to compare the approximation capabilities of fixed- and variable-basis approximation schemes (see also [15] for a discussion on this topic). In the first one, one fixes the family of functions to be approximated (e.g., the unit ball in a Sobolev space [16]), then one finds bounds on the worst-case approximation error for functions belonging to such a family, for various approximation schemes (fixed- and variable-basis ones). The second approach, initiated by Barron [12, 17], fixes a variable-basis approximation scheme (e.g., the set of one-hidden-layer perceptrons with a given upper bound on the number of sigmoidal computational units) and searches for families of functions that are well approximated by such an approximation scheme. Then, for these families of functions, the approximation capability of the variable-basis approximation scheme is compared with the ones of fixed-basis approximation schemes. In this context, one is interested in finding cases for which, the number of computational units being the same, one has upper bounds on the worst-case approximation error for certain variable-basis approximation schemes that are smaller than corresponding lower bounds for any fixed-basis one, implying that such variable-basis schemes have better approximation capabilities than every fixed-basis one.

One problem of the first approach is that, for certain families of smooth functions to be approximated, the bounds on the worst-case approximation error obtained for fixed- and variable-basis approximation schemes are very similar. In particular, typically one obtains the so-called *Jackson rate* of approximation [4] , where is the number of computational units, is the worst-case approximation error, is a measure of smoothness, and is the number of variables on which such functions depend. Following the second approach, it was shown in [12, 17] that, for certain function-approximation problems, variable-basis schemes exhibit some advantages over fixed-basis ones (see Sections 2.1 and 2.2, where extensions of some results from [12, 17] are also derived).

In Section 2.1, some bounds in the -norm are considered, whereas Section 2.2 investigates bounds in the supnorm. Estimates in the -norm can be applied, for example, to investigate the approximation of the optimal policies in static team optimization problems [19]. Estimates in the supnorm are required, for example, to investigate the approximation of the optimal policies in dynamic optimization problems with a finite number of stages [20]. Indeed, for such problems, the supnorm can be used to analyze the error propagation from one stage to the next one, while this is not the case for the -norm [20]. Moreover, it provides guarantees on the approximation errors in the design of the optimal decision laws.

##### 2.1. Bounds in the -Norm

The following Theorem 2.1 from [12] describes a quite general set of functions of real variables (described in terms of their Fourier distributions) whose approximation from variable-basis approximation schemes with sigmoidal computational units requires computational units, where is the desired worst-case approximation error measured in the -norm. Recall that a sigmoidal function is defined in general as a bounded measurable function such that as and as [21]. For , a positive integer, and a bounded subset of containing 0, by we denote the set of functions having a Fourier representation of the form for some complex-valued measure (where and are the magnitude distribution and the phase at the pulsation , resp.) such that where is the standard inner product on . Functions in are continuously differentiable on [12]. When is the hypercube , the inequality (2.2) reduces to where denotes the -norm.

For a probability measure on , we denote by the Hilbert space of functions with inner product and induced norm . When there is no risk of confusion, the simpler notation is used instead of .

Theorem 2.1 (see [12, Theorem 1]). *For every , every sigmoidal function , every probability measure on , and every , there exist , , and of the form
**
such that
*

Variable-basis approximators of the form (2.4) are called *one-hidden-layer perceptrons* with computational units. Formula (2.5) shows that *at most *
computational units are required to guarantee a desired worst-case approximation error in the -norm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the set .

In contrast to this, Theorem 2.2 from [12] shows that, when is the unit hypercube and is the uniform probability measure on , for the same set of functions the best linear approximation scheme requires computational units in order to achieve the same worst-case approximation error . The set of all linear combinations of fixed basis functions in a linear space is denoted by .

Theorem 2.2 (see [12, Theorem 6]). *For every and every choice of fixed basis functions , one has
*

*Remark 2.3. *Inspection of the proof of [12, Theorem 6] shows that the factors and , which appear in the original statement of the theorem, have to be replaced by and in (2.7), respectively.

Inspection of the proof of Theorem 2.2 in [12] shows also that the lower bound (2.7) still holds if the set is replaced by either
or
where denotes any multi-index and its norm (i.e., the sum of the components of , which are nonnegative). Obviously, when is the unit hypercube , the upper bound (2.5) still holds under one of these two replacements, since .

The inequality (2.7) implies that for a uniform probability measure on , *at least *
computational units are required to guarantee a desired worst-case approximation error in the -norm, when fixed-basis approximation schemes of the form are used to approximate functions in . Then, at least for a sufficiently small value of , Theorems 2.1 and 2.2 show that for , variable-basis approximators of the form (2.4) provide a smaller approximation error than any fixed-basis one for functions in , the number of computational units being the same.

It should be noted that, for fixed and , the estimate (2.6) is constant with respect to , whereas the one (2.10) goes to 0 as goes to . So, a too small value of in the bound (2.10) for fixed-basis approximation may make the theoretical advantage of variable-basis approximation of impractical use, since for large it would be guaranteed only for sufficiently small (depending on , too). In the following, families of -variable functions are considered, for which this drawback is mitigated. These are families of -variable functions whose actual dependence is on a subset of variables, where the indices of these variables are not known a priori. These families are of interest, for example, in machine learning applications, for problems with redundant or correlated features. In this context, each of the real variables represents a feature (e.g., a measure of some physical property of an object), and one is interested in learning a function of these features on the basis of a set of supervised examples. As it often happens in applications, only a small subset of the features is useful for the specific task (typically, classification or regression), due to the presence of redundant or correlated features. Then, one may assume that the function to be learned depends only on subset of features but one may not know a priori which particular subset is. The problem of finding such a subset (or finding a subset of features of sufficiently small cardinality on which the function mostly depends, when the function depends on all the features) is called the *feature-selection problem* [22].

For a positive integer and its multiple, denotes the subset of functions in that depend only on of their possible arguments.

Proposition 2.4. *For every and every choice of fixed basis functions , for one has
**
and for *

*Proof. *The proof is similar to the one of [12, Theorem 6]. The following is a list of the changes to that proof, needed to derive (2.11) and (2.12). We denote by the number of nonzero components of the multi-index . Proceeding likewise in the proof of [12, Theorem 6], we get
where is the smallest positive integer such that the number of multi-indices with norm and that satisfy the constraint is larger than or equal to . More precisely, (2.13) is obtained by observing that for such an integer the set contains at least orthogonal cosinusoidal functions with -norm equal to and applying [12, Lemma 6], which states that for any orthonormal basis of a -dimensional space, there does not exist a linear subspace of dimension having distance smaller than from every basis function in such an orthonormal basis. The constraint is not present in the proof of [12, Theorem 6] and is due to the specific form of the set . Because of such a constraint, the functions in with do not belong to .

Then we get
Indeed, for the equality (2.14) follows recalling that the number of different ways of placing identical objects in distinct boxes is [23, Theorem 5.1], and for this case it is the same estimate as the one obtained in the proof of [12, Theorem 6]. Similarly, for the constraint is redundant and we get again (2.14). Finally, for a positive integer larger than 1 and , the upper bound in (2.15) is obtained ignoring the constraint , whereas the lower bound is obtained as follows. First, we partition the set of variables into subsets of cardinality , and then we apply to each subset the estimate obtained by replacing by in (2.14). In this way, the multi-index is counted times (one for each subset), but the final estimate so obtained holds since for there are at least other multi-indices that have been not counted in this process.

In the following, we apply (2.14) and (2.15) for and , respectively. For , the condition becomes
so for . This, combined with (2.13), proves (2.11).

Now, likewise in the proof of [12, Theorem 6], for we exploit a bound from Stirling’s formula, according to which , so the condition holds if we impose
which is equivalent to
(note that, for , the value of provided by (2.18) is indeed larger than 1, as required for the application of (2.15)). Since
we conclude that for . This, together with (2.13), proves the statement (2.12).

For the case considered by Proposition 2.4, an uniform probability measure on , and , formulas (2.11) and (2.12) show that *at least *
computational units are required to guarantee a desired worst-case approximation error in the -norm, when fixed-basis approximation schemes of the form are used to approximate functions in .

*Remark 2.5. *The quantity in Proposition 2.4 has to be interpreted as an *effective number of variables* for the family of functions to be approximated. Roughly speaking, the flexibility of the neural network architecture (2.4) allows one to identify, for each , the variables on which it actually depends, whereas fixed-basis approximation schemes have not this flexibility. Indeed, differently from the lower bound (2.10), for fixed , , and the lower bound (2.20) goes to as goes to . Finally, similar remarks as in Remark 2.3 apply to Proposition 2.4.

##### 2.2. Bounds in the Supnorm

The next result is from [17] and is analogous to Theorem 2.1, but it measures the worst-case approximation error in the supnorm.

Theorem 2.6 (see [17, Theorem 2]). *For every and every , there exists of the form (2.4) such that
*

Upper bounds in the supnorm similar to the one from Theorem 2.6 are given, for example, in [24, 25]. Moreover, for , the following estimate holds.

Proposition 2.7. *For every and every , there exists of the form (2.4) such that
*

*Proof. *Each function depends on arguments; let be their indices. Let be defined by , where , and all the other components of are arbitrary in . Then , so by Theorem 2.6 there exists an approximation made up of sigmoidal computational units and a constant term such that . Finally, we observe that can be extended to a function of the form (2.4) such that , then one obtains (2.22).

The estimates (2.21) and (2.22) show that *at most *
computational units, respectively, are required to guarantee a desired worst-case approximation error in the supnorm, when variable-basis approximation schemes of the form (2.4) are used to approximate functions belonging to the sets and , respectively.

The next proposition, combined with Theorem 2.6 and Proposition 2.7, allows one to compare the approximation capabilities of fixed- and variable-basis schemes in the supnorm, showing cases for which the upper bounds (2.21) and (2.22) are smaller than one of the corresponding lower bounds (2.24)–(2.26), at least for sufficiently large.

Proposition 2.8. *For every and every choice of fixed bounded and -measurable basis functions , the following hold.*(i)*For the approximation of functions in , one has
*(ii)*For the approximation of functions in , for , one has
whereas for *

*Proof. *For each bounded and -measurable function , we get
so
Then we get the lower bounds (2.24)–(2.26) by (2.7), (2.11), and (2.12), respectively.

For the case considered by Proposition 2.8, the estimate (2.24) implies that *at least * computational units are required to guarantee a desired worst-case approximation error in the supnorm, when fixed-basis approximation schemes of the form are used to approximate functions in . Similarly, for , the bounds (2.25) and (2.26) imply that *at least * computational units are required when is replaced by . One can observe that, for each , and , each of the lower bounds (2.25) and (2.26) is larger than (2.24). Moreover, all the other parameters being fixed, the lower bound (2.24) goes to 0 as tends to , whereas for , the lower bound (2.25) holds, and it does not depend on the specific value of . Finally, for , the upper bound (2.21) is smaller than the lower bound (2.24) for sufficiently large, and similarly, for , the upper bound (2.22) is smaller than the lower bounds (2.25) and (2.26) for sufficiently large. For instance, in the latter case and for sufficiently small with respect to , this happens for and for
where and .

Similar remarks as in Remark 2.3 can be made about the bounds in the supnorm derived in this section.

#### 3. Application to Functional Optimization Problems

The results of Section 2 can be extended, with the same rates of approximation or similar ones, to the approximate solution of certain functional optimization problems. This can be done by exploiting the concepts of modulus of continuity and modulus of convexity of a functional, provided that continuity and uniform convexity assumptions are satisfied. The basic ideas are the following (see also [5] for a similar analysis).

##### 3.1. Rates of Approximate Optimization in Terms of the Modulus of Continuity

Let be a normed linear space, , and a functional. Suppose that the functional optimization problem
has a solution , and let be a nested sequence of subsets of such that
for some , where as . Then, if the functional is continuous, too, one has
where defined by is the *modulus of continuity* of at . For instance, if is Lipschitz continuous with Lipschitz constant , one has , and by(3.2)
Then, if an upper bound on in terms of is known (e.g., under the assumptions of Theorem 2.1, where and is the set of functions of the form (2.4)), then the same upper bound (up to a multiplicative constant) holds on . So, investigating the approximating capabilities of the sets is useful for functional optimization purposes, too.

##### 3.2. Rates of Approximate Optimization in Terms of the Modulus of Convexity

When dealing with suboptimal solutions from a set , the following question arises: suppose that is such that
for some , where as . This can be guaranteed, for example, if the functional is continuous, the sets satisfy the property (3.2), and one chooses assuming, almost without loss of generality, that such a set is nonempty. If this is not the case, then one can proceed as follows. For , let . Then one obtains estimates similar to the ones of this section (obtained assuming that is nonempty) by choosing , where is a constant. Does the estimate (3.5) imply an upper bound on the approximation error ? A positive answer can be given when the functional is uniformly convex. Recall that a functional is called *convex* on a convex set if and only if for all and all , one has and it is called *uniformly convex* if and only if there exists a nonnegative function such that , for all , and for all and all , one has
Any such function is called a *modulus of convexity* of [26]. The terminology is not unified: some authors use the term “strictly uniformly convex” instead of “uniformly convex” and reserve the term “uniformly convex” for the case where merely satisfies and for some (see, e.g., [27, 28]). Note that when is a Hilbert space and has the quadratic expression
for some constant , the condition (3.6) is equivalent to the convexity of the functional . Indeed, the latter property means that, for all and all , one has
and this is equivalent to
since one can show through straightforward computations that, for a Hilbert space, one has

One of the most useful properties of uniform convexity is that implies the lower bound for any (see, e.g., [5, Proposition 2.1(iii)]). When the modulus of convexity has the form (3.7), this implies (together with (3.5)) When (3.2) holds, too, and has modulus of continuity at , one can take in (3.12), thus obtaining Again, this allows one to extend rates of function approximation to functional optimization, supposing, as in Section 3.1, that is also Lipschitz continuous with Lipschitz constant and that . Then, one obtains (from the choice (3.13) for and formula (3.14))

*Remark 3.1. *In [29], a greedy algorithm is proposed to construct a sequence of sets corresponding to variable-basis schemes and functions that achieve the rate (3.15) for certain uniformly convex functional optimization problems. Such an algorithm can be interpreted as an extension to functional optimization of the greedy algorithm proposed in [12] for function approximation by sigmoidal neural networks.

Finally, it should be noted that the rate (3.15) is achieved in general by imposing some structure on the sets and . For instance, the set in [29] is the convex hull of some set of functions , that is,
whereas, for each , the set in [29] is
Functional optimization problems have in general a natural domain larger than (or its closure in the norm of the ambient space ). Therefore, the choice of a set of the form (3.17) as the domain of the functional might seem unmotivated. This is not the case, because there are several examples of functional optimization problems for which, for suitable sets and a natural domain larger than (resp., ), the set
has a nonempty intersection with (resp., ), or it is contained in it. This issue is studied in [20] for dynamic optimization problems and in [19] for static team optimization ones, where structural properties (e.g., smoothness) of the minimizers are studied.

##### 3.3. Comparison between Fixed- and Variable-Basis Schemes for Functional Optimization

The proposition follows by combining the results derived in Sections 2.1, 3.1, and 3.2.

Proposition 3.2. *Let the functional be Lipschitz continuous with Lipschitz constant and uniformly convex with modulus of convexity of the form (3.7), , any probability measure on , , and suppose that there exists a minimizer . Then the following hold.*(i)*For every there exists of the form (2.4) such that
For each such one has
and if of the form (2.4) is such that
then
*(ii)*For , equal to the uniform probability measure on , every , and every choice of fixed-basis functions , there exists a uniformly convex functional (such a functional can be also chosen to be Lipschitz continuous with Lipschitz constant , but this is not needed in the inequalities (3.24)–(3.29), since they do not contain ) with modulus of convexity of the form (3.7) and minimizer such that for every one has
*(iii)*The statements (i) and (ii) still hold by replacing the set by , for a multiple of . The only difference is that the estimates (3.24) and (3.25) are replaced, respectively, by
for and by
for .*

*Proof. *(i) The estimate (3.20) follows by Theorem 2.1. The bound (3.21) follows by (3.20), the definition of modulus of continuity, and the assumption of Lipschitz continuity of . Finally, (3.23) is obtained by property (3.11) of the modulus of convexity and its expression (3.7).

(ii) (3.24) comes from Theorem 2.2: the constant is introduced in order to remove the supremum with respect to in formula (2.7) and replace it with the choice , where is any function that achieves the bound (2.7) up to the constant factor ; (3.25) follows from (3.24), (3.11), and (3.7), choosing as any functional that is uniformly convex with modulus of convexity of the form (3.7), and such that .

(iii) The estimates (3.20), (3.21), (3.23) still hold when the set is replaced by since for , whereas formulas (3.26)–(3.29) are obtained likewise formulas (3.24) and (3.25), by applying Proposition 2.4 instead of Theorem 2.2.

#### 4. Discussion

Classes of function-approximation and functional optimization problems have been investigated for which, for a given desired error, certain variable-basis approximation schemes with sigmoidal computational units require less parameters than fixed-basis ones. Previously known bounds on the accuracy have been extended, with better rates, to families of functions whose effective number of variables is much smaller than the number of their arguments .

Proposition 3.2 shows that there is a strict connection between certain problems of function approximation and functional optimization. For such two classes of problems, indeed, the approximation error rates for the first class can be converted into rates of approximate optimization for the second one and vice versa. In particular, for , , and any linear approximation scheme , the estimates (3.21) and (3.25) show families of functional optimization problems for which the error in approximate optimization with variable-basis schemes of sigmoidal type is smaller than the one associated with the linear scheme. For and , a similar remark can be made for the estimates (3.21) and (3.27) and for the bounds (3.21) and (3.29). Finally, the bound (3.23) shows that for large any approximate minimizer of the form (2.4) differs slightly from the true minimizer , even though the error in approximate optimization (3.22) and the associated approximation error (3.23) have different rates. In contrast, the estimates (3.24), (3.26), and (3.28) show that, for any linear approximation scheme , there exists a functional optimization problem whose minimizer cannot be approximated with the same accuracy by the linear scheme.

The results presented in the paper provide some theoretical justification for the use of variable-basis approximation schemes (instead of fixed-basis ones) in function approximation and functional optimization.

#### Acknowledgment

The author was partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Adaptive State Estimation and Optimal Control.”