#### Abstract

We introduce in this work an extension for the generalization complexity measure to continuous input data. The measure, originally defined in Boolean space, quantifies the complexity of data in relationship to the prediction accuracy that can be expected when using a supervised classifier like a neural network, SVM, and so forth. We first extend the original measure for its use with continuous functions to later on, using an approach based on the use of the set of Walsh functions, consider the case of having a finite number of data points (inputs/outputs pairs), that is, usually the practical case. Using a set of trigonometric functions a model that gives a relationship between the size of the hidden layer of a neural network and the complexity is constructed. Finally, we demonstrate the application of the introduced complexity measure, by using the generated model, to the problem of estimating an adequate neural network architecture for real-world data sets.

#### 1. Introduction

Feed-forward neural networks trained by back-propagation have become a standard technique for classification and prediction tasks given their good generalization properties. However, the process of selecting adequate neural network architecture for a given problem is still a controversial issue. Several important contributions regarding the number of hidden neurons needed to implement a given function in a neural architecture have been made using different methods. Baum and Haussler [1] obtained some bounds on the number of neurons in an architecture related to the number of training examples that can be learnt using networks composed of linear threshold networks. Barron [2] made an important contribution about the approximation capabilities of feed-forward networks, computing an estimation of the number of hidden nodes necessary to optimize the approximation error. Camargo and Yoneyama [3] obtained a result for estimating the number of nodes needed to implement a function using Chebyshev polynomials and previous results from Scarselli and Chung Tsoi [4] about the number of nodes needed for approximating a given function by polynomials. Hunter et al. [5] focused on the importance of selecting the learning algorithm to train closer to optimal architectures. Methods based on the geometry of output classes [6–8], single value decomposition [9], information entropy [10], and the signal to noise ratio [11] have been used to obtain an approximation to the size of hidden layer in a neural architecture.

Some of the previous studies tried to determine the adequate architecture depending on the complexity of the data set available for a given problem, but as expected measuring the complexity of data is a difficult task. Firstly, it has to be clearly defined what exactly the measure tries to quantify, as complexity can be related to several aspects of the data. Even if different complexity measures related to the size of the architectures needed to implement the data or to the complexity of learning have been proposed in the past [12–14], they have not been applied to the neural network architecture selection problem, in principle because they have not been proposed with this focus.

Moreover, several approaches have been proposed within the learning theory area to analyze the relationship between generalization and complexity. Ho et al. [15, 16] studied the complexity that characterizes the difficulty of a classification problem, and they suggest using this value to guide the selection of classifier. Sánchez et al. [17] tried to characterize the behavior of the k-NN rule when working under certain situations. More specifically, their analysis focused on the use of some data complexity measures to describe class overlapping, feature space dimensionality, and class density and discover their relation with the practical accuracy of this classifier. Duch et al. [18] suggested that the identification of datasets with high complexity is important to test new methods in computational intelligence.

But most of these analyses focused on the complexity of the architectures and on the error obtained at the end of the training process rather than on the intrinsic complexity of the data. Recently, Franco and colleagues [19, 20] have proposed a complexity measure named “generalization complexity” (GC) that aims to quantify the level of generalization ability that can be expected when Boolean data are used in a classification algorithm. The measure has been also used in the process of architecture selection involved in the implementation of a neural network, as it is expected that for more complex data larger neural network architectures might be more adequate [21]. Nevertheless, the proposed measure can only be applied to Boolean input data so, in this work, the Boolean generalization complexity is first extended to the continuous input case, to then perform a series of tests to validate the proposal using a set of continuous functions with parametrized complexity. Also, by using the set of orthonormal Walsh functions, we extend the proposal for its use with patterns of data. Finally, a model is built from which it is possible to estimate the adequate feed-forward neural network architecture for real-world benchmark data sets by choosing the number of neurons to include in the hidden layer, as the size of the input and output layers is determined by the problem.

#### 2. The Generalization Complexity Measure and Its Extension to Real Input Values

Our main goal in this work is to extend the GC measure defined in for real input and real output functions . The choice of the intervals for the input and for the output is arbitrary and it is used for simplicity with no restrictions for the general case. We will analyze the more general case of having a continuous output as this case can later be easily particularized to the Boolean output case, more related to classification problems.

The original definition of the GC measure [19, 20] comprises two terms accounting for the first and second nearest neighbor pairs of input data points (), where the neighborhood is defined in terms of their Hamming distance. Let be the total number of examples (or equivalently patterns) considered and the number of first nearest neighbors that every example () has; that is, examples that are the closest Hamming distance. The first term of the GC measure, , known to be the more influential, is defined in Boolean space as where the first factor is a normalization one taking into account the number of pairs considered. Essentially, (1) measures the proportion of neighboring pairs that have different output, that is, belong to different output classes.

In the previous equation, the distance between pairs of inputs is measured by the Hamming distance, but this measure is not applicable for real valued input data. Instead, we will opt for a straightforward choice and use the Euclidean distance. We consider first the 1-dimensional (1D) case corresponding to a single continuous input variable, starting the process by discretizing the input interval in subintervals of length . In this way a data point, , will be indicated by the subinterval in which its coordinates are included , where (), with and . The total number of examples in the 1D case is equal to , while, for an arbitrary dimension , the discretization of every variable in the same way leads to examples.

Let us define for 1D as the value of the function at the center of subinterval : , and also we assume that and . For fixed , we will say that two input data points are first nearest neighbors if they are at distance (this would be the equivalent of Hamming distance 1 in Boolean space).

In this way, (1) can be generalized as where . For we can obtain the first term of the complexity measure, , for continuous input data using a grid with subintervals: where we used , , and substituted the sum over the two neighboring pairs by a forward sum over the sites. Defining the complexity measure density , we can write which in the limit () converges to

In terms of notation we will use for the first term of the original Boolean GC measure, for the discretized version for continuous functions, and will denote continuous generalization complexity density (CGC).

Equation (5) will be our proposal for the first term of the GC for continuous value input data for . Clearly, this function will be larger for more fluctuating functions as expected. For , we have where is the value of the function within the square with coordinates , . The previous expression can be written more compactly as

If takes alternatively the maximum and minimum values () on neighboring sites, , taking care of counting only once the difference between neighboring sites. Defining the complexity measure density as before, and following the same steps, we get The above procedure can be straightforwardly generalized to arbitrary dimension obtaining We observe that (9) is not bounded; that is, there is not a function with maximum complexity. This seems to be an intrinsic difficulty as for a real function the number of maxima and minima can grow indefinitely. In any case, (8) can be useful because it can measure complexities relative to a given function.

Along similar lines, we can build the continuous version of the second term of the complexity measure, . In its original version for Boolean functions this term accounts for the output difference of pair of data points located at Hamming distance 2: For the continuous case we can write, for , Defining the second-order complexity density as , we obtain in the limit Hence, for , we have that . For , we have that in the limit leads to Equation (14) will be our proposal for the continuous version of the second term of the GC measure.

##### 2.1. Testing the Generalization Complexity on a Set of Continuous Functions

Having introduced an extension of the complexity measure for a set of continuously distributed data (9) and (14), we now would like to test the proposal, and for that we will use a set of trigonometric functions with parametrized complexity. The set in dimension is defined by with taking integer values , even if real values can be also considered (e.g., ). Dividing the -dimensional hypercube by using a grid of spacing leads to a function that cancels at the borders of the hypercubes of side , taking alternatively the values ±1 on nearest neighbour cells. This function is precisely the well-known parity Boolean function, having a very high complexity among the set of Boolean functions [19]. Measured by the first term of the GC measure, the parity function achieves maximum complexity of 1, and thus, given a value of the discretization spacing of , it makes sense to consider only values of up to a maximum value .

From the definition of the first term () of the continuous GC measure (CGC) (9), the complexity of the set of trigonometric functions defined by (15) can be obtained: We observe that the complexity of the set of functions grows linearly to , which is proportional to the density of points where the function cancels, a sensitive measure of the variation of the function.

The family of functions (15) can be generalized to consider different variation indexes according to the spatial direction; namely, where . The complexity can also be easily computed and leads to We use the family of functions (15) to compare the behavior of the discrete and continuous complexity measures introduced in the previous section. To do that we computed numerically the discrete complexities and as a function of for and , for a fixed value of the discretization . Figure 1 shows the complexity values obtained for the continuous and discrete first terms ( and , resp.) for one and two dimensions (Figures 1(a) and 1(b)), noting that for relatively low values of , that is, when , the agreement is quite good, while for larger values, the discrete version underestimates the true complexity. A similar behaviour is observed for both plotted dimensions, noting that as the dimension increases the maximum complexity decreases by a factor (cf. (18)). The evaluation of the second term of the continuous complexity measure () is more cumbersome but it can be obtained with the aid of numerical integration software. In particular, for , the calculations lead to Figure 2 shows the results for the second term of the complexity measure for the 2D set of functions. In the figure and are shown as a function of . The continuous complexity grows linearly according to what has been obtained in (19), showing a different behaviour with respect to the discrete version counterpart with a nonmonotonic curve. The quadratic-like shape of (in Boolean space) has been previously analyzed [19] and its behaviour independently of does not hold for the continuous case. The fact that the value of is proportional to (for the set of sinusoidal benchmark functions, cf. (15)) implies that the second term does not contain independent information from what is provided by the first term.

**(a)**

**(b)**

#### 3. Use of Walsh Functions for Testing and Estimation of GC

The set of Walsh functions introduced by Walsh in 1923 [22] is a set of orthonormal binary functions with continuous input. Walsh functions have been widely applied in signal processing [23, 24] and are also well known because their relationship to the Hadamard transform [25]. The approach developed in the previous section cannot be applied to a set of patterns (the standard case for practical problems) as it requires knowing the analytic expression of the underlying function. In this section, we first compute the complexity of the set of Walsh functions showing that it leads to sensitive results for the estimation of GC. After this test, we apply the set of Walsh functions for carrying out the approximation of the GC for a set of patterns. The choice of the set of Walsh functions is motivated by the fact that the original GC defined in Boolean space can be computed almost straightforwardly for this set given its discrete output. Also, the intrinsic discretization of the input space as the order of the Walsh functions is increased favors their application to continuous input problems.

##### 3.1. The GC of the Set of Walsh Functions

The proposed complexity measure (9) can be applied to the set of Walsh functions by introducing an appropriated limit procedure. Let us consider first the one-dimensional case, namely, the set of Walsh functions defined on the real interval , where the index is chosen so that it coincides with the number of nodes of the function. For instance, for all , if , if , and so forth.

We will introduce a set of continuous parametric functions to approach the Walsh functions. can be constructed in such a way that it has the same nodes as ; it is differentiable in the neighborhood of all the nodes of and . The functions can be constructed by combining sigmoidal functions centered at the nodes of and constant functions taking values ±1 between them, joined smoothly by any interpolation procedure, such as a spline or polynomial method. Figure 3 shows two Walsh functions approximated by using hyperbolic tangent functions combined with constant ones.

**(a)**

**(b)**

Let us consider for simplicity a finite set of Walsh functions up to order (for some fixed integer value of ). Then, the location of the nodes of every one of these functions belong to the set of values , . Let be an arbitrary interval enclosing only one particular node . Then the following properties hold: Hence, we can write where the coefficients can take the values (if has no node at ) and ; otherwise is a real function sharp peaked around which satisfies , being a Dirac delta function [26]. Then, we can define the complexity of the Walsh functions as

From (5), (21), and (22), it follows that . The extension to higher dimension is straightforward. Let be a D-dimensional Walsh function, where is a set of one-dimensional Walsh indexes, defined as before. From (9) we obtain

##### 3.2. GC Estimation for a Set of Data Points Using the Base of Walsh Functions

Suppose that we want to compute the coefficients, , for a given function using a set of Walsh functions defined in the
given a limited set of sampling data points . We will solve the estimation of the coefficients solving a minimization problem of the square error :
where *≡* .

To find the minimum of the error function, , we compute the first derivative and make it equal to 0:
from which
Define the vector *≡* as
and matrix with
Equation (27) takes the lineal form = , whose solution is given by

A practical issue of the previous procedure is the computational cost involved; as for D-dimensional input data a matrix of size has to be inverted (cf. (30)), where is the maximum spacing used for the construction of the 1D set of Walsh functions. Nevertheless, such computation has to be done only once for given values of and , being independent of the data.

Once the Walsh coefficients of a function (or data) have been obtained, the CGC can be approximated by the same limiting procedure of the previous section. For instance, in one dimension we have where we have used (21). For an expansion of a D-dimensional function on a finite set of Walsh functions with (, ), we obtain similarly where indicates the approximation of the CGC using the set of Walsh basis functions. We carried out an experiment where we analyzed the accuracy of the proposed approximation to obtain a similar graph to the one shown in Figure 1(a), indicating that the approximation is working correctly. The fact that the graph obtained is almost exact to the one obtained in Figure 1(a) is consistent with what can be expected, as both are discrete approximations of the continuous value of the complexity.

#### 4. Application to Real-World Input Data

In order to test practically the developed procedures, we first construct a model based on the extension of the complexity measure proposed previously, to then apply this model for the estimation of adequate neural network architecture to real-world problems. The model was estimated using the set of trigonometric functions defined by (15) for . For each of the analyzed data set we calculated the complexity with the above method and we found values in the range between 0 and 0.5, and the generalization ability was computed for a set of single hidden layer neural architectures with a number of neurons in the hidden layer between 2 and 50, choosing the one that leads to the lowest validation error computed in a cross-validation procedure to avoid overfitting (early stopping), where the training is performed by the standard back-propagation algorithm. From the obtained number of neurons for each of the analyzed cases, a quadratic fitting was applied to obtain the final model, shown in Figure 4 by the solid line.

Figure 4 shows the application of the developed method, described in Section 3.2, to obtain the value of CGC for a given data set. Using the constructed model (the solid line in the Figure 4), it is then possible to use the obtained CGC value to get an estimate of an adequate neural architecture to implement the function. The figure also shows the best architecture found by intensive numerical simulations (see Table 1 for the numerical values).

Table 1 shows the results obtained by applying the developed method to 10 four-dimensional benchmark data sets. The data set problems are taken from the UCI repository and for each problem 4 input variables were selected. The columns show the identifier of the function, the name of the benchmark dataset with the 4 input variables used (indicated as a superscript), the estimated Generalization complexity obtained from (22), the number of neurons in the hidden layer estimated by the model (), and the best number of neurons found from exhaustive simulations (). The results obtained shown a quite good correlation between the estimated and best found values (, value = 0.002), suggesting the validity of the approach, even if there are some cases, like the function indicated in the table by for which the estimation is not extremely accurate. Nevertheless, some discrepancies are always expected as the problem of choosing an adequate neural architecture is a complex problem with no exact solution, as it depends on the particular set of patterns presented and the training process used, and thus it is an intrinsically noisy process.

#### 5. Discussion and Conclusions

We have introduced in this work an extension for the generalization complexity (GC) measure for continuous input data. The analysis of the new measure on a parametrized complexity set of trigonometric functions shows that the new proposal is consistent with the expected results and with the spirit of the original measure, as the GC essentially measures for a set of data the output variations as the inputs are modified. Nevertheless, a difference between the continuous and discrete cases exists in relationship to the role of the second term of the GC, as in the continuous case this term is no longer independent from the first term (at least for the set of trigonometric functions), and thus it does not add extra information about the complexity of the data. We have also introduced an approach based on the use of the set of Walsh functions for computing the CGC measure for data expressed as a set of patterns, the typical case in most practical applications. By fitting a model that relates architecture size to function complexity, a model is built and then it is applied to the problem of selecting an adequate neural network architecture in ten real-world benchmark problems. The application of the method to the benchmark data shows that the estimated neural architectures are quite close to the optimal values, indicating the suitability of the developed approach to the architecture selection problem. The method is clearly more efficient than the trial-and-error alternative for choosing a proper neural network architecture, as the computationally heavy part of the procedure is related to a matrix inversion that has to be done only once for a given dimension and thus, once computed, it can be reused with different data sets. The GC measure provides an estimate of the complexity of the data, and as such can possibly be used not only for the case of choosing the adequate architecture for neural networks, but also when using other predictive models (like SVM, decision trees, etc.), for example, for choosing the magnitude of the penalization term of the model complexity (regularization).

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors acknowledge support from CICYT (Spain) through Grants TIN2008-04985 and TIN2010-16556 (including FEDER funds), from Junta de Andalucía through Grants P08-TIC-04026 and P10-TIC-5770, and from CONICET (Argentina) and SECyT Universidad Nacional de Córdoba (Argentina).