#### Abstract

Multivariate noises in the learning process are most of the time supposed to follow a standard multivariate normal distribution. This hypothesis does not often hold in many real-world situations. In this paper, we consider an approach based on multivariate skew-normal distribution. It allows for a multiple continuous variation from normality to nonnormality. We give an extension of the generalized least squares error function in a context of multivariate nonlinear regression to learn imprecise data. The simulation study and application case on real datasets conducted and based on multilayer perceptron neural networks (MLP) with bivariate continuous response and asymmetric revealed a significant gain in precision using the new quadratic error function for these types of data rather than using a classical generalized least squares error function having any covariance matrix.

#### 1. Introduction

Let and be, respectively, and valued random vectors defined on the same probability space and let be independent identically distributed (i.i.d) replications of . One seeks to build the best estimator of the regression function : in the sense of a loss function where the bias and the term of variance must be conjointly minimized and given the bias-variance balance to quantify the accuracy of the estimator. Let be not necessarily a nonlinear parametric regression model family with . is a continuous function of for fixed and measurable for all fixed and is a compact (i.e., closed and bounded) subset of the set of possible parameters of the regression model family. We assume that is an observed data set coming from a true model for which the true regression function is , for a (possibly not unique) in the interior of .

The nonlinear regression is often applicable in modeling empirical data for forecasting purposes, where relationships that underlie the measurements can be strongly nonlinear, nonunivocal, noisy with dynamic nature [1–6]. Nonlinear regression techniques such as projection-pursuit regression, support vector machine, and multilayer perceptron are applied in various fields [7–10]. For instance in agronomy, crop yield can be predicted, where = {agricultural yields, insurance policy} and = {agronomic, edaphic, climatic and economic parameters}; in climatology, estimation of long wave atmospheric radiation with {heating rates profile, radiation fluxes} and = {temperature, moisture, , , cloud parameters profiles, surface fluxes, etc.}, and in finance, estimation of financial characteristics with = {future stock index, exchange rate, option price, etc.} and = {financial data up to a certain time}.

Estimation of parameters is done using learning techniques. It consists to estimate the true parameter based on the observations . This can be done by minimizing the mean square error () function [11]: . The minimization of this quadratic error function leads to a biased under the assumption , when is imprecise. To solve this problem of the misspecification of the quadratic error function, Badran and Thiria [1] proposed another function by supposing using Bayesian approach. This hypothesis is arbitrary if not false due to the nature of some data in some fields of application [12]. In this work, we propose an extension of Badran and Thiria [1] results using new hypothesis on . One of these hypotheses is that follows the multivariate skew-normal distribution. This distribution refers to a parametric class of probability density function (p.d.f.) that extends the normal distribution using an additional shape parameter that regulates the skewness, allowing for a multiple continuous variation from normality to nonnormality. In addition, we compare the newly established error function with that of Badran and Thiria [1] in case of any covariance matrix, especially when the matrix is diagonal and its diagonal elements are constant on the one hand and nonidentical on the other hand. This comparison is made on simulated and real data in order to evaluate the possible performance of the proposed error function.

In Section 2 of the paper, we recall some existing concepts. In Section 3, we establish the main results. Section 4 presents simulation study and application case on real dataset. The conclusion is considered in Section 5. The proofs of the main results are presented in the Appendix.

#### 2. Preliminary Definitions

##### 2.1. Imprecise Data and Multivariate Skew-Normal Distribution

*Definition 1. *Imprecise data are data inconsistent in the way they are collected and/or registered (rounding, grouping, etc.). It is data with asymmetry or data showing a slight or heavy tail of the distribution.

*Definition 2. *A random multivariate variable of dimension is said to follow a multivariate skew-normal distribution, denoted by , if it is continuous and its density function is given by (Azzalini and Dalla Valle [13]):where is the density function of a -dimensional multivariate normal distribution with standardized marginals and correlation matrix , is the cumulative distribution function for the standard normal distribution, and is a -dimensional vector that regulates departure from symmetry.

Consider the variance-covariance matrix with , the diagonal matrix of dimension of the inverse of standard deviation . The expression of the probability density function of is thenwhere denotes the determinant of matrix . For more details, see Azzalini and Dalla Valle [13].

##### 2.2. Bayes Approach

Suppose that is a vector of observations whose probability density function depends on the values of -dimensional parameter . Suppose also that itself has a probability density function . Then, denoting by and , the marginal probability density function of , and the conditional probability density function of given , respectively, one haswhere given the observed data , the conditional probability density function of when is

By considering , we may write (4) alternatively as

Therefore, Bayes formula in (5) is often written as , which denotes that is proportional to .

*Definition 3. *The prior distribution of , in (5) is the probability distribution of without knowledge of . In contrast, the posterior distribution of given , in (5) is the probability distribution of given .

A Bayesian statistic model is performed from a parametric statistical model and a prior distribution of parameter .

##### 2.3. Learning

The learning consists to estimate of the regression function that minimizes the loss function . A simple expression of this function is the generalized least squares (GLS) error function given bywhere with , the joint probability density function of , and is the inverse of the covariance matrix of the conditional random variable , which is a symmetric positive definite matrix of order .

#### 3. Main Results

##### 3.1. Existence of the Minimum of the GLS Error Function

The following result provides the solution of the minimization problem of (6).

Theorem 1. *Let be the unknown covariance matrix of the conditional random variable , be the output vector of the regression function, and be the conditional mean vector of the observations . For any value of , with respect to exists and is given by*

The proof of Theorem 1 is given in the Appendix.

However, from (7) cannot be analytically obtained because and the marginal probability density function are unknown. We consider the case where is known and does not depend on . Thus, the solution of the minimization problem of (6) leads to the following theorem.

Theorem 2. *Let be orthogonal matrix of the eigen vectors of and , the eigenvalue associated to . Suppose does not depend on with , the inverse diagonal matrix. Thus, we have from (6) the existence of given by**The proof of Theorem 2 is given in the Appendix.*

*Remark 1. *Under the conditions of Theorem 1 or 2, the minimum of problem of (6) with respect to is reached with .

The argument of minimizing the theoretical cost function given by (6) with respect to is difficult to get because and are unknown. However, we can obtain a finite set of independent observations , , which allows to define the discrete approximation of the theoretical cost function.

Therefore, is obtained by applying the optimization procedure on the empirical cost function given bywithwhich is a discrete approximation of the theoretical cost function given by (7).

In the next subsection, we use Bayesian approach to take into account uncertainties appearing during the regression process by considering some hypotheses linked to the nature of . In this case, the maximum likelihood model provides an extension of the empirical cost function (9).

##### 3.2. Extension of GLS considering the Output Noise

Results considered below are established under the following assumptions related to the random i.i.d. noises . A1. and ; , A2. A3. is not spoiled and does not depend on

Theorem 3. *Assume that the conditions A1–A3 hold. Let be the empirical cost function, be the posterior distribution of the model parameters () known as finite set of independent observations and , the prior distribution of which is assumed to have a uniform distribution, and be the likelihood of knowing . Thus,with .*

The proof of Theorem 3 is given in the Appendix.

In (11), it appears as an extension of the empirical expression (9). The additional terms and in the expression (9) contribute in the process to take into account of the imprecise data. They can be known or not; if not, they have to be considered and estimated during the minimization process.

Corollary 1. *Under the conditions of Theorem 3, if the vector is equal to the vector null, we have the same results with Badran and Thiria [1], which is*

The simulation study presented below is based on multilayer perceptron neural networks (MLP) with one hidden layer. The output variable is continuous and bivariate. This output variable has a deterministic part (its true value) plus an additional random part which is asymmetric. The same MLP architecture is used to predict the bivariate response from simulated data at different sample sizes. Three different empirical cost functions have been used in the learning process of the imprecise data generated for prediction purposes. These are (i) the classical generalized least squares function () (see (10)), (ii) the extension of the generalized least squares function proposed by [1] () (see (12)), and (iii) the new extension of the generalized least squares function that we proposed () (see (11)).

The coefficient of determination, , and the mean absolute percentage error, , allowed comparing the performance of the estimation of the nonlinear regression function based on these three empirical cost functions.

#### 4. Simulation Study and Real Data Applications

In this section, we consider the regressor as a multilayer perceptron neural network (MLP).

A MLP is a set of units (neurons) organized in successive layers (first layer = input layer, last layer = output layer, and the middle layer(s) = hidden layer(s)), which are interconnected with the weights (). The connections are always directed from lower layers to upper layers and the neurons of the same layer are not interconnected.

The MLP model () considered here has four input neurons noted (), three hidden units which represent of input neurons, and two output neurons denoted , (see Figure 1).

Consider as a vector of parameters between the input and hidden layers, as a vector of parameters between the hidden and output layers and as a compact subset of possible parameters of the regression model family . and (real value functions) are considered in the model as the output and hidden-unit activation function, respectively.

##### 4.1. Simulation Plan

Let be a vector of input variables and , a bivariate output variable. We have where and is a vector of residual variables. To guarantee the nonlinearity of , we chose

The coefficient used in the simulation study was prespecified as ; ; ; . Concerning the vector , its components, respectively, follow normal, log-normal, binomial negative, and Poisson laws, with , , , and . The set of distributions of the different components allowed having distribution of probability unknown as in the previous theorems. The additional random part at is with , where corresponding to a case of bivariate skew-normal distribution. Consider any covariance matrix assumed to be known. Thus, we arbitrarily chose so that the errors are centered, reduced, and uncorrelated, and so that the errors are centered, reduced, and correlated (see Figure 2).

**(a)**

**(b)**

The conditions A1–A3 and the previous information made it possible to randomly generate a population of size .

Simulation’s algorithm is presented in the following steps: Step 1. Generate the input variables to and from their respective distribution for . Step 2. Calculate using (13) from values of coefficients to and generated input variables to . Step 3. Generate outcome variables for , from each distribution adds to . Step 4. Consider a multivariate Bootstrap sample of size () from the populations generated at previous steps. Step 5. For each Bootstrapping sample, we run MLP model on of the data (training dataset) and the remaining (test dataset) are intended for predictive performance analysis. The characteristics considered for the model are as follows:(1)Activation functions ( and );(2)Number of hidden neurons = 3;(3)Learning rate = ;(4)The initial value of ;(5)Learning algorithm = standard backpropagation, and here each of the three empirical cost functions (, , ) is considered. Step 6. For each combination of and distribution, we repeat the previous step 1000 times, and the mean of the performance criteria is computed. The *R* software (v. 3.3.6) [14] is used for data generation and model implementation.

##### 4.2. Performance Criteria

Two performance criteria are used on test data: coefficient of determination, , and mean absolute percentage error, [15, 16].

A close to “1” indicates good performance. is a measure of the accuracy of the forecast. A value close to “0” indicates good accuracy.

##### 4.3. Findings Related to Simulation Study

Comparing the performance of nonlinear regression models based on multilayer perceptual neural networks using three different quadratic error functions in the learning process reveals that the larger the sample size, the more accurate the model (low value of and high values of ). On the other hand, when the errors are centered, reduced, and uncorrelated, the error functions and give the same results but gives a better result than them (Table 1). For centered, reduced, and correlated errors, the three error functions each used in the learning process lead to different results. The best performance regardless of sample size is obtained with the quadratic error function , followed by and finally (Table 1).

##### 4.4. Real Data Applications: AIS Dataset

In this subsection, an application of the new quadratic error function and the two other to real-world data is shown. Data refer to 13 biological characteristics of 102 male and 100 female athletes recorded at the Australian Institute of Sport, courtesy of Richard Telford and Ross Cunningham. The data have been presented and examined by [17] and are available in the package “sn” of *R* software under the name “ais.” The set is repeatedly used in the literature about skewed distributions. In our application, we only use a subset of these characteristics used by Azzalini [18] for fitting linear models with skew-elliptical error term and in addition to some biological characteristics. Output bivariate variable considered is body mass index (BMI) and lean body mass (LBM) with and . The input variables are red cell count (RCC), white cell count (WCC), hematocrit (Hc), hemoglobin (Hg), sum of skin folds (SSF), and body fat percentage (Bfat). These data were fitted to nonlinear regression models based on multilayer perceptron neural networks as in the simulation study.

##### 4.5. Findings Related to Real Data Applications

Analysis of the performance of the MLP model on real data using the quadratic error functions , , and in a context where the errors are centered and correlated (Table 2) shows that gives the best precision and low bias compared with the other two functions (low value of and high values of ).

#### 5. Conclusion

In the case of imprecise data prediction, nonlinear regression models are more appropriate. In the context of parametric models with multivariate response, parameter estimation based on the traditional generalized least squares error function () leads to biases and the models obtained are less efficient. Results obtained from our simulation study reveal the importance of developing a new quadratic error function to learn these types of data rather than using a traditional error function. Our proposed error function provides more flexibility to take into account the nature of noise in the imprecise data during the learning process. The next step will be to develop a learning algorithm and a package to learn imprecise data for prediction purposes of nonlinear system.

#### Appendix

#### A. Proof of Theorem 1

Letbe the least squares error function (cost function) in the case of a nonlinear regression model (). Replace by in equation (A.1), and we have

Using the Bayes rules, , in equation (A.2) gives

We can now rewrite (A.2) aswith

Using the distributivity rule with respect to and doing some additional simplifications, equation (A.5) becomes

In equation (A.6),does not depend on . and does not depend on , and then we have

The minimization of with respect to depends only on

. We therefore have

#### B. Proof of Theorem 2

Replacing in equation (A.2) gives

And the discrete form of can be written as follows:

Let us consider

With some development at the internal of B by adding , it becomes

Therefore,

#### C. Proof of Theorem 3

Consider the following expression:

Using the Bayes formula in (4), we have

From this equation, is constant with respect to (see (5)). is a regularization term and corresponds to an additional constraint on the distribution. By considering , and are not depending on . Then, we can remove and in the minimization process of ln . By considering the independent observations, we get

Its decomposition gives

The first term of (C.4) is the probability that comes from when the conditional probability is generated by the nonlinear regression model . In the second term, considering that the hypothesis A3 is satisfied, it can be removed in the minimization process.

Recall that is the stochastic part of () and depends on , with being the deterministic part of . It satisfies Definition 1. Assuming that hypotheses A1 and A2 hold, we have from (2) of Definition 2:with

Assume an acceptable nonlinear regression model such that ; then, we can simplify the term by replacing with and have

Hence, we can define the empirical risk from (C.7) aswith

#### Data Availability

The Australian Institute of Sport (AIS) data used to support the results of this study have been posted on https://rdrr.io/cran/MoEClust/man/ais.html or https://www.rdocumentation.org/packages/sn/versions/1.5-_4/topics/ais. In addition, these data were made public in the book by Cook and Weisberg [17].

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.