Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2011, Article ID 570509, 15 pages
http://dx.doi.org/10.1155/2011/570509
Research Article

An Algorithm for Optimally Fitting a Wiener Model

1Intel Corporation, Hillsboro, OR 97124, USA
2Department of Statistics, Iowa State University, Ames, IA 50010, USA
3Department of Chemical & Biological Engineering, Iowa State University, Ames, IA 50010, USA
4BodyMedia Inc., 420 Fort Duquesne Boulevard, Pittsburgh, PA 15222, USA

Received 2 September 2010; Accepted 15 September 2011

Academic Editor: Elbert Macau

Copyright © 2011 Lucas P. Beverlin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The purpose of this work is to present a new methodology for fitting Wiener networks to datasets with a large number of variables. Wiener networks have the ability to model a wide range of data types, and their structures can yield parameters with phenomenological meaning. There are several challenges to fitting such a model: model stiffness, the nonlinear nature of a Wiener network, possible overfitting, and the large number of parameters inherent with large input sets. This work describes a methodology to overcome these challenges by using several iterative algorithms under supervised learning and fitting subsets of the parameters at a time. This methodology is applied to Wiener networks that are used to predict blood glucose concentrations. The predictions of validation sets from models fit to four subjects using this methodology yielded a higher correlation between observed and predicted observations than other algorithms, including the Gauss-Newton and Levenberg-Marquardt algorithms.

1. Introduction

Wiener networks are widely used in modeling complex nonlinear systems. These networks have the ability to model a wide range of data types, such as gas concentrations [1], blood glucose concentrations [2], and pH levels [3], and their structure can yield parameters with phenomenological meaning [2]. In this work, Wiener networks are used to first convert inputs into their corresponding dynamic responses and then to pass these dynamic responses through a second-order linear regression function to obtain the fitted output response. The parameters needed to convert the inputs into dynamic responses are referred to as dynamic parameters and the parameters of the regression function as static parameters. However, estimating these parameters can be quite challenging, as the behavior of these networks can be highly nonlinear in the dynamic parameters, and the number of parameters, which can be large, increases rapidly as inputs are added. Overfitting can also be a major issue. Given that Wiener networks utilize differential equations for the conversion to dynamic responses, stiffness, which is the situation where the derivative increases or decreases rapidly while the solution to the differential equation does not [4], is another concern. This phenomenon causes an algorithm to take very small steps (i.e., progress slowly) in order to reach an optimal solution. Another issue is that the process being modeled could change over time due to degradation of equipment, an increase in production, or one of many other reasons. While this change could be very gradual, this implies that eventually the fitted model can degrade over time. If this process is online, then a new model will need to be found expediently to minimize downtime and to take into account the new conditions and inputs.

The basic purpose of this article is to present a new methodology for fitting Wiener networks to large input datasets which can overcome these challenges. In the process control literature this is called “process identification.” By fitting subsets of the parameters iteratively, we can deal with a large number of parameters and their nonlinearity, as numerical instability in the next iteration is less likely when fitting a subset of the parameters. During optimization, parameter step size is controlled by the value of the objective function, which deals with stiffness. To avoid overfitting, we will utilize what is called “supervised learning” in the statistics literature [5]. In supervised learning, the dataset is broken up into three subsets: training, validation, and test. The model is fit to the training set with the validation set used to determine the number of iterations to use to guard against overfitting. The test set is scrutinized at the end of the optimization process to further evaluate if overfitting has occurred. While our methodology does not perform the optimization as fast as some of the other popular algorithms, utilizing parallel computing has sped up the optimization of a Wiener network using our methodology by roughly 25% in MATLAB [6].

There are other methods for fitting Wiener networks. Due to the Wiener network's nonlinear nature, these are iterative techniques. Note that a nonlinear modeling problem is one with unknown parameters that are functionally nonlinear. Here the objective is to obtain a set of parameters that minimize the sum of squared residuals (i.e., the least squares objective function). Popular techniques for this optimization objective include the Gauss-Newton algorithm and the Levenberg-Marquardt algorithm [7]. Many mathematics/statistics software packages, such as Minitab (Minitab, Inc., State College, PA.), R [8], SAS (SAS Institute Inc., Cary, NC), and MATLAB can implement both of these algorithms. The Levenberg-Marquardt algorithm, which is given in detail in the appendix, is a compromise between the Gauss-Newton algorithm and the method of steepest descent. We have found that fitting all parameters simultaneously using either algorithm can result in overfitting. Even fitting subsets of parameters with just one algorithm has resulted in a model that performs worse than our methodology. Given that the Wiener network used here has a conditionally linear structure since fixing the dynamic parameters yields a linear model, another potential approach is that of Barham and Drane [9]. They fit four different models to argue that alternating between using the usual least squares formula for estimating the linear parameters and a modification of the Gauss-Newton algorithm suggested by Hartley [10] for the nonlinear parameters will generally perform better than either Hartley's modified Gauss-Newton algorithm or the Levenberg-Marquardt algorithm alone. However, using least squares to fit the linear parameters tended to badly overfit the model irrespective of the dynamic parameters. A final method we reviewed in this work is the GRG2 algorithm [11], which is utilized in the Solver program in Microsoft Excel. This was used to fit the Wiener networks used to model blood glucose concentrations in [2]. It attempts to fit models using a generalized reduced gradient algorithm, which can handle constraints on parameters and is found to be fairly robust. However, for supervised learning with this implementation in Excel, the algorithm must be paused after each iteration for inspection, which becomes very time consuming if there are more than a few parameters in the model.

The proposed methodology is presented with the following outline. First a detailed description of the Wiener network for multiple inputs and a single output is given to establish the problem context. After this section the details of the methodology and an example are given to illustrate the algorithm in the fourth section. Finally concluding remarks and some ideas for future work are given in the last section.

2. The Wiener Network

In this section, a detailed description of a Wiener network is given to establish the context of the problem. These networks have a powerful structure for modeling nonlinear dynamic systems. A block diagram with 𝑝 inputs and one output is given in Figure 1.

570509.fig.001
Figure 1: A graphical representation of the Wiener model.

As shown in Figure 1, each input 𝐱𝑖 is first passed through a dynamic linear block, denoted 𝑔(𝐱𝑖) and converted into its corresponding dynamic variable 𝐯𝑖. Following Rollins et al. [2], we will use the following second-order plus-lead plus-dead-time differential equation:𝜏2𝑖𝑑(𝑡,𝑋)2𝑣𝑖𝑑𝑡2(𝑡)+2𝜏𝑖(𝑡,𝑋)𝜁𝑖(𝑡,𝑋)𝑑𝑣𝑖𝑑𝑡(𝑡)+𝑣𝑖(𝑡,𝑋)=𝜏𝑎𝑖(𝑡,𝑋)𝑑𝑥𝑖𝑑𝑡𝑡𝜃𝑖+𝑥𝑖𝑡𝜃𝑖,(2.1) where 𝜏𝑖 is a time constant, 𝜏𝑎𝑖 is a lead parameter, 𝜁𝑖 is a damping coefficient, and 𝜃𝑖 denotes dead time. For simplicity, we will assume that the dynamic parameters are time and space invariant, that is, 𝜏𝑖(𝑡,𝑋)=𝜏𝑖, 𝜏𝑎𝑖(𝑡,𝑋)=𝜏𝑎𝑖, and 𝜁𝑖(𝑡,𝑋)=𝜁𝑖, and for the rest of this section, fix 𝜃𝑖=0. Also when referring to 𝑣𝑖(𝑡,𝑋), we will write 𝑣𝑖(𝑡) henceforth. We will use 𝝉, 𝝉𝑎, and 𝜻 to denote all time constants, lead parameters, and damping coefficients, respectively.

To find a recursive definition for 𝐯𝑖, a forward difference approximation to (𝑑𝑣𝑖/𝑑𝑡)(𝑡) will be used. First let 𝑗=𝑡/Δ𝑡 and 𝑑𝑖=𝜃𝑖/Δ𝑡 so that 𝑣𝑖(𝑡)=𝑣𝑖𝑗 and 𝑥𝑖(𝑡𝜃𝑖)=𝑥𝑖,𝑗𝑑𝑖. Thus, 𝑑𝑣𝑖𝑗𝑣𝑑𝑡𝑖(𝑡)𝑣𝑖(𝑡Δ𝑡)=𝑣Δ𝑡𝑖𝑗𝑣𝑖,𝑗1,𝑑Δ𝑡(2.2)2𝑣𝑖𝑗𝑑𝑡2𝑑𝑣𝑖𝑗/𝑑𝑡𝑑𝑣𝑖,𝑗1/𝑑𝑡𝑣Δ𝑡(2.3)𝑖𝑗𝑣𝑖,𝑗1𝑣/Δ𝑡𝑖,𝑗1𝑣𝑖,𝑗2/Δ𝑡𝑣Δ𝑡(2.4)𝑖𝑗2𝑣𝑖,𝑗1+𝑣𝑖,𝑗2Δ𝑡2.(2.5) By substituting (2.2) and (2.5) into (2.1),𝑣𝑖𝑗=2𝜏2+2𝜏𝜁Δ𝑡𝜏2+2𝜏𝜁Δ𝑡+Δ𝑡2𝑣𝑖,𝑗1𝜏2𝜏2+2𝜏𝜁Δ𝑡+Δ𝑡2𝑣𝑖,𝑗2+𝜏Δ𝑡𝑎+Δ𝑡𝜏2+2𝜏𝜁Δ𝑡+Δ𝑡2𝑥𝑖,𝑗𝑑𝑖𝜏𝑎Δ𝑡𝜏2+2𝜏𝜁Δ𝑡+Δ𝑡2𝑥𝑖,𝑗𝑑𝑖1.(2.6)

Next all of these dynamic variables are passed through a static nonlinear block, denoted 𝑓(𝐯) in Figure 1, in order to obtain the predicted value of the response variable at time 𝑡. Following Rollins et al. [2], we use a second-order regression function with linear terms, quadratic terms, and second-order interaction terms, giving𝑦𝑗=𝑎0(𝑡,𝑋)+𝑝𝑖=1𝑎𝑖(𝑡,𝑋)𝑣𝑖𝑗+𝑝𝑖=1𝑏𝑖(𝑡,𝑋)𝑣2𝑖𝑗+𝑝1𝑝𝑖=1𝑘=𝑖+1𝑐𝑖𝑘(𝑡,𝑋)𝑣𝑖𝑗𝑣𝑘𝑗+𝜖𝑗,(2.7) where 𝜖𝑗 is a normally distributed error term with mean 0 and variance 𝜎2 and that for any 𝑘𝑗, 𝜖𝑗 and 𝜖𝑘 are independent. Again assume that the parameters are invariant with respect to time and space, for example, 𝑎0(𝑡,𝑋)=𝑎0. The vector of parameters corresponding to the linear terms will be denoted by 𝐚, the quadratic terms by 𝐛, and the interaction terms by 𝐜.

3. The Proposed Parameter Estimation Algorithm

In this section the featured algorithm we are proposing to solve the nonlinear regression problem given in the previous section will be described. Following Rollins et al. [2], the objective of this modeling problem is to maximize the true but unknown correlation coefficient between the measured and fitted glucose concentrations that is defined by 𝜌𝑦,̂𝑦 and estimated by 𝑟t. More specifically, under this objective a model is declared useful if and only if𝜌𝑦,̂𝑦>0.(3.1) The meaning of this criterion is that predictions of blood glucose concentrations from the model decrease and increase with measured blood glucose concentrations beyond some degree of mere chance; that is, there is true positive correlation. Notwithstanding, the closer this value is to the upper limit of 1, the more useful the model. Therefore, to achieve this objective, one seeks to identify a model with a sufficiently large value of 𝑟t. To this end, the data are separated into a set for training and a set for validation and/or testing. The training set is used to build the model, and the validation (or testing) set is used to evaluate the model against data that were not directly used by the optimization process to estimate the model parameters. However, due to the highly complex mapping of the parameters into the response space of 𝑟t, the following indirect criterion is used:Maximize𝑟tbyminimizingSSE𝚯=𝑛𝑖=1𝑦𝑖̂𝑦𝑖2Subjectto𝜁𝑖>0,𝜏𝑖>0,𝜃𝑖0𝑖,(3.2) where Θ is a vector representing the estimated dynamic and static parameters 𝝉,𝜻,𝝉𝑎,𝜽,𝐚,𝐛,𝐜 and 𝑛 is the number of observations in the training set. The objective criterion is used under the assumption that minimizing SSE is equivalent to maximizing 𝑟t. While there is no formal proof for this assumption, experimental evidence supports a strong tendency for this relationship [2].

A model that is nonlinear in parameters, such as the proposed structure, does not have the condition that that sum of the residuals equal 0 as in the case of linear regression. However, under (3.2), the sum of the residuals in the training set should be small and thus a secondary criterion on the closeness of 𝑦𝑖 and ̂𝑦𝑖 for accuracy can be used. This measure of accuracy, denoted the average absolute error (AAE), is defined asAAE=𝑖n𝑗=𝑖init||𝑦𝑗̂𝑦𝑗||𝑖n𝑖init,(3.3) where 𝑖init is the initial observation used for calculation of this statistic and 𝑖n is the final observation used. Hence accuracy is judged to increase as AAE decreases.

Thus, in addition to sufficiently large 𝑟t values for both the training and test/validation datasets, an acceptable model must also have a relatively small value of AAE in training. This secondary criterion is not imposed in testing/validation because (3.2) forces small residuals for training data only. Furthermore, if a model is capable of a high 𝑟t as demonstrated in training then high accuracy can be obtained with effective feedback correction or feedback control to reduce or eliminate bias.

To obtain a useful model the proposed method fits a subset of the parameters using an iterative approach, and the validation set is used to terminate optimization to guard against overfitting. As mentioned before, this was done because attempts to fit with respect to every parameter at once tend to overfit the training set. To determine which iteration is best, each iteration was given a score based on a linear combination of two statistics whose maximum is 1: 𝑟t and the correlation 𝑟Val between observed and predicted response values in the validation set. To establish notation, one can write the correlation between datasets {𝑥𝑖}𝐾𝑖=1 and {𝑡𝑖}𝐾𝑖=1 as𝑟Val=𝐾𝑖=1𝑥𝑖𝑥𝑡𝑖𝑡𝐾𝑖=1𝑥𝑖𝑥2𝐾𝑖=1𝑡𝑖𝑡2,1𝑥=𝐾𝐾𝑖=1𝑥𝑖,1𝑡=𝐾𝐾𝑖=1𝑡𝑖,(3.4) where 𝐾 is the total number of observations in the dataset. It was used in order to ensure that the predictions in the validation set are properly tracking changes found in the actual data.

After setting the starting values for the parameters, we then execute Algorithm 1 of our methodology. Note that we divide 𝑟t by  .9 and 𝑟Val by  .8 before applying the coefficients. This was done because in practice these values appear to be the maximum attainable values for 𝑟t and 𝑟Val in the example in the proceeding section. These values can be changed if one has a rough guess as to what the maximum attainable values are. The score is then calculated as .1𝑟t/.8+.9𝑟Val/.8. Starting with 𝑎1, we will successively add 0.2 until the score decreases. We then subtract 0.2 in order to retain the maximum score. This is done for each 𝑎𝑖. If choosing 𝑎𝑖=0.2 yields a lower score than setting 𝑎𝑖=0, then we successively subtract 0.2 from 𝑎𝑖 until the score decreases and readd 0.2 once a decrease is observed.

alg1
Algorithm 1: Starting algorithm.

We then attempt to fit a model using all parameters with the Levenberg-Marquardt and Gauss-Newton algorithms [7]. The Levenberg-Marquardt algorithm is given in the appendix, as the Gauss-Newton algorithm used here is simply the Levenberg-Marquardt algorithm where 𝛼=0. The Levenberg-Marquardt algorithm is fit nine times, each with the same starting parameters. The only difference between these trials is the values of the damping parameter 𝛼 used, which were 100,  10,  1,  .01,  .0001, 1𝑒6, 1𝑒8, 1𝑒10, and 0. Note that 𝛼 affects the step size of each iteration, as each model responds differently to each value of 𝛼 chosen. For each iteration of each trial, a score is calculated as before, and the parameter estimates corresponding to the iteration with the highest score that satisfies the parameter constraints among these trials are returned at the end of this loop.

The next stage of this methodology takes the parameter estimates from the previous stage and proceeds to fit subsets of these parameters at a time. Since using one algorithm exclusively has generally yielded weaker results, we chose to use three different algorithms and compare their results. The first algorithm that is attempted by our algorithm is the BFGS [7] algorithm. This is a quasi-Newton method in that it approximates the Hessian, which is used in Newton's method, with a rank 2 matrix that depends on the Jacobian. Note that we remove the difficulty of finding the inverse of a matrix by using the Sherman-Morrison-Woodbury formula [12]. We also used a trust-region version of this algorithm, in that we fixed the maximum step size of this algorithm. However, it is not guaranteed that a step from this algorithm will result in a decrease in the objective function. Hence the second algorithm we will use is the conjugate gradient [13] algorithm. This algorithm requires very little storage and is based on the idea of conjugate directions, although these directions lose conjugacy if the model is not well approximated by a quadratic approximation. However, the conjugate gradient algorithm tends to take a large step followed by several small steps, and these large steps tend to overfit when fitting Wiener networks. On top of this, the derivatives of the objective function with respect to the dynamic parameters must be approximated and can be at times unreliable. Thus the final algorithm we use is the Nelder-Mead [12] algorithm. This algorithm first generates 𝑛+1 points equidistant from one another and from the starting values, which is called a simplex. It then uses function evaluations to move the simplex toward a minimum as well as to expand or shrink the simplex. Thus it uses more function evaluations in place of derivatives. Details on these algorithms are given in the appendix. Note that to run the Nelder-Mead algorithm, the built-in function fminsearch in MATLAB was used. The number of iterations that were ran for the Nelder-Mead algorithm per iteration of this stage and the trust-region radius of the BFGS algorithm was chosen based on the value of 𝑟t at the start of each respective algorithm.

As each iteration of each algorithm is determined from a subset of the parameters, once again a score is calculated, and the parameter estimates with the highest score where 𝑟t has increased over the value found from the starting parameters and each constraint is satisfied are chosen as the new parameter estimates. There are four methods of choosing the subsets in our methodology. For the first method, we fit 𝐚 first, then 𝐛, 𝐜, 𝝉, 𝜻, and finally 𝝉𝑎. This is repeated if the score of the parameter estimates 𝜽 has improved by at least 0.0001 from the score of the parameter estimates used as starting values to fit 𝐚 up to three times. For the second method, the first subset of parameters to be fit is the set of static parameters that depend on the first input, that is, {𝑎1,𝑏1,𝑐12,𝑐13,,𝑐1,11}. The second subset of parameters is the set of static parameters that depend on the second input, that is, {𝑎2,𝑏2,𝑐12,𝑐23,𝑐24,,𝑐2,11}. Since we have eleven inputs, there will be eleven such subsets. The final three subsets of parameters to be fit are 𝝉,𝜻, and 𝝉𝑎. Again this is repeated up to three times if a score increase of at least 0.0001 is observed each time. For the third method, we will fit the same subsets of static parameters as in stage two. However, we will use different subsets of the dynamic parameters. We first fit three subsets of 𝝉 in this order: {𝜏1,𝜏2,𝜏3}, {𝜏4,𝜏5,𝜏6,𝜏7}, {𝜏8,𝜏9,𝜏10,𝜏11}. Then the corresponding subsets of 𝜻 are fit in the same order, followed by these subsets of 𝝉𝑎. For the last method, the first subsets of parameters fit are {𝑎1,𝑏1},{𝑎2,𝑏2},,{𝑎11,𝑏11}. Next we fit subsets of parameters corresponding to the interaction terms. First we fit the interaction parameters corresponding to the first input, {𝑐12,𝑐13,,𝑐1,11}, followed by the second input, {𝑐12,𝑐23,𝑐24,,𝑐2,11}, and so on. After these parameters we fit the same subsets of dynamic parameters as in the previous stage. Assuming the value of 𝑟t has increased by 0.002 since the start of the first stage, we will update the coefficients for calculating the scores and the algorithm will return to the first stage. If not, then we attempt to fit a model one parameter at a time. If after cycling through every parameter three times the value of 𝑟t has not increased from the first stage by 0.001, then the algorithm exits. Otherwise the coefficients are updated, and the algorithm returns to the first stage. After two iterations through each of the four methods, each stage may only be repeated twice instead of three times.

Finally we discuss how to update the coefficients of the score. For the first iteration of this methodology, the score is calculated as before:.1𝑟t/.9+.9𝑟Val/.8. To update the coefficients, first let 𝑤=𝑟t+4𝑟Val. Then they are updated to be 𝑟t/𝑤, and 4𝑟Val/𝑤, respectively. This is done to force the correlation between the predicted and observed values in the validation set to be weighted heavily. This can be altered depending on how important it is to the researcher to achieve a high 𝑟Val. The weight of 𝑟t is large enough so that a large increase can be chosen if a small enough decrease in 𝑟Val is observed.

4. An Example: Blood Glucose Concentration Prediction of Type 2 Diabetics

We now illustrate our methodology and compare it to other algorithms mentioned in this paper. In this study, several subjects who exhibit significant variation in their blood glucose concentrations participated in a study in order to determine if their blood glucose concentrations can be adequately predicted from a Wiener network using activity variables, food consumption, and time of day. Since type 2 diabetes affects each subject differently, a model was built for each individual. Due to time constraints to meet the submission deadline, four of the subjects will be evaluated in this work.

In order to obtain blood glucose concentrations, the Medtronic MiniMed Continuous Glucose Monitoring CGMS System Gold (Medtronic Minimed, Northridge, Calif) was used. The SenseWear Pro3 Body Monitoring System (BodyMedia, Inc., Pittsburgh, PA) was used to measure the activity variables used in building this model. From these devices measurements of activity and blood glucose concentrations were obtained every five minutes. Subjects were also asked to record the food that they ingested during this time onto a PDA, which used the Weightmania Pro software (Edward A. Greenwood, Inc., Cambridge, Mass). Other than the necessary downtime to download the data from these devices, data were collected by these devices twenty-four hours a day for four weeks. While the SenseWear Pro3 Body Monitoring System can measure over 30 activity variables, it was decided after much trial and error to use only 7 of these variables for their Wiener network. Of the other four variables, three of them, carbohydrates, fat, and protein, are food variables that measure the amount of each consumed in grams every five minutes. The final one, time of day, allows one to capture the Circadian rhythm of each individual’s body [14]. It assumes values from 0, denoting midnight, to 1439, denoting 11:59 pm. A table of all inputs is given in Table 1.

tab1
Table 1: A table of inputs used in this type 2 diabetes study.

Due to the amount of data available for each subject, the first week of a subject's data was used to fit an individual model for that subject and the subsequent two weeks as a validation set. The starting values were set to be 0 for each 𝑎𝑖, 𝑏𝑖, and 𝑐𝑖 other than 𝑎0, which was set to 𝑦Tr. The dynamic variables were set to parameter estimates found from fitting a pilot model. We have fit models using six different methods. The first method was the proposed methodology (PM). The second method utilized the Gauss-Newton (GN) algorithm, and another method used the GN algorithm in a method similar to PM. More specifically, the parameters were fit using the GN algorithm, with the subsets as done in PM. This particular method is called the modified GN algorithm. The fourth method utilized the Levenberg-Marquardt (LM) algorithm, and the fifth method was a modified LM algorithm, where the modifications were made in the same manner as the modified GN algorithm. Finally models were fit using the Excel Solver (ES) routine. Other than the ES routine, all models were fit using MATLAB on a computer with a 2.66 GHz Intel Core 2 Quad processor and 4 GB of RAM. The comparative results are given in Table 2. For each subject, the correlation between predicted and observed blood glucose concentrations in the validation set is at least 0.48. Here we desire a high 𝑟Val as we wish to track the actual blood glucose concentration closely, since we do not wish to miss sudden changes in blood glucose concentration, particularly if it becomes very low (<40 mg/dL) due to the health consequences. This is why we chose the coefficients for the score as stated earlier.

tab2
Table 2: Training and validation statistics for Wiener networks fit to model blood glucose concentrations for four diabetic subjects. Note that PM is the proposed methodology, GN is the modified Gauss-Newton algorithm, LM is the modified Levenberg-Marquardt algorithm, and ES is the Excel Solver methodology. Note that ES was fit manually.

We have split the results into two tables. Table 2 compares the methods that fit subsets of the parameters, and Table 3 compares the proposed methodology to the generic GN and LM algorithms. Looking at Table 2, we see that the modified GN algorithm had difficulty with subject 1, as 𝑟t for this subject was 0.40. This indicates that 𝐽𝐽 is ill conditioned for this subject at the starting values, and since this could happen for other subjects, the Gauss-Newton algorithm alone would not be a good choice for fitting a Wiener network to these subjects. The LM algorithm does not typically have such a difficulty due to its damping parameter 𝛼, and the modified LM algorithm generally outperformed the modified GN algorithm. However it should be noted that 𝛼 was set depending on the value of 𝑅2: it was set to 100 if 𝑟t<0.3, 1 if 0.3𝑟t<0.5, 103 if 0.5𝑟t<0.7, and 106 if 𝑅20.7. This was done in order to alleviate ill conditioning and to allow for more aggressive steps when ill conditioning is no longer a problem. We see that the modified LM and GN algorithms fit much faster than the proposed method, but this is not a major problem since we are fitting these models offline. As for the GRG2 algorithm in Excel, it appears to be very competitive with the proposed method for finding a model with a large 𝑟t value, but the proposed method outperforms this method for every subject's validation set.

tab3
Table 3: Training and validation statistics for Wiener networks fit to model blood glucose concentrations for four diabetic subjects. Note that PM is the proposed methodology, GN is the Gauss-Newton algorithm, and LM is the Levenberg-Marquardt algorithm.

As for Table 3, we compare the proposed methodology to fitting models under supervised learning with all parameters simultaneously using either the GN algorithm or the LM algorithm. Of course these other algorithms will fit much faster as there is only one set of parameters to be fit. However, we see that 𝑟t and 𝑟Val are greater for each subject when fitting a model with the proposed methodology than either such algorithm.

5. Concluding Remarks

This paper presents a methodology that appears to find better parameter estimates for Wiener networks than other previous algorithms. This methodology uses a score based on two statistics: 𝑟t and 𝑟Val in order to avoid overfitting. It also uses a grid search and the Levenberg-Marquardt algorithm in order to improve on the starting parameters. Subsets of the parameters are then fit using the Nelder-Mead, BFGS, and conjugate gradient algorithms in order to overcome issues such as stiffness, poorly approximated derivatives, and nonlinearity. However, we believe there are a few things that can be done to enhance the algorithm.

First, instead of solving the differential equation in order to calculate the dynamic variables used in the example, they were approximated. If possible, one should obtain exact values for the dynamic variables, as this will reduce error in the model. In the example presented, this is not possible since there is no closed-form solution for the differential equation used in the dynamic blocks due to the 𝜕𝑥𝑖𝑗/𝜕𝑡 term.

Secondly, the parameter 𝜃𝑖 was fixed for each input. This was done because 𝜃𝑖 must be a multiple of 5 due to the fact that measurements were taken every five minutes. Since the objective function used here would not be continuous with respect to 𝜃𝑖, one would be unable to calculate derivatives. Revising Algorithm 1 such that 5 is added or subtracted from ̂𝜃𝑖 would be one possible method of overcoming this. This could be done at the end of each stage of the algorithm as this would be done one ̂𝜃𝑖 at a time.

Another possible improvement is the elimination of constraints in the parameter space. Here the parameters 𝜏𝑖 and 𝜁𝑖 must be larger than 0. One way to deal with this issue is reparameterization, which is suggested in [15]. By setting 𝜏𝑖=𝑒𝜆𝑖 and 𝜁𝑖=𝑒𝛾𝑖, one can optimize with respect to 𝜆𝑖 and 𝛾𝑖 instead of 𝜏𝑖 and 𝜁𝑖. This would eliminate the need to check whether 𝜏𝑖>0 or 𝜁𝑖>0 for any iteration of our methodology. The only concern is that approximating the Jacobian and the gradient will become even more difficult if this reparameterization is performed.

One last thing to discuss is the importance of starting parameters. While finding starting dynamic parameter values that would yield useful models regardless of the data would make implementation easier, there may be properties of the experiment worth exploiting. It is common knowledge that carbohydrates are digested and metabolized faster than fats. Thus the amount of time that a step change in carbohydrates affects the system is less than that of fat. This “residence time” can be calculated for input 𝑖 to be 2𝜏𝑖𝜁𝑖. For starting parameters, one idea could be to set the starting values for 𝜁1 and 𝜁2 to be equal and choose 𝜏1 to be less than 𝜏2.

Appendix

For these algorithms, let 𝜃 denote the vector of all parameters, ̃𝜃 denote the subset of parameters to be fit, and 𝑓(𝑥|𝜃) denote the model of interest, see Algorithms 2, 3, 4, and 5. Also note that checks for convergence have been left out. First we give a short legend of the notation used in this appendix.̂𝜃:Current estimate of parameters,𝐟:[𝑓(𝐱1̂𝜃)𝑓(𝐱2̂𝜃)𝑓(𝐱𝑛̂𝜃)],𝐹:̂𝐹(𝐱𝜃),̂𝐹(𝐱𝜃):(𝑦𝑖𝑓(𝑥𝑖̂𝜃))2,𝐽:Jacobian of ̂𝐹(𝐱𝜃).

alg2
Algorithm 2: Levenberg-Marquardt algorithm used.

alg3
Algorithm 3: Conjugate Gradient algorithm used.

alg4
Algorithm 4: Nelder-Mead algorithm used. (Taken from [6, 12]).

alg5
Algorithm 5: BFGS algorithm used.

Acknowledgments

The authors would like to thank Jeanne Stewart and Kaylee Kotz for their help with data collection and BodyMedia, Inc. for their funding and donation of equipment.

References

  1. A. H. Tan and K. Godfrey, “Modeling of direction-dependent processes using Wiener models and neural networks with nonlinear output error structure,” IEEE Transactions on Instrumentation and Measurement, vol. 53, no. 3, pp. 744–753, 2004. View at Publisher · View at Google Scholar
  2. D. K. Rollins, N. Bhandari, J. Kleinedler et al., “Free-living inferential modeling of blood glucose level using only noninvasive inputs,” Journal of Process Control, vol. 20, no. 1, pp. 95–107, 2010. View at Publisher · View at Google Scholar
  3. G. Pajunen, “Adaptive control of Wiener type nonlinear systems,” Automatica, vol. 28, no. 4, pp. 781–785, 1992. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  4. J. D. Faires and R. Burden, Numerical Methods, Brooks/Cole, Pacific Grove, Calif, USA, 2nd edition, 1998.
  5. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics, Springer, New York, NY, USA, 2nd edition, 2009. View at Publisher · View at Google Scholar
  6. MATLAB Version 7.10.0., The MathWorks, Natick, Mass, USA, 201.
  7. K. Madsen, H. B. Nielsen, and O. Tingleff, Methods for Non-Linear Least Squares Problems, Informatics and Mathematical Modelling, Technical University of Denmark, Kongens Lyngby, Denmark, 2nd edition, 2004.
  8. R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010.
  9. R. H. Barham and W. Drane, “An algorithm for least squares estimation of nonlinear parameters when some of the parameters are linear,” Technometrics, vol. 14, pp. 757–766, 1972. View at Publisher · View at Google Scholar · View at Zentralblatt MATH
  10. H. O. Hartley, “The modified Gauss-Newton method for the fitting of non-linear regression functions by least squares,” Technometrics, vol. 3, pp. 269–280, 1961. View at Google Scholar · View at Zentralblatt MATH
  11. L. S. Lasdon and A. D. Waren, “Generalized reduced gradient software for linearly and nonlinearly constrained problems,” in Design and Implementation of Optimization Software, H. J. Greenberg, Ed., pp. 335–362, Sijthoff and Noordhoff, Groningen, The Netherlands, 1978. View at Google Scholar
  12. J. Nocedal and S. J. Wright, Numerical Optimization, Springer Series in Operations Research, Springer, New York, NY, USA, 1999. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet
  13. J. R. Shewchuk, “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain Edition 1 1/4,” 1994, http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.
  14. E. Van Cauter, K. S. Polonsky, and A. J. Scheen, “Roles of circadian rhythmicity and sleep in human glucose regulation,” Endocrine Reviews, vol. 18, no. 5, pp. 716–738, 1997. View at Publisher · View at Google Scholar
  15. D. M. Bates and D. G. Watts, Nonlinear Regression Analysis and Its Applications, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons, New York, NY, USA, 1988. View at Publisher · View at Google Scholar · View at Zentralblatt MATH