#### Abstract

Reinforcement learning (RL) agents can learn to control a nonlinear system without using a model of the system. However, having a model brings benefits, mainly in terms of a reduced number of unsuccessful trials before achieving acceptable control performance. Several modelling approaches have been used in the RL domain, such as neural networks, local linear regression, or Gaussian processes. In this article, we focus on techniques that have not been used much so far: symbolic regression (SR), based on genetic programming and local modelling. Using measured data, symbolic regression yields a nonlinear, continuous-time analytic model. We benchmark two state-of-the-art methods, SNGP (single-node genetic programming) and MGGP (multigene genetic programming), against a standard incremental local regression method called RFWR (receptive field weighted regression). We have introduced modifications to the RFWR algorithm to better suit the low-dimensional continuous-time systems we are mostly dealing with. The benchmark is a nonlinear, dynamic magnetic manipulation system. The results show that using the RL framework and a suitable approximation method, it is possible to design a stable controller of such a complex system without the necessity of any haphazard learning. While all of the approximation methods were successful, MGGP achieved the best results at the cost of higher computational complexity. Index Terms–AI-based methods, local linear regression, nonlinear systems, magnetic manipulation, model learning for control, optimal control, reinforcement learning, symbolic regression.

#### 1. Introduction

A reinforcement learning (RL) agent interacts with the system to be controlled by measuring its states and applying actions according to a policy so that a given goal state is attained. The policy is iteratively adapted in such a way that the agent receives the highest possible cumulative reward, which is a scalar value accumulated over trajectories in the system’s state space. The reward associated with each transition in the state space is described by a predefined value function.

Existing RL algorithms can be divided into critic-only, actor-only, and actor-critic variants. The critic-only variants optimize the value function (-function) that is then used to derive the policy; the actor-only variants work directly on the policy optimization without any need for a value function; and actor-critic variants optimize both functions simultaneously. An example of the actor-only RL variant, often called Q-learning, can be found in [1] and that of the actor-critic variant in [2].

From a different point of view, RL algorithms can be also divided into model-based and model-free variants. Examples of both approaches can be found in [3, 4]. The model-based variants include a model representation of the system to be controlled and can be pretrained in simulation (offline) and then updated when controlling the actual system (online). Model-free methods learn online exclusively through trial and error. Both variants have their specific advantages and disadvantages. We can often find remarks about the model-free approach requiring much more data, especially in high-dimensional cases [1]. In this paper, we employ the model-based, critic-only variant without any online training so that we can compare different modelling approaches.

We focus on two promising categories of approximation algorithms: genetic programming and local linear regression. Our aim is to contribute to the methodology of choosing the optimal out of dozens of existing modelling algorithms when presented with a specific RL task. This problem arises not only in connection with the RL framework (see [5]) but in modelling of a dynamical system in general [6, 7].

Genetic algorithms (GA) and their many variations are well established as a tool for modelling or parameter estimation of dynamical systems [8, 9]. However, genetic programming as a modelling approach used within RL is relatively new and promises good results with high-dimensional systems where other approaches fail. It creates a continuous-time, globally nonlinear model described by an analytical equation built of combinations of predefined functions [10]. As it is common with genetic optimization algorithms, these methods tend to be computationally demanding. On the other hand, local regression is a well-established modelling approach for model-based RF agents where the model is composed of local linear models, offering fast and computationally cheap approximation. There are several variants of local modelling methods; comprehensive examples of grid-based local linear model structure and data-based local linear regression (LLR) are described in [11, 12], respectively. Even though the use of local regression techniques within RL has been researched in the past, it was mainly based on simple, memory-based approximation methods such as the LLR, which is thoroughly described and examined in [13, 14], and more complex incremental methods such as the receptive field weighted regression (RFWR) [15, 16] or locally weighted projection regression (LWPR) [17] were omitted, with the exception of [18], where the RFWR algorithm was used as a critic approximator. The RFWR and LWPR methods provide significant benefits in lower memory use and higher stability by employing optimization-based (RFWR) or statistical (LWPR) methods to discover the optimal distribution of the local models’ areas of validity, that is, the receptive fields.

It is important to benchmark the modelling methods because of the large number of existing approaches, which aim at similar tasks, while there are no simple guidelines on the method choice. Also, the presented algorithms are not yet well established within the RL domain. Finally, studying control algorithms for magnetic manipulation systems has importance on its own because of its application in many industrial fields (medical applications, magnetic levitation systems, etc.), thus leading to the two separate aims of this paper: exploring control algorithms suitable for control of precise magnetic manipulator systems and benchmarking different modelling approaches.

When dealing with real magnetic manipulator systems, we also need to address practical issues that are often neglected in simulations, that is, nonlinearities such as actuator dead zones, saturations, Coulomb friction, signal delays, and so on. These present significant obstacles then implementing the control algorithm on a real system. In some cases, the dead zone and saturation problem can be addressed by nonlinear or adaptive control laws. For example, [19] shows an approach using fuzzy control with Gaussian membership functions, which is in practice similar to the RFWR method, and [20] describes a gain-scheduling adaptive approach to deal with internal system bounds. Using RL to find a control law for a nonlinear system also has the advantage that it can often deal with such disturbances on its own through the optimization process; for example, only a limited range of the actor outputs may be limited, which is the approach utilized in this paper.

In this paper, we also present minor adjustments to the RFWR algorithm in Section 3, proposed to lower the computational complexity while preserving stability when working with low-dimensional problems.

#### 2. Methods

##### 2.1. Magnetic Manipulator

Genetic programming was already applied to nonlinear systems like an inverted pendulum or a collaborative robot [2, 10, 21]. To further investigate the approximation capabilities of these methods, we use a different system–a magnetic manipulator (Magman). This system consists of four coils that are independently operated by separate current controllers and a steel ball that can move freely over the coils; see Figure 1. To ensure that the ball moves only in the measured direction with limits on the edges, it is placed in a groove with 10 mm in size. In this case, we decided to limit the system to the first two coils only, as a system with four inputs is much more complex in terms of the RL computational complexity, while it does not enrich the system with different nonlinearities as it only spatially repeats the same kind of nonlinear behaviour.

The steel ball can be positioned by properly controlling the current and thereby the magnetic force of the coils. The magnetic force a coil exerts on the ball is highly dependent on the distance of the ball from the coil’s centre, which introduces a significant nonmonotonic nonlinearity [22–24].

All experiments and simulation were scripted in MATLAB. The coil currents are controlled by stabilized current source modules, which communicate that MATLAB through a USB/RS232 transceiver using the virtual COM port (VCP) protocol on Windows OS. As the ball position is measured with a laser sensor with analog (voltage) output, the Humusoft MF634 IO card was used to measure the signal in real time from the MATLAB environment with a sampling period of 5 ms. Even though the Window OS is not an RTOS, with this sampling frequency, the period jitter is negligible (below 0.1%), and thus, the system can be considered real time.

Table 1 lists the parameters of the magnetic manipulator we use in our experiments. With the task being a precise positioning of an object in a magnetic field, similar concepts can be found in many real-world applications, for example, maglev, microrobots, contactless stirring of chemicals, and so on.

Approximate equations of motion inferred using the first principle method can be found in [24]. The system parameters were either measured directly or estimated using MATLAB parameter estimation toolbox based on measured data.

Generally, the system can be described by a continuous-time, nonlinear state-space model as follows:where is the state vector composed of the position and velocity of the ball, forming the continuous system state space ; is the state vector derivative; and is the input (action) vector composed of the coil currents. form the system input space . The nonlinear vector function thus describes the system dynamics.

In this paper, by modelling the system, we mean approximating the underlying real function using various methods, which all build upon experimentally measured input-output data. Each data point is formed by corresponding assumed inputs and outputs of the function – . In practice, these data points are corrupted by noise and other disturbances that are assumed to be with zero means.

##### 2.2. SNGP

Single-node genetic programming (SNGP) is a graph-based genetic programming algorithm evolving a population organized as an ordered linear array of interlinked individuals, each representing a single program node [2, 10, 21]. Generally, symbolic regression algorithms try to find a model in the form of an analytical expression for a given data set by forming and evolving the expression out of elemental functions and operations. In our case, the algorithm is based on the assumption that the nonlinear function in (1) can be efficiently approximated by the following equation:where the nonlinear function , called the feature, is developed by means of genetic programming with being the maximum number of features, number of states, and the coefficients estimated by the least-squares method. The features are constructed from a list of elementary functions that are assumed to be able to produce the required fitting approximation of the presented data. The features can be combined by common operators or nested, but the maximal depth of the expression is limited to avoid overfitting. The symbolic model is evolved so that the mean-squared error over the training data is minimized.

##### 2.3. MGGP

The second GP algorithm we used is called multigene genetic programming (MGGP). As opposed to SNGP, it combines the features defined also by 2 into tree-like structured expressions called genes. The final expression is formed by a linear combination of these genes, which act as the individual features in equation (2). The parameters of this top-level linear combination are again estimated through least squares. Further details about the algorithm can be found in [25]. The actual MGGP implementation we used is extended with linear combinations of features [26] that enable the algorithm to find affine transformations of the feature space via a backpropagation-like technique, thus making it easier for the driving genetic programming algorithm to approximate the nonlinearities.

##### 2.4. Receptive Field Weighted Regression

Receptive field weighted regression (RFWR) is an incremental approximation method that creates a set of local linear models and the corresponding Gaussian basis functions called the receptive fields and gradually updates them to fit the input-output data. The set of local linear models is updated with new data points (called the query points) using a weighted variant of the recursive least squares (RLS) method and the basis functions are updated through a gradient search with the help of heuristic decision rules. It can continually improve the set of models while still providing the best estimation of the approximated function at each query point based on the previously provided data. The original algorithm, first presented in [15, 16], which is the basis we build upon, can be best described by the following pseudocode:(1)For each new query point (2)For each existing local model(3)Calculate model weight according to (4)(4)If activation limit (5)Update model parameters using RLS according to (6) and (7)(6)Update the corresponding receptive field using (12) and (14)(7)End(8)End(9)If no model was activated(10)Place a new model at the query point using (15)(11)Else if two or more models were activated with weight pruning limit (12)Prune the model with the smaller receptive field(13)End(14)Calculate the model output as a weighted average of the activated local models(15)End

Usually, the receptive field activation limit is set as . This parameter represents the weight limit for a local model to be updated according to the new data and to be included in the output estimation through a weighted average with another activated model. The pruning limit is usually set as , which represents the highest acceptable overlap of neighbouring receptive fields.

The RFWR variant described in this paper follows the main outline of the original algorithm [15] with several adjustments and improvements for the sake of stability and computational complexity for low-dimensional problems. This mainly concerns the rules for adding new local models, adjusting their receptive fields, and generalizing the algorithm in a way that the receptive fields are placed and optimized in a lower number of dimensions than the order of the models. This is especially useful in cases when the nonlinearities are significant mainly in one or two dimensions of the state space of the system. This algorithm, in its original implementation, is successfully being used to approximate inverse models of nonlinear systems to be used as a feedforward compensator [27, 28]. Figure 2 shows an example approximation of a complex univariate nonlinear function by the RFWR algorithm.

Each of the local models is represented by a parameter vector . With the input vector (a query point) , the output is calculated by

The weight of a local model at a query point is determined by its Gaussian receptive field as follows:with the vector of model centre coordinates and the distance inducing matrix of the basis function (receptive field). The overall output is then calculated as a weighted average of the outputs of the activated local models.

The output estimate of the set of local models and their receptive fields is calculated by the following equation:

We modified the original RFWR algorithm described in [15] to be used for low-dimensional problems. These modifications consist of the following:(1)Precise placement of new receptive fields that takes into account the location and dimensions of the existing surrounding receptive fields (see section 3.3)(2)Heuristic rules for stable updating of the receptive fields (see section 3.1)(3)Receptive fields can be distributed along a lower number of dimensions than the dimension of the data space (see section 3.2)

###### 2.4.1. Updating parameters of Local Models

When a new query point is acquired, the activated local models are updated using the recursive least-squares algorithm (RLS) according towhere is the covariance matrix of the estimate, is a forgetting parameter, and is the acquired output for the actual system state called the query point. The covariance matrix needs is usually initialized as a diagonal matrix.

###### 2.4.2. Updating dimensions of Basis Functions

To avoid calculating the matrix inversion in (4) for every local model, an upper triangular matrix is used instead of . Because of symmetry and positive definiteness, these matrices relate according to

To update the receptive field, we update using a gradient-descent optimization

of the cost function as follows:where is the activated receptive field weight, is the estimated output of the respective model at the query point , and is the number of local models. The parameter is the gradient optimization step size. As the calculation of the cost function according to (9) is computationally very complex, we simplified the optimization algorithm through a set of heuristic decision rules and implemented the optimization as follows:

This implementation introduces a parameter , which is an expression of a simple heuristic to decide whether the value of a basis function (weight) at the actual query point should be increased or decreased. This enables to stop updating the distance inducing matrix when a precision criterion is met and to limit the maximal number of local models to avoid overfitting.

Parameter can be determined by various decision rules. A simple yet effective set, which was used in this research, can be created by using a long-term (cumulated over time) MSE of a particular model according to the data points, which can be described by

###### 2.4.3. Adding New Local Models

During the optimization process, it is possible that no model exceeds the activation limit . In such a case, a new local model with a receptive field is added to the approximation set. The centre of the receptive field is automatically placed at the actual query point, and the model parameters are initialized to fit the measured output of the approximated system. What needs to be determined is the area in the state space that should be covered by the newly created receptive field. The original algorithm uses a default diagonal distance inducing matrix for every local model. However, an optimal distance inducing matrix can be determined. Intuitively, the new receptive field should cover the gap between the already existing models. The distance inducing matrix should be initialized as a diagonal matrix with parameters that ensure that the new receptive field does not overlap with any existing one more than a preset limit. In our case, the limit was set to . Since this would be a complex optimization task not suitable for real-time calculation, we simplified the criterion so that the maximal overlapping weight of two models is analyzed only over the line segment connecting their centres. In that case, the distance parameter for initializing the distance inducing matrix can be determined by the following equation.where is a vector between the new centre and the centre of a neighbouring receptive field . A two-dimensional example is shown in Figure 3. This method yields a better estimate of the distance inducing matrix of the new receptive field than the fixed initial dimension matrix in the original algorithm as it requires fewer iterations to stabilize and to cover the gap between neighbouring receptive fields.

The distance parameter has to be calculated for every existing local model, and the minimal distance is used to initialize the distance inducing matrix according towhere is a unity matrix of the corresponding order.

In the specific case of the magnetic manipulator, the inputs of the local models would correspond to and the output to .

##### 2.5. Reinforcement Learning

Consider the following discrete deterministic state-space model of a system to be controlled:where denotes discrete time instants, is the state vector, and is the input vector. An RL agent learns to control the system so that it achieves the maximal cumulated reward on a trajectory from the initial state to the desired state [10]. At each state transition, as described by (14), the agent receives a scalar reward according to

The reward function is usually based on the distance of the current state to the goal state. The optimal control law, called the policy, is determined as follows such that it maximizes the cumulative reward, called the return:where is the discount factor and the initial state is selected from the state space domain . The return for any permissible initial state is captured by the value function defined as follows:

An approximation of the optimal -function can be found by solving the Bellman equation as follows:

The optimal action can be found as the action that steers the system to a state with maximal value [21]. This corresponds to maximization of the right-hand side of (18):

#### 3. Experimental Results

We prepared training and validation I/O data sets measured on the magnetic manipulator with random input signals as a list of data points in the form . The random input signals (coil currents) were generated in the way that only one coil was active at a time that eliminated possible electromagnetic interactions between them (the coil current was controlled by an HW-based current feedback controller module rendering the transient times negligible). Figure 4 shows an example of a training data set.

Even though the ball’s position measurement is very precise, it still contains significant noise. For that reason, the time-domain derivatives of the position (velocity and acceleration) needed for the dynamic model approximation were determined using the Savitzky–Golay filter, which is an FIR filter based on least-squares polynomial approximation able to perform numerical differentiation while filtering the noise simultaneously [29, 30].

Especially for the RFWR implementation, it is important to note that the system’s nonlinearity is mainly significant along the position of the ball and the system can be seen as linear in parameters along the other dimensions (acceleration and velocity). In this case, the general model (1) can be rearranged as follows:where the function represents the significant nonlinearity suitable for local approximation, the term represents a simple model of dry friction, the term represents the viscous friction, and the last term models nonlinear damping caused by electromagnetic induction influencing the steel ball while moving rapidly through a magnetic field. Despite being nonlinear, all of the terms are linear in their parameters and can be modelled globally, which means that the local models share parameters through .

The term in the (20) is quite important in practical situations where Coulomb friction is not negligible. The sign function is often being used to approximate the effects of Coulomb friction wherever there is no significant stiction (difference between static and dynamic friction effects). There are better approximations for simulation purposes, for example, the sigmoid function; however, most of them are not linear in parameters and thus not applicable for RLS parameter estimation.

The same data set was presented to all of the approximation methods (RFWR, SNGP, and MGGP). Due to the stochastic nature of the two algorithms based on genetic programming, the same process was repeated with different pseudorandom seeds. Overall, 30 runs for SNGP and MGGP and 1 run for RFWR were made. Table 2 shows the summary of the MSE results.

Since the MSE of the models with respect to the training set is not sufficient to decide which models are better, two separate data sets were measured: one was used for the training of the models and one for validation. However, since the magnetic manipulator is not open-loop stable, the common open-loop validation is not suitable, as every model diverges quickly even though the parameters may be close to ideal due to errors introduced by numerical integration. Therefore, the models were validated in several steps-ahead prediction mode. As we also suspected that only one-sample-ahead validation could be influenced by the remaining noise in the measured signals, we validated the model for 1, 3, 5, 10, 50, 100, and 250 samples ahead. We used the five-step-ahead prediction (5-SAP) as a baseline for selecting the best models for further experiments. The reason to choose five samples is based on an experimentally validated assumption that shorter intervals do not show the model’s imprecision and longer intervals cause even very precise models to diverge randomly.

The *n*-step-ahead prediction validation is based on a moving frame of *n* consecutive data points, where the first data point is applied as the initial condition for numerical integration (using the ode45 solver) of the dynamical model being tested. When the simulation reaches the *n*-th step, an MSE residual is calculated between the corresponding data point and the model prediction. The resulting model validation metric is then calculated as the sum of residuals over each of the prediction frames.

Figure 5 shows examples of the models constructed by each algorithm. As the transition model of the system is four-dimensional, for visualization purposes, the figures show a two-dimensional situation for the input vector set to . This corresponds to the situation when the first coil is turned off and the current through the second coil is set to the maximal value. The plotted functions were cropped in the parts of the state space that are not reachable by the system (typically high velocities close to the edge of the frame).

Based on the 5-SAP validation results, 21 best models (10x SNGP, 10x MGGP, and 1x RFWR) were chosen to be used for RL control of the magnetic manipulator. Based on these models, we calculated the approximations of the optimal -functions using the fuzzy value iteration [10, 21]. Furthermore, we used equation (19) to calculate the corresponding policies. Figure 6 shows examples of the value functions that resulted in the policies shown in Figure 7.

All the policies were tested on the actual magnetic manipulator. A sequence of five consecutive goal states was chosen as a goal state trajectory, and a corresponding -function and policy were calculated for each one. During the actual control process, a new policy is used every time the goal state changes. Figure 8 shows an example of the control experiments with an RL controller based on the model SNGP 10. Furthermore, Figure 9 shows the ball’s trajectory in the state space plotted over the -function and the policy.

To conduct all of the experiments, we measured the ball’s position using a laser distance sensor, and a PCIe I/O card Humusoft MF634 was used to capture the sensor’s analog output signal. The coil currents were driven by a custom dual-channel current controller. Both of the devices are operated from MATLAB.

##### 3.1. Results

We compared the performance of various models of a complex nonlinear system created with three different approximation methods, two of which were based on genetic programming and the third was based on a modified local linear approximation algorithm (RFWR). Based on these models, we used a model-based critic-only RL agent to control the system and validate the results.

Table 2 shows the resulting MSE values from the estimation, validation, and control processes.

Most of the models selected for the actual control experiments were successful in achieving stable control, although they differ in precision. The histogram in Figure 10 shows the number of models for the two GP-based algorithms (SNGP and MGGP), which fall into several MSE categories. The MSE describes the control precision as the mean-squared error between the goal and the actual trajectory of the closed-loop controlled system.

#### 4. Conclusions

First of all, the results show that it is possible to construct even such a complex nonlinear system using the RL framework. The results are not significant in terms of control precision, which depends highly on the specific system, amount of experimental data, and many other factors. An important achievement is a fact that all of the modelling algorithms demonstrated in this paper provide a viable alternative to the commonly used methods while being much less computationally expensive (in the case of RFWR) or much more user-friendly (in the cases of SNGP and MGGP). Also, it was proven that the commonly used RL framework may be built even on tom of imperfect models.

It is clear from the results that the methods that at least partially depend on random number generation (SNGP and MGGP) need to be run repeatedly in search for the best solution, which clearly outperforms the result of the local approximation method (RFWR). On the other hand, the RFWR method requires significantly less computational power than the GP-based methods. Also, it seems that both the SNGP and the MGGP are able to find similarly precise models with the MGGP having a higher probability of converging to the best solution. It is interesting to note that models with better training or validation fit are not always better for control, as can be seen in Table 2. All of the methods provide a useful tool to be used within the reinforcement learning framework with the main advantage of the GP-based approximation method being a form of output (analytical expression) that is understandable and readable and whose complexity is controllable through intuitive parameters. Considering the simplifications made during the simulations and experiments and a relative imprecision during the control processes, there is space for future research in modifying these methods to be suitable for higher-dimensional systems, implementing GPU, handling specific nonlinearities (friction, hysteresis, etc.), and using them also for an approximation of the -functions and policies.

#### Data Availability

Research data in the form of simulation and experimental results and MATLAB files are available from the corresponding (first) author upon request ([email protected]).

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.