Abstract

In this paper, the current variant technique of the stochastic gradient descent (SGD) approach, namely, the adaptive moment estimation (Adam) approach, is improved by adding the standard error in the updating rule. The aim is to fasten the convergence rate of the Adam algorithm. This improvement is termed as Adam with standard error (AdamSE) algorithm. On the other hand, the mean-variance portfolio optimization model is formulated from the historical data of the rate of return of the S&P 500 stock, 10-year Treasury bond, and money market. The application of SGD, Adam, adaptive moment estimation with maximum (AdaMax), Nesterov-accelerated adaptive moment estimation (Nadam), AMSGrad, and AdamSE algorithms to solve the mean-variance portfolio optimization problem is further investigated. During the calculation procedure, the iterative solution converges to the optimal portfolio solution. It is noticed that the AdamSE algorithm has the smallest iteration number. The results show that the rate of convergence of the Adam algorithm is significantly enhanced by using the AdamSE algorithm. In conclusion, the efficiency of the improved Adam algorithm using the standard error has been expressed. Furthermore, the applicability of SGD, Adam, AdaMax, Nadam, AMSGrad, and AdamSE algorithms in solving the mean-variance portfolio optimization problem is validated.

1. Introduction

Recently, the application of the stochastic gradient descent (SGD) approach to machine learning and deep learning is actively explored. Moreover, due to the ability of the SGD approach in handling the stochastic optimization problems [1] and for solving optimization problems under the uncertainty environment [2, 3], the SGD approach and its variants have been developed rapidly. By virtue of this, the mean-variance portfolio optimization problem [4], which deals with risk and return, has attracted the attention of the investment community. The optimal decision on the portfolio selection is necessarily needed, where the scientific approach is employed in maximizing the return with the minimum risk [5]. However, this optimal decision is difficult to be made in advance.

In this paper, the disadvantage of the SGD approach, which is the slow convergence [6, 7], is noticed. To improve this weakness, the standard error from the sampling theory is added to the updating rule of the adaptive moment estimation (Adam) algorithm [8], which is the current variant of the SGD approach. On this basis, the convergence rate of the Adam algorithm is improved significantly. This improved version is then known as Adam with standard error (AdamSE) algorithm. On the other hand, the application of SGD methods, including Adam, adaptive moment estimation with maximum (AdaMax), Nesterov-accelerated adaptive moment estimation (Nadam), AMSGrad, and AdamSE approaches, for solving the mean-variance portfolio optimization problem is further studied. For this purpose, the historical data of the rate of return for the S&P 500 stock, 10-year Treasury bond, and money market are employed. Then, the mean-variance portfolio optimization model is formulated. During the calculation procedure, the iterative solution converges to the optimal portfolio solution, and the performance of these algorithms is presented.

The rest of the paper is organized as follows. In Section 2, the mean-variance portfolio optimization problem is described, where the expected return and the covariance matrix are expressed. In Section 3, the enhancement of the convergence rate of the Adam algorithm by using the standard error from the sampling theory is further discussed. The calculation procedure for the SGD, Adam, AdaMax, Nadam, AMSGrad, and AdamSE algorithms is summarized. In Section 4, a mean-variance portfolio optimization model is formulated using the historical data of S&P 500 stock, 10-year Treasury bond, and money market. Then, the model is solved by using the algorithms discussed, and the results are presented. Finally, some concluding remarks are made.

2. Problem Description

Consider a general mean-variance portfolio optimization problem for n risky assets given bywhere is the vector of portfolio weights of the assets and represents the covariance for the assets. Here, gives the variance of the portfolio, is the vector of portfolio return mean, and is the vector with 1s elements. Note that the targeted expected return depends on the risk tolerance of investors.

Furthermore, by using the geometric mean, the portfolio return mean is computed fromwhere is the rate of return of asset i at time and is the mean of the rate of return for the asset i, whereas the covariance matrix of assets is defined asfor . The assumption of the mean-variance portfolio optimization defined in (1) is to minimize the risk of an investment that is represented by the variance, at the same time, to satisfy the targeted return of the portfolio [9]. Since the risk is always related to randomness and uncertainty [10], the stochastic optimization approach will be used in solving the optimization problem defined in (1).

3. Stochastic Optimization Method

Now, let us define the Lagrange function as follows:where is the vector of the Lagrange multiplier. Then, the following first-order necessary conditions are derived:

3.1. Analytical Optimal Solution

From (5), the optimal weighted value of the portfolio is calculated from

Refer to (6), the targeted expected return is provided by

Then, substitute (9) into (10) to have the targeted expected return in terms of the Lagrange multipliers, that is,

Rewrite (7) to beand substitute (9) into (12):

From (11) and (13), after doing some algebraic manipulations, the Lagrange multipliers are computed fromand . Therefore, from the discussion above, the analytical solution of the mean-variance portfolio optimization problem defined in (1), which is given by (9), (14), and (15), is resulted. However, this analytical solution is assumed to be not available due to the uncertainty and randomness of the variables.

3.2. Stochastic Gradient Descent Algorithm

Referring to the mean-variance portfolio optimization problem defined in (1), let us introduce an augmented objective function aswith . Since the existence of the uncertainty, the augmented objective function defined in (16) can be rewritten as the expected objective function, given bywhere is the element of the augmented objective function that is uniformly sampled at random and is the expectation operator. By virtue of this, the sampled gradient is denoted as an unbiased estimator to be

Notice that the first-order necessary condition for (16) is equivalent to the first-order necessary condition (5). That is,

For convenience, define the stochastic gradientwhich can be calculated from (5). The updating rule of the SGD approach is given bywith the step size , which is also known as the learning rate, is the number of iterations, and the random index is the gradient referred.

Hence, the calculation procedure of the SGD algorithm is summarized as in Algorithm 1.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: update the vector from (21). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The tolerance is , and the learning rate is  = 0.001.
3.3. Adaptive Moment Estimation Algorithm

In the Adam approach [11], the exponential decaying averages of past gradients and past squared gradients are considered as follows:where is the gradient, and are the decay rates, which are close to 1. Notice that and are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients, respectively. These biases are counteracted by using the bias-corrected first- and second-moment estimates, given by

Thus, Adam updating rule has been presented as follows:where is the smoothing term used to avoid division by zero.

The calculation procedure of the Adam algorithm is summarized as in Algorithm 2.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: compute the decaying averages of past and past squared gradients from (22) and (23).
Step 5: calculate the bias-corrected first- and second-moment estimates from (24) and (25).
Step 6: update the vector from (26). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The default values for the decay rates are  = 0.9 and  = 0.999, and the smoothing term is , while the tolerance is , and the learning rate is  = 0.001.
3.4. Adaptive Moment Estimation with Maximum

AdaMax, which is the adaptive moment estimation with maximum [11], is a variant of the Adam optimizer that uses the infinity () norm, while the Adam optimizer itself uses the -norm for optimization. When generalizing the Adam algorithm to the -norm, and hence in AdaMax, the gradient update is the maximum between the past gradients and current gradient, which is shown as

Then, the updating rule of AdaMax is

The calculation procedure of the AdaMax algorithm is summarized as in Algorithm 3.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: compute the decaying averages of past and past squared gradients from (22) and (27).
Step 5: calculate the bias-corrected first-moment estimate from (24).
Step 6: update the vector from (28). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The default values for the decay rates are  = 0.9 and  = 0.999, the tolerance is , and the learning rate is  = 0.001.
3.5. Nesterov-Accelerated Adaptive Moment Estimation

Nadam, which is the Nesterov-accelerated adaptive moment estimation, combines Adam and NAG, which is the Nesterov acceleration gradient [12]. The Nadam algorithm is employed for noisy gradients or gradients with high curvatures. The NAG algorithm allows performing a more accurate step in the gradient direction by updating the parameters with the momentum step before computing the gradient. The learning process is accelerated by summing up the exponential decay of the moving averages for the previous and current gradients. It is resulting in a little faster training time than the Adam algorithm. Its updating rule is shown by

The calculation procedure of the Nadam algorithm is summarized as in Algorithm 4.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: compute the decaying averages of past and past squared gradients from (22) and (23).
Step 5: calculate the bias-corrected first- and second-moment estimates from (24) and (25).
Step 6: update the vector from (29). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The default values for the decay rates are  = 0.9 and  = 0.999, and the smoothing term is , while the tolerance is , and the learning rate is  = 0.001.
3.6. AMSGrad

In setting, where the Adam algorithm converges to a suboptimal solution, it has been observed that some minibatches provide large and informative gradients, but as these minibatches only occur rarely, exponential averaging diminishes their influence, which leads to poor convergence. To fix this behaviour, a new algorithm, which is known as the AMSGrad algorithm [13], that uses the maximum of past squared gradients rather than the exponential average to update the parameters is created:

Therefore, the updating rule of AMSGrad is

The calculation procedure of the AMSGrad algorithm is summarized as Algorithm 5.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: compute the decaying averages of past and past squared gradients from (22) and (23).
Step 5: calculate the bias-corrected moment estimate based on (30).
Step 6: update the vector from (31). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The default values for the decay rates are  = 0.9 and  = 0.999, and the smoothing term is , while the tolerance is , and the learning rate is  = 0.001.
3.7. Improved Adaptive Moment Estimation Algorithm

Consider the standard error (SE) from the sampling theory [14],where is the population standard deviation and is the sample size for the sampling. Thus, for improving the updating rule of the Adam algorithm, assume that the standard error of the bias-corrected first-moment estimate is defined bywhere represents the sample standard deviation of the gradient and k is the number of iterations. From (26), the updating rule of the Adam algorithm is modified to be

For the modification made, this improved Adam algorithm is also known as Adam with standard error (AdamSE) algorithm [11, 14].

The calculation procedure for the AdamSE algorithm is summarized as Algorithm 6.

Data: given the initial value , the number of samples , the step size , and the tolerance . Set .
Step 1: evaluate the augmented objective function from (16).
Step 2: compute the stochastic gradient from (20).
Step 3: set the random index .
Step 4: compute the decaying averages of past and past squared gradients from (22) and (23).
Step 5: calculate the bias-corrected first- and second-moment estimates from (24) and (25).
Step 6: calculate the standard error of the bias-corrected first-moment estimate from (33).
Step 7: update the vector from (34). If , then stop the iteration. Otherwise, set and repeat from Step 1.
Remark:
The default values for the decay rates are  = 0.9 and  = 0.999, and the smoothing term is , while the tolerance is , and the learning rate is  = 0.001 as the same as in the Adam algorithm.

4. Illustrative Example

Consider a portfolio optimization problem [15], where the portfolio selection is based on three securities, namely, S&P 500 stock, 10-year Treasury bond, and money market (MM). The corresponding historical data of the annual rate of return for these securities, which are dated from 1961 to 2003, are shown in Table 1.

By using (2) and (3), the mean of the return and the related covariance of this portfolio selection are calculated, and their values are, respectively, given by

Consequently, the mean-variance portfolio optimization model is formulated as follows:with the initial weight , , and and the targeted expected return.

As a result, the optimal portfolio in percentage is shown in Table 2, where the final iterative solutions obtained from the SGD, Adam, AdaMax, Nadam, AMSGrad, and AdamSE algorithms are compared. The optimal solution from [15] is enclosed as the benchmark solution. It could be noticed that these algorithms are able to provide the optimal weight for the portfolio selection, which is given by (2.63, 10.24, 87.13). The Lagrange multipliers are and .

The performance of the respective methods in solving the mean-variance portfolio optimization problem is shown in Table 3, where the numbers of iterations are presented.

Apparently, the AdamSE algorithm has the smallest number of iterations, which is an 86 percent reduction from the iteration numbers of the Adam algorithm, while the Adam algorithm leads the iterative solution to the convergence with 1 percent faster than the SGD algorithm. At the same time, some variants of Adam algorithm, namely, AdaMax, Nadam, and AMSGrad, show more iteration numbers than the Adam algorithm. Also, the convergence behaviour for each algorithm, which is represented by the norm of the stochastic gradient, is shown in Figures 16, respectively. For better visualization, the iterative results for the first 300 iteration numbers of the SGD and AdaMax algorithms are presented as shown in Figures 1 and 3, respectively. Therefore, the modification of the Adam algorithm by equipping the standard error significantly enhances the rate of the convergence of the Adam algorithm and the efficiency of the AdamSE algorithm is definitely proven.

In addition to this, the objective function has a minimum risk of , where the changes in the variance presented by each algorithm during the iteration procedure are shown in Figures 712, respectively. From these figures, it is noticed that the variance is dramatically increasing before meeting a peak point for the different algorithms, and then the variance is gradually decreasing and heading to the minimum variance of 5.0182 × 10−4. This behaviour indicates that the divergence issue has been counted when reaching a peak point, and after satisfying the constraints, the optimal solution is successfully determined to give the optimal weight of the portfolio selection.

5. Concluding remarks

The enhancement of the convergence rate of the Adam algorithm, which is addressed by using the standard error, was discussed in this paper. This improved version of the algorithm is known as the AdamSE algorithm. In addition, the application of the algorithms of SGD, Adam, AdaMax, Nadam, AMSGrad, and AdamSE in solving the mean-variance portfolio optimization problem was also studied. The result obtained showed that the AdamSE algorithm is an efficient approach, especially for solving the mean-variance portfolio optimization problem. In conclusion, the practicality of the SGD algorithm and its current variants, which are the Adam, AdaMax, Nadam, AMSGrad, and AdamSE algorithms, is particularly validated for the mean-variance portfolio optimization problem.

Data Availability

The data used are shown in Table 1.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Ministry of Education Malaysia (MOE) for supporting this research under the Fundamental Research Grant Scheme (vot. no. FRGS/1/2018/STG06/UTHM/02/5). This research was partially sponsored by Universiti Tun Hussein Onn Malaysia.