#### Abstract

The autoregressive moving average (ARMA) model is a simple but powerful model in financial engineering to represent time-series with long-range statistical dependency. However, the traditional maximum likelihood (ML) estimator aims to minimize a loss function that is inherently symmetric due to Gaussianity. The consequence is that when the data of interest are asset returns, and the main goal is to maximize profit by accurate forecasting, the ML objective may be less appropriate potentially leading to a suboptimal solution. Rather, it is more reasonable to adopt an asymmetric loss where the model's prediction, as long as it is in the same direction as the true return, is penalized less than the prediction in the opposite direction. We propose a quite sensible asymmetric cost-sensitive loss function and incorporate it into the ARMA model estimation. On the online portfolio selection problem with real stock return data, we demonstrate that the investment strategy based on predictions by the proposed estimator can be significantly more profitable than the traditional ML estimator.

#### 1. Introduction

In modeling time-series data, capturing the underlying statistical dependency of the variables of interest at current time on the historic data is central to accurate forecasting and faithful data representation. For financial time-series data especially (e.g., daily asset prices or returns) where a large amount of potential prognostic indicators is available, the development/analysis of sensible dynamic models as well as effective parameter estimation algorithms has been investigated significantly.

To account for statistical properties specific to financial sequences, several sophisticated dynamic time-series models have been developed: fairly natural autoregressive and/or moving average models [1], the conditional heteroscedastic models that represent dynamics of volatilities (variances) of the asset returns [2–4], and nonlinear models [5, 6] including bilinear models [7], threshold models [8], and regime switching models [9, 10].

Among those, the autoregressive moving average (ARMA) model [1] is the simplest while essential in the sense that most other models are equipped with at least the basic ARMA components. The ARMA models appear in a wide spectrum of applications recently including filter design in signal processing [11], time-series analysis and model selection in computational statistics [12], and jump (large changes) modeling for asset prices in quantitative finance [13], to name just a few. For a time-series (e.g., is the asset return at the th day), the determines bywhere indicates the vector . Here is the stochastic Gaussian error term at time where we assume iid across ’s. In (1) , , , and are the model parameters. That is, is dependent on previous asset returns, historic errors, and the current error .

In this paper we consider a more general, recent stochastic extension of ARMA (abbreviated as sARMA) [14] (in contrast to the deterministic equation (1)) that adds a Gaussian noise to (1). Moreover, the extra covariates (called* cross predictors*) are assumed available at time ; for instance, they are typically economic indicators, market indices, and/or the previous returns of other related assets. The model can be specifically written aswhereHere is the weight vector (model parameters) for the cross predictor.

Hence, sARMA deals with Gaussian noisy observation (with variance ), and it exactly reduces to the ARMA model in the limiting case . The noisy observation modeling of sARMA is beneficial in several aspects: not only does it merely account for the underlying noise process in the observation but also the model becomes fully stochastic, which allows principled probabilistic inference and model estimation even with missing data [14].

Given the observed sequence data, the parameters of the sARMA model can be estimated by the expectation maximization (EM) algorithm [15]. Compared to the traditional Levenberg-Marquardt method for ARMA model estimation [1], the EM algorithm is beneficial for dealing with latent variables (i.e., the error terms) as well as any missing observations in an efficient and principled way. However, both estimators basically aim to achieve data likelihood maximization (ML) under the Gaussian model (2) [14] (with corresponding to ARMA).

Due to the Gaussian observation modeling in sARMA, the ML estimation inherently aims to minimize a* symmetric* loss. In other words, letting and be the model forecast and the true value at time , respectively, incorrect prediction with the prediction error incurs the same amount of loss for both and (i.e., regardless of over- or underestimation). This strategy is far from being optimal especially for the asset return data as argued in the following.

The main goal is to maximize profit by accurate forecasting with the asset return data that encode signs (directions) toward profits. Traditional maximum likelihood (ML) estimator aims to minimize a loss function that is inherently symmetric and hence unable to exploit the property of the asset return data, leading to a suboptimal solution.

Suppose that our data forms a sequence of daily stock log-returns, encoded as (<0) indicating that the stock price moves up (declines) on the th day against the previous day. Now, consider a portfolio selection algorithm that makes an investment based on the forecast given the information up to time . The investment yields positive revenue when the signs of and are equal and the other way around. Hence, the prediction loss should be inherently asymmetric. Furthermore, when , having , even its underestimation (i.e., ) should be penalized less than the prediction with the opposite direction (i.e., ) because the former does not incur any loss in revenue but the latter does.

To address this issue, we propose a reasonable cost function that effectively captures the above idea of the intrinsic asymmetric profit/loss structure regarding asset return data. Our cost function effectively encodes the goodness of matching in directions between true and model predicted asset returns, which is directly related to ultimate profits in the investment. We also provide an efficient optimization strategy based on the subgradient descent using the trust-region approximation, whose effectiveness is empirically demonstrated for the portfolio selection problem with real-world stock return data.

It is worth mentioning that there have been several other asymmetric loss functions proposed in the literature similar to ours. However, existing loss models merely focus on the asymmetry with respect to the ground-truth value point. For instance, the linex function [16, 17] is defined to be linear-exponential function of difference between predicted and ground-truth values. The linlin method [18] adopts a piecewise linear function where the change point is simply the ground-truth value. To the best of our knowledge, we are the first to derive the loss based on the matching the directions (signs) of the predicted and ground-truth returns. This effectively enables incorporating the critical information about directions of profits/losses, in turn leading to a more accurate forecasting model.

The rest of the paper is organized as follows. In the next section we suggest a novel sARMA estimation algorithm based on the cost-sensitive loss function: beginning with the overall objective, we derive the one-step predictor for the sARMA model in Section 2.1, provide details of the proposed cost function in Section 2.2, and state the optimization strategy in Section 2.3. The statistical inference algorithm for the sARMA model is also provided in full derivations in Section 2.4. In the empirical study in Section 3, we demonstrate the effectiveness of the proposed algorithm on the online portfolio selection problem with real data, where the significantly higher total profit is attained by the proposed approach than the investment based on the traditional ML-estimated sARMA model.

#### 2. Cost-Sensitive Estimation

The proposed estimator for sARMA is based on the cost-sensitive loss of the model predicted one-step forecast value (denoted by ) at each time with respect to the true one (denoted by ) available from data. More specifically, for a given data sequence , we aim to solve the optimization problem:Here is the cost of predicting the asset return as when the true value is . In Section 2.2 we define a reasonable cost function that faithfully incorporates the idea of asymmetric cost-sensitive loss discussed in the introduction.

In the objective, we also simultaneously minimize , the parameter regularizer that typically penalizes a nonsmooth sARMA model while preferring a smooth model (effectively achieved by encouraging the regression parameters in close to ) model. Specifically we use the L2 penalty, . The constant (>0) trades off the regularization against the prediction error cost.

Note also that in (4) we use the notation to emphasize the dependency of the model predicted on . We use the principled maximum a posteriori (MAP) predictor estimated under the sARMA model, which is fully described in Section 2.1. The predictor is evaluated based on the inference on the latent error terms, which can be computed recursively where we give detailed derivations for the inference in Section 2.4.

##### 2.1. One-Step Predictor for sARMA

Under the sARMA model, the predictive model at time , given all available information , is for . From this predictive model, one can make deterministic decision on the asset return at , typically as the maximum-a-posteriori (MAP) estimation:

Note that in the sARMA model, it is always assumed that we have at least previous observations and previous error terms . The error terms are simply assumed to be throughout the paper. Due to the linear Gaussianity of the sARMA’s local conditional densities, we have Gaussian , and the MAP predictor (5) exactly coincides with the mean .

In this section we derive the MAP (or mean) prediction as a function of the sARMA model parameters , which can then be used in gradient evaluation for the optimization in (4). As is shown, the predictive distributions heavily resort to the posterior distributions of the error terms, namely, for . They are also Gaussians, and we denote them by infor . Note that and have dimensions and , respectively. The full derivation of the error term posteriors is provided in Section 2.4.

In deriving , one may need to differentiate three cases for : (i) , (ii) , and (iii) . The first case simply forms the initial condition which immediately follows from the local conditional model with marginalization of . That is, when ,where we define for .

We distinguish the second and third cases for the following reason: at time , the previous error terms are* fully* included in the time window in the latter case, while they are* partially* included in the former. Hence in the second case, we additionally deal with the error terms which are always given as 0. Specifically, in the second case (), the terms are partitioned into (), and we haveIn (12), we let be the subvector of corresponding to .

In the third case (), we only need to deal with error terms , and the predictive density is derived as follows:In (14), we introduce and as submatrices of and taking the indices from to only.

In summary, the one-step predictor at with all available information can be written asNote here that the means of the error term posteriors (and their subvectors ) have also dependency on the model parameters .

##### 2.2. Proposed Cost Function

In this section we propose a cost function (used in (4)) that effectively encodes the intrinsic asymmetric profit/loss structure regarding asset return data. To meet the motivating idea discussed in Section 1, we deal with two outstanding cases: the case when the true is positive and the case when is negative. In each case, we further consider a certain margin (small positive, e.g., ), where observing indicates positive return with high certainty; on the other hand, having can be regarded differently as weak positivity and might be considered as noise. For the negative return, we have similar two regimes of different certainty levels.

We discuss the first case, . Depending on the value of , the cost functional changes over the four intervals: (i) incurs the highest loss with a super-linear penalty along the magnitude of (we particularly choose a convex quadratic function), (ii) , that is, overestimation, should be penalized the least, and we opt for an increasing linear function with a small slope, (iii) is an underestimation, but the prediction has certainty greater than a margin and thus is penalized less (we choose a linear function with slope slightly higher than the second case), and (iv) makes prediction in correct direction, but due to the weak certainty below the margin, we penalize it more severely than previous two regimes.

Our specific cost definition for is as follows:where we choose the constants as follows: , , , . To make the cost function continuous, we define two offsets as follows: and .

In the case of , we exactly penalize the prediction in the same way as the first situation. Specifically, the cost definition for iswhere the same constants are used, and the offsets are now set as and .

For the uncertain (within the margin ) return, we still conform to the strategy of encouraging the same direction as the true return. In the case of , we assign small penalty for overestimation as long as it is in the correct direction, while rapidly growing quadratic loss for the prediction toward opposite direction. To summarize, the cost for iswhere we set (for continuity) . The other case of is similarly defined aswhere for continuity of the cost function.

Finally when , we can have a symmetric loss; for instance, .

##### 2.3. Optimization Strategy

In this section we briefly describe the optimization strategy for (4). We basically follow the subgradient descent [19, 20] where the derivative of the cost function with respect to can be derived asHere, due to the nondifferentiability of the cost function (albeit continuous), we use the* subgradient* in place of the second part of RHS of (21).

Evaluating the first part, that is, the gradient of with respect to , requires further endeavor. According to the functional form of in (16), it has complex recursive dependency on mainly due to the error posterior means . Instead of exactly computing the derivative of , we address this issue by evaluating an approximate gradient by treating as a constant (constant evaluated at the current iterate ). In consequence, we have a linear function of , and the gradient can be computed easily. However, the approximation (i.e., constant with respect to ) is only valid in the vicinity of the current . Hence, to reduce the approximation error, we restrict the search space to be not much different from the current iterate (i.e., we search the next within the small-radius ball centered at the current , specifically for some small ). Our optimization strategy is closely related to the trust-region method [21], where the objective is approximated in the vicinity of the current parameters.

##### 2.4. Inference in sARMA

In this section we give full derivations for statistical inference on the latent error variables for each , conditioned on the historic observations and the cross predictors in the sARMA model. That is, the posterior densities for are fully derived. In essence, these are all Gaussians, and as denoted in (6), we find the recursive formulas for the means and covariances (). We also denote the inverse covariance by .

Similarly as one-step predictive distributions, we consider three cases: (i) initial , (ii) where fully contains what we need to infer, that is, , and (iii) where we have to infer three groups of variables . The initial case () is straightforwardly derived as follows:In (23), as before, and we use . The theorem of product of two Gaussians is applied to yield (24) from (23). This forms the initial posterior mean and inverse covariance as follows:

We next deal with the second case; that is, . We partition into three parts: , , and . The parameter vector for is accordingly divided into subvectors (for ) and (for ). We only need to infer ; thus and and the conditional density can be derived as follows:

To derive and , we rearrange the exponent of (28) as a canonical quadratic form in terms of . It is not difficult to have the following formulas after some algebra:where

Finally, for the third case (), the variables to be inferred (i.e., ) are partitioned into three groups of variables: , , and . Here, and , when concatenated, yield a vector of the same dimension as , and we partition accordingly as and . Similarly, is partitioned into blocks, and we denote them by for . The posterior can then be written as

Similar to the second case, we derive and by rearranging the exponent of (33) as a canonical quadratic form in terms of . The resulting formulas are as follows:where

#### 3. Empirical Study

In this section we empirically test the effectiveness of the proposed sARMA estimation method. In particular we deal with the task of portfolio selection on the real-world dataset comprising daily closing prices from Dow Jones Industrial Average (DJIA).

We consider the task of online portfolio selection (OLPS) problem with real stock return data. We begin with a brief description of the OLPS problem. Assuming there are different stocks to invest in daily basis, at the beginning of day , the historic closing stock prices up to day , denoted by , are available, where is -dim vector whose th element is the price of the th ticker. Using the information, you decide the portfolio allocation vector , a nonnegative -dim vector that sums to 1 (i.e., ). Assuming no short positioning is allowed, is the proportion of the whole budget to be invested in the th stock for .

The portfolio strategy is thus a function that maps the historic market information (say ) to the price prediction . The sARMA-based portfolio strategy can be built by estimating sARMA models, one for each stock ticker , for the stock log-return data; namely, (here, we drop the dependency on for simplicity). Then the predicted can be used to decide the proportion of the budget to be invested in the th ticker at time . A reasonable strategy is to make no investment (i.e., ) if , while forcing to be proportional to if .

To evaluate the performance of a portfolio strategy, we use the popular (running) relative cumulative wealth (RCW) defined as , where is the total budget at time . Thus indicates the total budget return at time compared to the initial budget, and the portfolio strategy that yields high for many epochs ’s is regarded as a good strategy. Assuming that there is no transaction cost, it is not difficult to see that where we define the price relative vector (division element-wisely).

Hence, in the sARMA model, it is crucial to accurately forecast the returns, and we compare the model estimated by our cost-sensitive loss with the one using traditional ML estimation. For each approach, we estimate sARMA models, one for each stock return, and once the predicted returns ’s at are obtained, ’s are decided as follows: if and where is the number of ’s with . We also contrast them with the fairly standard market portfolio strategy which sets to be proportional to the total market volume (i.e., the product of the price and the total number of shares) of the ticker .

We test the above-mentioned three portfolio strategies on the real-world data, the 30 tickers’ daily closing prices from Dow Jones Industrial Average (DJIA) for about 15 months beginning on January 14, 2001, which amounts to about 340 daily records. The dataset is available publicly (http://www.mysmu.edu.sg/faculty/chhoi/olps/datasets.html, http://www.cs.technion.ac.il/~rani/portfolios), and the detailed description can be found in [22]. The stock tickers appear to be considerably correlated with one another and include GE, Microsoft, AMEX, GM, COCA-COLA, and Intel.

In the sARMA estimation, we set , and the cross predictors are defined to be the returns of the other 29 stocks at day . The parameter in our cost-sensitive estimation is empirically chosen. First, the average costs attained, that is, , which are further averaged over different models, are 114.0994 for the ML-estimated sARMA and 0.0030 for the proposed cost-sensitive sARMA. This implies that the proposed estimation method yields a far more accurate prediction performance than the traditional ML method in terms of the proposed cost function.

Next, we depict the running RCW scores for three competing portfolio strategies in Figure 1. As shown, the proposed approach (sARMA-cost) achieves the highest profits consistently for almost all ’s during the time horizon, significantly outperforming the market strategy. The ML-based sARMA estimator performs the worst, which can be explained by its attempt at fitting a model to overall data, not accounting for the asymmetric loss structure for the asset return data, especially regarding the directions of return predictions. In the end, for , the proposed method indeed gives positive return (i.e., ) whereas the other two methods suffer from substantial budget loss (). This again signifies the effectiveness of the cost-sensitive loss minimization in the return prediction.

#### 4. Conclusion

In this paper we have introduced a novel ARMA model identification method that exploits the asymmetric loss structure for the financial asset return data. The proposed cost function effectively encodes the goodness of matching in directions between true and model predicted asset returns, which is directly related to ultimate profits in the investment. We have provided the subgradient-based optimization using the trust-region approximation, where it has been empirically shown to work well for the portfolio selection problem in a real-world situation.

#### Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This study is supported by National Research Foundation of Korea (NRF-2013R1A1A1076101).