#### Abstract

Many researchers have tried to optimize pairs trading as the numbers of opportunities for arbitrage profit have gradually decreased. Pairs trading is a market-neutral strategy; it profits if the given condition is satisfied within a given trading window, and if not, there is a risk of loss. In this study, we propose an optimized pairs-trading strategy using deep reinforcement learning—particularly with the deep Q-network—utilizing various trading and stop-loss boundaries. More specifically, if spreads hit trading thresholds and reverse to the mean, the agent receives a positive reward. However, if spreads hit stop-loss thresholds or fail to reverse to the mean after hitting the trading thresholds, the agent receives a negative reward. The agent is trained to select the optimum level of discretized trading and stop-loss boundaries given a spread to maximize the expected sum of discounted future profits. Pairs are selected from stocks on the S&P 500 Index using a cointegration test. We compared our proposed method with traditional pairs-trading strategies which use constant trading and stop-loss boundaries. We find that our proposed model is trained well and outperforms traditional pairs-trading strategies.

#### 1. Introduction

Pairs trading is a method for obtaining arbitrage profit when there is a statistical difference between two stocks with similar characteristics that are cointegrated or highly correlated. This is possible because of the statistical reason that spreads made by two stocks have a mean reversion in the long run [1]. In the early days, pairs-trading methods were popular because of the opportunity to obtain arbitrage profit [1–4]. However, as many investors including hedge funds sought these arbitrage opportunities by executing the pairs-trading strategy, its profitability began to deteriorate [5, 6]. To overcome these shortcomings, significant research has been conducted to improve the pairs-trading strategy [7–10].

The mechanism of pairs trading is as follows. First, a pair of stocks with similar trends is identified. Second, regression analysis such as ordinary least squares (OLS), total least squares (TLS), and error correction models (ECM) is used to calculate the spread of these stocks. Finally, if the spread hits preset boundaries, investors will open a portfolio which takes a long position on the undervalued stock and shorts the overvalued stock. Subsequently, if the spread reverses to the mean, investors will close the portfolios which are opposite position to the open portfolio. In this case, the investor obtains an arbitrage profit by executing this strategy. However, there is a risk when the spread does not reverse to the mean. In such a situation, investors are at high risk because they cannot close the portfolio. By setting a stop-loss boundary, investors can hedge the risk [11–13].

Many researchers have applied various statistical methods to improve the efficiency and performance of pairs trading. In particular, they focused on using the spread as a trading signal. The study in [1] collected pairs of stocks based on minimizing the sum of squared deviations between the two stocks and then executed the trading strategy if the difference between the pairs is twice the standard deviation of the spread. They used normalized US stock price data from 1962 to 2002 to test the profitability of pairs trading. The study in [14] used the cointegration approach to protect the pairs-trading strategy from severe losses. They applied an OLS method to create a spread and set various conditions that translated into trading actions. From these models, they achieved a trading strategy with a minimum level of profits protected from risk of loss. The results showed about an 11% annualized excess return over the entire period. The research in [15] compared the distance and cointegration approaches for each high-frequency and daily dataset to check whether it is profitable for Norwegian seafood companies. The performance is similar between two approaches. Reference [16] used a Kalman filter to calculate spread, which was then used as a high-frequency trading signal, on the shares constituting the KOSPI 100 Index. He found that the pairs-trading strategy’s performance was significant on the KOSPI and was better during daily market conditions at market opening and closing. Moreover, [7] optimized a pairs-trading system as a stochastic control problem. They used the Ornstein-Uhlenbeck process to calculate spread as a trading signal and tested their model with simulated data; the results showed that their strategy performs well. In addition, [17] suggested the Ornstein-Uhlenbeck process to make a market microstructure noise used as a trading signal in pairs trading strategy. The performance is better under this method than in traditional estimators such as ARIMA(1,1) and maximum likelihood. Reference [18] applied a cointegration method to Chinese commodity futures from 2006 to 2016 to check whether pairs trading was suitable in that market. They used OLS regression to create spreads from the pairs. Furthermore, [10] applied a cointegration test to assorted pairs of stocks and a vector error-correction model to create a trading signal.

It is important to set a boundary to optimize the pairs-trading strategy. This boundary is a criterion for deciding whether to execute a pairs-trading strategy. If a low boundary is set, many strategies will be executed, but profits will be lower; if a high boundary is set, investors will get high returns when the strategy is executed. However, all this assumes that mean reversion occurs. If the spread does not return to the average in the specified trading window, losses will be incurred. If a low boundary is set, the loss will be small. However, if the strategy is executed with a high boundary, the loss will increase. Therefore, the performance of pair trading depends on how the boundary is set. Reference [14] suggested taking a minimum-profit condition, which could be efficient to reduce losses in a pairs-trading system. They set a trading rule with a diverse open condition: for example, if the spread is above 0.3, 0.5, 0.75, 1.0, and 1.5 standard deviations. They used the daily closing prices from January 2, 2001, to August 30, 2002, of two stocks, the Australia New Zealand Bank and the Adelaide Bank. The results showed that, as the open condition value decreases, the number of trades and profits increases. Also [19] suggested optimal preset boundaries calculated from estimated parameters for the average trade duration, intertrade interval, and number of trades and used them to maximize the minimum total profit. They used the daily closing price data from January 2, 2004, to June 30, 2005, of seven pairs of stocks on the Australian Stock Exchange. The results showed that their proposed method was efficient in making profits using the pairs-trading strategy. Reference [18] examined whether the pairs-trading strategy could be applied to the daily return of Chinese commodity futures from 2006 to 2016 using three methods: classical, closed-loop, and dynamic stop-loss. The closed-loop method takes only a stop-profit barrier which executes the strategy and does not consider the risk if spreads revert to the mean. The classical method adds stop-loss boundaries to the closed-loop method. The dynamic stop-loss method uses a variety of stop-profit and stop-loss barriers to fit the spreads if the spread is larger than the standard deviation, which is set using criteria based on the historical average of spreads. The results showed that these methods obtained an annualized return of over 15%, especially the closed-loop method, which yielded the highest profit of 26.94%. In addition, [20] experimented with fixed optimal threshold selection, conditional volatility, percentile, spectral analysis, and neural network thresholds in pairs-trading strategy. Of these, the neural network threshold has outperformed all other strategies.

Following the success of reinforcement learning, demonstrated by its successful performance at Atari games [21], many researchers have attempted to apply this algorithm to the financial trading system. Reference [22] proposed a deep Q-trading system using reinforcement learning methods. They applied Q-learning to a trading system to trade automatically. They set a delta price using data from the past 120 days, had three discrete action spaces (buy, hold, and sell), and used long-term profit as a reward. They used daily data from January 01, 2001, to December 31, 2015, of the Hang Seng Index and the S&P 500 Index. The experimental results showed that their proposed method outperformed buy-and-hold strategies and recurrent reinforcement learning methods. Reference [23] proposed three steps to apply reinforcement learning to the financial trading system. First, they reduced relative replay size to fit financial trading. Second, they proposed an action-augmentation technique that provides more feedback from the action to the agent. Third, they used long sequences as reinforcement data to conduct recurrent neural network training. The experimental data comprised tick-by-tick data of 12 forex currency pairs from January 2012 to December 2017. The results showed that the action-augmentation technique yielded more profit than an epsilon-greedy policy. Reference [10] used an N-armed bandit problem to optimize the pairs-trading strategy. They took the spread using an error-correction model and found the parameters using a grid-search algorithm. They compared their proposed model with a constant parameter model, which was similar to a traditional pairs-trading strategy. They used intraday one-minute data of some stocks in the FactSet database from June 2015 to January 2016. The performance of their proposed model was better than the constant-parameter model.

We investigate not only the dynamic boundary based on a spread in each trading window—which can achieve higher profit than the fixed boundary used in traditional pairs trading strategy—but also if it is possible to train deep reinforcement learning methods to follow this mechanism. To this end, we propose a new method to optimize the pairs trading strategy using deep reinforcement learning, especially deep Q-networks, since pairs trading strategy can be thought of as a game. After opening a portfolio position, the profit can be set whether portfolio is closed, stop-loss position. Therefore, if we set this strategy as a game by setting boundaries which are optimized in spreads in trading window, we can achieve more profit than traditional pairs trading strategies. In particular, we set the pairs-trading system to be a kind of game and obtain the optimal boundaries, trading thresholds, and stop-loss thresholds according to the calculated spread. The reason for this construction is that if the portfolio is opened and closed in the trading window in the calculated spread, it will be unconditionally profitable if the portfolio is closed. If the portfolio reaches the stop-loss boundary or does not converge to the mean, losses may occur. We therefore set the DQN to learn by positively rewarding it if it takes a closed position and negatively rewarding it if it reaches the stop-loss or exit thresholds. We conducted the following experiments to verify that our proposed method is optimized compared to the conventional method. First, we used different spreads calculated using OLS and TLS to see how the results differ depending on the spread used for input. Second, depending on the formation window and trading window, the spread and hedge ratio will be varied. We therefore set a total of six window sizes for selecting the optimal window size which had the best performance. Finally, we compared the proposed method with the traditional pairs-trading strategy using the test data with the optimal window size. In this experiment, we use the daily adjusted closing prices from January 2, 1990, to July 31, 2018, of 50 stocks in the S&P 500 Index. Experimental results show that our proposed method outperforms the traditional pairs-trading strategy across all the pairs. In addition, we can confirm that the performance measure varies according to the spread.

The main contributions of this study are as follows. First, we propose a novel method to optimize pairs trading strategy using deep reinforcement learning, especially deep Q-networks with trading and stop-loss boundaries. The experimental results show that our method can be applied in the pairs trading system and also to various other fields, including finance and economics, when there is a need to optimize a rule-based strategy to be more efficient. Second, we propose an optimized dynamic boundary based on a spread in each trading window. Our proposed method outperforms traditional pairs trading strategy which set a fixed boundary. Last, we find that our method outperforms traditional pairs trading strategy in all pairs based on constituent stocks in S&P 500. Since our method selects optimal boundaries based on spreads, it can be applied to other stock markets such as KOSPI, Nikkei, and Hang Seng. It should be noted that the present work is a part of the Master thesis [24].

The rest of this paper is organized as follows. Section 2 explains the technical background. Section 3 describes the materials and methods. Section 4 shows the results and provides a discussion of the experiments. Section 5 provides our conclusions to this study.

#### 2. Technical Background

##### 2.1. The Traditional Pairs-Trading Strategy

Pairs trading is a representative market-neutral trading strategy which simultaneously longs an undervalued stock and shorts an overvalued stock. This strategy is a form of statistical arbitrage trading that assumes the movements of the prices of the two assets will be similar to previous trends [1]. It follows the assumption that asset prices will return to the long-term equilibrium. This strategy started from the idea that arbitrage opportunities exist when the price gap between two assets expands to or past a certain level. It is also based on the belief that historical price movements will not change significantly in the future.

In Figure 1, the graph drawn in blue is a spread made of two stocks that are cointegrated, the red lines are the trading boundaries, and the green lines are the stop-loss boundaries. When this spread reaches the trading boundaries, the portfolio is opened and only closed when the spread returns to the average. However, losses are incurred when prices reach the stop-loss boundaries after the portfolio is opened and do not return to the average. Furthermore, after the portfolio is opened, if the trading signal is not reversed to mean during the trading window, the portfolio is closed by force; this is called the exit position of the portfolio.

###### 2.1.1. The Cointegration Test

There are many approaches for pair selection such as the discrete approach [11, 25–27], the cointegration approach [10, 16, 27], and the stochastic approach [7, 8]. In this study, we use the cointegration approach to choose pairs which have long-term equilibrium. Generally, a linear combination of nonstationary variables is also a nonstationary relationship. Assume that and have unit roots; as previously mentioned, the linear combination of these variables follows nonstationary conditions. However, it can be a stationary relationship if the nonstationary variables are cointegrated. In this case, this regression must be checked to determine whether it is a spurious regression or cointegrated. Johansen’s method is widely used to test for cointegration [28]. In this method, the number of cointegration relations and the parameters of the model are estimated and tested using maximum likelihood estimation (MLE). Since all variables are regarded as endogenous variables, there is no need to select dependent variables and multiple cointegration relationships are identified. In addition, we use MLE to estimate the cointegration relation with the vector autoregression model and to determine the cointegration coefficient based on the likelihood-ratio test. There is therefore an advantage in performing various hypothesis tests related to the estimation of cointegration parameters and the setting of other models when there is cointegration, and not merely to test for cointegration.

##### 2.2. Spread Calculation

###### 2.2.1. Ordinary Least Squares

In regression analysis, OLS is widely used to estimate parameters by minimizing the sum of the squared errors [29]. Assume that , , and are an independent variable, a dependent variable, and an error term. We can estimate from the following equation by taking a partial derivative:The value obtained from equation (5) is used for the number of stock orders. The epsilon value is also used as a trading signal through Z-scoring, in the state composed of the formation-window size.

###### 2.2.2. Total Least Squares

TLS estimates parameters to minimize the sum of the measured distance and the vertical distance between regression lines [30]. Since the vertical distance does not change when the X and Y coordinates are changed, the value of is calculated consistently. In the TLS method, the observed values of and have the following error terms: where and are true values and and are error terms following independent identical distributions. It is assumed that there is linear combination of true values. For convenience, we represent the error variance ratio in equation (10):The orthogonal regression estimator is calculated by minimizing the sum of the measured distance and the vertical distance between regression lines in equation (11):The value obtained from equation (12) is used in the same way as that obtained from equation (5) and the epsilon value is also used as a trading signal through the Z-score in the state composed of the formation-window size.

##### 2.3. Reinforcement Learning and the Deep Q-Network

The idea of reinforcement learning is to find an optimal policy which maximizes the expected sum of discounted future rewards [31]. These rewards come from selecting the optimal value of each action, called the optimal Q-value. Reinforcement learning basically solves the problem defined by the Markov decision process (MDP). It consists of a tuple , where is a finite set of states, is a finite set of actions, is a state transition probability matrix, is a reward function, and is a discount factor. In environment , agent-observed state at time , action is selected. From the results of these sequences, environmental feedback is provided to the agent in the form of reward and next state . An action is selected by the action-value function that represents the expected sum of discounted future rewards. In this action-value function , we find an optimal action-value function , following an optimal policy which maximizes the expected sum of discounted future rewards.This optimal action-value function can be formulated as the Bellman equation. The DQN uses a nonlinear function approximator to estimate the action value function. This network is trained by minimizing a sequence of loss functions , which changes with each sequence of . The weight of is updated as the sequence progresses:

#### 3. Materials and Methods

##### 3.1. Data

In this study, 50 stocks from the S&P 500 Index were selected based on their trading volume and market capitalization. To carry out the experiment, the data must cover the same period. Therefore, corresponding stocks were selected, leaving a total of 25 stocks. Table 1 represents the dataset of stock names, abbreviations of those stocks, and their respective sectors. We collected the adjusted daily closing prices using Thomson Reuters’ database. The period of the training dataset is from January 2, 1990, to December 31, 2008, comprising 4792 data points; the test dataset covers the period from January 2, 2009, to July 31, 2018, comprising 2411 data points. From these datasets, a pair of stocks will be selected during the training dataset period using the cointegration test.

##### 3.2. Selecting Pairs Using the Cointegration Test

It is necessary to pair stocks which have long-run statistical relationships or similar price movements. It is possible to determine the degree to which two stocks have had similar price movements through the correlation value. Furthermore, the long-term equilibrium of a pair of stocks is an important characteristic for the execution of pairs trading. In this study, we used the cointegration approach to select pairs of stocks. Through Johansen’s method, we selected 11 pairs of stocks that have long-run equilibria. Table 2 shows the resulting pairs of stocks that were identified based on t-statistics and Figure 2 shows price movements of the cointegrated stocks XOM and CVX. Using this dataset, we will verify whether our proposed method has better performance than the traditional pairs-trading method.

##### 3.3. Trading Signal

After selecting the pairs, it is necessary to extract the signal for trading. To extract signals, we opt for the OLS or TLS methods. First, because the stock price follows a random walk [32], we need to ensure that it follows the process through the augmented Dickey-Fuller test. Subsequently, the process should be created using the logarithmic difference in stock prices which is then applied to the OLS and TLS methods. In equation (18), is a constant value, is a hedge ratio (which is used as trading size), is the error term, and and are the logarithmic differences in the stock prices and at time . We convert values of into a Z-score used as a trading signal. For example, if the trading signal reaches the threshold, we short one share of the overvalued stock (represented as ) and long shares of the undervalued stock (represented as ). The hedge ratio is determined based on the window size. We set a total of six discrete window sizes to obtain the optimal window size for the experiment. Trading windows are constituted using half of the formation-window size. The spread obtained here is used as a state when applying reinforcement learning (i.e., as an input of the DQN).

##### 3.4. Proposed Method: Optimized Pairs-Trading Strategy Using the DQN Method

In this study, we optimize the pairs-trading strategy with a type of game using the DQN. We will attempt to implement an optimal pairs-trading strategy by taking optimal trading and stop-loss boundaries that correspond to the given spread, since performance depends on how trading and stop-loss boundaries are set in pairs trading [14]. Figure 3 shows the mechanism of our proposed pairs-trading strategy. Throughout the cointegration test, we identify pairs and, using regression analysis, obtain a hedge ratio used as trading volume and a spread used as a trading signal and state. In the case of the DQN, two hidden layers are set up and the number of neurons is optimized by taking half of input size through trial and error. Action values consist of the six discrete spaces in Table 3. Each value of has values for trading and stop-loss boundaries.

A pairs-trading system can make a profit if the spread touches the threshold and returns to the average such that the portfolio is closed in each trading window. On the other hand, if the trading boundary is touched and the stop-loss boundary is reached, the system tries to minimize losses by stopping trades. If the spread touches the trading boundary but fails to return to the average, the strategy may end up with a profit or a loss. In this study, the pairs-trading strategy is therefore considered as a kind of game; closing a portfolio yields a positive reward and a portfolio that reaches its stop-loss threshold yields a negative reward. Although an exited portfolio may possibly generate a positive profit, there is also a possibility that losses will occur and it is therefore set to yield a negative reward. We set the other conditions (such as the maintenance of the portfolio or not to execute the portfolio) to zero so as to concentrate on the close, stop-loss, and exit positions.We fix the values of portfolio close, stop-loss, and exit to +1000, −1000, and −500, respectively. When we update the Q-values, we must consider the reward as a significant component of efficiently training the DQN. We therefore set the reward value to have a range similar to that of the Q-value. Additionally, we included the corresponding profit or loss value to reflect that weight after the trading ended. In equation (19), and are the stock orders of stocks and at time , and are the stock prices of and at time , and and are the stock prices of and at time .

Algorithm 1 shows the process of our proposed method. Before we start our proposed method, we set a replay memory and batch size and select pairs using the cointegration test. At each epoch, we initialized total profit to 1.0. In the training scheme, we set a state which has spreads within the formation window and select actions which are used as trading and stop-loss boundaries. Throughout the trading window, we executed a strategy similar to a traditional pairs-trading strategy using the action selected. After executing the strategy, we obtain a reward based on the results of the portfolio. Finally, for the Q-learning process, we update the Q-networks by performing a gradient descent step.

Initialize replay memory and batch size | |

Initialize deep Q-network | |

Select pairs using cointegration test | |

(1) For each epoch do | |

(2) Profit = 1.0 | |

(3) For steps t = 1, … until end of training data set do | |

(4) Calculate spreads using OLS or TLS methods | |

(5) Obtain initial state by converting spread to Z-score based on formation window | |

(6) Using epsilon-greedy method, select a random action | |

(7) Otherwise select | |

(8) Execute traditional pairs-trading strategy based on the action selected | |

(9) Obtain reward by performing the pairs-trading strategy | |

(10) Set next state | |

(11) Store transition in | |

(12) Sample minibatch of transition from . | |

(13) | |

(14) Update Q-network by performing a gradient descent step on | |

(15) End | |

(16) End |

##### 3.5. Performance Measure

We check our experiment results based on profit, maximum drawdown, and the Sharpe ratio. Profit is commonly used as a performance measure for trading strategies. It is calculated as the sum of returns taking into consideration trading cost. Since many trades can increase total profit, it is necessary to determine the total profit taking into consideration transaction costs depending on trading volume. In this study, we set a trading cost of 5 bp; equation (21) is almost the same as equation (19), but it does not include absolute value, and is trading cost. Maximum drawdown represents the maximum cumulative loss from the highest to the lowest values of the portfolio during a given investment period where is the value of the portfolio and is the terminal time value. The Sharpe ratio is an indicator of the degree of excess profits from investing in risky assets used in evaluating portfolios [33]. In equation (23), is the expected sum of portfolio returns and is the risk-free rate; we set this value to 0 and is the standard deviation of portfolio returns. The Materials and Methods section should contain sufficient details so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.

#### 4. Results and Discussion

We use the stock pair XOM and CVX, which rejects the null hypothesis at the 1% significance level, to verify whether our proposed model is trained well. The lengths of the window sizes such as the formation window and trading window are selected from the performance results with the training dataset. From these results, we select an optimized window size and compare our proposed model with traditional pairs trading, which takes a constant set of actions with the test dataset.

##### 4.1. Training Results

To find the optimum window size for the optimized pairs-trading system, we experimented with six cases. We performed the experiments based on six window sizes, and the results for each window size are calculated by averaging the top-5 results for a total of 11 pairs. From Tables 4 and 5, we can find that the best performance is obtained when the formation and training windows are 30 and 15, respectively, based on the profit generated by both the OLS and TLS methods. When we trained our networks, we set a positive reward for taking more closed positions and fewer stop-loss and exit positions. We can find the lowest ratio of portfolio closed positions based on the number of open positions, which in the formation and trading windows are for 30 and 15 days (0.68). Contrary to this result, the highest ratios of the number of closed positions in the formation and trading windows are for 120 and 60 days (0.73). However, the highest profits reported in the formation and trading windows are for 30 and 15 days. This can be explained when we check the ratio of the number of stop-loss portfolios. The formation and trading window sizes are 30 and 15 days and the ratio of portfolio stop-loss position is 0.13, but the formation and trading window sizes are 0.20. This result indicates that it is important to reduce the stop-loss position while increasing the closed position. In addition, we can see that the trading signals made with the TLS method are better than those made with the OLS method in all six of the discrete window sizes. The reason for this is based on the difference between the hedge ratios of the two methods. In OLS, when one side is the reference, the relative change of the other side is estimated. Since the assumption is that there is no error component on the reference side and there is an error only on the other side, the hedge ratio varies depending on the side used as the reference. However, in TLS, hedging ratios are the same regardless of which side is used as the reference. For this reason, the experimental results confirm that the TLS method is better able to determine when to execute the pairs-trading strategy. From these results, we take the optimum window size when we verify our proposed method in the test dataset. However, we first need to ensure that the model we proposed is well-trained.

It is important to check whether our reinforcement learning algorithm is trained well. Reference [21] suggested that a steadily increasing average of Q-values is evidence that the DQN is learning well. Figure 4(a) shows the average Q-values of HON and TXN as training progressed. We find that the average Q-values steadily increased, indicating that our proposed model is properly trained. In addition, we provide a positive reward when the portfolio closes and a negative reward when the portfolio reaches the stop-loss threshold or exits. Figure 4(b) shows the ratio of the number of portfolio positions as training progressed. The ratio of closed to open portfolio positions increased and the ratio of portfolios reaching their stop-loss thresholds to open portfolio positions decreased. We also find that the ratio of portfolio exits to open portfolio positions slightly increased. It is possible that the rewards given for an open portfolio position compared to those given for a closed portfolio position are relatively small. The DQN is therefore trained to prevent portfolios from reaching their stop-loss thresholds (the more important objective) over exiting them. This result can also serve as a basis for judging whether the proposed model is being trained properly.

**(a)**

**(b)**

Tables 6 and 7 represent the performance results of XOM and CVX in the training dataset. We call our proposed model pairs-trading DQN (PTDQN) and traditional pairs trading with constant action values as pairs trading with action 0 (PTA0) to pairs trading with action 5 (PTA5). From this result, we can confirm that our proposed method is more profitable than the constant pairs-trading strategies. In addition, we can see that the TLS method has a higher profitability compared to the OLS method. From PTA0 to PTA5, the trading boundary and the stop-loss boundary grew larger; the numbers of open and closed portfolios and portfolios that reached their stop-loss thresholds are reduced. In other words, there is less opportunity for profit, but the probability of loss is also reduced. It is important not only to take a lot of closed positions, but also to take the best action to open and close the portfolio. For example, if a portfolio is opened and closed by a boundary corresponding to action 0 within the same spread and if a portfolio is opened and closed by a boundary corresponding to action 1, the corresponding profit is different. Assuming that the mean reversion is certain to occur, if we take the maximum boundary condition to open a portfolio, we will obtain a larger profit than when we take a smaller boundary condition. We can see that the PTDQN returns are higher than the strategy with the highest return among the traditional pairs trading strategies that take the constant action. Figures 5–8 show the changes in trading and stop-loss boundaries and the highest profit for constant action when applying the DQN method during the training period using OLS and TLS.

Figures 5 and 6 show comparisons of PTDQN and PTA1 using the TLS method. Figure 5 consists of the spread, trading, and stop-loss boundaries. We find that trading and stop-loss boundaries have different values in PTDQN, showing that it has learned to find the optimal boundary according to each spread. In contrast to PTDQN, PTA1 in Figure 6 has constant trading and stop-loss boundaries. Figures 7 and 8 exhibit the same features we see in Figures 5 and 6. The difference between these methods lies in the spreads: different results can be obtained depending on the spreads used. Making better spreads can therefore improve performance.

Figures 9 and 10 represent the profit corresponding to DQN and constant actions using TLS and OLS. Reference [34] suggested that an average value over multiple trials should be presented to show the reproducibility of deep reinforcement learning because there may be different results from high variances across trials and random seeds. We therefore conducted five trials with different random seeds. The profit graph of DQN represents the average profit of these trials and the filled region between the maximum and minimum profit values. We can see that PTDQN had a higher profit than the traditional pairs-trading strategies during the training period. This means that, even with the same spread, we can see how profit will change as the boundaries are changed. In other words, finding the optimal boundary for the spread is an important factor in optimizing the profitability of pairs trading.

##### 4.2. Test Results

Tables 8 and 9 show the average performance measures of each pair tested by applying the top-5 trained models. We can see that the constant action with the highest returns for each pair is different, and the TLS method is higher in all pairs than the OLS method based on profit, as shown above. We also find that PTDQN has better performance than traditional pairs-trading strategies. The pair with the highest profit using the proposed method is HON and TXN (3.2755); it also shows the biggest difference between the DQN method and the optimal constant action (0.9377). We find that the proposed method has a higher Sharpe ratio in all pairs except for MO and UTX when the TLS method is used. If we add the Sharpe ratio in addition to the total profit as an objective function, we can build a more optimized pairs-trading system. Based on these results, we can ensure the robustness of our proposed method for our dataset. The proposed method can be applied to other pairs of stocks found in other global markets.

In Figure 11, we can see that our proposed method, PTDQN, outperforms the traditional pairs trading strategies that have constant actions in test dataset. The crucial aspect of this method is the selection of optimal boundary in the spread that makes the highest profit in constant action, which is like a constant boundary. Therefore, the trend is the same as traditional pairs trading strategies; however, when the optimal boundaries which have the highest profit in the spread are combined, PTDQN is found to have higher profit than traditional pairs trading strategies. This method can therefore be applied in various fields when there is a need to optimize the efficiency of a rule-based strategy [35, 36]. In this study, we consider spread and boundaries to be the important factors of pairs trading strategy. Therefore, we tried to optimize pairs trading strategy with various trading and stop-loss boundaries using deep reinforcement learning and our method outperforms rule-based strategies. By optimizing key parameters in rule-based methods, it can improve the performances.

**(a) MSFT/JPM**

**(b) MSFT/TXN**

**(c) BRKa/ABT**

**(d) BRKa/UTX**

**(e) JPM/T**

**(f) JPM/HON**

**(g) JPM/GE**

**(h) JNJ/WFC**

**(i) XOM/CVX**

**(j) HON/TXN**

**(k) GE/TXN**

Pairs trading uses two types of stock which have the same trends. However, it can be broken due to various factors such as economic issues and company risk. In this situation, the spread between two stocks is extremely large. Although this situation cannot be avoided, we hedge this risk by taking a dynamic boundary. In this sense, taking the lowest stop-loss boundary is the best choice since it can be overcome with the least loss. By taking the dynamic boundary using the deep reinforcement learning method, we can see that not only profits are increased, but losses are also minimized as compared to taking a fixed boundary.

#### 5. Conclusions

We propose a novel approach to optimize pairs trading strategy using a deep reinforcement learning method, especially deep Q-networks. There are two key research questions posed. First, if we set a dynamic boundary based on a spread in each trading window, can it achieve higher profit than traditional pairs trading strategy? Second, is it possible that deep reinforcement learning method can be trained to follow this mechanism? To investigate these questions, we collected pairs selected using the cointegration test. We experimented with how the results varied according to the spread and the method used. We therefore set different spreads using OLS and TLS methods as the input of the DQN and the trading signal. To conduct this experiment, we set up a formation window and a trading window. The hedge ratio, which is an important factor in determining how much stock to take, depends on this value. We therefore applied the OLS and TLS methods and experimented to find the optimal window size by varying the formation window and the trading window.

Tables 6 and 7 show the average performance values of the formation windows and trading windows in the training dataset. The results show that all six window sizes were higher when TLS spreads were used than in OLS spreads. In addition, we can see that profitability gradually increases as the estimation windows and trading windows of methods using TLS and OLS decreased. The reason is that although the ratio of closed position portfolio is the lowest in what we set formation and trading windows, the ratio of stop-loss position portfolio is also the lowest compared with other formation and trading windows. It means that reducing stop-loss position portfolio is important as well as increasing closed position portfolio to make a profit. Using the optimal window size, we then check whether our DQN is properly trained. At each epoch, we find that the average Q-value steadily increased, the ratio of closed portfolios increased, and the ratio of portfolios that reached their stop-loss thresholds decreased, confirming that our DQN is trained well. Based on these results, we find that our proposed model using the test dataset with a formation window of 30 and a trading window of 15 had results that were superior to those of traditional pairs-trading strategies in the out-of-sample dataset. In Figure 11, we can see that the profit path of PTDQN is similar PTA0 to PTA5, but better than that from other methods. This shows that taking dynamic boundaries based on our method is efficient in optimizing the pairs trading strategy. During economic issues uncertainties, it can be a risk to manage the pairs trading strategies including our proposed method. However, we set a reward function if spread is suddenly high, and our network is trained to prevent this situation by taking less stop-loss boundary since it is trained to maximize the expected sum of future rewards. Therefore, our proposed method can minimize the risk when the economic risks appeared compared with traditional pairs trading strategy with fixed boundary.

From the experimental results, we show that our method can be applied in the pairs trading system. It can be applied in various fields, including finance and economics, when there is a need to optimize the efficiency of a rule-based strategy. Furthermore, we find that our method outperforms the traditional pairs trading strategy in all pairs based on constituent stocks in S&P 500. If we select appropriate pairs which are cointegrated, we can apply our methods to other markets such as KOSPI, Nikkei, and Hang Seng. The study focused on only spreads made by two stocks, which have long-term equilibrium patterns. Since our method selects optimal boundaries based on spreads, it can be applied to other stock markets such as KOSPI, Nikkei, and Hang Seng.

In future works, we can develop our proposed model as follows. First, as profit was set as the objective function in this study, the performance of the model is lower than traditional pairs trading when based on other performance measures. It can therefore be possible to create a better-optimized pairs-trading strategy by including all these other performance indicators as part of the objective function. Second, we can use other statistical methods such as the Kalman filter and error-correction models to use diversified spreads. Finally, it is possible to create a more-optimized pairs-trading strategy by continuously changing the discrete set of window sizes and boundaries. We will solve these difficulties in future studies.

#### Data Availability

The data used to support the findings of this study have been deposited in the figshare repository (DOI: 10.6084/m9.figshare.7667645).

#### Disclosure

The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. This work represents a part of the study conducted as a Master Thesis in Financial Engineering during 2016 and 2018 at the University of Ajou, Republic of Korea.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT: Ministry of Science and ICT) (No. NRF-2017R1C1B5018038).