#### Abstract

In this study, we analyze the term structure of credit default swaps (CDSs) and predict future term structures using the Nelson–Siegel model, recurrent neural network (RNN), support vector regression (SVR), long short-term memory (LSTM), and group method of data handling (GMDH) using CDS term structure data from 2008 to 2019. Furthermore, we evaluate the change in the forecasting performance of the models through a subperiod analysis. According to the empirical results, we confirm that the Nelson–Siegel model can be used to predict not only the interest rate term structure but also the CDS term structure. Additionally, we demonstrate that machine-learning models, namely, SVR, RNN, LSTM, and GMDH, outperform the model-driven methods (in this case, the Nelson–Siegel model). Among the machine learning approaches, GMDH demonstrates the best performance in forecasting the CDS term structure. According to the subperiod analysis, the performance of all models was inconsistent with the data period. All the models were less predictable in highly volatile data periods than in less volatile periods. This study will enable traders and policymakers to invest efficiently and make policy decisions based on the current and future risk factors of a company or country.

#### 1. Introduction

A credit default swap (CDS) is a credit derivative based on credit risk, similar to a bond. The prices of both CDSs and bonds change depending on the risk of the reference entity. If the reference entity has a higher risk, then the CDS spread is set higher. To manage credit risk, we can use a CDS contract. The CDS seller (protection seller) insures the protection buyer’s risk in the event of a credit default, such as bankruptcy of the reference entity, debt repudiation, or, in the case of a sovereign bond, a moratorium. There are two ways for a protection seller to compensate the protection buyer’s loss. The first is to buy the underlying asset at face value; the second is to pay the difference between the remaining value and the face value. In this way, the protection buyer can hedge his or her credit risk and give the CDS spread to the protection seller.

A CDS spread is an insurance fee that a protection buyer pays to the protection seller, often quarterly. Its value is determined by factors such as the probability of credit default and recovery rate. The recovery rate is the percentage of the bond value that the reference entity offers to the protection buyer when a credit default happens. Therefore, if the recovery rate is high, the CDS spread will be low. The CDS spread will be high if the default rate is high, which indicates a high probability of credit default. Because the CDS spread indicates the bankruptcy risk of institutions or countries, it is an important economic index that is being actively traded. According to the Bank for International Settlements, the total outstanding notional amount of CDS contracts was $7809 billion in the first half of 2019.

To date, numerous studies have been conducted on the prediction of financial asset values. For example, Li and Tam [1] forecasted stock price movements of different volatilities using a recurrent neural network (RNN) and support vector machine (SVM). Chen et al. [2] predicted the movement of the Chinese stock market using a long short-term memory- (LSTM-) based model. Gao et al. [3] also used LSTM to predict stock prices. However, few studies have been conducted on forecasting the CDS term structure. Shaw et al. [4] used the Nelson–Siegel model to make 1-, 5-, and 10-day forecasts of the CDS curve and compared its efficiency with that of the random-walk method. They showed that, although the 1-day forecast was not very effective, the accuracy of the 5- and 10-day forecasts outperformed those of the random-walk model. Avino and Nneji [5] predicted daily quotes of iTraxx Europe CDS indices using linear and nonlinear forecasting models, such as autoregressive (AR) and Markov switching AR models. They found that the AR model often outperforms Markov switching models, but Markov switching models offer a good in-sample fit for iTraxx index data. Sensoy et al. [6] used permutation entropy to test the weak-form efficiency of CDS markets in some countries. They found that CDS markets could be efficient during crisis periods, which implies that the impact of a crisis on CDS market efficiency is limited, and Asian markets outperformed the other tested markets in terms of efficiency. In addition, they showed a negative linear correlation between a country’s CDS efficiency and daily CDS levels. Neftci et al. [7] asserted that CDS markets provide unique information on default probability. They showed that the information provided by a CDS regarding the default risk of a sovereign bond is more accurate than the information from a bond spread provided by the corresponding treasury using a stochastic differential equation based on the Markov process. Duyvesteyn and Martens [8] used the structural model for a sovereign bond from Gray et al. [9] to predict how exchange rate returns and volatility changes affect market CDS spread movements. The model results, such as default probability and spreads, were strongly correlated with CDS spreads. Their results also rejected their hypothesis that changes in sovereign credit spreads are correlated to changes in sovereign market spreads.

As mentioned above, several studies have attempted to predict various financial market indices with machine-learning methods; however, research on CDS term structure is limited. CDS term structure reflects the conditions for monetary policy and companies’ future risk expectations. CDS spread can be classified into two types. The first one is sovereign CDS, which has a country as its reference entity. Sovereign CDS spreads reflect the creditworthiness of a country. That is, the sovereign CDS spread can be considered as a measure of the sovereign credit risk [10]. Furthermore, the sovereign CDS spreads contain some components that are attributed to global risk, according to Pan and Singleton [11] and Longstaff et al. [12]. Studies on sovereign CDS include Pan and Singleton [11], Longstaff et al. [12], Blommestein et al. [10], Galariotis et al. [13], Srivastava et al. [14], Ho [15], and Augustin [16]. The other type of CDS is written with respect to one single reference entity, the so-called single-name CDS. In addition, CDS sector indices are based on the most liquid 5-year term, are equally weighted, and reflect an average midspread calculation of the given index’s constituents. However, single-name CDS spreads are much less liquid than indices [17–19]. In several studies, the creditworthiness of individual industries was investigated using CDS sector data [19–22].

The CDS term structure is important because it integrates the future risk expectations of both markets and companies by offering CDS spreads over time. Thus, we can confirm various types of information from the CDS term structure, such as firm leverage and volatility, as shown by Han and Zhou [23]. Furthermore, understanding the implications of the term structure also provides us with a method of extracting this information and predicting the effect of financial events and risk on it. Despite the large number of studies on CDS, studies that attempt to forecast its term structure remain few.

In this study, we analyze the CDS term structure, particularly sovereign CDS, forecast it using machine-learning models, and identify the most suitable model for predicting CDS term structure. We consider model-driven and data-driven methods: the Nelson–Siegel model, RNN, SVR, LSTM, and GMDH. The Nelson–Siegel model, as a model-driven method, was devised to fit the yield term structure; however, in this study, it was fitted to the CDS term structure to extract the term structure parameters and forecast the CDS term structure with the AR(1) model. RNN, SVR, LSTM, and GMDH are machine-learning models that specialize in predicting time-series data. RNN memorizes previous information and uses it to predict future information. LSTM is basically the same as RNN; however, it memorizes only significant information based on some calculations. SVR is derived from the structural risk minimization principle [24] and has been used for prediction in many fields [25–27]. Among the machine-learning methods, a GMDH network is a system identification method that has been used in various fields of engineering to model and forecast the nature of unknown or complex systems based on a given set of multi-input-single-output data pairs [28–30].

Machine learning is widely used in various fields to analyze data and forecast future flow. For example, Yan and Ouyang [31] compared the efficiency of the LSTM model in predicting financial time-series data with that of other machine-learning models, such as SVM and K-nearest neighbor. Baek and Kim [32], Yan and Ouyang [31], Cao et al. [33], and Fischer and Krauss [34] also analyzed and forecasted financial data using machine learning. Machine learning is widely used in medical research. Thottakkara et al. [35], Motka et al. [36], Boyko et al. [37], and Tighe et al. [38] studied and predicted various illnesses and clinical data with machine-learning models. Many studies have also been performed to predict weather conditions using machine learning. Choi et al. [39], Haupt and Kosovic [40], Rhee and Im [41], and James et al. [42] conducted research on forecasting weather conditions. Ma et al. [43] and Li et al. [44] used a convolutional neural network (CNN) to predict a transportation network. Furthermore, GMDH has been widely used for time-series prediction [45–47]. As in these studies, we will apply machine-learning methods to forecast the CDS term structure and identify the most efficient method. There are not many studies on financial data using machine-learning methods compared to other areas, and to the best of our knowledge, this work is the first to present a forecasting model for CDS data. Therefore, although there are many prediction methods, we especially focus on methods which are generally used in the prediction of time-series data, such as LSTM, RNN, SVR, and GMDH.

Methodologically, we adopt Nelson–Siegel as a model-driven method and RNN, LSTM, SVR, and GDMH as data-driven methods to predict the CDS term structure for the period (2008–2019). We optimize the data-driven models using a grid search algorithm with the Python technological stack. Furthermore, these tests are explored using subperiod analyses to investigate changes in the model performances over the experimental period. Specifically, we split the entire sample period into two subperiods: January 2008–December 2011 (subperiod 1) and January 2012–December 2019 (subperiod 2), because subperiod 1 contains financial market turbulence due to the global financial crisis and European debt crisis. Through this subperiod analysis, we investigate the change in the forecasting performance of all methods in both high-variance and relatively low-variance data. This kind of subperiod analysis is common in other studies [48–51].

In time-series forecasting, sequence models, either RNN, LSTM, or a combination of both, are frequently used owing to considerations of time. The sequence model recognizes time as an order and can check how it changes according to the order; therefore, it can be applied to data, such as weather and finance. According to Siami-Namini and Namin [52] and McNally et al. [53], neural network (NN) models, such as RNN and LSTM, outperformed conventional algorithms, as measured by their autoregressive integrated moving averages (ARIMAs), when using financial data or bitcoin prices. McNally et al. [53] also evaluated the performance of LSTM using volatile Bitcoin data, and Cortez et al. [54] used data from the Republic of Guatemala to predict emergency events. Furthermore, LSTM is known to be better than RNN because it is modified to correct the disadvantages of RNN; however, it appears to depend on the dataset. For example, Samarawickrama and Fernando [55] demonstrated that LSTM exhibited higher accuracy than RNN when predicting stock prices. However, Selvin et al. [56] also compared RNN with LSTM in forecasting stock prices and found that RNN outperformed LSTM. Therefore, in this study, we used both RNN and LSTM to confirm whether LSTM outperforms RNN when forecasting CDS spreads. Ultimately, the motivation for conducting this study is to compare the CDS forecasting performance between the Nelson-Siegel model and the RNN, LSTM, SVR, and GDMH models, to determine the difference between model-driven and data-driven methods.

This paper is organized as follows: in the next section, we review our dataset and present a statistical summary of the CDS term structure; we describe our methods: Nelson–Siegel, RNN, SVR, LSTM, and GMDH, and we explain hyperparameter optimization and its application to the CDS term structure; Section 3 presents our forecasting results on CDS term structure with various error estimates and demonstrates the performance of each model; and Section 4 provides a summary and concluding remarks.

#### 2. Data Description and Methods

##### 2.1. Data Description

The CDS spread can be classified into several categories. The classification method usually depends on the frame of the credit event. The full restructuring clause is the standard term. Under this condition, any restructuring event could be a credit event. The modified restructuring clause limits the scope of opportunistic behavior by sellers when restructuring agreements do not result in a loss. While restructuring agreements are still considered as credit events, the clause limits the deliverable obligations to those with a maturity of less than 30 months after the termination date of the CDS contract. Under the modified contract option, any restructuring event, except the restructuring of bilateral loans, could be a credit event. Additionally, the modified-modified restructuring term is introduced because modified restructuring has been too severe in its limitation of deliverable obligations. Under this term, the remaining maturity of deliverable assets must be less than 60 months for restructured obligations and 30 months for all other obligations. Under the no restructuring contract option, all restructuring events are excluded under the contract as “trigger events.”

For this type of CDS, we will use a full restructuring sovereign CDS spread dataset because other datasets are unavailable for long periods. Sovereign CDS spread reflects the market participants’ perceptions of a country’s credit ratings. Our data cover the period from October 2008 to October 2019 and maturities of six months and 1, 2, 3, 4, 5, 7, 10, 20, and 30 years. All data were sourced from Datastream and correspond to the daily closing price of the CDS spread. The term structure of the CDS spread normally shows upward sloping curves, as seen in Figure 1. Furthermore, CDS spreads seem to be lower as they get closer to the current date with no exceptions. Table 1 provides summary statistics of the CDS data. We can also verify that spreads with longer maturities have higher prices in terms of both mean and percentile. It is interesting to note that the standard deviation is also higher when the maturity is longer, which implies that the market predictions are highly unstable for longer periods.

##### 2.2. Nelson–Siegel Model

Nelson and Siegel [57] proposed a parsimonious model, and it is widely used to predict the interest rate term structure. The formula is as follows:where is the time-decay parameter; is the maturity; and , , and are the three Nelson–Siegel parameters. is the long-term component of the yield curve as it does not decay to 0 and remains constant for all maturities. is the short-term factor, which starts at 1 but quickly decays to 0. Finally, starts at 0 and increases before decaying back to 0; hence, it is medium term, which creates a hump in the yield curve.

The Nelson–Siegel model is a simple but effective method for modeling a term structure, and various studies have used the model to predict the yield curve or other term structures. For example, Shaw et al. [4] forecasted CDS using the Nelson–Siegel model to fit the CDS curve. Guo et al. [58] used the Nelson–Siegel model to model the term structure of implied volatility. GrØnborg and Lunde [59] used it to model the term structure of future oil contracts and forecast the prices of these contracts, while West [60] determined the future price of agricultural commodities. In particular, the CDS term structure has a strong relationship with the interest rate term structure. For example, Chen et al. [61] found that interest rate factors not only affected credit-spread movements but also forecasted future credit risk dynamics. They claimed that the different frequency components of interest rate movements affected the CDS term structure in various industrial sectors and credit rating classes. Specifically, worsening credit conditions tend to lead to future easing of monetary policy, leading to lower current forward interest rate curves. On the contrary, positive shocks to the interest rate narrow the credit spread at long maturities. Tsuruta [62] tried to decompose the yield and CDS term structure into risk and nonrisk structures and found that credit risk components have a negative relationship to the local equity market.

In this study, we attempted to fit the CDS curve to the Nelson–Siegel model by estimating the time-decay parameter and Nelson–Siegel parameters , , and . We can estimate Nelson–Siegel parameters using various models, such as autoregressive-moving-average (ARMA) and ARIMA, and select the most accurate model. For example, Shaw et al. [4] used the AR(1) process to estimate , , and . Here, we used the AR(1) process to estimate Nelson–Siegel parameters and time-decay parameters. The error measures mean squared error (MSE), root MSE (RMSE), mean percentage error (MPE), mean absolute percentage error (MAPE), and mean absolute error (MAE) to compare the efficiency of this method with that of other methods, such as RNN or LSTM.

##### 2.3. SVR

SVR is a field of machine-learning models derived from SVM. SVM is an algorithm that returns a hyperplane that separates the training samples into two labels, positive and negative. We refer to the distance between the closest point and the hyperplane as the “margin,” and the goal of SVM is to identify the hyperplane that maximizes the margin. There are two types of margin. The first type is a hard margin, which is for linearly separable datasets, meaning that every point does not violate its label. In other words, all the points can be classified into their labels with a hyperplane. The second one is a soft margin, which is for nonseparable cases. In this case, some points in the dataset, called “outliers,” are incorrectly classified. There are two ways to select a soft margin hyperplane. On the one hand, we can make the margin larger and take more errors (outliers). This is usually used for datasets that have only a small number of outliers. On the other hand, we can choose a hyperplane that has a small margin and minimize the empirical errors. This is useful for datasets with dense point distributions, where it is difficult to separate the data explicitly.

Additionally, the kernel trick can be used for linearly nonseparable datasets. Kernel represents a function that maps origin data points to a higher dimensional dataset that is separable. The reason it is called the “kernel trick” is that, although the dimension of the dataset is increased, the cost of the algorithm does not increase much.

SVM originated from the statistical learning theory introduced by Vapnik and Chervonenkis. The characteristic idea of SVM is to minimize the structural risk, while artificial neural networks (ANNs) minimize the empirical risk. Furthermore, SVM theoretically demonstrates better forecasting than articular neural networks, according to Gunn et al. [63] and Haykin [64].

SVR is derived from SVM. It is a nonlinear kernel-based approach, and the main idea is to identify a function whose deviation from the actual data is located within the predetermined scale. SVR is applied to a given dataset , where is the input vector, is the output, and is the total number of data points. The following formulation was introduced by Pérez-Cruz et al. [65]. SVR assumes that the function is a nonlinear function of the form , where and are the weight and constant, respectively. denotes a mapping function in the feature space. Then, weight vector and the constant are estimated by minimizing the following optimization problem:where is the prespecified value and and are slack variables indicating the upper and lower constraints, respectively. Setting , equations (3) and (4) become the -loss function introduced by Vapnik. is the regularization parameter, and is a nonlinear transformation to a higher dimensional space, also known as feature space.

Using Lagrange multipliers and the Karush–Kuhn–Tucker condition, the dual problem for the optimization problem (2)–(4) can be obtained:

To solve the above problem, we do not identify the nonlinear function . The solution can be obtained aswhere is called the kernel function, defined as . Any kernel function satisfying Mercer’s condition can be used as the kernel function (see Mohri et al. [66]).

The selection of the kernel has a significant impact on its forecasting performance. It is a common practice to estimate a range of potential settings and use cross-validation over the training set to determine the best one. In this research, we use three kernel functions: polynomial, Gaussian, and Sigmoid, as presented in Table 2.

Cao and Tay [67] provided a sensitivity of SVMs to the parameters *C* and . and play an important role in the performance of SVR. Therefore, it is necessary to choose these parameters properly.

##### 2.4. RNN

An ANN is a classification or prediction process that imitates human neurons. The output of a simple ANN model is generated by multiplying weights assigned to input data. After comparing the output data and the real values to be predicted, we create new weights adjusted according to the error. The step in which weights are multiplied by the input data is called forward propagation, and the step in which the error is calculated and weights are adjusted is called backpropagation. The final goal of the ANN model is to determine the weights that minimize the error between the predicted and target values.

A CNN is a machine-learning method that uses a neural network algorithm. It consists of convolution layers, pooling layers, and neural network layers. A convolution layer uses a “filter” to analyze data, typically vectorized image data. The filter analyses small sections while moving over the entire dataset, and each section expresses a “feature” of the data with pooling layers.

An RNN is another representative neural network model that has a special hidden layer. While a simple neural network has a backpropagation algorithm and adjusts its weights to reduce prediction errors, the RNN has a hidden layer that is modified by the hidden layer of the previous state. Each time the algorithm operates, the RNN hidden layer affects the next hidden layer of the algorithm. Because of its characteristics, RNN is an optimized method to analyze and predict nonlinear time-series data, such as stock prices. It is an algorithm operating in sequence with input and output data. It can return a single output from one or more input data and return more than one output from one or more input data. One of its characteristics is that it returns the output in every hidden time-step layer and simultaneously sends it as input data to the next layer; we demonstrate the simplified structure in Figure 2. RNN has a memory cell in the hidden layer, which returns the output through various activation functions, such as the sigmoid and softmax functions. The memory cell memorizes the output from the previous time-step and uses it as input data recurrently. For instance, at a specific time , the output of the previous time-step and input of time-step are used as input data, and the output is among the input data of the next time-step .

The greatest difference between RNN and CNN or multilayer perceptron (MLP) is that CNN and MLP do not consider previous state data in later steps, but RNN considers both the output of the previous state and the input of the present state. Furthermore, as it is optimized to deal with sequential data, it is used in text, audio, and visual data processing.

However, RNN has a vanishing gradient problem in long backpropagation processes. The algorithm of an RNN is based on gradient descent and modifies its weights in each time-step after one forward propagation process. Weights are modified with error differentials so that these rapidly converge to zero with repetitive backpropagation—this is called the vanishing gradient problem. To solve this problem in long-term time-series data, LSTM is widely used.

##### 2.5. LSTM

To solve the vanishing gradient problem of RNN, Hochreiter and Schmidhuber [68] proposed LSTM, while Gers and Schmidhuber [69] added a forget gate to improve it. RNN considers all previous time-step memories, whereas LSTM chooses only the necessary memories to convey to the next time-step, using an algorithm in a special cell called the LSTM cell. Each of the cells has a forget gate, input gate, output gate, and long short-term memory (, ) that pass these cells, as shown in Figure 3.

Input data are deleted, filtered, and added to the long-term memory in the forget gate. The forget gate generally uses a sigmoid function as an activation function that transposes input data and short-term memory into numbers ranging from zero to one. This implies that if the output of the forget gate is close to zero, then most of the information will not pass through; if the output is close to one, then most of the information will pass to the next cell. Next, the input gates decide which data from input and short-term memory must be added after substitution to and .

generates new candidate vectors that could be added to the present cell state, and decides the amount of the information that generated to save. uses the sigmoid function in the same way as the forget gate with the same meaning, i.e., if the value of is close to one, then most of will pass through, and if it is close to zero, then most would not be taken in this cell. is computed with the input gate value and forget gate value. By multiplying with , the amount of information from the previous time-step cell that will be memorized is determined. Finally, the output gate decides which data will be the output of each cell, considering the memory term and .

The processes performed by each gate are expressed as follows:

and are the weights of and , respectively. For example, is the weight of input data to input gate .

To develop an LSTM model, we must assign the initial values of and . As mentioned by Zimmermann et al. [70], we set both initial memory term values as zero. LSTM is broadly applied to forecast time-series data; however, owing to its complexity, Chung et al. [71] designed a simpler model called a gated recurrent unit (GRU) while adopting the advantages of LSTM. GRU consists of a reset gate, which decides how to add new input data to the previous cell memory, and an update gate, which decides the amount of memory of the previous cell to save. However, as our dataset is not very large, we used the LSTM model and compared its performance in forecasting the CDS term structure with RNN.

##### 2.6. GMDH

GMDH is a machine-learning method based on the principle of heuristic self-organizing, proposed by Ivakhnenko [72]. The advantage of GMDH is that various considerations, including the number of layers, neurons in hidden layers, and optimal model structure, are determined automatically. In other words, we can apply GMDH to model complex systems without a priori knowledge of the systems.

Suppose that there is a set of variables consisting of and one variable. The GMDH algorithm represents a model as a set of neurons in which different pairs in each layer are connected via quadratic polynomials, and they generate new neurons in the next layer [28, 73]. Figure 4 shows the simplified structure. The formal identification problem of the GMDH algorithm is to identify a function that can be used to forecast the output for a given input vector as close as possible to its actual output instead of actual function . Therefore, we can describe the observations of multi-input and single output data pairs as follows:

We train a GMDH network to predict the output for any given input vector , which is given as

Now, the GMDH network is determined by minimizing the squared sum of differences between sample outputs and model predictions, that is,

The general connection between input and output variables can be expressed by a series of Volterra functions:where is the input variable vector and is the weight vector. Equation (10) is known as the Kolmogorov–Gabor polynomial [28, 45, 72, 74, 75].

In this study, we use the second-order polynomial function of two variables, which is written as

The main objective of the GMDH network is to build the general mathematical relation between the inputs and output variables given in equation (10). The weights in equation (11) are estimated using regression techniques so that the difference between actual output () and the calculated output () is minimized, described as

These parameters can be obtained from multiple regression using the least squares method, and we can compute them by solving some matrix equations. Refer to [28, 29, 46, 76] for a detailed description of the parameter estimation process. The GMDH network can be associated with various algorithms, such as the genetic algorithm [77, 78], singular value decomposition [28], and backpropagation [29, 46, 73, 79–81]. We also improved the GMDH network using backpropagation.

##### 2.7. Hyperparameter Optimization

Hyperparameter optimization refers to the problem of determining the optimal values of hyperparameters that must be set up in advance to perform training and that can complete the generalized performance of the training model to the highest level. In the deep-learning model, for example, the learning rate, batch size, etc. can be regarded as hyperparameters, and in some cases, they can be added as targets for exploration as hyperparameters that determine the structure of the deep-learning model, such as the number of layers and the convolution filter size. Hyperparameter optimization typically includes manual search, grid search, and random search.

Manual search is a way for users to set hyperparameters individually and compare performances according to their intuition. After selecting the candidate hyperparameter values and performing training using them, the performance results measured against the verification dataset are recorded, and this process is repeated several times to select the hyperparameter values that demonstrate the highest performance. This is the most intuitive method; however, it has some problems. First, it is relatively difficult to ensure that the optimal hyperparameter value to be determined is actually optimal because the process of determining the optimal hyperparameter is influenced by the user’s selections. Second, the problem becomes more complicated when attempting to search for several types of hyperparameters at once. Because there are some types of hyperparameters that have mutually affecting relationships with others, it is difficult to apply an existing intuition to each single hyperparameter.

Grid search is a method of selecting candidate hyperparameter values within a specific section to be searched at regular intervals, recording the performance results measured for each of them, and selecting the hyperparameter values that demonstrated the highest performance (see Hsu et al. [82]). The user determines the search target, length of the section, interval, etc., but more uniform and global search is possible than in the previous manual search. On the contrary, the more the hyperparameters to be searched that are set at one time, the longer the overall search time, and it increases exponentially.

Random search (see Bergstra and Bengio [83]) is similar to grid search but differs in that the candidate hyperparameter values are selected through random sampling. This method can reduce the number of unnecessary repetitions and simultaneously search for values located between predetermined intervals so that the optimal hyperparameter value can be determined more quickly. Random search has the disadvantage that unexpected results can be obtained by testing various combinations other than the values set by the user.

The grid search and random search algorithms are illustrated in Figure 5. In this study, we use the grid search algorithm because it is the simplest and is most widely used for determining optimal hyperparameters [84]. Although a random search can perform much better than grid search for high-dimensional problems, according to Hutter et al. [85], our data are simple time-series data, and the candidate parameter set is limited; thus, we use the grid search algorithm [86, 87]. The Python technological stack was used for experiments. We implemented the machine-learning algorithms and grid search via “Keras,” “TensorFlow,” and “GmdhPy.”

**(a)**

**(b)**

#### 3. Empirical Results

We used 2886 daily time-series data points on CDS term structure from October 2008 to October 2019. Because international financial markets from 2008 to 2011 were unstable, we divided these data into two subperiods, and we measured the forecasting performance of the five methods we used in both high-variance and relatively low-variance data. The first training dataset is from 1st October 2008 to 22nd January 2019 (full period), the second one is from 1st October 2008 to 9th September 9th 2011 (subperiod 1), and the third one is from 2nd January 2012 to 22nd January 2019 (subperiod 2). We selected our test dataset as the last 200 days (from 23rd January 2019 to 29th October 2019, test dataset 1) for each maturity in the full period, the subperiod 2, and last 80 days (from 12th September 2011 to 30th December 2011, test dataset 2) for the subperiod 1. There is a gap between subperiod 1 and subperiod 2 because of the test dataset 2 for subperiod 1 training set. These all cases are summarized in Table 3. Summary statistics for the test dataset are provided in Tables 4 and 5. Test dataset 2 has higher standard deviations than test dataset 1. Through this subperiod analysis, we compared the prediction power of the models in a relatively volatile period (subperiod 2) and a less volatile period (subperiod 1). We used grid search to optimize the parameters in RNN, LSTM, SVR, and GMDH and calculated the RMSE, MSE, MAPE, MPE, and MAE to compare the performance of these five models. Figures 6–11 show the performance of the Nelson–Siegel, RNN, LSTM, SVR, and GMDH models with the test datasets for each maturity.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

Our main findings can be summarized as follows: first, as shown in Figures 6–11, every model provides accurate predictions of CDS term structure. Figures 12–14 also show that machine-learning methods have similar accuracy and outperformed the Nelson–Siegel with AR(1) model. This proves that machine-learning models can be applied to forecasting CDS time-series data and that the Nelson–Siegel model fits both the interest rate term structure and CDS term structure. Furthermore, GMDH, SVR, and RNN have very similar accuracies in all periods and maturities. Second, comparing the Nelson–Siegel model with the four machine-learning methods in predictive power, the Nelson–Siegel model shows the poorest performance for all test sets. That is, machine-learning algorithms are more effective in predicting CDS spread than the Nelson–Siegel model, based on interest rate term structures, which play an important role in determining CDS spread levels. Third, among the machine-learning methods, GMDH presents the best prediction results. The error of the GMDH was found to be the lowest among the five methods, as shown in Tables 6–8. In addition, we expected LSTM to outperform RNN, but the RNN model slightly outperformed the LSTM model. However, this result remains debatable, as mentioned in Introduction. Performance comparisons between machine-learning algorithms are finding different conclusions in different studies [55, 56, 88–91]. Fourth, the periods with higher standard deviations are generally harder to predict accurately, as seen in Tables 7 and 8. Additionally, the maturities with higher standard deviations are generally harder to predict accurately, as seen in Figures 12–14. The changes in the standard deviation and in the forecasting error are similar for most error measures except MAPE and MPE, as shown in Figure 13.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

#### 4. Summary and Concluding Remarks

The purpose of this study is to compare the prediction of CDS term structure between the Nelson–Siegel, RNN, LSTM, SVR, and GMDH models. We determined the most suitable model to predict time-series data, especially the CDS term structure. The CDS spread is a default risk index for a country or company; hence, this study is useful because it not only offers the best time-series forecasting model but also predicts future risk.

Existing studies on the prediction of CDS term structure and other risk indicators using machine-learning models remain few; most focus on stock price prediction. This study is significant because it demonstrated that various machine-learning models can be applied to other time-series data, and further research on various time-series data using machine-learning models is expected. This study also confirmed that data-driven methods, such as RNN, LSTM, SVR, and GMDH, outperform the model-driven Nelson–Siegel method, which is usually used in analyzing the CDS term structure. The performance of model-driven methods could decline if the data have a significant number of outliers because it is dependent on the assumption that the dataset can be formalized on a specific formula. In our dataset, the presence of outliers made it difficult to make predictions with model-driven methods. On the contrary, data-driven methods were not affected by outliers (see Solomatine et al. [92]), as these consider only datasets that include outliers. As most data available today have many outliers, it is not surprising that data-driven methods outperform model-driven ones.

Some studies show that linear models such as AR are better than ANNs [93–95] for forecasting time series. However, CDS series data are not persistent and volatile, as shown in Figure 1, so Nelson–Siegel based on the AR process performs more poorly than the machine-learning methods. In other words, because of the nonlinearity, machine-learning techniques can be successfully used for modeling and forecasting time series [96–100].

Based on the empirical findings given in Section 3, we have three implications. The first is that the data-driven method is more effective in predictive power than the theoretical model consisting of theoretical variables that influence a financial asset’s price. Of course, the data-driven method has a much larger number of parameters than the model-driven method and a much slower implementation speed. However, it is acceptable to use a machine-learning algorithm without the need for prior knowledge, such as interest rate period structure, to predict CDS term structure more accurately. Second, we need to improve the existing Nelson–Siegel model. We showed that the machine-learning models outperform the Nelson–Siegel model for all three cases, which implies both that the machine-learning methodologies excel at this task and that there is a factor in the CDS term structure that the Nelson–Siegel model does not reflect. Nelson–Siegel still has room for improvement in its performance, especially in forecasting applications. Third, the performance of all models was inconsistent depending on the data period. In the highly volatile data period (subperiod 1), all models were less predictable than in the less volatile data period (subperiod 2). In both approaches, the model performance is not stable when the data are highly volatile. Figure 1 shows that the CDS term structure from 2012 to 2019 seems regular but has some unpredictable points related to the financial turbulence from 2008 to 2011. This unusual volatility is one of the things that reduced the forecasting performance of all models. Therefore, it is necessary to consider a new approach that can achieve solid forecasting performance regardless of the volatility of the data.

Our findings can help investors and policymakers analyze the risk of companies or countries. The CDS spread is an index that represents the probability of credit default; thus, this study offers a measure to predict future risk. For instance, Zghal et al. [101] showed that CDS can function as a strong hedging mechanism against European stock market fluctuations, and Ratner and Chiu [19] confirmed the hedging and safe-haven characteristics of CDS against stock risks in the U.S. Researchers can also apply machine-learning models to forecast financial risk time-series data.

Future studies should apply this same experiment to datasets other than CDS data for comparing the forecasting performance of model-driven and data-driven methods, such as the implied volatility surface. The implied volatility surface is a fundamental concept for pricing various financial derivatives. Therefore, for a long time, many researchers have been working on it, and various models have been developed [102–106]. Because it is a key part of the evaluation of financial derivatives, comparisons of performance between existing volatility models and data-driven models in predicting implied volatility should draw attention from academics and practitioners. GMDH showed the best predictive performance for the CDS term structure used in this study. It is now necessary to ensure that GMDH performs best for other term structures as well, such as for volatility term structures and yield curves, or other CDS contracts, for example, corporate CDS and CDS index. As a possible future study, extended Nelson–Siegel models can be used, such as regime-switching [107] and the Nelson–Siegel–Svensson model [108], to forecast CDS term structure. Optimized through grid search for machine-learning algorithms, we expect to increase the forecasting power of the Nelson–Siegel model using extended models rather than by optimizing parameters for the Nelson–Siegel model.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors are grateful to the editor Baogui Xin for the valuable comments which helped to significantly improve this paper. This work was supported by the Gachon University Research Fund of 2018 (GCU-2018-0295) and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (no. 2019R1G1A1010278).