Forecasting models with high-order interaction has become popular in many applications since researchers gradually notice that an additive linear model is not adequate for accurate forecasting. However, the excessive number of variables with low sample size in the model poses critically challenges to predication accuracy. To enhance the forecasting accuracy and training speed simultaneously, an interpretable model is essential in knowledge recovery. To deal with ultra-high dimensionality, this paper investigates and studies a two-stage procedure to demand sparsity within high-order interaction model. In each stage, square root hard ridge (SRHR) method is applied to discover the relevant variables. The application of square root loss function facilitates the parameter tuning work. On the other hand, hard ridge penalty function is able to handle both the high multicollinearity and selection inconsistency. The real data experiments reveal the superior performances to other comparing approaches.

1. Introduction

Sparse representation has attracted a large deal of attention from researchers in different scientific fields including financial engineering and risk management computational biology and machine learning community [1]. For instance, in spectrum data analysis, scientists are exploring how to select few exact frequencies on which the beam energy is concentrated to avoid lower leakage problem [2]. The main goal of demanding sparsity is to discover quite a few numbers of relevant variables from a large pool of candidate variables which are essential when establishing an interpretable forecasting model. A model with high precision, low model complexity, and sensitivity to the model parameters can not only avoid the expensive cost of excessive supply but also reduce the time cost, service interruption, and resource waste [3]. Furthermore, researchers have noticed that a linear additive model using main effects only cannot provide accurate forecasting results. This motivates them to turn their attentions to high-order interaction model which considers interaction terms between variables. Interaction model has advantage over a linear model which does not consider possible relationship between variables. In microarray data analysis, for example, biologists are particular interested in gene-gene interactions and gene-environment interactions than a single gene because interactions of single nucleotide polymorphisms (SNPs) are more important in cancer diagnosis [4]. However, new studies have also revealed the interactive effect of multiple environment factors. For example, a child with a poor quality environment would be more sensitive to a poor environment as an adult which ultimately led to higher psychological distress scores. This depicts a three-way interaction term between gene and environment. Furthermore, interaction with multiple genes is also considered to be essentially important in the molecular analysis [5]. This motivates us to consider a high-order interaction model which includes both two-way and three-way interaction terms in this paper:where is the intercept term which vanishes after centering, represents the Hadamard product of , , and which denotes the three-way interaction term. The noise term follows Gaussian distribution. , , and are corresponding coefficients for main effects, two-way, and three-way interaction terms, respectively. Obviously, this high-order interaction model is more complicate than two-way interaction model.

Although high-order interactions play an important role in depicting the nonlinear relationship among variables, sparse representation of high-order interaction model is quite challenging since the number of interaction terms is huge (of order or even ) if there are main effects in the model. For instance, even with 102 variables in the model, a high-order interaction model has to consider 104 or 106 variables which make variable selection procedure more difficult. Furthermore, computation cost is another issue because of the huge number of variables in the high-order interaction model. Even worse, severe multicollinearity exists among variables caused by the high dimensionality. There is a large body of literature on sparse representation methods. Subset selection methods based on L0 type penalty have been discovered for decades. These methods including Akaike information criterion (AIC) [6], Bayesian information criterion (BIC) [7], and mallow’s [8] which are NP hard since all the possible combinations of variables are taken into account. Namely, if there are m variables, subset models have to be checked. To improve the computational efficiency, sparse representation approaches based on convex optimization are proposed. For instance, Tibshrani (1996) proposed the least absolute shrinkage and selection operator (LASSO) which applies L1 type penalty to pursue sparsity. In wavelet study, it is famous as basis pursuit method [9]. It is more computational efficient than subset selection methods. Efron (2004) studied the least-angle regression (LAR) which solved the same problem as LASSO but in a forward selection manner. They have shown that the solution path of LAR is pretty similar as that of LASSO. Yuan and Lin (2006) investigated group LASSO method which selected group variables based on “all in and all out” principle. To overcome the shortcoming of LASSO and further improve the selection accuracy, Zou (2006) studied adaptive LASSO using weights for each variable to adjust the sparsity degree. Friedman (2012) summarized and provided a review of sparse regression and classification methods [10].

However, convex optimization has drawbacks since it is difficult to discover all the relevant variables if the model coherence is sufficiently high. To overcome the difficulties, nonconvex penalties including smoothly clipped absolute deviation (SCAD) [11], minimax concave penalty (MCP) [12], and hard ridge (HR) [13, 14] are proposed to boost the accuracy. Ye et al. (2017) studied sparse least squares support vector machine with Lp norm function applied to select the relevant support vectors [15]. To facilitate parameter tuning work and increase the forecasting accuracy, Jiang (2018) studied square root nonconvex optimization (SRNO) and showed its superior performance [2] via simulation and real data studies.

Sparse representation in the two-way interaction model has been studied for years. Bach (2008) explored hierarchical multiple learning and studied its selection property [16]. Zhao et al. (2009) proposed composite absolute penalty (CAP) which considers a mixture of Lp and Lq norm functions which are given by [17]. Choi et al. (2010) studied SHIM method considering . They also designed a nonconvex optimization to perform sparsity representation. However, the number of features is only 10 which is pretty small. Thus, there is no guarantee that their method can work in the high dimensional setup [18]. Bickel et al. (2010) used a two-step algorithm to purse sparsity. In the first step, LASSO was applied and backward procedure was used in the second step [19]. Wu et al. (2010) proposed a stagewise procedure which screened the main effects in the first step and the selected variable will be cleaned based on t-test statistic [20]. Radchenko and Jame (2010) investigated VANISH constructing the main effects and interactions from two small sets of orthogonal functions [21]. Bien et al. (2013) studied hierarchical LASSO (HL) which considered a L1 optimization problem using a nonconvex constraint [22]. She et al. (2014) studied GRESH method which enforced sparsity to main effects and two-way interactions simultaneously [23]. Hao et al. (2014) screened the main effects using forward regression in the first step and fitted the quadratic model with selected main effects and associated interactions [24]. Simon and Hastie (2015) proposed “Glinternet” to discover the two-way interaction terms between categorical variables and continuous variables [25]. Yan and Bien (2017) advocated VARXL method to implement sparse representation in a high dimensional time series data [26]. To the best of our knowledge, most of variable selection methods mentioned above only consider two-way interaction terms. Few studies have been done on the high-order interaction model with both two-way and three-way interaction terms. To tackle this issue, this paper explores sparse representation in a high-order interaction model. The main contributions of this paper are given as follows:(i)A two-stage square root hard ridge method (TSSRHR) for sparse representation in high-order interaction model is proposed.(ii)In computation, a fast and simple-to-implement algorithm is designed.(iii)In theory, we have shown the prediction error bound and estimation error bound of the TSSRHR estimate.(iv)Real data examples are shown to demonstrate the efficiency and superior performance of the proposed method over other existing competitors.

The rest of this paper is organized as follows. Section 2 represents the general framework of TSSRHR. Section 3 designs a fast and simple-to-implement algorithm with theoretical guarantee of its convergence and optimality. The theory part of TSSRHR is given in Section 4. In Section 5, real data analyses are shown to exhibit superior performance of the proposed method over other approaches. The conclusion of this paper is given in Section 6.

2. Two-Stage Square Root Hard Ridge Method

We investigate and study the square root hard ridge (SRHR) method which combines the benefits of square root loss function and nonconvex hard ridge penalty function for high-order interaction selection. We describe the two-stage SRHR algorithm at first. At stage 1, only main effects are selected by SRHR with all of high-order interactions including two-way and three-way interactions kept out. The selected variables are put in the set A. At stage 2, the selected set in stage 1. is expanded by adding both associated two-way and three-way interactions within . SRHR is applied again to implement variable selection task.(i)Stage 1. Define which represents all the main effects. Select the important effects using SRHR method by considering the following optimization problem:where and are two regularization parameters which are supposed to be determined. The index for selected variables are given in and the corresponding solution path is given by .(ii)Stage 2. Update . Select the important main effects, two-way and three-way interaction using SRHR method again by considering the following problem (for notation simplicity, and are denoted by and , respectively):where denotes the indicator function, represents the Cartesian product, and and are two regularization parameters which need to be selected carefully. The solution path is provided by and the selected variable are given in .

A penalized sparse representation method considers the sum of two components: the loss function and the penalty function. The loss function often applies the negative log-likelihood of the error and in Gaussian setup, it is a squared error loss function which involves the noise level which is difficult to estimate. Specifically, in a Gaussian linear regression setting , the negative likelihood function of is , where is a function of and is a constant. SRHR uses square root error loss function which can facilitate the parameter tuning work by avoiding estimating the noise level or using other forms of estimates [27]. Hard ridge penalty has two parts; it is obvious that the first part which is L0 norm penalized the number of nonzero elements using the regularization parameter . The second part is L2 type ridge penalty. It can compensate the shrinkage given by the L0 norm and is able to handle the high collinearity of the high-order interaction model. Therefore, SRHR combines the advantages of square root loss function and hard ridge penalty.

The choices of regularization parameters (, and ) play an important role in balancing a model’s in sample fit and out-of-sample forecasting ability. We grid starting from the largest value such that all the coefficients are shrunk to zeros. Then let the smallest value be a small proportion of , that is, and the grid values are generated between and exponentially. For , the same strategy is also applied. The optimal values for and are obtained from a grid of values .

To choose the optimal regularization parameters, extended Bayesian information criterion (EBIC) [28] and high dimensional Bayesian information criteria (HDBIC) [29] are applied in this paper. Particularly, EBIC is defined asand HDBIC is defined aswhere and are represent the cardinalities of and . In the first stage, and will be selected by the EBIC since large number of main effects are considered. It worth mentioning that different from [24], we did not apply BIC in the second stage. Although the number of variables is reduced dramatically, we still use HDBIC which is a high dimensional version of BIC because both two-way and three-way interaction terms are considered which still cause the dimensions of the model to be high.

3. Algorithm

In this section, for notation simplicity, two design matrices are designed as follows:

andFurthermore, let . Clearly, represents the design matrix for two-way interaction terms, denotes the design matrix for three-way interaction terms, and is the overall design matrix for main effects, two-way interaction terms, and three-way interaction terms. For the coefficients to be estimated, let and be the coefficient matrix for two-way and three-way interaction terms, respectively. Then (1) can be written aswhere for any . Define . To design an algorithm for TSSRHR, we consider the following optimization problem in the first stage:and in the second stage, we considerTo solve (9) and (10) accurately and efficiently, threshold rule is defined at first.

Definition (threshold rule). A threshold rule is a real-valued function defined for and such that(1);(2) for ;(3); and(4) for .

From the definition, it is easy to see that is an odd monotone unbounded shrinkage rule for , at any . A vector version of is defined component-wise if either or is replaced by a vector. Now define the HR thresholding rule, which is closely associated with the HR penalty, as where , are two regularization parameters. Using HR thresholding function, (9) and (10) can be solved based on the following iterations:andwhere is a vector version of HR thresholding function with replaced by a vector.

Algorithm 1 exhibits the details of TSRHR algorithm with , , , and defined. Consider the nonconvex optimization is not easy to solve using ADMM algorithm [30]. HR thresholding function is applied in the algorithm. The basic structure of this algorithm follows coordinate decent algorithm. The error tolerance of tol1, tol2 and maximum number of iterations m1 and m2 are determined based on trial and error. Usually, the default values of , and and are 1e-4, 1e-4, 500, and 500, respectively. TSSRHR algorithm has some advantages. Firstly, it does not involve any matrix inversion. Secondly, is applied to enforce elementwise sparsity. The step sizes and are able to guarantee the convergence of the algorithm. The theoretical results show that and provide satisfactory results. The step sizes do not depend on line search which takes a lot of computation time. The TSSRHR algorithm is simple to implement and computational efficient. To further speed up the convergence, a fast iterative shrinkage thresholding algorithm (FISTA) [31] is used to obtain the TSSRHR estimate. The following theorem provides the theoretical guarantee of convergence of Algorithm 1.

Inputs: the dataset
: maximum number of iterations used in the SRHR algorithm of the first stage;
: maximum number of iterations used in the SRHR algorithm of the second stage;
: the error tolerance used in the SRHR algorithm of the first stage;
: the error tolerance used in the SRHR algorithm of the second stage;
: scale parameter used in the SRHR algorithm of the first stage;
: scale parameter used in the SRHR algorithm of the second stage;
Output: The forecasting test error and selected pattern;
Randomly divide the original data into the training dataset and
test dataset .
The first stage using SRHR algorithm:
Generate grid values of and .
Initialization: ,
while or do
Step 1.
Step 2.
Step 3.
Step 4:
end while
end for
end for
Obtain the solution path and the corresponding sparsity patterns based on
EBIC criterion.
The second stage using SRHR algorithm:
Generate the high-order interaction model and based
on the sparsity pattern .
Generate grid values of and .
while or do
Step 1.
Step 2.
Step 3.
Step 4:
end while
end for
end for
Obtain the solution path and update the sparsity patterns using
HDBIC criterion.
Calculate the test error using test dataset

4. Prediction and Estimation

Notice that does not follow the triangle inequality but follows the following inequality for any vectors . This poses critically challenges to the derivation the forecasting and estimation error bounds. Namely, the ways of performing error rate analysis on square root lasso [27] are not applicable to SRHR because of its nonconvexity. Therefore, a novel approach is urgently needed to derive the forecasting and estimation error bounds.

Assumption 1 (overfitting assumption). Suppose that , then does the tasks as the Ordinary Least Square (OLS) estimator .

Assumption 2 (weak decomposability assumption). A norm in is called weakly decomposable for an index set , if there exists a norm on such thatFurthermore, a set is called allowed set if is a weakly decomposable norm for this set.

We can show that hard ridge penalty is weakly decomposable as follows: define and . Then we can obtain weak decomposability for any set This is due to the weak decomposability property of the penalty and the penalty.

Definition 3 (M-effective sparsity). For an allowed set of a norm , and a constant, the eigenvalue is defined as Then the M-effective sparsity is defined asM-effective sparsity is a variant of regularity condition used to establish the nonasymptotic oracle inequality, such as Restricted Eigenvalue condition [32], Mutual Coherence condition [33], Compatible Condition [34], Cone invertibility factor [35], and Restricted invertibility factor [36]. With Assumptions 1 and 2 satisfied, based on the Theorem 1 in [37], we can obtain the forecasting and estimation error bounds of the TSSRHR estimator in the following remark.

Remark. Chooseandwhere with and . Under the overfitting assumption and weak decomposability for , the following inequalities hold with probability andwhere , and are some positive constants.

This remark provides prediction error bound and estimation error bound for the TSSRHR estimate. The proof of remark follows Theorem 1 and Corollaries 1 and 2 in [37] directly. Thus, the details are omitted here.

5. Real World Data

In this section, twinsine signal data including 8 scenarios and microarray gene expression data about inbred mouse are treated as real data examples to test the performances of sparse representation methods. It worth mentioning both of these two real data examples are ultra-high dimensional data when three-way interaction terms are considered. Comparisons will be made between the proposed two-stage hard ridge (TSSRHR) and other two-stage sparse representation methods including two-stage LASSO (TSLASSO), two-stage SCAD (TSSCAD), and two-stage square root lasso (TSSRL) in terms of forecasting accuracy, sparsity, and computational efficiency. The unknown parameters in TSLASSO and TSSCAD are determined using 10-fold cross validation. TSSRL selects the unknown parameter based on a theoretical choice: where is the number of variables and is the cumulative distribution function of Gaussian distribution. Here we remind that TSSRHR selects the regularization parameters by EBIC and HDBIC (cf. Section 2). The forecasting model will be trained with 75% of the samples and the remaining 25% will be used to calculate the test errors. The entire procedure will be repeated for 150 times and the median of evaluation criteria including number of nonzero elements (NZ), mean square error (MSE), root mean square error (RMSE), and computational cost (CC) will be reported.

5.1. Twinsine Signal Data

Spectral estimation studying the distribution over frequencies is widely applied in speech coding and radar solar signal processing. Nevertheless, it is critically difficult and challenging when there are a large number of frequencies which causes high resolution problem. This type of problem is called super resolution spectral estimation. To avoid the power leakage and noisy observation caused by the super resolution, a novel method is urgently needed to choose the important frequencies which plays a dominant role in super-resolution estimation. The twinsine signal is generated from the target function:where , , , , and follows white Gaussian noise with variance given by . The frequence resolution is chosen as 0.002HZ to detect and distinguish the two sinusoidal components. for and the frequency atoms are given by and . In our experiments, we will consider the following 8 different scenarios:(i)Scenario 1: (ii)Scenario 2: (iii)Scenario 3: (iv)Scenario 4: (v)Scenario 5: (vi)Scenario 6: (vii)Scenario 7: (viii)Scenario 8: (ix)Scenario 9: (x)Scenario 10:

The results of forecasting, variable selection, and computation are summarized in Table 1 and Figures 1, 2, 3, and 4. It is not difficult to observe that TSSRHR obtains the most accurate outcome in all the scenarios including the most challenging scenario when and . TSSRL outperforms TSLASSO and TSSCAD method by delivering lower MSE and its forecasting results are comparable with TSSRHR in scenario 6. From the prospective of model sparsity, TSSRL achieves the most interpretable model followed by TSSRHR. TSLASSO and TSSCAD tend to select a large number of variables which can be found in all the scenarios except scenario 5 and scenario 6. For instance, TSLASSO selects 31458 variables and TSSCAD chooses 29113 variables in scenario 8. A large number of nuisance variables are included in the models given by TSLASSO and TSSCAD which will affect the forecasting results negatively. In terms of computational efficiency, TSSCAD spends more computational time in scenarios 1-2 and scenarios 5-6 when the dimensions are moderate. As the dimensions increase, all of the forecasting approaches take more computation cost. For instance, TSSRL took 1215.93 seconds and 1529.38 seconds in scenarios 4 and 8 when . TSLASSO runs fastest by using the convex penalty. However, notice that when D is large, although TSSRL also uses the convex penalty, it is not computational efficient since the design of the TSSRL is based on COORD algorithm which runs much slower than coordinate descent algorithm. In scenario 9 and 10 where the sample sizes are large, TSSRHR still performs better than other methods. Same as other scenarios, TSSRL gives the most sparse model with fewer variables. Also notice that the train speed of TSSRHR is acceptable. Therefore, TSSRHR outperforms other forecasting methods in terms of forecasting accuracy, model sparsity, and computation speed.

5.2. Microarray Gene Expression Data

The inbred mouse microarray gene expression dataset with 31 female mice and 29 male mice is considered to demonstrate the effectiveness of the proposed TSSRHR method. The gene expression values of 22,690 genes are measured by each array and the response gene is a continuous phenotypic variable measured by stearoyl-coenzyme desaturase (scd1) with probe set ID 1415965 at. The remaining 22,689 probe set IDs are regarded as candidate variables. We are seeking a good model which is able to describe the relationship between the response variable and other candidate genes. The performances of forecast models on inbred mouse microarray gene expression data are shown in Table 2. It worth mentioning that a gene filtering preprocessing procedure is applied to the inbred mouse data to reduce the expensive computational cost. TSSRHR delivers the best forecasting performances by giving the lowest MSE and RMSE followed by TSSRL, TSLASSO, and TSSCAD. TSLASSO selects a model with 6311 three-way interaction terms while TSSCAD only selects 3 variables which is fewer than those of TSSRL and TSSRHR. This can be verified in Figure 5 which shows the selection pattern of the compared methods. The computational cost of TSSRL is almost 4 times of TSSRHR. Comparing with TSSRHR and TSSCAD, TSLASSO and TSCAD spend less computational cost. Overall, TSSRHR provides the lowest forecasting error with sufficient model sparsity and acceptable computational time.

6. Conclusion

Two-way interaction model has been widely used in numbers of scientific fields. However, researchers find that three-way interaction terms play more important role than two-way interaction terms when establishing forecasting model. To this end, this paper studies high-order interaction model which has both two-way and three-way interaction terms. To establish an interpretable model and reduce the computation time, sparse representation method using a two-stage square root hard ridge (TSSRHR) is proposed and studied. Based on the property of two-stage method, the number of interactions will decrease dramatically so that the computational issue is solved. To implement SRHR in each stage, a simple and efficient algorithm is designed. Furthermore, the prediction and estimation error bounds of TSSRHR are also exhibited using two regularity assumptions. For real data analysis, a twinsine signal data and microarray gene expression data are provided to show the forecasting and selection performances of TSSRHR comparing with other stagewise sparse representation methods. A hierarchical variable selection problem on the high-order interaction model will be conducted for the future research.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.


This research was supported by the National Natural Science Foundation of China (Grants no. 71861012 and no. 71761016), the Natural Science Foundation of Jiangxi, China (Grants no. 20181BAB211020 and no. 20171BAA218001), and the China Postdoctoral Science Foundation (Grants no. 2017M620277 and no. 2018T110654).