#### Abstract

Financial data are not only characterized by time-domain correlations but also heavily influenced by numerous market factors. In stock price analysis, the prediction of short-term movements is of much interest to investors and traders. In this paper, we consider forecasting price movements based on ensembled machine learning models, which is generally viewed as a challenging task due to noise components inherent in the data and uncertainties in various forms of financial information related to stock prices. To enhance the accuracy of trend predictions, we propose to use wavelet packet decomposition (WPD) and kernel-based smoothing techniques to remove high-frequency noise from the data, based on which we further perform feature engineering to obtain a comprehensive list of multidimensional technical features. Subsequently, we employ the light gradient boosting machine (lightGBM) algorithm to classify the change in the direction of the price trend that occurs in ten trading days. Numerical results on the Shanghai composite index show that the proposed approach has noticeable advantages over traditional statistical and machine learning methods when predicting near term price trends. Index terms—ensembled machine learning, feature correlation, financial data, LGBM, and wavelet denoising.

#### 1. Introduction

Analyses on the stock market have received the attention of numerous traders and researchers. Specifically, the stock price forecast provides an important reference on setting a trading strategy or determining the appropriate timing for the transaction. Various theories and numerical techniques have been applied for decades to the stock market seeking to analyze the laws governing the price movements. Changes in the direction of price trends inevitably depend on a large number of factors, such as positive and negative news, company profiles, historical prices, and risk tolerances [1]. It is almost impossible to construct an all-compassing model incorporating the aforementioned factors. Moreover, the stock market itself includes transient and ubiquitous incidents involving individual companies and external incidents [2], such as diplomatic issues, and are impacted by random noises from market participants who hold different perspectives on economic outlooks. Needless to mention, personalized features, such as investors’ sentiments, individual risk-bearing capacities, and even trading days or dates, also significantly affect the stock market, which makes the trend prediction a highly complex task [2].

Based on the generalized target, stock price prediction can be categorized into the tasks of classification and regression, respectively. The regression model obtains the estimated stock price directly, while the classification model produces the probability of the increment or the decrement over a certain time span. The resulting trading strategy is based on the rise or fall of the predicted price and hence provides traders with recommendations to buy or sell, respectively [3].

In the earlier stages of financial data analytics, researchers resorted to a number of statistical methods to materialize forecasting capacities [2], e.g., the support vector machine (SVM), (MLR), extra-trees algorithms (ET), autoregressive moving average (ARMA). In reference [1], Asghar et al. predicted closing prices on the Karachi Stock Exchange (KSE)-100 index dataset based on a multilayered machine learning model. In reference [4], the authors used the ARMA model to make a forecast on the New York Stock Exchange (NYSE). In reference [5], the ARMA-GARCH algorithm was optimized, hoping to find a setting of near-optimal parameters to deliver the best returns for traders. However, due to the highly fluctuating and nonstationary nature of the stock market, statistics-based models are not effective in tackling the volatility and correlation structure in the data.

In recent years, the forecasting task significantly benefits from the development of machine learning approaches, which has led to notable advances in terms of numerical benchmarks. In reference [6], a support vector regression (SVR) algorithm is used to predict short-term returns. In reference [7], Nabipour et al. provide a comprehensive analysis and comparison of various models and evaluate their performance on the Tehran Stock Exchange (TSE). Experimental results show significant improvements on short-term return rates when machine learning models are used for binary classification tasks rather than performing regression on continuous data. In reference [8], a deep learning-based zero-inflated model is designed to conduct data analysis on financial data featured with irregularly spaced time. In reference [9], a fluctuation prediction model is presented to form trading strategies by processing a synthetic combination of online news, financial capacities, and social interest indicators.

While it is impossible to take into consideration all relevant multimodal data affecting the stock market, we believe that the impacts of these factors are manifested quantitatively as the numerical features of candlestick charts, which are also known as K-lines and typically used to represent both short-term and long-term fluctuations of stock prices. Hence, we construct a machine learning model by delving into empirical technical indicators and also deriving a set of signal characteristics as the model inputs. In this paper, we propose to use the state-of-the-art light gradient boosting machine (LGBM) model [10] to realize a binary classification task, i.e., to predict how the closing price in ten trading days changes over the corresponding value on the current date. The LGBM is an ensemble tree-based machine-learning framework and has critical superiorities over other models in that it takes advantage of sparsity characteristics of training data and is also viewed as an interpretable model [11, 12]. The novelties of the proposed approach in this paper are as follows.(1)We use advanced signal processing methods, including wavelet filtering, to remove high-frequency noise components inherent in trading data and improve classification accuracies.(2)We combine financial technical indicators with domain-specific signal characteristics to form a comprehensive list of features.(3)Through numerical evaluations based on real-world market data, we demonstrate that the proposed approach achieves much higher accuracies over classical statistical models and conventional machine-learning models by resorting to effective data preprocessing and feature extraction techniques.

The rest of the paper is organized as follows: in Section 2, the proposed approach that incorporates wavelet filtering and feature engineering are presented. A brief description of the LGBM model is also included. In Section 2, we present numerical results on the Shanghai Stock Exchange (SSE). In Section 4, the conclusion is drawn.

#### 2. Proposed Architecture

In this paper, we first present an introduction of the wavelet transform to preprocess stock data. Subsequently, we extract technical indicators to obtain representative information related to analyzing stock prices. Inspired by financial indicators originating from mechanical engineering, we obtain a set of features typically used for analyzing machine vibration signals. The extracted features are combined, normalized, and fed into a lightGBM model to make a binary classification on the rise or fall of the closing price over an interval of ten trading days.

##### 2.1. Wavelet Denoising

Wavelet transform can be applied to remove high-frequency noise components from time sequences in signal processing. The set of wavelet functions [13] is derived from a wavelet function *h*(*t*), which is further extended by *a* = 2^{m}, translated by *b* = *k*2^{m}, and normalized by [2]where *m* and *k* are defined by solving an expansion equation as indicated in [14, 15]. When the sequence *x*(*n*) has *N* = 2^{s} values, its expansion can be evaluated by [2]

In this paper, we choose the Haar wavelet as the wavelet function after extension numerical experiments, where the basis function *h*_{k}(*z*) is defined as [2]where and *k* is obtained by *k* = 2*p* + *q* − 1.

The original signal can be decomposed by the wavelet transform to obtain the approximate components and detailed components, which are subsequently thresholded for the purpose of reconstructing a denoised version of the original signal. The decomposition procedure may be viewed as a multiresolution analysis [16, 17], and involves the following steps. First, we construct the transformation basis by using a scaling function and a wavelet function, which are defined by [18].where *j* denotes the dilation or the visibility in frequency and *k* specifies the position.

To obtain the wavelet decomposition of a signal, we need to ensure that the scaling function of the signal is orthogonal to its translated variant. Moreover, the subspaces that are obtained based on spanning the scaling function at low scales are required to be nested within those obtained at higher scales.

##### 2.2. Financial Technical Indicators

Figure 1 shows the daily candlestick chart of the stock with the code 000001.SZ ranging from April to May 2019. The chart is a type of financial representation that shows the price action for an investment market. It consists of specific candlesticks that denote the opening, closing, and high and low prices each day over a given time interval, which makes it more useful than traditional lines that simply connect the dots of closing prices.

Moreover, the chart can be used for identifying trading patterns that help technical analysts to establish trading modes. From a pragmatic perspective, candlestick patterns can be formed by grouping two or more candlesticks in a certain fashion. The pattern trend provides an intuition to predict the direction of the price movements. In Figure 1, the rectangle part of the daily candlestick is called the real body showing the link between opening and closing price and represents the price gain or loss for the specified period. The thin lines above and below the real bodies are called shadows and also referred to wicks, which show the highest and lowest prices in the trading session. The filled red color of the real body means that the price is closed higher than its opening price; while the green color indicates that a session is closed lower. Although the candlestick can be interpreted using a variety of methods, the relationship between the opening and closing price is considered the most vital information on price movements. In particular, the close price is generally considered as the most important indicator to assist a trader in forming his short-term trading strategy.

Over the past years, researchers developed a large number of technical indicators based on the statistics of the candlestick chart to analyze the price fluctuation. The moving average convergence and divergence (MACD) indicator is deemed a most well-known trend-following momentum oscillator to represent quantitatively the relationship between the moving averages (MA) of the closing price. Mathematically, the standard MACD is calculated based on the difference (DIF) between fast (typically 12-day) and slow (26-day) exponential moving average (EMA). Changes in the time periods used for the calculation can be made to accommodate a trader’s specific targets or a particular type of trading. The EMA [19] is a type of moving average (MA) that places a larger weight on the most recent data points and reflects sensitively the near term price changes [20, 21].

The MACD histogram is a mathematical tool to evaluate the signed distance between the MACD and its signal line based on the 9-day EMA, which is also known as the divergence exponential average (DEA). The calculations of the MACD are given as follows [2]:where *x* is the number of days and *C*_{n} is the closing price on the *n*th day. Typically, traders use the MACD histogram to anticipate changes in the market momentum. For instance, for the positive values of DIF and DEA, the MACD line crosses the signal line to produce an uptrend divergence and output a buy suggestion. For negative DIF and DEA values, the signal line traverses the MACD line, which advises a sell recommendation based on the negative divergence behavior.

As the MACD alone may generate false predictions, experienced traders rely on complementary trend measurement indicators. A commonly used indicator is known as the KDJ index [21], which is otherwise known as the random index. It is a practical technical indicator that is commonly used in short-term trend analysis. It derived from the stochastic oscillator, which, however, differs from the latter by including an extra J line. Values of K and D lines show whether a stock is overbought or oversold; while the J line represents the divergence of the D line from the K line [22]. The indicator incorporates price levels accounting for the amplitude of fluctuations in the prices. The fastest, slowest, and medium indices K, D, and J are calculated as follows [21, 23]:where *n* denotes the *n*th trading day and *C*_{n} denotes the closing price on the *n*th day. Note that *H*_{n} and *L*_{n} denote the highest and lowest price within *n* days, respectively.

Another popular technical indicator is the relative strength index (RSI), which is typically employed in technical analysis to evaluate the magnitude of price changes. Hence, it is feasible to use this indicator to estimate if the trading condition is overbought or oversold. The RSI measures both the speed and the change rate in price movements. RSI values are typically estimated over a 14-day period and fluctuate between zero and 100 and can be obtained as [24]where U*p* and represent the upward and downward movements in terms of the closing price, respectively. Other indicators include the on-balanced volume (OBV) to describe changes in volume, the William % R to show the current closing price related to the high and low price of the past time period, and the price channels to identify an upward thrust to signal the start of an uptrend.

The calculation of the abovementioned five most popular financial indicators provides us with numerical metrics to quantitatively characterize the moving trend of stock prices. Furthermore, inspired by reference [24] that effectively extended the mechanical engineering concept RSI to the field of financial data analysis, we perform extensive numerical evaluations of the features that are specifically used in the empirical modeling of mechanical vibration signals [25, 26] and select the following twenty indicators based on the criterion of performance optimization, as shown in Table 1. Each feature reflects an attribute of a time sequence since the vibration signal and the stock price can be both viewed as time sequences and bear much resemblance with each other. Certain mechanical features place more significance on time-domain properties such as magnitudes and energy differences; while the others indirectly represent frequency-domain properties in terms of the zero crossing rate and the position change of frequency bands.

Combining financial indicators with mechanical signal characteristics, we form a comprehensive list of extracted features as the inputs to the proposed model. As various features differ noticeably in magnitudes, we apply the min-max scaling to perform the data standardization for each feature.

In the numerical experiments, a 30-day sliding window interval is used to form the model inputs and the corresponding binary targets. Considering that we have a two-dimensional array of extracted features for each target, we further extract four basic statistics, i.e., maximum, minimum, mean, and standard deviations of the input features as shown in (9) and (10) to reduce the model input to a one-dimension (1-D) vector for each target [27]

In financial analysis, it is worth mentioning that the date is deemed an important feature. For instance, the date of the mutual fund redemption tends to result in market volatilities. To fully exploit this feature, we convert trading dates into monthly, weekly, and daily variables, respectively, by resorting to the one-hot encoding technique, thus turning these categorical variables into numerical vectors.

Finally, we have a tabular form of processed data, where each row corresponds to a number of derived features to the model, which are composed of technical indicators, mechanical characteristics, and date variables, as well as a binary target, to be predicted to signify the rise or fall of the closing price in ten trading days as compared with the current date.

##### 2.3. LGBM Model

The decision tree model [28] seeks to maximize the differences between class probabilistic distributions. By building a tree structure that satisfies division conditions, samples are classified and predicted based on the optimized model. The gradient boosting decision tree (GBDT) algorithm is a family of lifting tree models seeking to improve the performance of the decision tree model by using the technique of classification and regression trees (CART). By initializing a weak classifier *f*_{0}(*x*), the negative gradient of each sample can be obtained by [29]

Hence, the obtained residual is used as the updated ground-truth value of the sample, and the training data can be updated for the next decision tree. Following this procedure, the final learner is obtained by calculating the best fitting value for the leaf area and repeatedly updating the strong learner as follows:

In order to find a suitable split point, the GBDT algorithm needs to scan all data subsets. Hence, it has an excessively slow computing speed and incurs a large amount of memory. To overcome these limitations, an improved memory-efficient version, i.e., the gradient boosting machine (GBM) algorithm is proposed in [10].

Despite the popularity of deep learning in recent years, the GBM algorithm generally performs better in the tasks of analyzing tabular data. In many scenarios, this algorithm is preferred in practical implementations due to its interpretability, fast convergence, and possibility to incorporate modularly domain-specific prior knowledge. It uses the additive models of weak learners to optimize a specific loss function and fine-tune hyper-parameters based on the gradient descent algorithm. Specifically, there are two categories of powerful GBM algorithms, i.e., the extreme GBM (XGB) and the LGBM models. Both algorithms obtain performance comparable to deep-learning-based convolutional neural networks (CNN) in data analytic tasks.

In the training process, the XGB algorithm traverses the dataset multiple and generally displays a much slower convergence behavior [30]. On the contrary, the LGBM model distributes computations across multiple nodes and employs a parallelized hierarchical learning approach to derive inherent patterns from large-scale data. For self-containing purposes, we include the abovementioned process in Figure 2. Specifically, the lightGBM algorithm performs the evaluation on a subset of training data to obtain an entropy metric based on which a nearly optimal segmentation can be made. Hence, it effectively achieves a reduction in terms of memory usage, communication costs, and the computational resources needed to obtain gains for tree-splits. The theory shows that this method does not suffer from the loss of the accuracy and is capable of achieving good performance with large datasets while with a significant reduction in training time as compared with the XGB algorithm.

Figure 3 shows that the construction of the LGBM follows a leaf-wise approach, reducing more training losses than the conventional level-wise algorithms [30]. When growing on an equivalent leaf, the leaf-wise algorithm optimizes the target function more efficiently than the level-wise algorithm and leads to better classification accuracies, which can rarely be achieved by other boosting algorithms. In this paper, we use the LGBM model incorporating a depth-limited leaf growth strategy in numerical experiments. A comparison with the XGB algorithm is also made by evaluating their accuracies on typical stocks across four industry sectors.

**(a)**

**(b)**

Figure 4 shows the schematic flowchart of the proposed approach. The algorithm consists of the wavelet filtering module to denoise the raw data.

Based on the filtered sequence, we proceed to derive empirically validated financial technical indicators including MACD, KDJ, RSI, and OBV. Furthermore, we extract a number of mechanical features typically used in the analysis of machine vibration signals to reflect both time-domain and frequency-domain properties. To achieve numerical stabilities in experiments, we perform the standardization of the derived features over a sliding window of 30 trading days and normalize all values to the range of 0 to 1.0 based on the min-max scaling. The target of the prediction is formed by calculating the difference between the current close price and the closing price obtained over the period of ten trading days, and thus generating a positive or negative binary label depending on the rise or fall of the trend. Finally, we apply the random-search technique to optimize the LGBM parameters and use the optimized model to predict the trend of the closing price. After an extensive search for optimized parameters, we set the initial learning rate to 0.05, the number of leaves to 120, and the maximum depth of the LGBM tree to 6.

#### 3. Numerical Experiments

We use Tushare [31] to retrieve stock price datasets across the real estate, coal, electric power, and cement segments in the Shanghai Stock Market. Tushare is a convenient tool to perform data retrieval, cleansing, and storage of financial data due to its simple application interface (API) and short response time. Moreover, it is equipped with a set of readily used visualization functions to check the stock price data, which are often susceptible to the existence of data errors or outliers.

A typical metric used to perform the evaluation on classification models is the confusion matrix, which shows the errors (i.e., confusions) among different classes. The results of correct classifications are displayed on the diagonal of the matrix, while incorrect results are expressed as off-diagonal entries. Based on the confusion matrix, we further calculate a simple metric, i.e., accuracy, as the percentage of correct predictions out of the total number of samples. Accuracy is recognized as the most widely used empirical metric in the literature. Hence, we use it in this paper to benchmark the performance of various algorithms.

Of utmost importance in the task of trend predictions is eliminating any possibility of the data leakage, i.e., using validation data in the training procedure and thus resulting in an unreasonable though attractive high accuracy. To ensure a fair comparison with other algorithms, we form a strictly non-overlapping subset of train data and validation data. That is, the data ranging from January 2015 to January 2019, is used to train the model; while validation is conducted on the financial data over the interval between February 2019 to September 2019 [23].

For illustration purposes, Table 2 shows the calculation of certain financial indicators by taking a sliding window over 30 trading days. It is noted that various indicators have a vastly different and dynamic range, which implies that normalization is a necessary operation to ensure the stability of the model in the training process and also enable the model to learn discriminant features across various domains.

To show quantitative results, Table 3 shows the accuracies of the predicted trends by the proposed method, which are numerically evaluated on a number of stocks belonging to the coal industry.

Table 3 shows that the proposed approach obtains an accuracy of nearly 70% for most companies in the coal industry by employing financial technical indicators and signal-domain features. The accuracy is considered to be impressive for a short-term prediction task. In Figure 5, we graphically present the prediction results of several stocks, e.g., 000552.SZ, 000937.SZ, and 600714.SH for illustration purposes, where colored triangles denote the trend of the closing price. The date on which triangles are drawn represent the current trading date. Specifically, red triangles denote an increment that tends to occur over the specified interval; while green triangles denote a predicted decline. In Table 4, we present the predictions over the real estate industry. It is of interest to observe that the proposed approach obtains better results when compared with the coal industry. Figure 6 shows a visual presentation of several stocks. It is shown that the trends are predicted accurately on most trading dates prior to abrupt changes in the curve.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(c)**

Similarly, Tables 5 and 6 show the prediction accuracies over the electric power and the cement industry, while the graphs of individual stocks are presented in Figures 7 and 8, respectively. By presenting the numerical results of the proposed model across four industries, we have established a benchmark to compare the performance with other models including the conventional SVM and RF models as well as the powerful XGB model. A comparison with deep-learning CNN and attention-based transformer models would be considered as future work.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(c)**

On the same test set, we perform classification based on the SVM, random forest (RF) [32], ARMA, and autoregressive integrated moving average model (ARIMA). For a fair comparison, we have generated the same set of features when compared with the conventional machine-learning models such as SVM and RF. For statistical models, note that we have to access the test set so as to predict the trend of closing prices due to the requirements of the models. Table 7 shows the predicted accuracies on the test set across four industries. We have included various ensembled models in the comparisons, i.e., the RF model that is composed of random trees and an ensembling approach to average the prediction probabilities of the RF and the SVM models. The XGB model [33], which is generally viewed as a powerful ensembled learning algorithm, is also evaluated on the validation set. It is shown that the proposed LGBM model based on a combination of domain-specific financial statistics and signal features performs very well in this binary classification task. By ensembling a large number of leaf-wise growing trees, the proposed approach results in a noticeable increase of the forecasting accuracies, e.g., 8% for the real estate industry and nearly 6% for the cement and coal industries, respectively, as compared with the XGB model.

Table 7 also shows that the ARMA model performs better than the SVM model and approaches the accuracies of the proposed method on the real-estate and coal industries. However, the ARMA has a much worse performance when evaluated on the other two industries. The ARIMA model does not deliver good performance across these four industries and is not considered a suitable candidate for short-term predictions, which may be attributed to the fact that it eliminates the influence of fluctuation trends by including a differential operation in the computation. It is worth mentioning that the proposed method does not access at all to the validation data and hence has better generalization capabilities. It effectively uses a combination of financial indicators and mechanic-specific signal features, which are obtained based on only the training data. On the contrary, we have to resort to the validation data in constructing the ARMA model in the short-term trend analysis.

#### 4. Conclusion

In this paper, we proposed a novel method to perform price trend prediction based on the LGBM model and a variety of feature engineering techniques. The wavelet transform is used to filter high-frequency noise from time sequences, thus alleviating instabilities inherent in the financial data analysis. Furthermore, we proposed to derive multidimensional features as inputs to the model based on domain-specific technical indicators and the expertise on the mechanical signal analysis. The derived features enable the model to deliver significantly better performance as compared with statistical models and conventional machine-learning algorithms. The proposed model, however, still requires a computationally intensive optimization of LGBM hyper-parameters. For future work, we will investigate the ensembling of tree-based models and CNN models. Transformer architectures that incorporate long-range attention mechanisms will also be studied for sequence-to-sequence prediction tasks.

#### Data Availability

The stock price data used to support the findings of this study are included within the article and cited as reference [31].

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

The study was funded by the Weihai Beiyang Electrical Group Co. Ltd., China, and the Shandong University, China.