Abstract

Succinylation is an important posttranslational modification of proteins, which plays a key role in protein conformation regulation and cellular function control. Many studies have shown that succinylation modification on protein lysine residue is closely related to the occurrence of many diseases. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. In this study, we develop a new model, IFS-LightGBM (BO), which utilizes the incremental feature selection (IFS) method, the LightGBM feature selection method, the Bayesian optimization algorithm, and the LightGBM classifier, to predict succinylation sites in proteins. Specifically, pseudo amino acid composition (PseAAC), position-specific scoring matrix (PSSM), disorder status, and Composition of -spaced Amino Acid Pairs (CKSAAP) are firstly employed to extract feature information. Then, utilizing the combination of the LightGBM feature selection method and the incremental feature selection (IFS) method selects the optimal feature subset for the LightGBM classifier. Finally, to increase prediction accuracy and reduce the computation load, the Bayesian optimization algorithm is used to optimize the parameters of the LightGBM classifier. The results reveal that the IFS-LightGBM (BO)-based prediction model performs better when it is evaluated by some common metrics, such as accuracy, recall, precision, Matthews Correlation Coefficient (MCC), and -measure.

1. Introduction

Posttranslational modification (PTM) is the chemical modification of the precursor protein after translation, such as the addition of a small molecule protein or the introduction of a functional group, so that the inactive precursor protein can obtain biological functions. There are many forms of posttranslational modifications of proteins, such as ubiquitination, glutarylation, sumoylation, palmitoylation, acetylation, and methylation. Succinylation is PTM that occurs on lysine. Lysine is an α-amino acid encoded by codons AAA and AAG and is easily modified [1]. Succinylation is a broadly conserved protein posttranslational modification that exists in prokaryotic and eukaryotic cells and can coordinate various biological processes [24]. Compared with the methylation and acetylation that occur on lysine, succinylation will cause more substantial changes in the chemical structure of lysine [5]. And in a variety of cell functions, including metabolism and epigenetic regulation, succinylated proteins are involved.

Some studies have shown that abnormalities and variations of succinylation associated with the pathogenesis of many diseases, including tumors [610], cardiac metabolic diseases [11, 12], liver metabolic diseases [13], and nervous system diseases [7, 14, 15]. Therefore, understanding succinylation and identifying the site of succinylation will help determine the pathogenesis of related diseases and develop targeted drugs [16].

Nowadays, many biological experimental methods have been developed to detect succinylated proteins or succinylation sites, such as the high-performance liquid chromatography assays, spectrophotometric assays, and radioactive chemical labeling [17, 18]. However, it is an arduous work on detecting protein succinylation through experiments, which inevitably waste lots of time and money. Machine learning, in contrast, has the advantage of performing a large number of experiments in a short time and has not been affected and restricted by external conditions, which are helpful on recognizing the succinylation sites. Zhao et al. [19] built a predictor called SucPred based on SVM using position amino acid weight composition, van der Waals volume normalized, grouped weight-based encoding, and autocorrelation functions. By using SVM, Xu et al. [20] developed iSuc-PseAAC that implemented a composition of pseudo amino acid (PseAAC) scheme. And then, Xu et al. [21] developed another predictor named SuccFind, which considered several amino acid-based composition encodings, including amino acid composition (AAC), -spaced amino acid pairs (CKSAAP), and amino acid index (AAindex). Hasan et al. [1] proposed the approach SuccinSite predictor with the RF classifier by integrating multiple sequence features. Ning et al. [22] built a predictor called PSuccE based on SVM using amino acid composition (AAC), binary encoding (BE), physicochemical property (PCP), and grey pseudo amino acid composition (GPAAC).

Although many methods for predicting succinylation sites based on machine learning have been proposed, there is still much room for improvement based on machine learning. Firstly, the protein sequences’ feature information is not clearly illuminated for predicting the effects of succinylation site interaction. Secondly, multi-information fusion produces high-dimensional feature vectors, which brings redundancy and noise information. There is an urgent need to use an efficient feature selection method to rank the importance of features and select the best feature subset from them. Finally, the development of experimental technology produced a large amount of succinylation data. How to make full use of experimental data to design effective prediction algorithms is very necessary.

In this study, we propose an IFS-LightGBM (BO) prediction framework based on machine learning to identify succinylation sites of lysines in protein sequences. Four sequence-based features are firstly used to represent the peptides: (1) pseudo amino acid composition (PseAAC), (2) position-specific scoring matrix (PSSM), (3) disorder status, and (4) Composition of -spaced Amino Acid Pairs (CKSAAP). Secondly, the LightGBM method is used to prioritize the 2501-D feature vector, which obtains from these four sequence-based features. Thirdly, we combine the incremental feature selection (IFS) method and the machine learning methods to perform effective feature fusion and then determine the optimal feature subset, which can eliminate the noise and redundancy information in the original feature vector. Finally, the Bayesian optimization (BO) algorithm is used to optimize the parameters of the LightGBM classifier. This work provides not only a better understanding of the sequence characteristics of protein succinylation modification but also an effective algorithm for directly predicting the succinylation sites in proteins.

2. Materials and Methods

2.1. Dataset

In this work, the training data were collected from dbPTM [23, 24] (http://dbptm.mbc.nctu.edu.tw/index.php), which integrated published literatures, public resources, and a total of eleven biological databases related to PTMs. We obtained 2599 protein sequences including 5049 experimentally verified lysine succinylation sites and 5526 nonsuccinylation sites remained from dbPTM. The data in the dataset used a window size of to extract the corresponding peptide fragment with lysine (K): “1” was a lysine (K) which was extracted as the central site of a polypeptide segment; “” was equal to 10, which meant that 10 AA (amino acid) residues were selected from the upstream and downstream of lysine; and finally, a polypeptide segment with a length of 21 was obtained. Among them, the positive sample took the succinylated residue as the central site.

2.2. Description and Representation of Samples

Encode each polypeptide segment as a numeric vector and input it as a feature into the model. This is the most critical step in building an effective predictive model. Therefore, it is necessary to use high-quality sequence encoding methods to generate features that can effectively predict succinylation sites. In this study, we use four types of amino acid feature encoding methods including CKSAAP, disorder, pseudo amino acid composition, and PSSM.

2.2.1. CKSAAP Encoding

The CKSAAP is one of the most classic encoding methods which has been widely used by people in bioinformatics tasks [1, 2531]. In this study, we use polypeptide segments of length 21. Take AxA as an example, whose space number is equal to 1. For the polypeptide segment which is composed mainly of 20 basic amino acids (i.e., A, R, D, C, ..., W, Y, V), when , the residue pairs that we need to extract are AA, AR, AD,..., VV, namely, there have a total of amino acid pairs.

The following formula is used to calculate the feature vector:

where represents the number of amino acid pairs with a distance of , is the total number of -spaced residue pairs in the fragment. In this study, the optimal is set as 4 (i.e., , 1, 2, 3, 4); then, the will be 20, 19, 18, 17, and 16, which is obtained by calculating . Thus, a polypeptide segment with a length of 21 is transformed into a 2000-dimensional () AA composition feature vector.

2.2.2. Disorder

One of the important indicators to measure the degree of protein structure is its inherent disorder. A large number of research results show that the inherent disorder of proteins plays a very important role in the prediction of protein structure and function [3234]. In order to characterize this property, we use the VSL2B [35] program to predict the disorder score value. By running the tool, two types of results will be obtained, namely, qualitative and quantitative. The quantitative result is a value in the range [0, 1]. It is generally believed that 0.5 is the boundary value for distinguishing order and disorder. If the scoring result exceeds 0.5, it is considered disorder, otherwise order. For a polypeptide segment with a length of 21, the disorder features with 21-dimensional will eventually be obtained.

2.2.3. Pseudo Amino Acid Composition (PseAAC)

In order to avoid complete loss of sequence-order information, the PseAAC [36] was proposed by Chou. PseAAC is an extended form of AAC, and it can identify salient hidden information [3739]. In this work, Type-1 PseAAC is used as the dominating sequence representations. Let be a polypeptide segment with the length of , and is the th residue of :

The Type-1 PseAAC is a function of , which produces a -D vector. The mathematical formulation of Type-1 PseAAC is where the first 20 features that can be obtained based on the composition of a given polypeptide segment, hydrophobicity, side-chain mass, and hydrophilicity of AAs are used to calculate the remaining features [38]. The of can be computed by Equations (4) and (5): where reflects the occurrence frequency of the 20 standard amino acids, is the th tier sequence correlation factor, is the weight balance parameter for the sequence order effect, and is the lag parameter. and are normalized values of hydrophilicity and hydrophobicity, respectively. The normalized values of the side-chain mass of are . Therefore, when and , each polypeptide segment is transformed into 40 dimensions AA composition feature vector.

2.2.4. PSSM (Position-Specific Scoring Matrix)

Use PSI-BLAST [40] software to search against the SWISS-PROT database to obtain protein evolution information, where the BLAST local protein database is an authoritative protein database which is established by the University of Geneva and the European Bioinformatics Institute (EBI) [41]. By running the PSI-BLAST tool, two matrices can be obtained. The first one is the position-specific scoring matrix (PSSM), which is used to represent the conversation scores of 20 standard AAs occurring at specific sequence positions during evolution. The second one is the position-specific frequency matrix (PSFM), which contains the frequency of occurrence for a given amino acid at a specific sequence position. For the PSSM, we expand it into a 440-dimensional vector () in the following way. Firstly:

In addition, features can be extracted from the PSSM: where is the length of each polypeptide segment, is the values in the matrix. The features of PSSM are achieved by the integration of and .

We combine four characterizations to obtain a total of a 2501-D feature vector from each polypeptide segment. The distribution of the 2501 features is listed in Table 1. Then, several kinds of feature selection method will be used to rank the 2501-dimensional features according to their importance.

2.3. Feature Selection Method

Before models are developed, because the dataset may have features that are unrelated to target value or noise interfered, it is necessary to select the optimal feature subset through feature selection, so as to reduce the dimension of feature space to further decrease the risk of overfitting. At the same time, the generalization performance and the prediction ability of the models can be further improved when irrelevant features are removed ahead of training. In this article, the LightGBM feature selection method is used to select independent variables that are used to form the optimal feature subset.

LightGBM is an effective GBDT implementation algorithm, which is designed by Microsoft Research Asia [42]. For the LightGBM algorithm, two kinds of the importance type are contained: one is the “split” and the other is “gain.” “Split” reflects the number of times the feature is used in a model. When building the boosted trees, the important features are used more frequently, and the rest are used to improve on the residuals. In this study, the importance type is “gain.” Different from the “split,” the “gain” measures the actual decrease in node impurity. The feature rankings of gain-based importance can be obtained after LightGBM fitting [43].

2.4. Incremental Feature Selection Method

The combination of the IFS method and feature selection method is helpful to select the optimal feature subset. First, the feature selection method is used to construct the feature list. Specifically, different feature selection methods will firstly generate the importance scores of all features in light of the calculation criteria, then arrange the features in descending order according to the importance scores, and finally construct a corresponding feature list based on the importance scores, where the feature list is denoted as . Next, incremental feature selection (IFS) [44] makes a series of feature subsets. IFS is a process of building an increasing number of feature subsets by gradually adding features. In this study, the incremental step length of the IFS method is set to 1, thereby constructing a series of feature subsets which can be denoted as , in which the feature subset of is constructed using the features ranked first in the feature list, and has the top 1 and top 2 features, namely, . Then, the generated feature subsets are sequentially input into the classifier, which use the 10-fold cross-validation method to evaluate. Finally, when the fitness function of the predictor uses a certain feature subset to reach its maximum value, its corresponding feature subset is the best feature subset.

2.5. LightGBM

LightGBM is an algorithm that has been successfully applied in the field of classification, ranking, and many other ML tasks [43, 4548]. It is an ensemble model based on a decision tree algorithm. In order to reduce memory usage and increase the training speed, the Histogram Algorithm is used in LightGBM, which tries to discretize every feature (successive floating-point eigenvalues) in the dataset into small bins. After that, these bins are used to construct the histogram with width . Once all the samples are traversed, the histogram will accumulate the required statistics that are gradients and number of samples’ in each bin. After accumulating these statistics, the optimal segmentation point can be found based on the maximum gain provided when bins are divided into two parts.

Besides Histogram Algorithm, LightGBM proposes gradient-based one-side sampling (GOSS), exclusive feature bunding (EFB), and leaf-wise growth methods to further improve computational efficiency without hurting the accuracy. When finding the best split in GOSS, the absolute value of samples’ gradients is firstly used to sort the data instances. Then, preserve data instances with higher gradients, and randomly sample instances with low gradients. Finally, GOSS calculates the variance gain of feature to give the best split node.

where is the instance set with high gradients and is the instance set with low gradients. is the gradients of each sample, is the sampling ratio of large gradient data, is the sampling ratio of small gradient data, and is the number of iterations.

Furthermore, EFB can speed up the training process of GBDT via bundling exclusive features into a single “big” feature. Compared with the level-wise growth strategy, the leaf-wise growth strategy will choose the leaf with max delta loss to grow, which can reduce more errors and obtain better accuracy under the same splitting times.

2.6. Model Construction and Performance Evaluation

For convenience, the succinylation site prediction method proposed in this study is called IFS-LightGBM (BO). The use of flowcharts can more intuitively show the internal mechanism of model construction. Therefore, we draw Figure 1 to show the overall framework of IFS-LightGBM (BO). The framework is implemented using PyCharm 2020 and Python 3.6. The experiment is conducted on a computer with 2.30 GHz, 32.0 GB RAM, and Windows operating system.

The specific steps of IFS-LightGBM (BO) for the succinylation site prediction are described as follows: (1)Dataset. The dataset is used to predict succinylation sites, and it is divided into two categories: positive samples containing succinylation sites and negative samples without succinylation sites; the numbers are 5049 and 5526, respectively(2)Feature Extraction and Feature Selection. In this study, four feature extraction methods are used: PseAAC, disorder, PSSM, and CKSAAP, which transform the protein sequence signal into a numerical signal. For PSSM, two kinds of encoding methods are used, namely, averaged by column and expanded by row. Then, these four feature extraction methods are fused to predict the succinylation site. As what follows, the LightGBM method is used to prioritize the features and generate the feature list(3)IFS-LightGBM (BO) Model Construction. The IFS method builds a series of feature subsets based on the feature list, and then, the generated feature subsets are input into the LightGBM classifier in turn. The 10-fold cross-validation is used to evaluate the predictive ability of the classifier. When the performance of the classifier reaches the best, the corresponding feature subset is the optimal feature subset. The selected best feature subset is used as the input features of the model, and then, the BO algorithm is used to optimize the hyperparameters of IFS-LightGBM (BO). Five measurement metrics including ACC and -measure are used to evaluate model performance

Protein PTM site prediction is essentially a binary classification problem, which can be measured using the notation in Table 2.

TP and FP represent the numbers of true positive and false positive, and the numbers of true negative and false negative are represented by TN and FN, respectively. To evaluate the prediction performance of our proposed method, five measures including accuracy (ACC), recall, precision, Matthews Correlation Coefficient (MCC), and -measure are used [49]. These metrics are calculated as follows:

ACC describes the proportion of samples that is correctly predicted, and its value ranges from 0 to 1, where 1 indicates the best prediction. Recall is the ability to identify positive cases, and precision denotes the class agreement of the data labels with the positive labels given by the classifier. MCC is a correlation coefficient describing the relationship between actual classification and predicted classification. -measure can combine recall and precision into a single class-specific accuracy, which is selected as the key measurement in this study.

3. Results and Discussion

3.1. Integrated Optimal Feature Extraction

CKSAAP characterizes the short sequence motif information of polypeptide segments, PSSM reflects evolutionary information, disorder features reflect natively disordered residues recognized by VSL2B [35], and PseAAC reflects the physicochemical information of polypeptide segments. Therefore, it is possible that the integration of such four kinds of encodings will characterize the sequence and structural features of polypeptide segments better. But at the same time, the dataset after feature fusion may have features that are unrelated to the target value, so the LightGBM ranking algorithm will be used to sort the fused features and generate a feature list, and then, the IFS method will be used to select the optimal feature subset. This section uses 10-fold cross-validation to evaluate the performance of the model. The corresponding results are shown in Table 3. “All (IFS)” represents the fusion of the four feature extraction methods and the dimensionality reduction through the IFS and LightGBM feature selection method. “All” represents the fusion of the four feature extraction methods without dimensionality reduction.

It can be seen that when using the PseAAC, disorder, CKSAAP, PSSM, and “All” feature extraction methods, the -measure scores are 69.37%, 56.79%, 68.29%, 67.56%, and 71.03%, respectively. The prediction -measure of “All (IFS)” is 72.32%, which is 1.29% higher than that of “All” and 15.53% higher than that of disorder. We can clearly see that when the four feature extraction methods are fused, as well as the IFS and LightGBM are used to select the optimal feature subset, the prediction -measure is significantly improved. Moreover, we can see that the prediction accuracy of “All (IFS)” is 2.99%, 16.51%, 3.96%, 3.75%, and 1.07% higher than those of PseAAC, disorder, CKSAAP, PSSM, and “All.” In summary, fusing the four feature extraction methods and using IFS and LightGBM to select the optimal feature subset can improve the prediction performance of protein succinylation sites. Hence, we combine the four feature extraction methods of PseAAC, disorder, CKSAAP, and PSSM to characterize information of each polypeptide segment.

3.2. Results of the IFS and Feature Selection Methods

Using multiple feature extraction methods can better characterize polypeptide segments, but at the same time, it will increase the risk of feature redundancy. In order to select the optimal features, the combination of the IFS method and different feature selection methods will be introduced. Meanwhile, in order to clearly reflect the superiority of the LightGBM feature selection method, we also use ReliefF [50], LinearSVR [51], XGBoost [52], and ANOVA [53] to find the optimal feature subset.

It can be seen from Table 4 that for the dataset, different feature selection methods have a great impact on the accuracy of succinylation site prediction. Among them, the LightGBM feature selection method enables the classifier to obtain the best prediction performance. When the top 351 features are selected as the feature subset, the ACC, recall, precision, MCC, and -measure are 73.60%, 72.23%, 72.40%, 47.08%, and 72.32%, respectively. The -measure of LightGBM is 0.51%, 0.73%, 0.04%, and 0.56% higher than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. The ACC is 0.41%, 0.56%, 0.09%, and 0.44% higher than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. The number of optimal features of LightGBM is 1963, 1756, 212, and 852 less than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. Furthermore, we also try the dimensionality reduction methods of the principal component analysis (PCA) [54] and -Distributed Stochastic Neighbor Embedding (t-SNE) [55] without using the IFS method. As we can see from Table 4, when PCA is used to select the best feature subset, the -measure value of the model is 0.6746, which is 4.86% less than LightGBM and 4.13% less than LinearSVR. Meanwhile, when-SNE is used to select the best feature subset, the -measure value of the model is 0.5218, which is 20.14% less than LightGBM and 19.41% less than LinearSVR. The experimental results indicate that although they consume less time, their predictive performances are worse than using other feature selection methods which combined with IFS, when processing the data in this study. Thus, we choose the way in combining IFS with different feature selection methods to select the optimal feature subset.

To further analyze, we draw the IFS curve plots of the predicted values for each feature selection method (i.e., LightGBM, ReliefF, LinearSVR, XGBoost, and ANOVA) as shown in Figure 2. It can be seen from Figure 2(b) that the LightGBM achieves the optimal -measure result through a subset with 351 features, while the ReliefF, LinearSVR, XGBoost, and ANOVA achieve the optimal -measure value when using the subset of the top 2314, 2107, 563, and 1203 features in their own feature list, respectively. Meanwhile, the LightGBM achieves the optimal ACC result using the subset of the top 351 features, while the ReliefF, LinearSVR, XGBoost, and ANOVA achieve the optimal ACC value when using the subset of the top 1465, 2130, 421, and 1203 features in their own feature list, respectively. In this section, the feature lists are different because these feature lists are constructed through different feature selection methods, but the feature subsets generated based on the feature lists will be input to the same classifier. The number of features used in the optimal feature subset determined by the LightGBM method is the least, while its corresponding ACC and -measure scores are the highest. Conversely, the best feature subset of the LinearSVR contains the largest number of features, and its prediction accuracy and -measure are smaller. Given the above, the prediction effect of the LightGBM feature selection method is better than ReliefF, LinearSVR, XGBoost, and ANOVA. Thus, we use the LightGBM feature selection method to select the best feature subset.

3.3. Selection of Classification Algorithms

The choice of classifier plays a crucial role in constructing an effective prediction model of succinylation sites. According to the discussion in Section 3.1, the 2501-dimensional feature vector will be obtained by fusing four feature extraction methods including PseAAC, disorder, CKSAAP, and PSSM. According to the analysis in Section 3.2, LightGBM will be used as the feature selection method and combined with the IFS method to construct the optimal feature subset. For the sake of reflecting the superiority of the LightGBM classifier in predicting succinylation sites clearly, Random Forest (RF) [56], ExtraTree (ET) [57], Gradient Boosting Decision Tree (GBDT) [58], -nearest neighbor (kNN) [59], XGBoost, and Naïve Bayes (NB) [60] algorithms will be introduced and compared with it. The number of neighbors in kNN is 5, the number of base decision tree in RF is 100, the number of iterations of LightBGM, XGBoost, and GBDT is 100. It can be seen from Table 5 that when the LightGBM classifier uses a subset of the top 351 features in the LightGBM feature list, the -measure value can achieve the best classification performance, which is 0.7% higher than GBDT and 13.64% higher than ET. Meanwhile, the ACC value of the LightGBM classifier is 0.7360, which is 1.78%, 13.5%, 0.93%, 7.64%, 0.89%, and 4.82% higher than that of RF, ET, GBDT, kNN, XGBoost, and NB, respectively.

Moreover, Figure 3 depicts the IFS curves of the ACC value and -measure value of each classifier. It can be seen from the curve in the figure that when a subset of the top 351 features in the LightGBM feature list is used by the LightGBM classifier, both the -measure and ACC achieve the optimal classification performance. For RF, ET, GBDT, KNN, XGBoost, and NB algorithms, the highest -measure values are obtained when they use the top 37, 33, 94, 32, 427, and 805 features in the LightGBM feature list, respectively. And when separately using the top 41, 9, 131, 16, 427, and 44 features in the LightGBM feature list, they can obtain the highest ACC.

To prove that the LightGBM classifier is better than the other six machine learning algorithms, we conduct 100 times 10-fold cross-validation on different classifiers by setting different random seed numbers for the cross-validation method, in which the optimal subsets constructed by IFS are acted as the trainset for training different classifiers. To further measure the performances of the seven machine learning methods, we calculate the mean of maximum ACC (Max_ACC mean), the mean of maximum -measure (Max_-measure mean), the standard deviation of the maximum ACC (Max_ACC std), and the standard deviation of the maximum -measure (Max_-measure std) of each classifier. The results are listed in Tables 6 and 7, respectively. As shown in Tables 6 and 7, the prediction performance of the LightGBM classifier is better than the other six machine learning methods because its “Max_-measure mean” and “Max_ACC mean” are higher than the RF, ET, GBDT, kNN, XGBoost, and NB algorithms. The “Max_-measure std” and “Max_ACC std” of NB are smaller than other algorithms, which show that it has a relatively stable prediction performance. It can be seen from Figure 4 that the results of -measure basically fit the normal distribution, and the continuities of the evaluation metric distributions of the RF, GBDT, kNN, XGBoost, and NB are better than the LightGBM. And the maximum -measure of the 100 experiments of the LightGBM method is concentrated between 0.712 and 0.718. In conclusion, the LightGBM model has a better predictive performance than the other six models. So, we use LightGBM as the classifier. Figure 4 describes the histogram of the results of the 100 experiments.

3.4. Parameter Tuning and Performance Analysis

In order to further improve the performance of the LightGBM model, the Bayesian optimization (BO) algorithm is used to optimize the model parameters. Bayesian optimization is an extremely powerful method, which uses a surrogate function to estimate the noisy, expensive black-box functions. The core idea of the Bayesian hyperparameter optimization algorithm is to establish a probabilistic model that defines a distribution over objective functions from the input space to the objective of interest [61]. BO uses prior knowledge to approach the relatively cheap posterior distribution and then infers where to explore the next optimum hyperparameter combination according to the distribution. In this study, the Gaussian process (GP) approach is selected as a surrogate model and Expected Improvement (EI) is selected as an acquisition function.

For the surrogate model hyperparameters , let be the observations of input-target pairs, be the predictive variance function, the predictive mean value is , and define

where is the lowest observed value, and are the standard cumulative and normal density, respectively [62].

The Bayesian optimization (BO) algorithm is used to optimize some critical hyperparameters in the LightGBM classifier as shown in Table 8. The -measure between training values and predictive values of 10-fold cross-validation is defined as the fitness function evaluation of hyperparameter optimization of the LightGBM classifier. In order to distinguish from the LightGBM without hyperparameter optimized, the classifier which is optimized by the BO algorithm is called IFS-LightGBM (BO).

For the IFS-LightGBM (BO) model which uses the optimal feature subset with the 351 features, the parameters learning_rate, max_depth, max_bin, reg_alpha, boosting_type, num_leaves, and n_estimators need to be optimized by using the BO algorithm. The IFS-LightGBM (BO) model with the highest -measure can be implemented when learning_rate is 0.0274, max_depth is 20, max_bin is 10, reg_alpha is 0.9647, boosting_type is goss, num_leaves is 11, and n_estimators is 600, which gave -measure of 0.7255. For the sake of reflecting the superiority of the BO algorithm in tuning parameters clearly, we also employ a grid search (GS) method to optimize parameters of the LightGBM classifier, and the optimization process of the hyperparameters is shown in Table 9. Table 10 lists other measurement results of IFS-LightGBM (BO), LightGBM, and IFS-LightGBM (GS). In Tables 10 and 11, “LightGBM (GS)” represents the model with the GS method; “LightGBM” represents the model without hyperparameter optimization.

To confirm that the performance of the LightGBM classifier has been further improved by the BO algorithm, we conduct 100 times 10-fold cross-validation on the IFS-LightGBM (BO) model. Meanwhile, the LightGBM classifier and LightGBM (GS) model are also evaluated by 10-fold cross-validation 100 times. Table 11 shows the average and standard deviation of the maximum -measure of three models for 100 random tests, which are IFS-LightGBM (BO), LightGBM, and LightGBM (GS). As can be seen from Table 11, the “Max_-measure mean” of the IFS-LightGBM (BO) model is higher than the LightGBM (GS) model and LightGBM classifier. Meanwhile, compared with the grid search method, the Bayesian optimization algorithm has less time consumption in searching the optimal parameters for the LightGBM classifier. Therefore, the Bayesian optimization algorithm is used to optimize parameters of LightGBM to construct the optimal model.

Next, we will further analyze the model output results. On the basis of the output from Section 3.2, we find that when we use the subset of the top 351 features in the LightGBM feature list, the values of -measure and ACC reach the highest point. It can be seen from Figure 5(a) that among the top 100 features in the feature list, the numbers of PseAAC, disorder, CKSAAP, and PSSM are 38, 16, 9, and 37, respectively. The evolution information (i.e., PSSM) and the physicochemical information (i.e., PseAAC) account for 38% and 37% of the total. Among the top 101-200 features in the feature list, the numbers of PseAAC, disorder, CKSAAP, and PSSM are 2, 5, 29, and 64, respectively. At the same time, among the top 201-351 features in the feature list, the numbers of CKSAAP and PSSM are 46 and 105, respectively.

4. Conclusions

The study of succinylation and its sites plays an important role in determining the pathogenesis of related diseases and the development of targeted drugs. In this study, we propose a model IFS-LightGBM (BO) based on machine learning for the prediction of succinylation sites. For the dataset, the PseAAC, disorder, PSSM, and CKSAAP four feature extraction methods are used to extract the sequence information and physicochemical information of polypeptide segments. At the same time, the IFS method and different feature selection methods are introduced and combined with the LightGBM classifier to eliminate redundant and noise information to determine the best feature subset. Through comparison, we find that compared with ReliefF, LinearSVR, XGBoost, and ANOVA methods, the LightGBM feature selection method can search for the optimal feature subset faster, and its corresponding model performance evaluation metrics are better. Finally, the BO algorithm is used to adjust the parameters of the LightGBM classifier to establish the best model. The results show that the IFS-LightGBM (BO) model is a very effective way to predict succinylation sites because of owing ACC of 0.7392, recall of 0.7219, MCC of 0.4771, precision of 0.7291, and -measure of 0.7255.

Data Availability

The data used to support the findings of this study are from previously reported studies and public database, which have been cited.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This study was supported by the Natural Science Foundation of Shanghai (17ZR1412500) and the Science and Technology Commission of Shanghai Municipality (STCSM) (18dz2271000).