Abstract

Emerging evidence demonstrates that post-translational modification plays an important role in several human complex diseases. Nevertheless, considering the inherent high cost and time consumption of classical and typical in vitro experiments, an increasing attention has been paid to the development of efficient and available computational tools to identify the potential modification sites in the level of protein. In this work, we propose a machine learning-based model called CirBiTree for identification the potential citrullination sites. More specifically, we initially utilize the biprofile Bayesian to extract peptide sequence information. Then, a flexible neural tree and fuzzy neural network are employed as the classification model. Finally, the most available length of identified peptides has been selected in this model. To evaluate the performance of the proposed methods, some state-of-the-art methods have been employed for comparison. The experimental results demonstrate that the proposed method is better than other methods. CirBiTree can achieve 83.07% in sn%, 80.50% in sp, 0.8201 in F1, and 0.6359 in MCC, respectively.

1. Introduction

Human genome project has been successfully completed in the end of the twentieth century. More than 20,000 protein-coding genes have been reported. These coding genes construct the intact proteins in the biological processions. Nevertheless, this information can hardly cover the relationships among the proteins and the human biological processions [1, 2]. With the development of the proteomics, several types of post-translational modification (PTM) have been reported in the level of protein. These modifications have the ability to construct protein structure and maintain proteins’ stability. According to the foundational protein composition, PTMs make contributions to translating peptides [3, 4]. A great number of PTMS can alter physiological activity. Meanwhile, several PTMs have reversible biological functions. It was noted that PTMs take part in several diseases. For instances, PTM enzymes are involved in neurodegeneration diseases, especially in patients with AD and Parkinson’s disease [57]. So, having a good knowledge of PTMs is critical for achieving basic biology functions, the human diseases’ detection, and drug target [8, 9]. It was pointed that an increasing number of modification sites can be identified with the methods of machine learning. Nevertheless, the majority of machine learning approaches and experimental ones are inherently expensive and time consuming. Therefore, constructing an accurate and effective identification algorithm seems to be an urgent issue in the field of computational biology.

Citrullination, which can be treated as a special type of deamination, is one of the most universal type in the level of post-translational modification [10, 11]. Citrullination has been reported in several biological processions, including cytoplasmic, nucleic, and membrane [12]. In order to have a good knowledge of the mechanisms of citrullination, one of the most significant steps can be regarded as the effective and accurately classification on the modification sites and nonmodification ones. It was pointed that several proteomics approaches, which include immune detection [13], colorimetric detection [14], and mass spectrometry [15, 16], should be utilized in this field. Nevertheless, these abovementioned methods’ experimental approaches can be regarded to be time consuming to some degree [17, 18].

With the development of machine learning and artificial intelligence, some methods in silicon have been widely utilized in the area of bioinformatics. It was pointed that computational tools, including phosphorylation [19], methylation [20], acetylation [21], ubiquitination [22], carbonylation [2325], succinylation [26], malonylation, S-sulfenylation [27], and S-nitrosylation sites [28], have been proposed. Currently, Zhang et al. [29] initially proposed a computational approach to identification of such modification residues. Meanwhile, such work has the ability to remove some noise and redundant features [30, 31]. However, these subtle performances of such algorithms cannot be neglected. In order to design an effective and accurate algorithm to classify the citrullination sites in this work, we noted that the available features and the classification model can be regarded as basic elements in this classification problem.

The CirBiTree, whose full name is citrullination site identification with a fuzzy neural network and flexible neural tree, has been proposed in this work. First of all, we utilize the biprofile Bayesian to extract peptide sequence information. A flexible neural tree and fuzzy neural network are employed as the classification model in the second step. The most available length of identified peptides has been selected in the final step. To evaluate the performance of the proposed methods, some state-of-the-art methods have been used for comparison. CirBiTree can achieve 83.07% in sn%, 80.50% in sp, 0.8201 in F1, and 0.6359 in MCC, respectively, and the outlines are shown in Figure 1.

2. Materials and Methods

2.1. Dataset

In the work, we take advantage of the training dataset [29], established by Zhang et al., to train and test the proposed algorithm. The dataset contains 116 modification sites and 332 nonmodification ones in the level of citrullination. Meanwhile, each sample has been demonstrated as the style of peptide, whose center amino acid residue is the potential modification site. Therefore, the length of the peptide should be discussed in this experiment. According to such a situation, length ranges from 15 to 21 in the predicted peptide segments are chosen. So as to easily understand the length, we give an example in this section. A sample can be demonstrated as a peptide segment of length 21 in the employed dataset. In order to ensure the same length of each sample, some added residues ‘X’ can be filled in the positions.

2.2. Biprofile Bayesian

The biprofile Bayesian feature set is an original type of an encoding approach in the field of bioinformatics [32]. The encoding approach is based on the statistical theories. For instance, an employed sample, which includes n length peptide segments, makes a predicted center modification residue, upstream side and downstream side. The potential predicted samples can be defined as two groups. These two groups include one negative sample group and one positive sample group. Therefore, we can give the definition that the sample in the positive group can be treated as the Cp and the sample in the negative group can be treated as the Cn. The Cp is the citrullination center site in the predicted sample, and the Cn is the noncitrullination center site in the predicted dataset. With statistical theories, each amino acid residue can be defined mutually independent, and the posterior’s probability of the peptide for the two types can be shown as the following equations:

Then, we may update equations (1) to (2) into the index form as follows:

The prior distribution can follow the uniform distribution. Therefore, both the probability of negative samples and positive ones can be defined as equal. The distinguished function can be demonstrated as follows:With Shao’s approach, equation (5) can be redefined as follows:

2.3. Flexible Neural Tree

The flexible neural tree (FNT), considered as a special neural network, has been proposed by Bao et al. [18, 33]. Such model has the ability to regulate the neural network with special strategies. FNT has been widely utilized in the field of machine learning. The main steps of FNT are shown in the following section.

First of all, the flexible neural tree utilizes instruction set to generate population with the following equations:where the instruction group consists of two operating subgroups, including the operation subgroup and the variable subgroup. The operation set +i includes several operation processions, and the variable set xi includes several values. Then, the employed flexible activation function is described in the following equation:

In the next step, the output can be computed by the method of recursion in each neural node. For each operation set element +i, the total excitation can be calculated as follows:where is the input to node +i. The output of the node +i is computed in as follows:

2.4. Fuzzy Neural Network

In this section, we introduce a special type of fuzzy neural network, whose name is reinforced hybrid interval fuzzy neural networks (RHIFNNs). Such model can be employed as a classification model in the field of machine learning. In the proposed classification model, the membership intervals are obtained on a basis of the membership grades produced by the two methods being realized for different values of the fuzzy parameters.where is the membership grade formed by the Fuzzy C means when being run for the fuzzy parameter m1, while is the membership grade produced by the Fuzzy C means with the value of the fuzzy parameter set to m2.

The consequent part of fuzzy rule, Ypi can be treated as an interval, , wherewhere and are the indexes of fuzzy class in this model. The model output can be calculated as

3. Results and Discussions

3.1. Performance Measurements

In this classification problem, samples can be defined as two types, including the positive samples and the negative samples. Defined positive samples mean the peptide segments, whose center lysine residues have the acetylation modification. On the contrary, the defined negative samples mean the peptide segments, whose center lysine residues do not have the acetylation modification. According to the definition of the classified samples, they can cause the four results in the common situation. We can easily obtain these formulations, including sensitivity, specificity, accuracy, F1 scores, and MCC. Also, the detailed information is given as follows:where P is the scale of positive samples and N is the scale of negative ones. T is a set of the true predicted result, and F is a set of the false predicted result.

Table 1 summarizes that several different types of features have been employed to be compared with the proposed method. All the abovementioned features, namely, binary encoding, AA composition, grouping AA composition, physicochemical properties, KNN Features, Secondary Tendency Structure, PSSM, and BPB, have been tested in the proposed method. Our approach can get the performances that the proposed method can achieve: 78.19% in sn%, 79.28% in sp, 0.7862 in F1, and 0.5747 in MCC, respectively.

Table 2 demonstrates several art-of-the-state tools and approaches that have been employed to be compared to the proposed algorithm. Meanwhile, the length is 15.

Table 3 shows several art-of-the-state methods’ results. In particular, our proposed algorithm can achieve 80.09% in sn%, 78.86% in sp, 0.7960 in F1, and 0.5896 in MCC, respectively. Meanwhile, we find that some features have different functions in this type modification site classification.

From Table 4, several art-of-the-state tools and approaches have been employed to be compared the proposed algorithm, while the length is equal to 17.

From Table 5, it can be seen that the proposed method can achieve 81.01% in sn%, 80.09% in sp, 0.8064 in F1, and 0.6111 in MCC, respectively. Meanwhile, we find that some features have different functions in this type modification site classification.

From Table 6, several art-of-the-state tools and approaches have been employed to be compared the proposed algorithm, while the length is equal to 19.

From Table 7, the proposed method can achieve 83.07% in sn%, 80.50% in sp, 0.8201 in F1, and 0.6359 in MCC, respectively. Meanwhile, we find that some features have different functions in this type modification site classification.

From Table 8, it can be seen that several art-of-the-state tools and approaches have been employed to be compared the proposed algorithm, while the length is equal to 21. The ROC curves of the art-of-the-state methods have been demonstrated in Figure 2.

It was pointed that the compared features and art-of-the-state approaches have some good performances in this classification issue. The proposed method has the ability, which is more accurate, in these candidate lengths. Meanwhile, we can easily find out that the different lengths of the amino acid residue have the different performances. We can get the conclusion that the most available length among the employed candidate ones is 21. The distances of upstream and downstream are equal to 10.

In order to demonstrate the performances of the CiBiTree, some art-of-the-state machine learning methods, including random forest, neural network, support vector machine (SVM), and k nearest neighbor (KNN), have been employed to be compared with it. The ROC curves of the different machine learning methods have been demonstrated in Figure 3.

From Table 9, we can easily find out that the proposed method has better performance than other machine learning methods in this field.

4. Conclusions and Discussions

In this study, a novel predictor named CirBiTree has been designed to predict citrullination residues with the classification model based on a fuzzy neural network and flexible neural tree algorithm. As far we are concerned, it is the first time these abovementioned classification algorithms are utilized to the classification of the citrullination samples and noncitrullination ones. Experimental results and performances demonstrated that CirBiTree achieved an excellent performance and could be a useful bioinformatics tool to accurate identification of citrullination sites.

At the same time, several key elements of citrullination sites predicition issue should be considered. First of all, the effective description and the available features’ discovery can be regarded as one of the most important elements to deal with such classification issue. On the one hand, several classical and typical methods should be utilized in this field. On the other hand, some potential information should be found with the deep learning approaches. Secondly, the high-effective classification algorithms should be proposed in the field of machine learning and artificial intelligence. With the development of deep learning, the deep learning methods can be utilized in this field. Meanwhile, it was pointed that the real-time capability should be taken into account in the model construction.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

C.S. conceived the method, designed the method, designed the website of this algorithm, and conducted the experiments, and H.W. wrote the main manuscript text. All authors reviewed the manuscript.