Scalable Data Mining Algorithms in Computational Biology and BiomedicineView this Special Issue
Research Article | Open Access
Positive-Unlabeled Learning for Pupylation Sites Prediction
Pupylation plays a key role in regulating various protein functions as a crucial posttranslational modification of prokaryotes. In order to understand the molecular mechanism of pupylation, it is important to identify pupylation substrates and sites accurately. Several computational methods have been developed to identify pupylation sites because the traditional experimental methods are time-consuming and labor-sensitive. With the existing computational methods, the experimentally annotated pupylation sites are used as the positive training set and the remaining nonannotated lysine residues as the negative training set to build classifiers to predict new pupylation sites from the unknown proteins. However, the remaining nonannotated lysine residues may contain pupylation sites which have not been experimentally validated yet. Unlike previous methods, in this study, the experimentally annotated pupylation sites were used as the positive training set whereas the remaining nonannotated lysine residues were used as the unlabeled training set. A novel method named PUL-PUP was proposed to predict pupylation sites by using positive-unlabeled learning technique. Our experimental results indicated that PUL-PUP outperforms the other methods significantly for the prediction of pupylation sites. As an application, PUL-PUP was also used to predict the most likely pupylation sites in nonannotated lysine sites.
Recently, a prokaryotic ubiquitin-like protein (Pup) has been identified in prokaryotes [1, 2]. Pup is an intrinsically disordered protein with 64 amino acids and marks the target proteins which are needed to be degraded [3, 4]. The process of Pup linking substrate lysine by isopeptide bonds is named pupylation which plays an important role in regulating protein degradation and signal transduction in prokaryotic cells . Although pupylation and ubiquitylation are functional analogues, the enzymology involved in them is different . In contrast to ubiquitylation requiring three enzymes E1 (activating enzyme), E2 (conjugating enzyme), and E3 (protein ligase), pupylation requires only two enzymes: the deamidase of Pup (DOP) and the proteasome accessory factor A (PafA) .
To understand the molecular mechanisms of pupylation, it is important to identify pupylation substrates and sites accurately. As the large-scale proteomics methods [8–11] are usually time-consuming and labor-intensive, several computational methods have been developed to predict the pupylation sites in recent researches. Liu et al. had developed the first predictor GPS-PUP for the prediction of the pupylation sites on the basis of group-based prediction system (GPS) 2.2 algorithm ; Tung developed a predictor, iPUP, by using SVM algorithm and the composition of k-space amino acid pairs (CKSAAPs) feature ; Chen et al. also proposed SVM-based predictor named PupPred, in which amino acid pairs feature was employed to encode lysine-centered peptides . Recently, Hasan et al. introduced a Profile-Based Composition of k-Spaced Amino Acid Pairs for the prediction of protein pupylation sites and built a web server named pbPUP .
Note that in the aforementioned three existing computational methods, the experimentally annotated pupylation sites are used as the positive training set and the remaining nonannotated lysine residues are used as the negative training set to build classifiers for prediction of new pupylation sites from the unknown proteins. However, due to the limitations of experimental condition and technique, the remaining nonannotated lysine residues may contain some pupylation sites which are not experimentally validated yet [13, 14]. Thus, the classifiers are actually trained on a noisy negative set. As a result, the performance of the classifiers may not be as good as it was supposed to be.
In contrast to existing prediction methods, experimentally annotated pupylation sites were used as the positive training set and the remaining nonannotated lysine residues were used as the unlabeled training set in this study. We developed a novel method to predict pupylation sites by using the positive-unlabeled (PU) learning technique. This method was called PUL-PUP (PU learning for pupylation sites prediction). Experimental results show that the performance of our method significantly outperforms the other methods on both training and test sets. As an application, the most likely pupylation sites were predicted in nonannotated lysine sites by the method we proposed in this paper. PUL-PUP Matlab software package is freely accessible at https://pul-pup.github.io/.
2. Materials and Methods
Tung’s training set and independent test set  were used in this study. The training set consisted of 162 proteins with 183 experimentally annotated pupylation sites and 2258 nonannotated pupylation sites; the independent test set consisted of 20 proteins with 29 experimentally annotated pupylation sites and 408 nonannotated pupylation sites. Sliding window method was used to encode every lysine residue K of dataset because pupylation only occurred in lysine residues K. According to , window size was selected as 21 in our study.
2.2. Feature Extraction and Feature Selection
The CKSAAP encoding has been widely used to various posttranslational modifications’ site prediction [16–18]. The CKSAAP features [13, 19] with = 0, 1, 2, 3, and 4 were used to encode each residue of lysine fragment in this study. Thus, each sample was represented by 2205 features. In Tung’s paper , chi-square test and backward feature elimination algorithm were used to remove the irrelevant and redundant features. Firstly, chi-square test was employed to rank the importance of the 2205 features. Then, the backward feature selection algorithm was used to eliminate 50 features with the lowest ranks in each iteration. Here, the top 150 CKSAAP features were selected as optimal feature set which were also same as Tung’s paper .
2.3. Development of PUL-PUP
The experimentally annotated pupylation sites were used as the positive training set and the remaining nonannotated lysine residues were used as the unlabeled training set to build classifier in this study. In this way, two types of subset were received in the training set: (1) the positive dataset and (2) the unlabeled dataset . Thus our problem became learning from positive and unlabeled samples. We proposed a novel PU learning algorithm named PUL-PUP to predict pupylation sites. The core learning algorithm of PUL-PUP is support vector machine (SVM) which has been widely used in various biological problems [20–22]. The flowchart of PUL-PUP algorithm is shown as follows:
Input(i)positive training data (ii)unlabeled data
Stage 1 (selection of initial reliable negatives). (i)
Stage 2 (expansion of reliable negative example set). (i);(ii)Repeat(iii);(iv)Construct two-class SVM based on P and ;(v)Classify by ;(vi) is the predicted negative set, where and ;(vii) where is the negative SVs of , is the surrounding points of in and ;(viii);(ix)until ;
Stage 3 (acquisition of final classifier). (i)A final SVM classifier was trained on positive set and representative reliable negative set There are three stages in PUL-PUP algorithm as follows.
Stage 1 (selection of initial reliable negatives). PUL-PUP selected the initial reliable negative set from unlabeled set U by maximum distance rule. should be located as far away from P as possible to ensure that the reliable negative set was the most dissimilar from the positive set P. Therefore, would satisfy the formula described below: where is Euclidean distance between and :
Stage 2 (expansion of reliable negative example set). After the selection of initial reliable negative set, PUL-PUP algorithm iteratively trained a series of two-class SVM classifiers and gradually extended reliable negative set. Specifically, at the th iteration, an SVM classifier was firstly trained in positive set and current reliable negative training set ; then, would be used to classify the current unlabeled set and calculate its decision value. To guarantee the reliability of the negative set, samples with the decision value less than a threshold () were selected as newly predicted negatives ; here was set to −0.25. To overcome the problem of imbalance during the iteration, the negative support vectors and their surrounding points in , named , were used to represent the existing negative set , and the size of was controlled less than . At the + 1th iteration, ; . Classifier was trained in positive set and current reliable negative training set . As this process continues, may contain more and more false positive examples; therefore, iteration should be terminated at some point. Iteration was repeated until the size of goes below a threshold ; here was set to 4.
Stage 3 (acquisition of final classifier). After the extraction of representative reliable negative set, a final SVM classifier was trained on positive set and representative reliable negative set .
2.4. SVM Parameter Selection
The core learning algorithm of PUL-PUP is support vector machine (SVM) with radial basis function (RBF) kernel. Libsvm  was used for training SVM models, and the grid search method was applied to tune the parameters in cross-validation. Parameter was selected from ,; and kernel parameter was selected from . The parameters of SVM were fixed during the expansion of reliable negative example set.
2.5. Performance Evaluation of PUL-PUP
Five widely accepted measurements, including sensitivity (Sn), specificity (Sp), accuracy (ACC), Matthew’s correlation coefficient (MCC), and area under receiver operating characteristic curve (AUC), were used to evaluate prediction performances of PUL-PUP. They are defined as where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively.
3. Results and Discussions
3.1. Performance of 10-Fold Cross-Validation on Training Set
In order to evaluate the effectiveness of the selected representative reliable negative samples on pupylation sites prediction, we compared our method with two other methods including SVM_balance and PSoL  on training set because the core learning algorithm of our method was SVM and our method was inspired by PSoL. For PUL-PUP and PSoL algorithms, the nonannotated lysine sites were used as the unlabeled training samples. The 10-fold cross-validation of them was performed on positive set and representative reliable negative set . For SVM_balance, a balanced negative training set which had the same size with the positive training set was randomly selected from the nonannotated lysine sites and the 10-fold cross-validation was also performed on the positive training set and the balanced negative training set to find the best parameters of SVM. The 10-fold cross-validation of the four methods was shown in Table 1. As shown in Table 1, PUL-PUP reached the highest Sn, Sp, ACC, MCC, and AUC values of 82.24%, 91.57%, 88.92%, 0.74, and 0.92, respectively, on training dataset. As the selected representative reliable negative samples, the PUL-PUP achieved an excellent performance on training set.
3.2. Comparison of PUL-PUP with Other Methods on Independent Test Set
To further evaluate the performance of pupylation sites prediction by PUL-PUP, we firstly compared it with PSoL and SVM_balance on independent test set. The compared results of different methods are shown in Table 2. Although SVM_balance can avoid the imbalanced problem, the performance of SVM_balance cannot be as good as the PUL-PUP because the negative training set in SVM_balance is randomly selected and cannot truly reflect the distribution of negative set well. It should be pointed out that stage 2 of PUL-PUP was similar to the negative set expansion in PSoL. But, in PUL-PUP, was represented by rather than merely. Thus, more information in is included and makes our algorithm more effective than PSoL.
We also compared our method with three existing pupylation sites predictors: GPS-PUP , iPUP , and pbPUP  on independent test set. Three thresholds of “High,” “Medium,” and “Low” were defined for PUL-PUP according to the SVM scores which were higher than 0.9672, 0.4032, and 0.1088, respectively. The performances of PUL-PUP and three existing pupylation sites predictors were shown in Table 3. As we can see from Table 3, the performance of our algorithm outperformed the existing three predictors significantly. Taking threshold “Medium,” for example, the MCC of PUL-PUP (0.24) was higher than that of GPS-PUP (0.14), iPUP (0.16), and pbPUP (0.07). Moreover, PUL-PUP achieved the highest AUC value (0.77). As our classifier is iteratively trained on the positive and reliable negative set in this paper, the performance of our algorithm outperformed the existing three predictors significantly. This demonstrates that PUL-PUP is more suitable for predicting the pupylation sites than other methods.
3.3. Prediction of the Most Likely Pupylation Sites in Nonannotated Lysine Sites
For the 183 pupylated proteins in PupDB , there are 212 experimentally annotated pupylation sites and 2666 nonannotated lysine sites. As mentioned earlier, those nonannotated lysine sites may contain some pupylation sites which have not been experimentally validated yet. To predict the most likely pupylation sites in nonannotated lysine sites, we run PUL-PUP algorithm on all data of the PupDB. The top 20 most likely pupylation sites in nonannotated lysine sites were listed in Supplementary S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2016/4525786). Here, we just give a possible hypothesis; whether those sites will cause pupylation or not remains to be experimentally verified.
In this study, we have developed novel pupylation sites prediction method PUL-PUP by using the PU learning. To the best of our knowledge, this is the first time PU learning has been applied to predict the pupylation sites. Experimental results have shown that our method outperformed the existing pupylation sites predictors significantly. Moreover, the most likely pupylation sites were predicted in nonannotated lysine sites by using PUL-PUP. We believe that our method can also be applied to predict the other types of posttranslational modification sites. In future research, we will develop a web server for the PUL-PUP.
The authors declare that they have no competing interests.
This work was supported by the National Natural Science Foundation of China (61502074), the Social Science and Technology Development Program of Dongguan, China (2013108101007), “Strategy of Enhancing School with Innovation” in Higher Education of Guangdong, China (2014KQNCX221), Dalian University of Technology Fundamental Research Fund (no. DUT15RC(3)030), and the China Postdoctoral Science Foundation (Grant no. 2016M591430).
The ‘Uniprot_AC’ means the accession number of protein in Uniprot database; ‘Site’ means lysine sites in the protein; the ‘Sore’ means SVM sore, and the higher SVM score indicates more reliable pupylation site.
- M. J. Pearce, J. Mintseris, J. Ferreyra, S. P. Gygi, and K. H. Darwin, “Ubiquitin-like protein involved in the proteasome pathway of Mycobacterium tuberculosis,” Science, vol. 322, no. 5904, pp. 1104–1107, 2008.
- K. E. Burns, W.-T. Liu, H. I. M. Boshoff, P. C. Dorrestein, and C. E. Barry III, “Proteasomal protein degradation in mycobacteria is dependent upon a prokaryotic ubiquitin-like protein,” The Journal of Biological Chemistry, vol. 284, no. 5, pp. 3069–3075, 2009.
- X. Chen, W. C. Solomon, Y. Kang, F. Cerda-Maira, K. H. Darwin, and K. J. Walters, “Prokaryotic ubiquitin-like protein pup is intrinsically disordered,” Journal of Molecular Biology, vol. 392, no. 1, pp. 208–217, 2009.
- S. Liao, Q. Shang, X. Zhang, J. Zhang, C. Xu, and X. Tu, “Pup, a prokaryotic ubiquitin-like protein, is an intrinsically disordered protein,” Biochemical Journal, vol. 422, no. 2, pp. 207–215, 2009.
- J. Herrmann, L. O. Lerman, and A. Lerman, “Ubiquitin and ubiquitin-like proteins in protein regulation,” Circulation Research, vol. 100, no. 9, pp. 1276–1291, 2007.
- C.-W. Tung, “PupDB: a database of pupylated proteins,” BMC Bioinformatics, vol. 13, no. 1, article 40, 2012.
- F. Striebel, F. Imkamp, M. Sutter, M. Steiner, A. Mamedov, and E. Weber-Ban, “Bacterial ubiquitin-like modifier Pup is deamidated and conjugated to substrates by distinct but homologous enzymes,” Nature Structural & Molecular Biology, vol. 16, no. 6, pp. 647–651, 2009.
- C. Poulsen, Y. Akhter, A. H.-W. Jeon et al., “Proteome-wide identification of mycobacterial pupylation targets,” Molecular Systems Biology, vol. 6, no. 1, 2010.
- R. A. Festa, F. McAllister, M. J. Pearce et al., “Prokayrotic ubiquitin-like protein (Pup) proteome of Mycobacterium tuberculosis,” PLoS ONE, vol. 5, no. 1, Article ID e8589, 2010.
- J. Watrous, K. Burns, W.-T. Liu et al., “Expansion of the mycobacterial ‘PUPylome’,” Molecular BioSystems, vol. 6, no. 2, pp. 376–385, 2010.
- F. A. Cerda-Maira, F. McAllister, N. J. Bode, K. E. Burns, S. P. Gygi, and K. H. Darwin, “Reconstitution of the Mycobacterium tuberculosis pupylation pathway in Escherichia coli,” EMBO Reports, vol. 12, no. 8, pp. 863–870, 2011.
- Z. Liu, Q. Ma, J. Cao, X. Gao, J. Ren, and Y. Xue, “GPS-PUP: computational prediction of pupylation sites in prokaryotic proteins,” Molecular BioSystems, vol. 7, no. 10, pp. 2737–2740, 2011.
- C.-W. Tung, “Prediction of pupylation sites using the composition of k-spaced amino acid pairs,” Journal of Theoretical Biology, vol. 336, pp. 11–17, 2013.
- X. Chen, J.-D. Qiu, S.-P. Shi, S.-B. Suo, and R.-P. Liang, “Systematic analysis and prediction of pupylation sites in prokaryotic proteins,” PLoS ONE, vol. 8, no. 9, Article ID e74002, 2013.
- M. M. Hasan, Y. Zhou, X. Lu, J. Li, J. Song, and Z. Zhang, “Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs,” PLoS ONE, vol. 10, no. 6, article e0129635, 2015.
- Z. Ju, J. Z. Cao, and H. Gu, “iLM-2L: a two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou's general PseAAC,” Journal of Theoretical Biology, vol. 385, pp. 50–57, 2015.
- Z. Ju, J.-Z. Cao, and H. Gu, “Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC,” Journal of Theoretical Biology, vol. 397, pp. 145–150, 2016.
- Z. Ju and H. Gu, “Predicting pupylation sites in prokaryotic proteins using semi-supervised self-training support vector machine algorithm,” Analytical Biochemistry, vol. 507, pp. 1–6, 2016.
- X.-B. Wang, L.-Y. Wu, Y.-C. Wang, and N.-Y. Deng, “Prediction of palmitoylation sites using the composition of k-spaced amino acid pairs,” Protein Engineering, Design and Selection, vol. 22, no. 11, pp. 707–712, 2009.
- J. Z. Zeng, Y. L. Liao, Y. S. Liu, and Q. Zou, “Prediction and validation of disease genes using HeteSim Scores,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016.
- L. Wei, M. Liao, Y. Gao, R. Ji, Z. He, and Q. Zou, “Improved and promising identificationof human microRNAs by incorporatinga high-quality negative set,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 1, pp. 192–201, 2014.
- Q. Zou, J. Z. Zeng, L. J. Cao, and R. R. Ji, “A novel features ranking metric with application to scalable visual and bioinformatics data classification,” Neurocomputing, vol. 173, part 2, pp. 346–354, 2016.
- C.-C. Chang and C.-J. Lin, “LIBSVM: a Library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011.
- C. Wang, C. Ding, R. F. Meraz, and S. R. Holbrook, “PSoL: a positive sample only learning algorithm for finding non-coding RNA genes,” Bioinformatics, vol. 22, no. 21, pp. 2590–2596, 2006.
Copyright © 2016 Ming Jiang and Jun-Zhe Cao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.