Abstract

Recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are recombination coldspots. Therefore, the identification of hotspots/coldspots could provide useful information for the study of the mechanism of recombination. In this study, a new computational predictor called SVM-EL was proposed to identify hotspots/coldspots across the yeast genome. It combined Support Vector Machines (SVMs) and Ensemble Learning (EL) based on three features including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). These features are able to incorporate the nucleic acid composition and their order information into the predictor. The proposed SVM-EL achieves an accuracy of 82.89% on a widely used benchmark dataset, which outperforms some related methods.

1. Introduction

Meiotic recombination describes the process of alleles’ exchange between homologous chromosomes during meiosis [1]. It can provide material for natural selection by producing diverse gametes. It might also contribute to the evolution of the genome via gene conversion or mutagenesis [24].

Although the exact location where recombination happens in the genome and the mechanism of recombination are still unclear, it has been assured that recombination plays an important role in promoting genome evolution. Therefore, several studies have been performed on chromosomes [57] and found that recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are called recombination coldspots [8, 9]. With the number of the sequenced genomes showing explosive growth, more reliable methods are urgently needed to be developed to identify the recombination spots.

The prediction of recombination hotspots or coldspots is still a challenging task, although much information can be acquired from the experiments. Recently, several computational models have been presented to identify the recombination hotspots/coldspots. For example, Liu et al. [10], based on sequence Kmer frequencies, proposed a model which combines the increment of diversity with quadratic discriminant analysis (IDQD). Later, this method was improved by adding gaps into the kmers [11]. Chen et al. presented a predictor called iRSpot-PseDNC trained with pseudo dinucleotide composition features [12].

The aforementioned methods extracted the features from DNA sequences in different aspects. For example, the model based on oligonucleotide frequencies considers the nucleic acid composition information. The iRSpot-PseDNC incorporates both the local nucleic acid composition information and the global information of the protein sequences. Therefore, it is reasonable to combine these complementary predictors to further improve the performance of recombination hotspot/coldspot identification. In this regard, three basic predictors trained with basic kmer (Kmer) [13], dinucleotide-based auto-cross covariance (DACC) [14, 15], and pseudo dinucleotide composition (PseDNC) [16], respectively, were combined via the framework of ensemble learning approach, and a novel predictor called SVM-EL was proposed. All these features can be easily generated by a recently proposed tool called Pse-in-One [17], which is able to generate various features only based on the DNA, RNA, or protein sequence information.

2. Materials and Methods

2.1. Benchmark Dataset

The benchmark datasets was obtained from Liu et al. [10]:where the subset contains 490 recombination hotspots, the subset contains 591 recombination coldspots, and the symbol represents the “union” in the set theory.

2.2. Feature Vectors Generated by Pse-in-One

SVM-EL is developed by combining the outcomes of three individual predictors which were trained by different features, including basic kmer (Kmer) [13], dinucleotide-based auto-cross covariance (DACC) [14, 15], and pseudo dinucleotide composition (PseDNC). These basic features can be generated by using Pse-in-One [17] which provides two approaches to generate feature vectors. One way is through the web server (http://bioinformatics.hitsz.edu.cn/Pse-in-One/) and another way is through the stand-alone tool (http://bioinformatics.hitsz.edu.cn/Pse-in-One/download/).

Suppose a DNA sequence iswhere represents the DNA sequence length and is the nucleic acid at the position . Therefore, three basic features used in the current study can be described as follows.

2.2.1. Kmer

Kmer [13] is an approach representing DNA sequences by the occurrence frequencies of kmers. The Kmer contains the local sequence-order information and it can be generated with the help of Pse-in-One by the following steps.

For web server approach, firstly, choose DNA sequences (PseDAC-General), then select Kmer in the tab of Mode, and set the value of . Secondly, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 1).

For stand-alone approach, Kmer features can be easily generated by using the following command line:‘./kmer.py −f svm −l +1 3 DNA’where −f svm represents the format of the output file which is the LIBSVM training data format, −l +1 represents the input file that contains positive samples only, equals 3, and the sequence type is DNA.

2.2.2. Dinucleotide-Based Auto-Cross Covariance (DACC)

Dinucleotide-based auto-cross covariance (DACC) [14, 15] is the combination of DAC [14, 15, 19] and DCC [14, 15]. The DAC measures the correlation between two dinucleotides for one DNA property [17]. The DCC approach measures the correlation between two dinucleotides for two different properties [17].

Given a DNA sequence represented as (2), the DAC feature can be calculated as [17]where is the dinucleotide property index; is the length of DNA sequence; lag represents the distance between two dinucleotides; represents the value of dinucleotide at position for the dinucleotide property index ; represents the average value of for a DNA sequence.

Given a DNA sequence represented as (2), the DCC feature can be calculated as [17]where and are two different dinucleotide property indices; is the DNA sequence length; lag is the distance between two dinucleotides; represents the value of dinucleotide at position for the dinucleotide property index (); represents the average value of for a DNA sequence.

The features of DACC contain global sequence-order information, and it can be generated via Pse-in-One [17] which includes two generation approaches. The generation steps of DACC feature can be described as follows.

For web server approach, firstly, choose the DNA sequences (PseDAC-General) option, then select DACC in the tab of Mode, and set the value of lag. Secondly, upload a user-defined physicochemical index file called user_property and the values of fifteen dinucleotide physicochemical properties are shown in Table 1. Finally, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 2).

For stand-alone approach, DACC features can be easily generated by using the following command line:‘./acc.py −e user_property −f svm −l +1 3 DNA DACC’where −e user_property represents the user-defined physicochemical index file, −f svm and −l +1 have the same meaning with the above command line, the parameter lag equals 3, the sequence type is DNA, and the method used is DACC.

2.2.3. Pseudo Dinucleotide Composition (PseDNC)

Given a DNA sequence represented as (2), the PseDNC feature vector can be defined as [17]wherewhere represents the normalized frequency of dinucleotides along the DNA sequence; () represents the weight factor; is the top counted tiers of the correlation in a DNA, measures the correlation between dinucleotides in the DNA, which is defined aswherewhere represents the indices of the dinucleotide property; represents the value of dinucleotide at position for the dinucleotide property index .

Pseudo dinucleotide composition (PseDNC) [17] not only incorporates the local nucleic acid composition information and the global or long range information along the DNA sequences, but also incorporates the dinucleotide properties into feature vectors.

For web server approach, the generation steps of the feature vectors are similar to those of the DACC’s. For web server approach, an example is shown in Figure 3.

For stand-alone approach, the command line is‘./pse.py −e user_property −f svm −l +1 7 0.3 DNA PseDNC’where −e user_property, −f svm, and −l +1 have the same meaning with the above command line, lambda equals 7, the value of weight equals 0.3, the sequence type is DNA, and the method used is PseDNC.

The meanings of all the parameters for these scripts are described in [17].

2.3. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a kind of algorithm based on statistical learning theory proposed by Vapnik [2022], which has been widely used for many bioinformatics tasks [2327].

In the current study, the LIBSVM package version 3.21 [18] has been employed. The SVM parameters, the kernel width parameter and the regularization parameter , were optimized via the grid tool provided by LIBSVM [18].

In the current study, three basic predictors are proposed, including SVM-Kmer, SVM-DACC, and SVM-PseDNC. The values of SVM-Kmer’s parameters are shown as follows:The values of SVM-DACC’s parameters are shown as follows:The values of SVM-PseDNC’s parameters are shown as follows:

2.4. Ensemble Learning

In machine learning, ensemble learning is the process by which multiple classifiers are constructed and combined based on the same dataset to obtain a better performance than a single classifier [28, 29] and existing popular multiobjective optimization evolutionary algorithms can be used for ensemble learning [30, 31]. Ensemble classifier also performed well in several bioinformatics problems. In the current study, the basic framework for an ensemble classifier is illustrated in Figure 4. The final results are obtained by fusing three individual classifier outcomes, as illustrated below.

Suppose the ensemble classifier is defined aswhere represents the classifier SVM-Kmer, represents the classifier SVM-DACC, and represents the classifier SVM-PseDNC. The symbol denotes the fusing operator.

Therefore, the process of the ensemble classifier can be formulated as follows:where is the set only containing recombination hotspots and is the set of recombination coldspots. is the probability for DNA sequence which belongs to category obtained by the th basic classifier.

Thus, which category the query DNA belongs to is to be determined by using its average probability calculated by (13); that is, suppose thatwhere the operator max represents selecting a lager value in the brackets, and the subscript represents the query DNA belonging to category .

2.5. Criteria for Performance Evaluation

The prediction results can be divided into true positive (TP), false negative (FN), false positive (FP), and true negative (TN) [32]. In the current study, jackknife test [3337] was employed and four kinds of evaluation indexes were adopted, including Sensitivity (Se), Specificity (Sp), Accuracy (Acc), and Matthew’s Correlation Coefficient (Mcc). They are described as

3. Results and Discussion

3.1. Performance of the Three Basic Classifiers

As an inherent property, sequence-order is important for the classification of DNA sequences. So, three basic methods based on sequence-order information are adopted to identify recombination hotspots/coldspots. Table 2 shows the performance of the three methods. According to the table, we can see that SVM-DACC and SVM-PseDNC outperform SVM-Kmer on the prediction accuracy index. The main reason is that SVM-Kmer is only based on local sequence-order information, while both of SVM-DACC and SVM-PseDNC also contain global sequence-order information.

3.2. The Performance of the Three Basic Predictors Can Be Further Improved by Using Ensemble Learning

Based on the analysis above, we have proposed three basic predictors for identifying recombination hotspots/coldspots. These methods capture DNA information from different aspects. Therefore, we presented a complementary method SVM-EL which can fuse these basic methods to improve the prediction performance. The performance of SVM-EL is shown in Table 2, from which we can see that SVM-EL outperforms the three basic methods. Besides, the corresponding receiver operating characteristic (ROC) curves of the four classifiers were drawn in Figure 5. AUC, the area under the ROC curve, is often used to indicate the performance of a classifier: the larger the value, the better the classifier.

As shown in Figure 5, the predictor SVM-EL showed the top performance, outperforming three basic methods: SVM-Kmer, SVM-DACC, and SVM-PseDNC.

3.3. Comparison with Other Related Predictors

Two state-of-the-art methods, IDQD [10] and iRSpot-PseDNC, were selected to compare with the proposed SVM-EL. Table 3 shows the results of various methods on the benchmark dataset.

According to Table 3, we can see that SVM-EL outperforms the other methods. The main reason is that IDQD and SVM-Kmer only consider local sequence-order information, and iRSpot-PseDNC, SVM-DACC, and SVM-PseDNC improved them by incorporating global sequence-order information. However, SVM-EL not only incorporates the local nucleic acid information, but also incorporates the global information. Therefore, we conclude that SVM-EL would be a useful tool for hotspots/coldspots identification.

4. Conclusion

In this article, we proposed a predictor called SVM-EL for yeast hotspot/coldspot identification, which combines Support Vector Machine (SVM) with Ensemble Learning (EL). The approach combined with different predictors trained by different features contributes to the improvement of prediction accuracy. SVM-EL is trained by different features, including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). All these features can be generated by Pse-in-One [17], which is a powerful web server for generating various DNA, RNA, or protein features. It also provides a stand-alone version to users, which is easy to use. Via jackknife test, it was observed that the predictor outperforms other predictors. In the future, we will consider using other approaches for yeast hotspot/coldspot identification, such as bioinspired computing models [3845].

Competing Interests

The authors declare no competing financial interests.

Authors’ Contributions

Bingquan Liu conceived the study and designed the experiments and participated in designing the study, drafting the manuscript, and performing the statistical analysis. Yumeng Liu participated in coding the experiments and drafting the manuscript. Dong Huang participated in performing the statistical analysis. All authors read and approved the final manuscript. Bingquan Liu and Yumeng Liu contributed equally to this paper.

Acknowledgments

This work was supported by the National High Technology Research and Development Program of China (863 Program) (2015AA015405), the National Natural Science Foundation of China (nos. 61300112, 61573118, 61272383, and 61572151), the Natural Science Foundation of Guangdong Province (2014A030313695), Guangdong Natural Science Funds for Distinguished Yong Scholars (2016A030306008), and Scientific Research Foundation in Shenzhen (Grant no. JCYJ20150626110425228).