Abstract

MicroRNAs (miRNAs) are a set of short (21–24 nt) noncoding RNAs that play significant regulatory roles in cells. In the past few years, research on miRNA-related problems has become a hot field of bioinformatics because of miRNAs’ essential biological function. miRNA-related bioinformatics analysis is beneficial in several aspects, including the functions of miRNAs and other genes, the regulatory network between miRNAs and their target mRNAs, and even biological evolution. Distinguishing miRNA precursors from other hairpin-like sequences is important and is an essential procedure in detecting novel microRNAs. In this study, we employed backpropagation (BP) neural network together with 98-dimensional novel features for microRNA precursor identification. Results show that the precision and recall of our method are 95.53% and 96.67%, respectively. Results further demonstrate that the total prediction accuracy of our method is nearly 13.17% greater than the state-of-the-art microRNA precursor prediction software tools.

1. Introduction

MicroRNAs are some of the most important noncoding RNA genes with rather short length. They regulate the expression of whole organism genes at the posttranscriptional level [1]. miRNA is widely involved in the metabolic activity of the body as well as in many important life processes, including cell proliferation and apoptosis, cell differentiation, growth and development of plants and animals, and organ formation [24]. Recently, several studies have shown that microRNAs are related to several cancers [57] and other diseases [810]. Caligiuri et al. [11] proposed that methods and compositions involving miRNAs are useful for the treatment of various diseases and cancers. Some miRNAs are regarded as potential therapeutic targets for various diseases [12]. Recently, the target gene (cancer gene) drugs, which developed in accordance with the theory on miRNA’s gene silencing, have been used for incurable disease that has become a threat to human health problems for years [13]. In addition, the viral genome can encode a large number of miRNAs by itself. Through combination with target genes and coding by viruses or host cell, these miRNAs can lead to immune escape or antiviral effect against the host cell. Therefore, the accurate prediction of miRNA and its target genes, as well as the correct understanding of miRNA mechanism, has important practical significance in medical treatments. Thus, the research on novel miRNA identification is rather essential.

Feature selection mainly dominated the performance of the prediction model in the machine learning process [1420]. In addition, effective features can represent the characteristics of the entire sequence data, which enables easy-to-build better prediction model. To represent the microRNA precursors, Xue et al. [21] proposed 32D novel triplet features, which involved secondary structure information. Jiang et al. [22] found that random rearrangement of the sequence could help obtain significant free-energy features. However, the free-energy computation for many random rearrangement sequences is very time consuming. Wei et al. [23] combined Xue et al.’s features and triplet nucleotide frequency to 98D features and obtained good performance result in human pre-miRNA identification. However, more features would not mean better performance because of some irrelevant and redundant features in the high dimensional or ultra-high dimensional feature set. The purpose of feature selection is to eliminate the irrelevant and redundant features of the feature set. In addition, the training time could be reduced effectively by the feature selection optimization [24]. Some studies focus on developing computational predictors by incorporating the sequence-order or structure-order effects [25, 26]. Several works indicated that proper features could improve the prediction performance of classification in a certain extent. For example, Wang et al. [27] employed the feature selection techniques to optimize the features in miR-SF. They proved that an optimized feature subset could improve the prediction performance. In addition, some popular recently proposed multiobjective optimization evolutionary algorithms can also be used as a possibly promising feature selection approach [28, 29].

Another factor that affects the performance of machine learning prediction method is the classifier algorithm. The selection of different classifiers often leads to the difference of classification results. Several different classifiers and strategies were employed for miRNA identification. Bayesian classifier algorithm was tested for predicting miRNA across different species in 2006 [30]. The method also utilized the multiple species of miRNA sequences and structural features. It proved that miRNA genes could be detected effectively in large scale of different species genomes.

MiPred classifier was tested for predicting miRNA in 2007 [22]. The method utilized random forest classifier algorithm. The prediction accuracy of MiPred is 10% higher than that of Triplet-SVM; the sensitivity and specificity of MiPred can reach to 95.09% and 98.21%. CSHMM classifier was also used for mining miRNA sequences from the genome [31], which utilized the Markov model. Overall, the accuracy of machine learning algorithm was up to 90%. The machine learning method is more accurate than the other methods.

In this study, we chose backpropagation neural network as the classifier. It has three advantages, including better generalization performance, faster learning speed, and good learning ability.

2. miRNA Identification with BP Neural Network

2.1. Pre-miRNA Features
2.1.1. -Gram Frequency

Some studies showed that the local primary sequence is crucial to the pre-miRNA sequence [32]. Thus, the -gram frequency is often applied for the feature map in the selection of the primary sequence feature [33, 34]. However, no good methods are still available for tuning the value of . In general, we choose by comparing the effect of -gram frequency with different -values. In our feature set, we select the different values () for comparison. The different frequency characteristics have almost the same effect on the classifier. Thus, consider that its base and adjacent base have practical biological significance. We chose as 3. A total of 64 ()-dimensional frequency features were calculated.

2.1.2. Triple Structure Sequence

In addition to high specificity of the primary sequence features, the secondary structure sequence of pre-miRNA is also a contributing factor. To analyze the contribution of the secondary structure, the secondary structure prediction software RNAfold is used to calculate the potential structures. In the secondary structure, each nucleotide of the sequence corresponds to two states, matching and nonmatching: record matching as “(” or “)” and nonmatching as “·.” In the structure, three character groups are considered as a unit, and every “)” is replaced as “(.” Thus, 8 () different combinations are available as a unit, including “(((,” “((·,” “(·(,” “·((,” “(··,” “·(·,” “··(,” and “···.”

To characterize pre-miRNA sequence better, the first nucleotide of the corresponding subsequence was added to the front of each structure unit. This provides 32 different combinations, that is, “A(((,” “U((·,” …, “G·((,” “G···.” For a sequence, the occurrence frequency of each combination is determined and coded into the 32D feature vector as the input of the classifier. This calculated 32D triple structure sequence feature is used to train the SVM classifier; the inclusion of the SVM classifier significantly improved the classification ability of pre-miRNA sequences [21].

2.1.3. Energy Characteristics

The real pre-miRNA sequences are generally more stable and show a lower minimum of free energy (MFE) than the randomly generated pre-miRNA. Therefore, energy characterization is often used to describe the structure pre-miRNA sequence as an aspect of feature extraction of the pre-miRNA sequence. To do this, the MFE value is obtained by using RNAfold to calculate the structure.

2.1.4. Structural Diversity Characteristics

The potential for nucleotide pairing in the sequence is a significant characteristic that can also be used to describe the pre-miRNA sequence. This includes both traditional Watson-Crick nucleotide pairing (A–U pairing and C–G pairing) and also other forms of nucleotide pairing, such as the G–U pairing that can occur in the loop of RNA hairpin structures. We included possible G–U pairing in our description of base pairing.

To summarize, we extracted 98 features for the input of the neural network, including 64-dimensional -gram frequency characteristics, 32-dimensional triple structure sequence characteristics, one-dimensional energy feature, and one-dimensional structural diversity characteristics.

2.2. Fixing the Number of Nodes in the Hidden Layer

In general, to select the number of nodes in the hidden layer in changing the BP neural network structure is difficult. Technically, a hidden layer could facilitate operation. However, too many hidden layers can reduce the operation rate.

Currently, no theoretical methods are available to fix the number of nodes in the hidden layer. However, the number generally depends on the empirical formula, as calculated in where represents the neuron number of the hidden layers, is the neuron number of the input layers, is the neuron number of the output layers, and is a constant between 1 and 10.

In this study, and . Therefore, (1) can be used for any values between 11 and 20. A comprehensive analysis of the training results with different numbers of nodes in the hidden layer was performed with the error set to 0.0001. A total of 621 samples were used to train the network, and one sample was used to test the network. The results are shown in Table 1.

From the data shown in Table 1, the increased number of nodes in the hidden layer did not result in better convergence. Additionally, the increased number of nodes increased the network parameters and greatly increased the amount of calculation of the classifier. Thus, keeping 13 nodes in the hidden layers required relatively less training times and less error and still produced relatively good training effects.

2.3. Fixing the Number of Nodes in the Output Layer

Two kinds of output exist, positive and negative, which are represented as 1 for a positive sample and 0 for a negative sample. The topology structure of this prediction method based on BP neural network is shown in Figure 1.

2.4. Selecting Training and Test Model Samples

The collection and organization of training samples are often limited by the objective conditions. Appropriate numbers of training samples are required to achieve sufficient precision. Therefore, it refers to the rule of experience:where represents the numbers of training samples and is the total of network connection weight equal to the sum of nodes of the input and hidden layers. In this study, 2236 samples were used for training.

The data set used for the pre-miRNAs was downloaded from http://bioinf.sce.carleton.ca/SMIRP [35], and these data include negative and positive samples for Arabidopsis lyrata. The FASTA file was converted to ARFF file using a jar package written by Java converting the reference index to numerical form. We randomly selected real pre-miRNAs and pseudo pre-miRNAs to evaluate our algorithm.

2.5. Error Evaluation Steps Based on BP

The structure of the intelligent diagnosis model contains three layers of 98-13-1. First, we set the nodes of the input, output, and hidden layers as , , and , respectively. Assuming the training sample set , the weight matrix between the input and hidden layers can be written as , where and . We assume the connection weight matrix between the hidden and output layers as , where , . Then, respectively, take and as the activation function of each node of the hidden and output layers. To simplify the derivation, we use the vector function for , where . After input of the sample , the actual output can be calculated by

The error function is defined in

Objectively, the target of BP training is to compute the and to minimize the solution of the error function . With this, a combination of gradient descent, common, and simple derivatives was used. To simplify the derivation process, we derive

Then, the error function can be written as

The corresponding gradient function of and can then be expressed as

For arbitrary initial values of and , gradient descent rules to modify the weight of the BP learning algorithm are applied in where represents the learning efficiency. is the partial derivative of the error function relative to . is the partial derivative of the error function relative to .

2.6. Selection of Training Functions and Related Parameters

The above analysis allows fixing of the BP neural network structure. Table 2 shows the chosen training functions and the relevant parameters.

This condition allows establishment of a complete classifier based on BP neural network structure. The model generation and training are summarized in Figure 2.

2.7. Measurement

The use of pattern recognition and machine learning methods can be used as a two-way classification problem. Four kinds of prediction results are presented in Table 3.

The four kinds of prediction results are true positive (TP), the number of positive cases that were correctly predicted; false positive (FP), the number of positive cases represented by error prediction; true negative (TN), the number of counter negative examples that were correctly predicted; and false negative (FN), the number of negative cases represented by error prediction.

Many evaluation indicators can be used for the classification results. First, the accuracy rate (ACC) is the ratio of the correctly predicted cases for the entire data set. Precision and recall can also be used as evaluation indicators in tests of pattern recognition models. Precision is expressed as the ratio of the correctly predicted values for the entire positive data set and recall reflects the number correctly judged as positive examples in the positive example test set [36]. The above three indicators are expressed in

Additionally, sensitivity and specificity parameters may be used to evaluate the function of the model. Sensitivity record (SE) is the same as the recall and specificity record (SP) calculated in accordance with

A challenge may be presented if the positive and negative test sets are unbalanced in the study of biological information. In most cases, the number of positive samples is far less than the number of negative samples. In a few cases, the number of positive samples may be much larger than the number of negative samples. We can easily obtain ACC-SP when the number of positive samples is greater than the negative samples. In this case, the classifier only reflects the classification effect of the negative samples and is unable to accurately express the prediction effect of the classifier on the entire test data set. To solve this problem, researchers typically use the geometric mean (Gm) as described in

Matthew’s correlation coefficient (MCC) [16, 21, 37, 38] can provide more equitable response forecast ability when a large difference exists between the number of positive samples and the number of negative samples. MCC can be expressed as

Currently, studies on miRNA commonly use one or more of these above evaluation indices. In this work, we estimate the overall performance of the classifier by analysis of ACC, SE, SP, Gm, and MCC.

3. Results and Discussion

3.1. Analysis of Feature Set Performance

To select a better feature set for classification, we needed to determine the effect of different feature subsets on the performance of the classifier. To do this, we used the BP neural network method with the same training set (553 positive samples and 1150 samples) to test different feature sets, with the results shown in Table 4.

From Table 4, we learn that the accuracy of the entire feature sets can be as high as 93.42%. This result indicates that our feature set is more effective for processing of a more complex structure or sequence diversity. Considering that the feature sets used here are not very large and each feature subset is highly independent, reducing the dimension of the feature vector is no longer needed.

3.2. Performances of BP

-fold cross-validation with moderate computational complexity is widely used for model selection. The selection of is important because not only determines the number of samples but also determines the computational complexity. Usually, a value of between 5 and 10 is selected based on experience. Statistical performance shows little improvement when selection is greater than 10. Again, computational complexity must be considered; thus a value between 5 and 10 is best [32].

We divided the samples into two cases for training and testing. In the first one, a large difference was observed between the number of positive and negative samples: 518 positive samples and 1078 negative samples as the training set and 166 positive samples and 366 negative samples for the test set. The second case included equal numbers of positive and negative samples: 552 positive samples and 552 negative samples as the training set and 138 positive samples and 138 negative samples for the test set. These training and testing were repeated five times. The testing performance is shown in Figures 3 and 4.

From comparison of the data in Figures 3 and 4, no significant difference was observed between the actual output and the expected output of each test. As described above, the evaluation of the reference index is shown in Table 5.

From the data presented in Table 5, the number of samples affects the accuracy and recall rate of the positive samples. In particular, the precision and recall rate of the negative samples decreased with the decrease in the number of negative samples in the training set. This result indicates that the more the samples in the training process, the better the classification effect of the classifier. At the same time, the precision and recall rate of the number of positive samples were affected. With the number of negative samples in the training set increased, the number of correct predictions increased by four and the number of error predictions was reduced by eight. This result shows that the precision and recall rate of the positive samples decreased with the increase in the number of the negative samples.

3.3. Comparison with Other Methods

The performance of our method was compared with other methods: J48, random forest, LibD3C [39], Adaboost, string kernel SVM [40], LibSVM, and GBDT, which were classified on the same data set. The data set contains 691 real pre-miRNAs and 1437 pseudo pre-miRNAs. As shown in Table 6 and Figure 5, the results demonstrate that the total prediction accuracy of our method is 13.64% greater than the string kernel SVM model and nearly 2% greater than the LibD3C and LibSVM models. The overall performance of the models as measured by MCC was in the following order: GBDT (0.8682), BP (0.8662), LibSVM (0.8510), LibD3C (0.8510), Adaboost (0.8120), random forest (0.7720), J48 (0.7200), and string kernel SVM (0.6002).

Thus, we conclude that the BP method allows improved recognition accuracy.

3.4. Performance on Different Species

To demonstrate the validity and the universal applicability of the BP method, we analyzed six other species: Anolis carolinensis, Arabidopsis thaliana, Drosophila melanogaster, Drosophila pseudoobscura, Epstein-Barr virus, and Xenopus tropicalis. The results shown in Figure 6 indicate that the accuracy of the GBDT is better than BP method in some situations, but the BP method has been achieved fairly good results in terms of ACC, precision, recall, and MCC.

4. Conclusions

Identification of miRNAs is the first step toward understanding their biological characteristics. Many approaches have been proposed to predict pre-miRNAs in recent years. However, feature extraction in these methods can result in information redundancy. To overcome this drawback, a BP neural network algorithm together with optimal 98D features was employed for this analysis. We compare our method with the existing methods of J48, random forest, LibD3C, Adaboost, GBDT, string kernel SVM, and LibSVM, which were trained on the same training data set. The results demonstrate that the total prediction accuracy of our method is 13.17% greater than the string kernel SVM model and nearly 2% greater than LibD3C and LibSVM.

After the identification step, functional analysis is also important for miRNA research. If human miRNA and diseases were focused on, two main approaches would be employed to predict the relationship. The first one is the statistical comparison analysis for the miRNA or isomiR expression [41]. The second one is the network analysis and prediction for miRNA-disease relationship [4245]. Several advanced machine learning, network techniques, and bioinspired models can be utilized on this problem, including random forest [46], semisupervised learning [47], HeteSim Scores [48], spiking neural P systems [4952], and membrane computing_ENREF_51 [5357]. Functional analysis of the novel detected miRNAs would be our future works.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The work was supported by the Natural Science Foundation of China (no. 61370010 and no. 61302139) and the State Key Laboratory of Medicinal Chemical Biology in China.