Machine Learning and Network Methods for Biology and Medicine 2021View this Special Issue
Genomic Island Prediction via Chi-Square Test and Random Forest Algorithm
Genomic islands are related to microbial adaptation and carry different genomic characteristics from the host. Therefore, many methods have been proposed to detect genomic islands from the rest of the genome by evaluating its sequence composition. Many sequence features have been proposed, but many of them have not been applied to the identification of genomic islands. In this paper, we present a scheme to predict genomic islands using the chi-square test and random forest algorithm. We extract seven kinds of sequence features and select the important features with the chi-square test. All the selected features are then input into the random forest to predict the genome islands. Three experiments and comparison show that the proposed method achieves the best performance. This understanding can be useful to design more powerful method for the genomic island prediction.
Horizontal gene transfer (HGT) is one of the main factors affecting bacterial adaptability. Hacker et al. found some viral gene clusters in E. coli genomes and did not exist in their close species, and they denoted them as pathogenic islands (PAIs) . Since then, at least a dozen PAIs have been detected, such as “secretion island,” “antimicrobial island,” and “metabolic island” . They are first expressed as genomic islands (GIs) and further encode them based on the functions related to the complex changes of niche . For example, GIs are responsible for the type III secretion system, iron absorption function, toxin, and adhesion secretion, which enhance the survival ability of pathogens in the host body, leading to diseases [4, 5]. Some researchers reported that pathogenicity can be regulated by selective loss or recovery of specific GIs [6, 7], and PAI can be spontaneously removed from chromosomes at a detectable rate, resulting in different pathogenic phenotypes [8, 9]. Therefore, the detection of different GIs has become an important content of microbial evolution and function research.
With the help of large-scale comparative genomics, researchers found that GIs have different sequence composition, direct flanking duplication, mobility, and tRNA genes. In turn, exploring and utilizing these features can lead to better detection of GIs [3, 10–12]. GIs are scattered among close relatives, which carry some species patterns different from the host. Researchers can identify distant relatives by comparing the differences of 16S rRNA or other homologous sequences . Some alignment-based methods have been developed to detect GIs, such as the basic local alignment method  and whole genome alignment method . These tools rely on the observation that, compared with the conserved regions, the genomic regions that are not aligned across multiple genomes or only aligned with one genome are more likely to be hypothetical GIs. For some complex cases, several methods of constructing and applying multilayer or large-scale genome comparison are reported. For example, MobilomeFINDER first finds shared tRNA genes in several related genomes and then uses Mauve to search for GIs in the upstream and downstream regions of homologous tRNA genes . Since the identified GIs with this method are related to tRNA disruption, the GIs without the tRNA gene as insertion site will be omitted. In order to solve this problem, MOSAIC has developed a method to identify strain-specific regions that do not necessarily insert tRNA . Unfortunately, inversion and translocation are often mistaken for strain-specific regions. IslandPick is one of the most widely used tools for GI detection . Given a genome, IslandPick first automatically selects the appropriate comparative genomes without any deviation and then uses Mauve to construct the whole genome alignment. To avoid duplication, IslandPick uses BLAST as a secondary filter to recheck the areas aligned by mauve. IslandPick has been integrated into the islandviewer website, where the dataset of precomputed GIs can be downloaded [19–21].
In addition to comparative genomics, component-based methods are also very sensitive to GI detection. Considering that GIs usually show significantly different sequence composition from the host, an effective detection algorithm can distinguish the abnormal region from the rest of the genome according to the composition deviation. In practice, component-based methods are desirable because they can rapidly detect GIs from analyzed sequences without the need for additional genomes. CG content and oligonucleotides with lengths 2-9 are widely used to describe the sequence composition in GI detection [10, 22–25]. For example, PAI-Finder calculates G + C content abnormality and codon usage deviation to detect GIs and further evaluates the candidate PAI only when PAI-like region partially or completely crosses GIs . PAI Finder has been integrated into the PAI database, where comprehensive information of all annotated PAIs and predicted PAI in prokaryotic genome can be downloaded [27, 28]. The HMM model has also been introduced to detect abnormal areas containing component deviations [22, 29–31]. For example, SIGI-HMM constructs an HMM model to remove codons using biased ribosomal regions [29, 30], and IslandPath-DIMoB  uses HMM to identify migration genes by searching the PFAM37 migration gene map  of each prediction gene . Alien_Hunter introduced a scoring system based on the -mers and refined the boundary of prediction GIs using the HMM model .
Although the performance of the above algorithms is good, there are still some problems: (1) the comparative genomics relies heavily on the genomes used in the comparison, and so it can be used in the annotation process or when closely related genomes are available. Even if more genomes are available, researchers have to spend more time on selecting genomes from the species of interest. (2) Although these methods based on HMM show better performance in GI detection, they involve relatively more parameters and a lot of training calculation; so, it takes a long time to detect GIs. (3) In recent years, different sequence features have been proposed, but these features are rarely applied to genome island prediction. How to fuse and select some effective features is also a way to improve the efficiency of genomic island detection.
With the above problems in mind, we present a scheme to predict the genomic islands using the chi-square test and random forest algorithm. We first extract seven kinds of widely used sequence features and compare their performance in GI detection. The chi-square test is then used to select the important features. At last, all the selected features are input into the random forest to detect the genome islands. Through a comprehensive comparison and discussion, some novel valuable guidelines for use of the sequence features, feature selection, and prediction methods are obtained.
2. Materials and Methods
Four standard data sets are used in this study. The first data set, PICK108, consists of 108 complete bacterial genome sequences and their annotations. The number of positive and negative GIs in this dataset is 3868 and 679, respectively . The second set of data is referenced as CF15 which consists of 15 complete bacterial genome sequences and their annotations. The number of positive and negative GIs in this data set is 6070 and 5833, respectively . The third data set, denoted as RGP104, consists of 104 complete bacterial genomes and their annotations. The number of positive and negative GIs is 1846 and 3267, respectively, in this dataset .
2.2. Sequence Features
Seven kinds of widely used sequence features are extracted for genome island detection. They are composition of -spaced nucleic acid pairs (CKSNAP), dinucleotide composition (DNC), nucleic acid composition (NAC), pseudodinucleotide composition (PseDNC), electron-ion-interaction pseudopotentials of trinucleotide (PSEIIP), reverse compliment -mer (RCKmer), and trinucleotide composition (TNC). The above features are obtained by iLearn that is a comprehensive python-based toolkit that integrates entity extraction, computation, entity analysis, and construction of predictor variables .
2.2.1. Reverse Compliment -Mer (RCKmer)
Reverse compliment -mer is a variant of -mer, which ignores the complementary sequences of adjacent nucleotide sequences. For example, there are 16 types of 2-mer: “AA,” “CC,” “GG,” “TT,”“AC,” “CA,” “GA,” “TA,” “AG,” “CG,” “GC,” “GT,” “AT,” “CT,” “TC,” and “TG.”. Because “TT” is the reverse completion -mer of “AA,” it can be left out. Therefore, there are only 10 kinds of 2-mer in this method: “AA,” “CC,” “AC,” “CA,” “GA,” “AG,” “CG,” “GC,” “AT,” and “TA.” The frequency of each -mer is calculated in turn .
2.2.2. Composition of -Spaced Nucleic Acid Pairs (CKSNAP)
CKSNAP feature represents the composition of nucleotide pairs that are separated by k (k =0, 1, 2, 5) nucleotides, and it reflects the short-range interactions of nucleic acids within the sequence . Using as an example, 16 0-spaced nucleotide pairs (i.e., “AA,” “AC,” “AG,” “AT,” “CA,” “CC,” “CT,” “CG,” “GA,” “GC,” “GG,” “GT,” “TA,” “TC,” “TG,” and “TT”) are generated. Then, a feature vector is defined as
In this study, all nucleotide pairs for (0, 1, …, 5) were considered, and they are encoded to a 96-dimensional digital vector as follows:
2.2.3. Dinucleotide Composition (DNC)
DNC expresses the composition of consecutive pairs of nucleotides [36, 39]. The coding of the DNC characteristics uses 16 descriptors defined as follows: where donates the number of dinucleotides represented by nucleotide types and .
2.2.4. Trinucleotide Composition (TNC)
TNC refers to the composition of three consecutive nucleotides in biological sequences . The coding of TNC 64 descriptors described as follows: (“AAA,” “AAC,” “AAG,” “AAT,” …, “TTT”), which can be defined as where donates the number of trinucleotide pairs represented by nucleotide types , , and .
2.2.5. Pseudodinucleotide Composition (PseDNC)
PseDNC converts the local sequence arrangement and global sequence information into the feature vector . The PseDNC is expressed as follows: where reflects the normalized frequency of occurrence of dinucleotides, λ represents the highest counted rank of the correlation along the biological sequences, (0 to 1) is the weight factor, and is the -tier correlation factor, which is defined as where the correlation function is defined as where μ denotes the number of physicochemical indexes, is the numerical value of the th physicochemical index of the dinucleotide , and denotes the corresponding value of the dinucleotide at position .
2.2.6. Nucleic Acid Composition (NAC)
NAC assesses the frequency of each nucleic acid along the sequence. The frequencies of all 4 natural nucleic acids (i.e., “ACGT”) can be calculated: where represents the number of nucleic acid type , while is the length of a nucleotide sequence .
2.2.7. Electron-Ion-Interaction Pseudopotentials of Trinucleotide (PSeEIIP)
EIIPA, EIIPT, EIIPG, and EIIPC represent the EIIP measurements of nucleotides , , , and , respectively. The average EIIP of the trinucleotides in each sample is exploited for the construction of the feature vector, which is described as follows: where represents the normalized frequency of the th trinucleotide, represents the EIIP value of a trinucleotide and .
2.3. Chi-Square Test
All kinds of sequence features will be fused together in order to improve the prediction efficiency, but the redundancy of different features cannot be ignored. Therefore, one of the primary tasks involved in genomic island prediction is to select the best features from the given dataset to achieve the best prediction. This work uses the chi-square test to select the best features for genomic island prediction.
The chi-square test measures the deviation from the expected distribution [40, 41]. Statistically, tests the independence of two variables, where two variables and are defined as independent if or (). In feature selection, the two variables are the term occurrence and the class occurrence. The terms in relation to the quantity are classified as follows: where is the observed frequency in and . and are defined as where U is a random variable that takes values (the presence of the feature ) and (absence of the feature ), and is a random variable that takes values (the presence of the feature in class ) and (absence of the feature in class ). We write and if it is not clear from context which features and class we are referring to and got the following equation: where the are counts of features that have the values of and that are indicated by the two subscripts. For example, is the number of features that contain () and are not in . is the number of features that contain (), and we count features independent of class membership . is the total number of documents .
is a measure of how much expected counts and observed counts deviate from each other. A high value of indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. An arithmetically simpler way of computing is the following:
2.4. Prediction Algorithm
Random forest (RF) is among the best classification algorithms and widely applied to manage many biological problems. It works by building small groups of weak classifiers, to finally combine them and form a strong classifier. This is a configuration learning method that can build models that create multiple decision trees during training and will remove modal classes from classes predicted by a single tree. It is a fusion of tree predictors, where each tree depends on the value of an independent sampled random vector and the same distribution of all trees in the forest .
A random forest is a collection of tree predictor , where represents the observed input (covariate) vector of length with associated random vector and . They are independent and identically distributed random vectors. As mentioned, we focus on the regression setting for which we have a numerical outcome , but we make some points of contact with classification (categorical outcome) problems . The observed (training) data is assumed to be independently drawn from the joint distribution of and comprises -tuples.
For regression, the random forest prediction is the weighted average over the collection
As , the law of large numbers ensures
The quantity on the right is the prediction (or generalization) error for the random forest, denoted as . The convergence implies that random forests do not overfit. Now, define the average prediction error for an individual tree
Assume that for all the tree is unbiased, i.e., . Then, where is the weighted correlation between residuals and for independent. The above inequality pinpoints what is required for accurate random forest regression: low correlation between residuals of differing tree members of the forest and low prediction error for the individual trees . Further, the random forest will decrease the individual tree error (), by the factor .
2.5. Performance Evaluation
This work introduces crossvalidation to evaluate the proposed method and calculates accuracy, recall, -measure, precision specificity, sensitivity, and precision as standard performance indicators. They are defined as follows: where is the number of true positives, is the number of false positives, is the number of true negatives, and is the number of false negatives.
3. Results and Discussion
3.1. Performance of the Proposed Prediction Method
To build the prediction model, seven kinds of sequence features are extracted, fused, and filtered by the chi-square test and then input into the random decision tree for genomic island prediction. Accuracy, F1, MCC, precision, recall, and AUC are calculated based on 10 times crossvalidation, which are summarized in Figure 1.
Figure 1 shows that the proposed method achieves good performance among four datasets. As for PICK108, its accuracy, precision, recall, F1, AUC, and MCC are 94.6%, 95.1%, 85.7%, 89.5%, 96.8%, and 80.3%, respectively. For dataset CF15, the overall precision is 94.9%, and precision, recall, F1, AUC, and MCC are 94.8%, 94.0%, 94.4%, 95.6%, and 88.8%, respectively. As for RGP104, its accuracy, precision, recall, F1, AUC, and MCC are 95.4%, 94.4%, 95.2%, 95.4%, 94.5%, and 90.9%, respectively.
We further compare the proposed method with the current methods. For the convenience of comparison, we compare our results with that of the published results with the existing methods. Therefore, different datasets choose different evaluation methods, which are summarized in Tables 1–3.
As for PICK108, the proposed method is compared with the Centroid , INDeGenIUS , MTGIpick , SigHunt , and Zisland Explore . Table 1 indicates that the proposed method achieves the highest accuracy, precision, and recall with the values of 94.6%, 95.1%, and 85.7%, respectively. Compared with the second best method, the accuracy, precision, and recall of the proposed method are 8.4%, 22.3%, and 38.5% higher than that of MTGIpick, respectively.
In the RGP104 dataset, PanRGP , IslandViewer [19, 20], IslandPath-Dimob , IslandCafe, and SIGI-HMM [29, 30] are compared with the proposed method. Table 2 shows that the proposed method outperforms the others in term of MCC, F1, accuracy, and recall. Specifically, the MCC, F1, ACC, and recall of the proposed method are 11%, 12.4%, 3.2%, and 15.2%, respectively, higher than that of the PanRGP model , but its accuracy is 0.1% lower than that of the PanRGP model.
In the CF15 experiment, IslandCafe , IslandViewer [19, 20], IslandPath-Dimob , Zisland Explorer  and SIGI-HMM [29, 30] are compared with the proposed method. Table 3 indicates that the proposed method achieves the highest recall, precision, F1, and MCC with the values of 95.4%, 95.4%, 95.4%, and 90.9%, respectively, which are 23.4%, 28.4%, 29.4%, and 28.9% higher than that of the next competitive method .
The above results show that the proposed method outperforms the available genomic island prediction methods, indicating that the combination of different features, feature selection based on the chi-square test, and prediction algorithm is very effective to advance the prediction. This understanding can be used to develop more powerful genomic island prediction methods.
3.2. Influence of the Different Features
To predict genomic islands, we use seven kinds of protein features: reverse compliment -mer (RCKmer), composition of -spaced nucleic acid pairs (CKSNAP), dinucleotide composition (DNC), trinucleotide composition (TNC), pseudodinucleotide composition (PseDNC), nucleic acid composition (NAC), and electron-ion-interaction pseudopotentials of trinucleotide (PSeEIIP). To evaluate the contribution of each kind of the sequence features, we present the comparison of the accuracies of seven kinds of the sequence features in Figure 2.
Figure 2 indicates that each feature makes its own positive contributions to the predictions; although, different features have certain preferences for different data sets. On the whole, PSeEIIP, RCKmer, and TNC achieve the best performance among all kinds of the sequence features. It is easy to note that PSeEIIP and RCKmer not only reflect the content of components but also focus the local sequence arrangement and global sequence information and calculate the energy of delocalized electrons in nucleotides as the electron-ion interaction. Compared with the ANC and DNC, PSeEIIP and RCKmer are more closely related to the genomic islands, and this is why they achieve the better performance in the genomic island prediction.
3.3. Influence of the Different Feature Selections
A feature of the proposed method is the feature selection based on the chi-square test. For a better understanding of the feature selection, we select the feature set with size from 5 to 120. All experiments are performed with each selected feature set using the 10 times crossvalidation test, and overall accuracy is chosen to represent the score in this prediction. Figure 3 is the overall accuracies of all experiments with the selected feature sets for three datasets.
As would be expected, the overall accuracy first increases and then decreases as the selected feature size continues to increase. When the selected feature set size is less than 30, all data sets have reached the best prediction. As the increase of the number of selected features, the overall accuracy decreases. The chi-square is further compared with feature importance (FI), Pearson correlation (PC), ROC-AUC, mutual information gain (MIG), linear discriminant analysis (LDA), and principal component analysis (PCA), and it is easy to note that the chi-square test achieves the best performance among seven feature selection method.
3.4. Influence of the Different Prediction Algorithms
Random forest (RF) was employed as a classifier in this work. To compare different classifiers’ performance, support vector machine (SVM), -nearest neighbor (KNN), gradient boosting (GB), adaBoost (AB), decision tree (DT), bagging, extra trees (ET), stochastic gradient descent (SGD), and layer perceptron (MLP) were also adopted for protein structural class prediction. All experiments are performed with each selected feature set using the 10 times crossvalidation test, and overall accuracy is chosen to represent the score in this prediction. Figure 4 summarizes the overall accuracies of all experiments with the different prediction algorithms for three datasets.
From Figure 4, it is easy to note that the random forest (RF) achieves the best performance among the ten classifiers. Specifically, the average overall prediction accuracy is 95% for PICK108, RGP104, and CF15 datasets compared with 91% of the gradient boosting (GB) and 92% of the bagging. These results indicate that the random forest is a more powerful classifier for the genomic island prediction.
Genome islands are related to the rapid adaptation of prokaryotes, which have important medical, economic, or environmental significance. Some methods usually evaluate all features and focus on whether the local features of a certain area are significantly different from the host. Although these methods have achieved good experimental results, various feature extraction methods have been proposed, but they are rarely used to predict genomic islands. With these problems in mind, we present a scheme to predict the genomic islands using the chi-square test and random forest algorithm. We extract seven kinds of widely used sequence features and select the important features with the chi-square test. At last, all the selected features are input into the random forest to predict the genome islands. Three experiment results show that the proposed method has better performance than previous methods.
The first contribution can be seen from the influence of the different features, and we find that PSeEIIP, RCKmer, and TNC are more closely related to the genomic islands and achieve the best performance among all kinds of the sequence features. The second contribution can be indicated from the influence of the different feature selections, and the chi-square test achieves the best performance among seven feature selection method. The final contribution can be seen from the influence of the different prediction algorithms, and we notice that the random forest (RF) achieved the best performance among the ten classifiers; its accuracy is 3% higher than that of the next one. This understanding can be then used to develop more powerful methods for genomic island prediction.
All the data used to support the findings of this study are available on https://github.com/Onesime243/Chi_square_Genomic_Islands_predicton_data-and-result.git.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work is supported by the National Natural Science Foundation of China (61772028) and research Grants from Zhejiang Provincial Natural Science Foundation of China (LY20F020016).
S. H. Yoon, C.-G. Hur, H.-Y. Kang, Y. H. Kim, T. K. Oh, and J. F. Kim, “A computational approach for identifying pathogenicity islands in prokaryotic genomes,” BMC Bioinformatics, vol. 6, no. 1, pp. 1–11, 2005.View at: Google Scholar
R. Merkl, “SIGI: score-based identification of genomic islands,” BMC Bioinformatics, vol. 5, no. 1, pp. 1–14, 2004.View at: Google Scholar
S. Waack, O. Keller, R. Asper et al., “Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models,” BMC Bioinformatics, vol. 7, no. 1, pp. 1–12, 2006.View at: Google Scholar
R. D. Finn, J. Tate, J. Mistry et al., “The Pfam protein families database,” Nucleic Acids Research, vol. 36, suppl_1, pp. D281–D288, 2007.View at: Google Scholar
D. Che, T. Shafer, and P. Tian, “Classification of endangered languages using decision tree based algorithms,” in 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 1814–1821, Guilin, China, 2017.View at: Google Scholar
P. P. Shinde and S. Shah, “A review of machine learning and deep learning applications,” in 2018 Fourth international conference on computing communication control and automation (ICCUBEA), pp. 1–6, Pune, India, 2018.View at: Google Scholar
M. R. Segal, “Machine Learning Benchmarks and Random Forest Regression,” UCSF: Center for Bioinformatics and Molecular Biostatistics, 2004.View at: Google Scholar