Abstract
Singlecell RNA sequencing (scRNAseq) is emerging as a promising technology. There exist a huge number of genes in a scRNAseq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNAseq data. We conduct the IGRDCGA and several competing methods on some realworld scRNAseq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.
1. Introduction
In scRNAseq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. If all genes are utilized in data classifications, the classification accuracy and classification efficiency may be low. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital.
Feature selection (FS) problems can be taken as largescale global optimization problems [1]; therefore, we can use bioinspired intelligence optimization algorithms to address feature selection problems. Wang et al. [1] took FS problems regard as largescale global optimization problems. Nakisa et al. [2] utilized the evolutionary computation (EC) to search the optimal feature subset. Eroglu and Kilic [3] integrated a genetic local search algorithm and a knearest neighbor classifier to select feature subset. Maleki and Zeinali [4] used a hybrid genetic algorithm (GA) to address dimension reductions and applied it to the classification of lung cancer. Tahir et al. [5] presented a binary chaotic GA to select feature for healthcare datasets. However, how to correctly use the GA to address the gene selection and classification of scRNAseq data is a significant issue to consider first. To the best of our knowledge, there are only a few literatures so far.
The study integrates the IGR and DCGA to address the gene selection and classification of scRNAseq data and proposes a novel gene selection and classification algorithm IGRDCGA. The IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and then employs the DCGA to choose high quality genes finely from the preliminary gene subset.
The rest of this study is organized as follows. Section 2 briefly describes the information gain ratio and genetic algorithm. Section 3 states three evaluation metrics. The dataset and preprocessing method to use in the study are described in Sections 4. The coding and the other details of the IGRDCGA are described in Section 5. The numerical results of the IGRDCGA and several competing algorithms are given in Section 6. The conclusion of the study is made in Section 7.
2. Related Work
In this section, the information gain ratio and genetic algorithm are described as follows.
2.1. Information Gain Ratio
The information gain is a metric derived from information entropy, often used to evaluate the mutual dependence level between two random variables. Namely, it is a symmetrical metric of dependency [6].
For two discrete random variables X and Y, their information entropy can be calculated, respectively, in terms of the following formulas:where , and represent the marginal probability of and , respectively.
The conditional entropy and information gain [6–8] of X versus Y can be calculated in terms of the first two following formulas, respectively. The information gain ratio of X versus Y is the ratio of the information gain to the information entropy, which is formulated in the last following formula.
The ranges from 0 to 1 while 1 represents that X completely leads to Y and 0 represents that X and Y are completely independent.
2.2. Genetic Algorithm
The genetic algorithm (GA) is a bioinspired intelligence optimization algorithm. It is inspired by the process of a natural selection and belongs to one of evolutionary algorithms (EAs) [9–13]. It is commonly utilized to generate feasible solutions for optimization problems by performing the operators such as selection, crossover, and mutation [14–16]. The selection operator is designed to choose a part of chromosomes for crossover operator from previous population. The frequently used selection operator is random selection strategy, such as tournament selection strategy. The crossover operator is designed to exchange one or many genes in two parents that are selected by selection operator. It simulates reproduction or recombination in biological evolution process. GA determines whether to perform mutation operator or not according to crossover probability. By crossover operator, two parents may generate two or many offsprings in terms of different crossover strategy. The frequently used crossover operator includes singlepoint crossover, twopoint crossover, and multipoint crossover. Mutation operator is designed to modify one or many genes in certain chromosome. It simulates gene mutation in biological evolution process. Similarly, GA determines whether to perform mutation operator or not according to mutation probability. The frequently used mutation operator includes locus mutation, exchange mutation, and insertion mutation. Mutation operator only acts on one chromosome while crossover operator acts on two chromosomes. Generally, crossover operator probability is much larger than mutation probability. This accords with a biological evolution process.
3. Evaluation Metrics
To evaluate the performance of the IGRDCGA, in the study, we utilize the following evaluation metrics: NMI (normalized mutual information) [17–19], ARI (adjusted random index) [20, 21], and purity [22].
3.1. NMI (Normalized Mutual Information)
The NMI is a frequently used evaluation metric. It can be often used to evaluate the accuracy and the difference between the obtained clustering results and the ground truth results.
For two discrete random variables X and Y, their MI (mutual information) can be calculated in terms of the following formula:where and , respectively, denote marginal probability of and and denotes a joint distribution probability.
The normalized MI is taken as NMI, which can be calculated in terms of the following formula.where and denote the entropy of X and Y, respectively. The NMI ranges from 0 to 1 while 1 is the optimal score, which represents that X and Y have identical mutual information. A larger NMI signifies higher concordance between X and Y.
Example 1. Suppose that the ground truth class labels of a singlecell dataset are as follows: y_{1} = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3] and the classification result of an algorithm is as follows: y_{2} = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3]. It follows from y_{1} and y_{2} that their unique value is both [1 2 3]. Then, the probability values in formula (6) are calculated as follows:According to formula (6), it follows thatAccording to formula (1) or (2), it follows thatAccording to formula (7), it follows that
3.2. ARI (Adjusted Rand Index)
The ARI is another widely used evaluation metric for measuring the concordance between two clustering results. The RI (Rand Index) measures the similarity between two results. It calculates all pairs of samples, including the pairs in identical or different clusters, and pairs in the predicted and ground truth clusters.
For two clustering results X and Y containing n elements, the RI of them is calculated in terms of the following formula.where a is the number of pairs in identical class in X and identical cluster in Y; b is the number of pairs in identical class in X but not identical cluster in Y; c is the number of pairs that are not in identical class in X but in identical cluster in Y; d is the number of pairs that are neither in identical class in X nor in identical cluster in Y.
The overlap between X and Y can be summarized in a contingency table, in which each element represents the number of objects in common between X and Y. The ARI is adjusted RI, which can be calculated in terms of the following formula.where n_{ij} are values from the contingency table, a_{i} is the sum of the ith row of the contingency table, b_{j} is the sum of the jth column of the contingency table, and denotes a binomial coefficient.
The ARI is the correctedforchance version of the RI. The RI may only yield a value between 0 and 1; the ARI can yield negative values if the index is less than the expected index. The optimal score of the ARI is 1, which represents that two clustering results are identical. A larger ARI signifies higher concordance between X and Y.
3.3. Purity
The purity [22] is another simple and transparent evaluation metric for evaluating the clustering performance. For purity, each identified cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this label assignment is computed by counting the number of correctly assigned cells and dividing by the number of cells N. This mapping is not onetoone and may be biased to the class which has the largest size. Nonetheless, it provides us a simple metric.
For two clustering results X and Y containing n elements, the purity of them is calculated in terms of the following formula.where p and q are the elements of X and Y, respectively.
The purity ranges from 0 to 1 while 1 is the optimal score, which denotes that X and Y have identical clustering accuracy. A larger purity signifies higher concordance between X and Y.
4. Dataset and Preprocessing
4.1. Dataset
In the study, 33 publicly available scRNAseq datasets are utilized to testify the performance of the IGRDCGA, which are shown in Table 1. The datasets contain various singlecell gene expression data that are published by many different publications. Each row in the datasets denotes an observation or cell while each column denotes a feature or gene. Table 1 shows the features of the datasets, such as GSE, name, the number of cells (#cell), the number of genes (#gene), the number of the ground truth classes (#class), and references.
4.2. Data Preprocessing
For a scRNAseq dataset containing N cells and M genes, if represents gene expression level of the ith cell versus the jth gene, then its adjacent matrix can be expressed as .
To lower the impacts of large gene expression levels on little gene expression levels, all the data in are normalized according to the following formula.where and , respectively, represent the minimum and maximum values of the jth gene in . The parameter is a very small value, which is to escape the denominator of 0, because and may be both equal to 0 in scRNAseq dataset.
It is obvious that all the elements are normalized according to each column (each column denotes a gene). Therefore, after preprocessing, the gene expression level of each gene ranges from 0 to 1, and all the elements in D range from 0 to 1 as well.
5. The Proposed Algorithm
In this study, we present a novel algorithm to address the gene selection and classification for scRNAseq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). The coding and the other details of the IGRDCGA are as follows.
5.1. Coding and Initialization
To choose high quality genes from a huge number of genes, we need to design a good coding for each chromosome. For the scRNAseq dataset containing N cells of M genes, we design a coding of variable length, whose length can be changed. We number M genes into 1∼M; the coding of a chromosome can be expressed as follows:where , respectively, denote the first locus, the second locus, and the lth locus of the coding, and they are any integer values larger than 1 and less than M; l denotes the length of the coding of the chromosome, which is variable for different chromosomes.
Obviously, the coding of a chromosome signifies a combination of genes, namely, a result of gene selection. Therefore, different chromosomes can signify different gene selection, and coding of variable length can signify gene selection of variable length.
According to the above coding rules of a chromosome, Algorithm 1 presents the initialization algorithm to generate the initial population whose population size is Npop.

5.2. Crossover Operator
As the length of the coding of a chromosome is variable, we need to design a new crossover operator to fit the coding of variable length. Given the coding of two parents C_{1} and C_{2}, their lengths are, respectively, l_{1} and l_{2}, and the minimum value of l_{1} and l_{2} is marked as l_{3}. The number of the genes to exchange in crossover operator should be less than or equal to l_{3}. The study presents a new crossover operator as shown in Algorithm 2.

5.3. Mutation Operator
Similarly, we also need to design a new mutation operator to fit the coding of variable length. Given the coding of one parents C_{3}, its length is l_{5.} The location of the genes to mutate in mutation operator should be less than or equal to l_{5.} The study presents a new mutation operator as shown in Algorithm 3.

5.4. Detailed Steps of IGRDCGA
To illustrate the main process of the proposed algorithm, Figure 1 shows the flow chart of the IGRDCGA.
The detailed steps of the IGRDCGA are summarized as follows. Step (1). Compute information gain ratio IGR of each gene in terms of formula (5) first. Then, eliminate those genes whose IGR is 0. Step (2). To eliminate irrelevant genes and choose high quality genes, the threshold method is used to choose those genes whose IGR is the top best . In later numerical experiment, . Step (3). Implement Algorithm 1 described in Section 5.1 to obtain the initial population P_{1}. Step (4). Compute the fitness and obtain the best chromosome.(1)For each chromosome p_{i} in P_{1}, select those genes determined by p_{i} to obtain a new data ; then conduct k means clustering on to compute NMI metric as the fitness of .(2)Obtain the best chromosome and its fitness by comparing the fitness. Step (5). Let the number of iterations ; then judge whether or not. If yes, then output and , and the IGRDCGA terminates; otherwise, turn to the next step. Step (6). Tournament selection strategy [55, 56] is used to choose chromosomes from P_{1} to obtain new population P_{2}. Step (7). Generate a random r. If r is less than or equal to the crossover probability , then randomly select two chromosomes from P_{2} to perform the crossover operator described in Section 5.2. The final population is marked as P_{3}. Step (8). Generate a random r. If r is less than or equal to the mutation probability , then randomly select one chromosome from P_{3} to perform the mutation operator described in Section 5.3. The final population is marked as P_{4}. Step (9). Compute the fitness value of each chromosome in P_{4} and obtain the best chromosome and its fitness . If , then . . Step (10). Turn to Step 5.
5.5. Time Complexity Analysis of IGRDCGA
Suppose that a scRNAseq dataset contains N cells and M genes, and the population size in GA is Npop. The time complexity of the IGRDCGA is analyzed as follows.
The time complexity of Step 1 is O (M), the time complexity of Step 2 is O (1), and the time complexity of Algorithm 1 in Step 3 is O (Npop). In Step 4, the time complexity of the k means clustering is O (N), and that of obtaining the best chromosome is also O (Npop). Step 5 to Step 10 are the main iteration process of the IGRDCGA; their time complexities determine the time complexity of the IGRDCGA. The execution count of Steps 5 and 6 is both t_{max} (the maximal number of iterations), and the time complexities are both O (t_{max}). In Step 7, the time complexity of selecting two chromosomes is O (t_{max}), while the counts of executions of the crossover operator are l3 (the minimum value of the lengths of two chromosome codes), and its time complexity is still O (l3 t_{max}). Therefore, the time complexity of Step 7 is O (l3 t_{max}). Similarly, the time complexity of Step 8 is still O (l6 t_{max}). In Step 9, the time complexity of computing the fitness is O (t_{max} Npop N), and the time complexity of obtaining the best chromosome is also O (t_{max} Npop). Namely, the time complexity of Step 9 is still O (t_{max} Npop). Obviously, the time complexity of Step 10 is O (t_{max}).
To sum up, the time complexity of the IGRDCGA involves the following time complexities: O (1), O (M), O (Npop), O (t_{max}), O (l3 t_{max}), O (l6 t_{max}), O (t_{max} Npop), and O (t_{max} Npop N). Obviously, O (1) < O (M), O (Npop) < O (t_{max} Npop) < O (t_{max} Npop N), O (t_{max}) < O (t_{max} Npop) < O (t_{max} Npop N). Therefore, the decisive time complexities of the IGRDCGA are O (M), O (l3 t_{max}), O (l6 t_{max}), O (t_{max} Npop N). As l3 < M and l6 < M, it follows that O (l3 t_{max}) < O (M t_{max}), O (l3 t_{max}) < O (M t_{max}), and O (M) < O (M t_{max}). Thus, the time complexity of the IGRDCGA is determined by O (t_{max} Npop N) and O (M t_{max}). From Table 1, it can be clearly shown that, for most singlecell datasets, . Commonly, . Consequently, for most singlecell datasets, the time complexity of the IGRDCGA can be considered as O (M t_{max}).
6. Numerical Results
In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a stateoftheart singlecell classification algorithm SIMLR [58], are employed to compare it. To compare the performance of gene selection of the IGRDCGA, two frequently used dimensionality reduction algorithms, the PCA and tSNE, are also utilized to compare the IGRDCGA. For the SIMLR, we use its MATLAB program, which can be accessed by the address: https://github.com/BatzoglouLabSU/SIMLR. For the k means, PCA, and tSNE, the study utilizes the builtin functions in MATLAB.
6.1. Parameter Values
The related parameters of the IGRDCGA are described as follows.(i)Parameter for IGR. The threshold .(ii)Terminal Condition for the IGRDCGA. The population size ; the crossover probability ; the mutation probability ; the maximal number of iterations .
The other competing algorithms use their default parameter values.
6.2. Results
6.2.1. Comparisons of the IGR
The IGR of each gene shows its relevance. The IGR ranges from 0 to 1 while 0 represents that the gene has no relevance for the classification. Therefore, this can be employed to obviate irrelevant genes roughly. We compute the IGR of each gene for all scRNAseq datasets and obtain the number of the irrelevant genes (#irrelevant genes) and their percentages in respect of the total genes. The obtained results are shown in Table 2.
From Table 2, we can obviously see that there exist irrelevant genes in most datasets. The percentage of irrelevant genes versus the total genes can indicate gene redundancy rates. There are 24 datasets whose gene redundancy rate is larger than 0 in Table 2, which account for 72.72% (total 33 datasets). This demonstrates that the IGR = 0 is a good way to obviate the irrelevant genes.
However, Table 2 also shows that there are 9 datasets whose redundancy rate is 0. Namely, it cannot find the irrelevant genes from the 9 datasets by means of the IGR = 0. This also demonstrates that the IGR = 0 can only determine irrelevant genes and cannot determine irrelevant genes. Therefore, our proposed algorithm IGRDCGA utilizes a threshold method to eliminate irrelevant genes.
Comparatively speaking, the datasets containing more genes possess higher redundancy rates of genes and vice versa. Nevertheless, this is not absolute, as can be shown in Table 2.
6.2.2. Comparisons of Evaluation Metrics
We perform the IGRDCGA and several competing algorithms on the above scRNAseq datasets described in Section 4.1. Three evaluation metrics NMI, ARI, and purity are employed to evaluate the performances of the IGRDCGA and the other four competing algorithms. All the algorithms are independently performed for 20 runs to obtain their average values. For all 33 datasets, the average values of NMI, ARI, and purity metrics are shown in Tables 3–5, respectively.
From Table 3, we can obviously see that, for 22 of 33 datasets, the IGRDCGA gains the largest NMI in six algorithms, which account for 66.67% (22 of 33 datasets). Meanwhile, we can also observe that, for 6 of the rest 12 datasets, the differences of the largest NMI and the NMI obtained by the IGRDCGA are very little. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest NMI is, respectively, 0, 1, 6, 2, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of NMI.
Table 4 shows that, for 24 of 33 datasets, the IGRDCGA acquires the maximal ARI in six algorithms, which account for 72.72%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the maximal ARI is, respectively, 0, 0, 3, 1, and 7. By contrast, the IGRDCGA is superior to the other five competing algorithms in terms of ARI.
From Table 5, it can be clearly seen that, for 21 of 33 datasets, the purity metrics obtained by the IGRDCGA are the largest in six algorithms, which account for 63.63%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest purity metrics is, respectively, 0, 1, 6, 1, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of purity metric.
By comparing Tables 3–5, we summarize the best evaluation metrics obtained by six algorithms in Table 6, where the NMI, ARI, and purity metrics are, respectively, denoted by 1, 2, and 3.
From Table 6, we can clearly see that, for 15 of 33 datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are all the best in six algorithms, which account for 45.45%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the best NMI, ARI and purity metrics is, respectively, 0, 0, 2, 1, and 6. Thus, the IGRDCGA outperforms the other five competing algorithms in terms of NMI, ARI, and purity metrics.
In the meantime, Table 6 shows that only in partial datasets does the IGRDCGA gain the best NMI, ARI, and purity metrics. For the Allodiploid, the IGRDCGA, SIMLR, and PCA obtain the best NMI, ARI, and purity metrics. We can observe from Table 1 that the Allodiploid possesses much less number of cells and genes compared with the other datasets. Namely, for the datasets with smaller dimensions, an algorithm is easy to obtain the best NMI, ARI, and purity metrics. In this case, the PCA is the fittest method as it is the simplest and easiest to implement in the above three algorithms. For the Ting, Chung, and Li2, the IGRDCGA obtains the best NMI and ARI, while the SIMLR obtains the best purity metrics. For the Camp15, Camp17, Grun, and Muraro, the IGRDCGA obtains the best ARI and purity metrics, while the SIMLR obtains the best NMI. By the comparisons of the IGRDCGA with the SIMLR, we can clearly see that the number of the best metrics obtained by the IGRDCGA is far larger than that obtained by the SIMLR. This fully demonstrates that the IGRDCGA is superior to the SIMLR in terms of NMI, ARI, and purity metrics.
For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron datasets, the tSNE gains the best NMI, ARI, and purity metrics. However, for the other datasets, three evaluation metrics obtained by the tSNE are far worse than those obtained by the IGRDCGA. As shown in Table 3, there are 17 datasets whose NMI of the tSNE is less than 0.1, while there are none datasets whose NMI of the IGRDCGA is less than 0.1. In Table 4, there are 20 datasets whose ARI of the tSNE is less than 0.1, while there are none datasets whose ARI of the IGRDCGA is less than 0.1. In Table 5, there are 19 datasets whose purity metrics of the tSNE are less than 0.5, while there are only 4 datasets whose purity metrics of the IGRDCGA are less than 0.5. Namely, for most datasets except the above six datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are superior to those obtained by the tSNE. This attests that the IGRDCGA outperforms the tSNE for most datasets in terms of NMI, ARI, and purity metrics.
By elaborative analyses for the above six datasets and the other datasets, we can clearly observe the following principal features and differences. To begin with, the six datasets all possess very low redundancy rates of genes. Table 2 shows that the redundancy rates of the Zeisel and Baron are, respectively, 9.8 and 0.11, and the other four datasets are all 0. For the datasets with low redundancy rates, the IGRDCGA may lose some genes to cause the decreasing of the classification performances. In addition, the six datasets have relatively low NMI and ARI for six algorithms. From Table 3, we can easily see that all the NMI metrics of the Nestorowa, Wang, and Manno_m are, respectively, less than 0.38, 0.33, and 0.48. From Table 4, it can be obviously seen that all the ARI metrics of the Nestorowa, Wang, Manno_m, and Baron are, respectively, less than 0.26, 0.20, 0.19, and 0.33. Thirdly, the six datasets are approximately and completely sparse. From our provided appendix file, it can obviously observe that the data values within [0, 0.1] in all datasets are more than those of the other data intervals. Nevertheless, the situation of the six datasets is a great deal more highlighted, and they are approximately and completely sparse if we let the data values within [0, 0.1] to be 0. For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron, the data values within [0, 0.1], respectively, account for more than 88%, 95%, 73%, 91%, 80%, and 91%, which are so high that they are nearly cover up the data values within the other ranges.
6.2.3. Iteration Plots and Heat Maps
In order to clearly illustrate the performance of the IGRDCGA, three scRNAseq datasets, Allodiploid, Chung, and Kolodz, are selected as representative datasets according to the different values of the NMI metrics in Table 3. For the Allodiploid, Yeo, and Grun, their iteration plots are, respectively, illustrated in Figures 2–4. For the Allodiploid, the NMI metric obtained by the IGRDCGA is 1, which represents that the NMI metrics are consistent with the ground truth class labels. It can be clearly seen from Figure 2 that the iteration plot of the Allodiploid is parallel to the Xaxis and its value is the maximal NMI of 1. Namely, the IGRDCGA obtains the maximal NMI at the first iteration. Therefore, the later iterations are useless. In the above 33 scRNAseq datasets, only the Allodiploid displays the plot illustrated in Figure 2.
In Figure 3, the iteration plot of Yeo is ascending as the number of iterations increases. The ascending trend is very obvious at the early period of the iteration, and the ascending trend stops as the number of iterations increases. Namely, the NMI metric turns into a fixed number at the later period of the iteration. In the above 33 scRNAseq datasets, most datasets display a similar plot shown in Figure 3.
From Figure 4, we can obviously see that the iteration plot of Grun is always ascending as the number of iterations increases. The ascending trend is always very obvious at the whole period of the iteration. In the above 33 scRNAseq datasets, a few datasets display a similar plot shown in Figure 4.
In order to illustrate the classification performance of the IGRDCGA more visually, the heat maps of the Allodiploid, Sasagawa, and Yan are illustrated in Figures 5–7, respectively.
The Allodiploid contains 16 cells, 2406 genes, and 2 classes, whose ground truth class labels are illustrated in the first column of Figure 5. From Figure 5, it can be obviously seen that the results obtained by the IGRDCGA, SIMLR, and PCA are fully concordant with the ground truth class labels; those obtained by the k means and spectral clustering are nearly one class; and those obtained by the tSNE are oscillatory. Obviously, the results obtained by the k means, spectral clustering, and the tSNE are a great deal worse than those obtained by the IGRDCGA, SIMLR, and PCA; the results of the tSNE are the worst in the six algorithms.
The dataset Sasagawa contains 23 cells, 32700 genes, and 3 classes, whose ground truth labels are illustrated in the first column of Figure 6. From Figure 6, it can be obviously observed that the results obtained by the IGRDCGA are fully concordant with the ground truth class labels; those obtained by spectral clustering are nearly one class; those obtained by the tSNE are oscillatory and the worst. In the results obtained by the k means, SIMLR, and PCA, the results of the subclass 2 (the ground truth class labels are 2) are fully concordant with the ground truth class labels while the other subclasses are confused. Therefore, for the Sasagawa, the IGRDCGA gains the best results while the tSNE gains the worst results.
The first column of Figure 7 illustrates the ground truth labels of the Yan, which contains 90 cells, 20214 genes, and 6 classes. From Figure 7, we can clearly see that none of the results obtained by six algorithms are concordant with the ground truth class labels. However, the results obtained by the IGRDCGA are closer to the ground truth class labels than those obtained by the other algorithms, and the results of the subclasses 2, 3, 4, 5, and 6 are fully concordant with the ground truth class labels. The results obtained by the tSNE are oscillatory and the worst in the six algorithms. The results obtained by the other algorithms are all confused. Therefore, for the Yan, the IGRDCGA obtains the best results whereas the tSNE obtains the worst results.
7. Conclusion and Future Work
In this study, we present a novel algorithm to address the gene selection and classification for scRNAseq data. It combines information gain ratio (IGR) and genetic algorithm with dynamic crossover (DCGA) and are abbreviated as IGRDCGA. It utilizes information gain ratio to eliminate irrelevant genes roughly and utilizes DCGA to choose high quality genes finely. We have conducted the IGRDCGA and several competing algorithms on 33 publicly available scRNAseq datasets. The obtained results demonstrate that the IGRDCGA can eliminate irrelevant genes and choose high quality genes effectively, and it is superior to the other several competing algorithms in terms of classification accuracy.
This algorithm is going on for further enhancement and improvement. One attempt is to utilize a more efficient coding to speed up its converging rate and stability. Another attempt is to extend the IGRDCGA to classification algorithms of the other high dimensional problems.
Data Availability
The datasets supporting this study are publicly available and they can be downloaded from EMBLEBI (https://www.ebi.ac.uk/) or the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/).
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was supported by Science Research Foundation for Highlevel Talents of Yulin Normal University (no. G2021ZK17).