Abstract

Single-cell RNA sequencing (scRNA-seq) is emerging as a promising technology. There exist a huge number of genes in a scRNA-seq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNA-seq data. We conduct the IGRDCGA and several competing methods on some real-world scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.

1. Introduction

In scRNA-seq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. If all genes are utilized in data classifications, the classification accuracy and classification efficiency may be low. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital.

Feature selection (FS) problems can be taken as large-scale global optimization problems [1]; therefore, we can use bioinspired intelligence optimization algorithms to address feature selection problems. Wang et al. [1] took FS problems regard as large-scale global optimization problems. Nakisa et al. [2] utilized the evolutionary computation (EC) to search the optimal feature subset. Eroglu and Kilic [3] integrated a genetic local search algorithm and a k-nearest neighbor classifier to select feature subset. Maleki and Zeinali [4] used a hybrid genetic algorithm (GA) to address dimension reductions and applied it to the classification of lung cancer. Tahir et al. [5] presented a binary chaotic GA to select feature for healthcare datasets. However, how to correctly use the GA to address the gene selection and classification of scRNA-seq data is a significant issue to consider first. To the best of our knowledge, there are only a few literatures so far.

The study integrates the IGR and DCGA to address the gene selection and classification of scRNA-seq data and proposes a novel gene selection and classification algorithm IGRDCGA. The IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and then employs the DCGA to choose high quality genes finely from the preliminary gene subset.

The rest of this study is organized as follows. Section 2 briefly describes the information gain ratio and genetic algorithm. Section 3 states three evaluation metrics. The dataset and preprocessing method to use in the study are described in Sections 4. The coding and the other details of the IGRDCGA are described in Section 5. The numerical results of the IGRDCGA and several competing algorithms are given in Section 6. The conclusion of the study is made in Section 7.

In this section, the information gain ratio and genetic algorithm are described as follows.

2.1. Information Gain Ratio

The information gain is a metric derived from information entropy, often used to evaluate the mutual dependence level between two random variables. Namely, it is a symmetrical metric of dependency [6].

For two discrete random variables X and Y, their information entropy can be calculated, respectively, in terms of the following formulas:where , and represent the marginal probability of and , respectively.

The conditional entropy and information gain [68] of X versus Y can be calculated in terms of the first two following formulas, respectively. The information gain ratio of X versus Y is the ratio of the information gain to the information entropy, which is formulated in the last following formula.

The ranges from 0 to 1 while 1 represents that X completely leads to Y and 0 represents that X and Y are completely independent.

2.2. Genetic Algorithm

The genetic algorithm (GA) is a bioinspired intelligence optimization algorithm. It is inspired by the process of a natural selection and belongs to one of evolutionary algorithms (EAs) [913]. It is commonly utilized to generate feasible solutions for optimization problems by performing the operators such as selection, crossover, and mutation [1416]. The selection operator is designed to choose a part of chromosomes for crossover operator from previous population. The frequently used selection operator is random selection strategy, such as tournament selection strategy. The crossover operator is designed to exchange one or many genes in two parents that are selected by selection operator. It simulates reproduction or recombination in biological evolution process. GA determines whether to perform mutation operator or not according to crossover probability. By crossover operator, two parents may generate two or many offsprings in terms of different crossover strategy. The frequently used crossover operator includes single-point crossover, two-point crossover, and multipoint crossover. Mutation operator is designed to modify one or many genes in certain chromosome. It simulates gene mutation in biological evolution process. Similarly, GA determines whether to perform mutation operator or not according to mutation probability. The frequently used mutation operator includes locus mutation, exchange mutation, and insertion mutation. Mutation operator only acts on one chromosome while crossover operator acts on two chromosomes. Generally, crossover operator probability is much larger than mutation probability. This accords with a biological evolution process.

3. Evaluation Metrics

To evaluate the performance of the IGRDCGA, in the study, we utilize the following evaluation metrics: NMI (normalized mutual information) [1719], ARI (adjusted random index) [20, 21], and purity [22].

3.1. NMI (Normalized Mutual Information)

The NMI is a frequently used evaluation metric. It can be often used to evaluate the accuracy and the difference between the obtained clustering results and the ground truth results.

For two discrete random variables X and Y, their MI (mutual information) can be calculated in terms of the following formula:where and , respectively, denote marginal probability of and and denotes a joint distribution probability.

The normalized MI is taken as NMI, which can be calculated in terms of the following formula.where and denote the entropy of X and Y, respectively. The NMI ranges from 0 to 1 while 1 is the optimal score, which represents that X and Y have identical mutual information. A larger NMI signifies higher concordance between X and Y.

Example 1. Suppose that the ground truth class labels of a single-cell dataset are as follows: y1 = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3] and the classification result of an algorithm is as follows: y2 = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3]. It follows from y1 and y2 that their unique value is both [1 2 3]. Then, the probability values in formula (6) are calculated as follows:According to formula (6), it follows thatAccording to formula (1) or (2), it follows thatAccording to formula (7), it follows that

3.2. ARI (Adjusted Rand Index)

The ARI is another widely used evaluation metric for measuring the concordance between two clustering results. The RI (Rand Index) measures the similarity between two results. It calculates all pairs of samples, including the pairs in identical or different clusters, and pairs in the predicted and ground truth clusters.

For two clustering results X and Y containing n elements, the RI of them is calculated in terms of the following formula.where a is the number of pairs in identical class in X and identical cluster in Y; b is the number of pairs in identical class in X but not identical cluster in Y; c is the number of pairs that are not in identical class in X but in identical cluster in Y; d is the number of pairs that are neither in identical class in X nor in identical cluster in Y.

The overlap between X and Y can be summarized in a contingency table, in which each element represents the number of objects in common between X and Y. The ARI is adjusted RI, which can be calculated in terms of the following formula.where nij are values from the contingency table, ai is the sum of the i-th row of the contingency table, bj is the sum of the j-th column of the contingency table, and denotes a binomial coefficient.

The ARI is the corrected-for-chance version of the RI. The RI may only yield a value between 0 and 1; the ARI can yield negative values if the index is less than the expected index. The optimal score of the ARI is 1, which represents that two clustering results are identical. A larger ARI signifies higher concordance between X and Y.

3.3. Purity

The purity [22] is another simple and transparent evaluation metric for evaluating the clustering performance. For purity, each identified cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this label assignment is computed by counting the number of correctly assigned cells and dividing by the number of cells N. This mapping is not one-to-one and may be biased to the class which has the largest size. Nonetheless, it provides us a simple metric.

For two clustering results X and Y containing n elements, the purity of them is calculated in terms of the following formula.where p and q are the elements of X and Y, respectively.

The purity ranges from 0 to 1 while 1 is the optimal score, which denotes that X and Y have identical clustering accuracy. A larger purity signifies higher concordance between X and Y.

4. Dataset and Preprocessing

4.1. Dataset

In the study, 33 publicly available scRNA-seq datasets are utilized to testify the performance of the IGRDCGA, which are shown in Table 1. The datasets contain various single-cell gene expression data that are published by many different publications. Each row in the datasets denotes an observation or cell while each column denotes a feature or gene. Table 1 shows the features of the datasets, such as GSE, name, the number of cells (#cell), the number of genes (#gene), the number of the ground truth classes (#class), and references.

4.2. Data Preprocessing

For a scRNA-seq dataset containing N cells and M genes, if represents gene expression level of the i-th cell versus the j-th gene, then its adjacent matrix can be expressed as .

To lower the impacts of large gene expression levels on little gene expression levels, all the data in are normalized according to the following formula.where and , respectively, represent the minimum and maximum values of the j-th gene in . The parameter is a very small value, which is to escape the denominator of 0, because and may be both equal to 0 in scRNA-seq dataset.

It is obvious that all the elements are normalized according to each column (each column denotes a gene). Therefore, after preprocessing, the gene expression level of each gene ranges from 0 to 1, and all the elements in D range from 0 to 1 as well.

5. The Proposed Algorithm

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). The coding and the other details of the IGRDCGA are as follows.

5.1. Coding and Initialization

To choose high quality genes from a huge number of genes, we need to design a good coding for each chromosome. For the scRNA-seq dataset containing N cells of M genes, we design a coding of variable length, whose length can be changed. We number M genes into 1∼M; the coding of a chromosome can be expressed as follows:where , respectively, denote the first locus, the second locus, and the l-th locus of the coding, and they are any integer values larger than 1 and less than M; l denotes the length of the coding of the chromosome, which is variable for different chromosomes.

Obviously, the coding of a chromosome signifies a combination of genes, namely, a result of gene selection. Therefore, different chromosomes can signify different gene selection, and coding of variable length can signify gene selection of variable length.

According to the above coding rules of a chromosome, Algorithm 1 presents the initialization algorithm to generate the initial population whose population size is Npop.

(1)Generate one random integer l in the interval [1, M].
(2)Generate l random integers in the interval [1, M].
(3)Loop the above (1) and (2) Npop times to obtain the initial population.
5.2. Crossover Operator

As the length of the coding of a chromosome is variable, we need to design a new crossover operator to fit the coding of variable length. Given the coding of two parents C1 and C2, their lengths are, respectively, l1 and l2, and the minimum value of l1 and l2 is marked as l3. The number of the genes to exchange in crossover operator should be less than or equal to l3. The study presents a new crossover operator as shown in Algorithm 2.

(1)Generate one random integer l4 that is less than or equal to l3.
(2)For C1, generate the location to exchange loc1, which contains l4 different random integers that are larger than or equal to 1 and less than or equal to l1.
(3)For C2, generate the location to exchange loc2, which contains l4 different random integers that are larger than or equal to 1 and less than or equal to l2.
(4)The coding bits of the location loc1 in C1 and the coding bits the location loc2 in C2 are exchanged among each other.
5.3. Mutation Operator

Similarly, we also need to design a new mutation operator to fit the coding of variable length. Given the coding of one parents C3, its length is l5. The location of the genes to mutate in mutation operator should be less than or equal to l5. The study presents a new mutation operator as shown in Algorithm 3.

(1)Generate one random integer l6 that is less than or equal to l5.
(2)For C3, generate the location to mutate loc3, which contains l6 different random integers that are larger than or equal to 1 and less than or equal to l5.
(3)The coding bits of the location loc3 in C3 are replaced into the other integers that are not equal to any of the coding bits of C3.
5.4. Detailed Steps of IGRDCGA

To illustrate the main process of the proposed algorithm, Figure 1 shows the flow chart of the IGRDCGA.

The detailed steps of the IGRDCGA are summarized as follows.Step (1). Compute information gain ratio IGR of each gene in terms of formula (5) first. Then, eliminate those genes whose IGR is 0.Step (2). To eliminate irrelevant genes and choose high quality genes, the threshold method is used to choose those genes whose IGR is the top best . In later numerical experiment, .Step (3). Implement Algorithm 1 described in Section 5.1 to obtain the initial population P1.Step (4). Compute the fitness and obtain the best chromosome.(1)For each chromosome pi in P1, select those genes determined by pi to obtain a new data ; then conduct k means clustering on to compute NMI metric as the fitness of .(2)Obtain the best chromosome and its fitness by comparing the fitness.Step (5). Let the number of iterations ; then judge whether or not. If yes, then output and , and the IGRDCGA terminates; otherwise, turn to the next step.Step (6). Tournament selection strategy [55, 56] is used to choose chromosomes from P1 to obtain new population P2.Step (7). Generate a random r. If r is less than or equal to the crossover probability , then randomly select two chromosomes from P2 to perform the crossover operator described in Section 5.2. The final population is marked as P3.Step (8). Generate a random r. If r is less than or equal to the mutation probability , then randomly select one chromosome from P3 to perform the mutation operator described in Section 5.3. The final population is marked as P4.Step (9). Compute the fitness value of each chromosome in P4 and obtain the best chromosome and its fitness . If , then . .Step (10). Turn to Step 5.

5.5. Time Complexity Analysis of IGRDCGA

Suppose that a scRNA-seq dataset contains N cells and M genes, and the population size in GA is Npop. The time complexity of the IGRDCGA is analyzed as follows.

The time complexity of Step 1 is O (M), the time complexity of Step 2 is O (1), and the time complexity of Algorithm 1 in Step 3 is O (Npop). In Step 4, the time complexity of the k means clustering is O (N), and that of obtaining the best chromosome is also O (Npop). Step 5 to Step 10 are the main iteration process of the IGRDCGA; their time complexities determine the time complexity of the IGRDCGA. The execution count of Steps 5 and 6 is both tmax (the maximal number of iterations), and the time complexities are both O (tmax). In Step 7, the time complexity of selecting two chromosomes is O (tmax), while the counts of executions of the crossover operator are l3 (the minimum value of the lengths of two chromosome codes), and its time complexity is still O (l3 tmax). Therefore, the time complexity of Step 7 is O (l3 tmax). Similarly, the time complexity of Step 8 is still O (l6 tmax). In Step 9, the time complexity of computing the fitness is O (tmaxNpopN), and the time complexity of obtaining the best chromosome is also O (tmaxNpop). Namely, the time complexity of Step 9 is still O (tmaxNpop). Obviously, the time complexity of Step 10 is O (tmax).

To sum up, the time complexity of the IGRDCGA involves the following time complexities: O (1), O (M), O (Npop), O (tmax), O (l3 tmax), O (l6 tmax), O (tmaxNpop), and O (tmaxNpopN). Obviously, O (1) < O (M), O (Npop) < O (tmaxNpop) < O (tmaxNpopN), O (tmax) < O (tmaxNpop) < O (tmaxNpopN). Therefore, the decisive time complexities of the IGRDCGA are O (M), O (l3 tmax), O (l6 tmax), O (tmaxNpopN). As l3 < M and l6 < M, it follows that O (l3 tmax) < O (Mtmax), O (l3 tmax) < O (Mtmax), and O (M) < O (Mtmax). Thus, the time complexity of the IGRDCGA is determined by O (tmaxNpopN) and O (Mtmax). From Table 1, it can be clearly shown that, for most single-cell datasets, . Commonly, . Consequently, for most single-cell datasets, the time complexity of the IGRDCGA can be considered as O (Mtmax).

6. Numerical Results

In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a state-of-the-art single-cell classification algorithm SIMLR [58], are employed to compare it. To compare the performance of gene selection of the IGRDCGA, two frequently used dimensionality reduction algorithms, the PCA and tSNE, are also utilized to compare the IGRDCGA. For the SIMLR, we use its MATLAB program, which can be accessed by the address: https://github.com/BatzoglouLabSU/SIMLR. For the k means, PCA, and tSNE, the study utilizes the built-in functions in MATLAB.

6.1. Parameter Values

The related parameters of the IGRDCGA are described as follows.(i)Parameter for IGR. The threshold .(ii)Terminal Condition for the IGRDCGA. The population size ; the crossover probability ; the mutation probability ; the maximal number of iterations .

The other competing algorithms use their default parameter values.

6.2. Results
6.2.1. Comparisons of the IGR

The IGR of each gene shows its relevance. The IGR ranges from 0 to 1 while 0 represents that the gene has no relevance for the classification. Therefore, this can be employed to obviate irrelevant genes roughly. We compute the IGR of each gene for all scRNA-seq datasets and obtain the number of the irrelevant genes (#irrelevant genes) and their percentages in respect of the total genes. The obtained results are shown in Table 2.

From Table 2, we can obviously see that there exist irrelevant genes in most datasets. The percentage of irrelevant genes versus the total genes can indicate gene redundancy rates. There are 24 datasets whose gene redundancy rate is larger than 0 in Table 2, which account for 72.72% (total 33 datasets). This demonstrates that the IGR = 0 is a good way to obviate the irrelevant genes.

However, Table 2 also shows that there are 9 datasets whose redundancy rate is 0. Namely, it cannot find the irrelevant genes from the 9 datasets by means of the IGR = 0. This also demonstrates that the IGR = 0 can only determine irrelevant genes and cannot determine irrelevant genes. Therefore, our proposed algorithm IGRDCGA utilizes a threshold method to eliminate irrelevant genes.

Comparatively speaking, the datasets containing more genes possess higher redundancy rates of genes and vice versa. Nevertheless, this is not absolute, as can be shown in Table 2.

6.2.2. Comparisons of Evaluation Metrics

We perform the IGRDCGA and several competing algorithms on the above scRNA-seq datasets described in Section 4.1. Three evaluation metrics NMI, ARI, and purity are employed to evaluate the performances of the IGRDCGA and the other four competing algorithms. All the algorithms are independently performed for 20 runs to obtain their average values. For all 33 datasets, the average values of NMI, ARI, and purity metrics are shown in Tables 35, respectively.

From Table 3, we can obviously see that, for 22 of 33 datasets, the IGRDCGA gains the largest NMI in six algorithms, which account for 66.67% (22 of 33 datasets). Meanwhile, we can also observe that, for 6 of the rest 12 datasets, the differences of the largest NMI and the NMI obtained by the IGRDCGA are very little. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest NMI is, respectively, 0, 1, 6, 2, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of NMI.

Table 4 shows that, for 24 of 33 datasets, the IGRDCGA acquires the maximal ARI in six algorithms, which account for 72.72%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the maximal ARI is, respectively, 0, 0, 3, 1, and 7. By contrast, the IGRDCGA is superior to the other five competing algorithms in terms of ARI.

From Table 5, it can be clearly seen that, for 21 of 33 datasets, the purity metrics obtained by the IGRDCGA are the largest in six algorithms, which account for 63.63%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest purity metrics is, respectively, 0, 1, 6, 1, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of purity metric.

By comparing Tables 35, we summarize the best evaluation metrics obtained by six algorithms in Table 6, where the NMI, ARI, and purity metrics are, respectively, denoted by 1, 2, and 3.

From Table 6, we can clearly see that, for 15 of 33 datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are all the best in six algorithms, which account for 45.45%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the best NMI, ARI and purity metrics is, respectively, 0, 0, 2, 1, and 6. Thus, the IGRDCGA outperforms the other five competing algorithms in terms of NMI, ARI, and purity metrics.

In the meantime, Table 6 shows that only in partial datasets does the IGRDCGA gain the best NMI, ARI, and purity metrics. For the Allodiploid, the IGRDCGA, SIMLR, and PCA obtain the best NMI, ARI, and purity metrics. We can observe from Table 1 that the Allodiploid possesses much less number of cells and genes compared with the other datasets. Namely, for the datasets with smaller dimensions, an algorithm is easy to obtain the best NMI, ARI, and purity metrics. In this case, the PCA is the fittest method as it is the simplest and easiest to implement in the above three algorithms. For the Ting, Chung, and Li2, the IGRDCGA obtains the best NMI and ARI, while the SIMLR obtains the best purity metrics. For the Camp15, Camp17, Grun, and Muraro, the IGRDCGA obtains the best ARI and purity metrics, while the SIMLR obtains the best NMI. By the comparisons of the IGRDCGA with the SIMLR, we can clearly see that the number of the best metrics obtained by the IGRDCGA is far larger than that obtained by the SIMLR. This fully demonstrates that the IGRDCGA is superior to the SIMLR in terms of NMI, ARI, and purity metrics.

For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron datasets, the tSNE gains the best NMI, ARI, and purity metrics. However, for the other datasets, three evaluation metrics obtained by the tSNE are far worse than those obtained by the IGRDCGA. As shown in Table 3, there are 17 datasets whose NMI of the tSNE is less than 0.1, while there are none datasets whose NMI of the IGRDCGA is less than 0.1. In Table 4, there are 20 datasets whose ARI of the tSNE is less than 0.1, while there are none datasets whose ARI of the IGRDCGA is less than 0.1. In Table 5, there are 19 datasets whose purity metrics of the tSNE are less than 0.5, while there are only 4 datasets whose purity metrics of the IGRDCGA are less than 0.5. Namely, for most datasets except the above six datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are superior to those obtained by the tSNE. This attests that the IGRDCGA outperforms the tSNE for most datasets in terms of NMI, ARI, and purity metrics.

By elaborative analyses for the above six datasets and the other datasets, we can clearly observe the following principal features and differences. To begin with, the six datasets all possess very low redundancy rates of genes. Table 2 shows that the redundancy rates of the Zeisel and Baron are, respectively, 9.8 and 0.11, and the other four datasets are all 0. For the datasets with low redundancy rates, the IGRDCGA may lose some genes to cause the decreasing of the classification performances. In addition, the six datasets have relatively low NMI and ARI for six algorithms. From Table 3, we can easily see that all the NMI metrics of the Nestorowa, Wang, and Manno_m are, respectively, less than 0.38, 0.33, and 0.48. From Table 4, it can be obviously seen that all the ARI metrics of the Nestorowa, Wang, Manno_m, and Baron are, respectively, less than 0.26, 0.20, 0.19, and 0.33. Thirdly, the six datasets are approximately and completely sparse. From our provided appendix file, it can obviously observe that the data values within [0, 0.1] in all datasets are more than those of the other data intervals. Nevertheless, the situation of the six datasets is a great deal more highlighted, and they are approximately and completely sparse if we let the data values within [0, 0.1] to be 0. For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron, the data values within [0, 0.1], respectively, account for more than 88%, 95%, 73%, 91%, 80%, and 91%, which are so high that they are nearly cover up the data values within the other ranges.

6.2.3. Iteration Plots and Heat Maps

In order to clearly illustrate the performance of the IGRDCGA, three scRNA-seq datasets, Allodiploid, Chung, and Kolodz, are selected as representative datasets according to the different values of the NMI metrics in Table 3. For the Allodiploid, Yeo, and Grun, their iteration plots are, respectively, illustrated in Figures 24. For the Allodiploid, the NMI metric obtained by the IGRDCGA is 1, which represents that the NMI metrics are consistent with the ground truth class labels. It can be clearly seen from Figure 2 that the iteration plot of the Allodiploid is parallel to the X-axis and its value is the maximal NMI of 1. Namely, the IGRDCGA obtains the maximal NMI at the first iteration. Therefore, the later iterations are useless. In the above 33 scRNA-seq datasets, only the Allodiploid displays the plot illustrated in Figure 2.

In Figure 3, the iteration plot of Yeo is ascending as the number of iterations increases. The ascending trend is very obvious at the early period of the iteration, and the ascending trend stops as the number of iterations increases. Namely, the NMI metric turns into a fixed number at the later period of the iteration. In the above 33 scRNA-seq datasets, most datasets display a similar plot shown in Figure 3.

From Figure 4, we can obviously see that the iteration plot of Grun is always ascending as the number of iterations increases. The ascending trend is always very obvious at the whole period of the iteration. In the above 33 scRNA-seq datasets, a few datasets display a similar plot shown in Figure 4.

In order to illustrate the classification performance of the IGRDCGA more visually, the heat maps of the Allodiploid, Sasagawa, and Yan are illustrated in Figures 57, respectively.

The Allodiploid contains 16 cells, 2406 genes, and 2 classes, whose ground truth class labels are illustrated in the first column of Figure 5. From Figure 5, it can be obviously seen that the results obtained by the IGRDCGA, SIMLR, and PCA are fully concordant with the ground truth class labels; those obtained by the k means and spectral clustering are nearly one class; and those obtained by the tSNE are oscillatory. Obviously, the results obtained by the k means, spectral clustering, and the tSNE are a great deal worse than those obtained by the IGRDCGA, SIMLR, and PCA; the results of the tSNE are the worst in the six algorithms.

The dataset Sasagawa contains 23 cells, 32700 genes, and 3 classes, whose ground truth labels are illustrated in the first column of Figure 6. From Figure 6, it can be obviously observed that the results obtained by the IGRDCGA are fully concordant with the ground truth class labels; those obtained by spectral clustering are nearly one class; those obtained by the tSNE are oscillatory and the worst. In the results obtained by the k means, SIMLR, and PCA, the results of the subclass 2 (the ground truth class labels are 2) are fully concordant with the ground truth class labels while the other subclasses are confused. Therefore, for the Sasagawa, the IGRDCGA gains the best results while the tSNE gains the worst results.

The first column of Figure 7 illustrates the ground truth labels of the Yan, which contains 90 cells, 20214 genes, and 6 classes. From Figure 7, we can clearly see that none of the results obtained by six algorithms are concordant with the ground truth class labels. However, the results obtained by the IGRDCGA are closer to the ground truth class labels than those obtained by the other algorithms, and the results of the subclasses 2, 3, 4, 5, and 6 are fully concordant with the ground truth class labels. The results obtained by the tSNE are oscillatory and the worst in the six algorithms. The results obtained by the other algorithms are all confused. Therefore, for the Yan, the IGRDCGA obtains the best results whereas the tSNE obtains the worst results.

7. Conclusion and Future Work

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data. It combines information gain ratio (IGR) and genetic algorithm with dynamic crossover (DCGA) and are abbreviated as IGRDCGA. It utilizes information gain ratio to eliminate irrelevant genes roughly and utilizes DCGA to choose high quality genes finely. We have conducted the IGRDCGA and several competing algorithms on 33 publicly available scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can eliminate irrelevant genes and choose high quality genes effectively, and it is superior to the other several competing algorithms in terms of classification accuracy.

This algorithm is going on for further enhancement and improvement. One attempt is to utilize a more efficient coding to speed up its converging rate and stability. Another attempt is to extend the IGRDCGA to classification algorithms of the other high dimensional problems.

Data Availability

The datasets supporting this study are publicly available and they can be downloaded from EMBL-EBI (https://www.ebi.ac.uk/) or the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by Science Research Foundation for High-level Talents of Yulin Normal University (no. G2021ZK17).