Gene Selection and Classification of scRNA-seq Data Combining Information Gain Ratio and Genetic Algorithm with Dynamic Crossover

Feng, Junhong; Niu, Xishuan; Zhang, Jie; Wang, Jian-Hong

doi:https://doi.org/10.1155/2022/9639304

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Work Dataset Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Innovative Artificial Intelligence-Based Internet of Things for Smart Cities and Smart Homes

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 9639304 | https://doi.org/10.1155/2022/9639304

Gene Selection and Classification of scRNA-seq Data Combining Information Gain Ratio and Genetic Algorithm with Dynamic Crossover

Junhong Feng,¹Xishuan Niu,¹Jie Zhang,¹and Jian-Hong Wang²

Academic Editor: Chao-Yang Lee

Received02 Dec 2021

Accepted07 Jan 2022

Published31 Jan 2022

Abstract

Single-cell RNA sequencing (scRNA-seq) is emerging as a promising technology. There exist a huge number of genes in a scRNA-seq data. However, some genes are high quality genes, and some are noises and irrelevant genes because of unspecific technology reasons. These noises and irrelevant genes may have a strong influence on downstream data analyses, such as a cell classification, gene function analysis, and cancer biomarker detection. Therefore, it is very significant to obviate these irrelevant genes and choose high quality genes by gene selection methods. In this study, a novel gene selection and classification method is presented by combining the information gain ratio and the genetic algorithm with dynamic crossover (abbreviated as IGRDCGA). The information gain ratio (IGR) is employed to eliminate irrelevant genes roughly and obtain a preliminary gene subset, and then the genetic algorithm with a dynamic crossover (DCGA) is utilized to choose high quality genes finely from the preliminary gene subset. The main difference between the IGRDCGA and the existing methods is that the DCGA and IGR are integrated first and used to select genes from scRNA-seq data. We conduct the IGRDCGA and several competing methods on some real-world scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can choose high quality genes effectively and efficiently and outperforms the other several competing methods in terms of both the dimensionality reduction and the classification accuracy.

1. Introduction

In scRNA-seq data, there often are amounts of genes and may reach tens of thousands. Some genes are irrelevant or unsuitable for classification tasks, and they may seriously affect the efficiency of downstream data analysis. If all genes are utilized in data classifications, the classification accuracy and classification efficiency may be low. In order to obviate these irrelevant genes and select high quality genes, an effective and efficient gene selection algorithm is vital.

Feature selection (FS) problems can be taken as large-scale global optimization problems [1]; therefore, we can use bioinspired intelligence optimization algorithms to address feature selection problems. Wang et al. [1] took FS problems regard as large-scale global optimization problems. Nakisa et al. [2] utilized the evolutionary computation (EC) to search the optimal feature subset. Eroglu and Kilic [3] integrated a genetic local search algorithm and a k-nearest neighbor classifier to select feature subset. Maleki and Zeinali [4] used a hybrid genetic algorithm (GA) to address dimension reductions and applied it to the classification of lung cancer. Tahir et al. [5] presented a binary chaotic GA to select feature for healthcare datasets. However, how to correctly use the GA to address the gene selection and classification of scRNA-seq data is a significant issue to consider first. To the best of our knowledge, there are only a few literatures so far.

The study integrates the IGR and DCGA to address the gene selection and classification of scRNA-seq data and proposes a novel gene selection and classification algorithm IGRDCGA. The IGRDCGA utilizes the IGR to eliminate irrelevant genes roughly and obtain a preliminary gene subset and then employs the DCGA to choose high quality genes finely from the preliminary gene subset.

The rest of this study is organized as follows. Section 2 briefly describes the information gain ratio and genetic algorithm. Section 3 states three evaluation metrics. The dataset and preprocessing method to use in the study are described in Sections 4. The coding and the other details of the IGRDCGA are described in Section 5. The numerical results of the IGRDCGA and several competing algorithms are given in Section 6. The conclusion of the study is made in Section 7.

In this section, the information gain ratio and genetic algorithm are described as follows.

2.1. Information Gain Ratio

The information gain is a metric derived from information entropy, often used to evaluate the mutual dependence level between two random variables. Namely, it is a symmetrical metric of dependency [6].

For two discrete random variables X and Y, their information entropy can be calculated, respectively, in terms of the following formulas:where , and represent the marginal probability of and , respectively.

The conditional entropy and information gain [6–8] of X versus Y can be calculated in terms of the first two following formulas, respectively. The information gain ratio of X versus Y is the ratio of the information gain to the information entropy, which is formulated in the last following formula.

The ranges from 0 to 1 while 1 represents that X completely leads to Y and 0 represents that X and Y are completely independent.

2.2. Genetic Algorithm

The genetic algorithm (GA) is a bioinspired intelligence optimization algorithm. It is inspired by the process of a natural selection and belongs to one of evolutionary algorithms (EAs) [9–13]. It is commonly utilized to generate feasible solutions for optimization problems by performing the operators such as selection, crossover, and mutation [14–16]. The selection operator is designed to choose a part of chromosomes for crossover operator from previous population. The frequently used selection operator is random selection strategy, such as tournament selection strategy. The crossover operator is designed to exchange one or many genes in two parents that are selected by selection operator. It simulates reproduction or recombination in biological evolution process. GA determines whether to perform mutation operator or not according to crossover probability. By crossover operator, two parents may generate two or many offsprings in terms of different crossover strategy. The frequently used crossover operator includes single-point crossover, two-point crossover, and multipoint crossover. Mutation operator is designed to modify one or many genes in certain chromosome. It simulates gene mutation in biological evolution process. Similarly, GA determines whether to perform mutation operator or not according to mutation probability. The frequently used mutation operator includes locus mutation, exchange mutation, and insertion mutation. Mutation operator only acts on one chromosome while crossover operator acts on two chromosomes. Generally, crossover operator probability is much larger than mutation probability. This accords with a biological evolution process.

3. Evaluation Metrics

To evaluate the performance of the IGRDCGA, in the study, we utilize the following evaluation metrics: NMI (normalized mutual information) [17–19], ARI (adjusted random index) [20, 21], and purity [22].

3.1. NMI (Normalized Mutual Information)

The NMI is a frequently used evaluation metric. It can be often used to evaluate the accuracy and the difference between the obtained clustering results and the ground truth results.

For two discrete random variables X and Y, their MI (mutual information) can be calculated in terms of the following formula:where and , respectively, denote marginal probability of and and denotes a joint distribution probability.

The normalized MI is taken as NMI, which can be calculated in terms of the following formula.where and denote the entropy of X and Y, respectively. The NMI ranges from 0 to 1 while 1 is the optimal score, which represents that X and Y have identical mutual information. A larger NMI signifies higher concordance between X and Y.

Example 1. Suppose that the ground truth class labels of a single-cell dataset are as follows: y₁ = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3] and the classification result of an algorithm is as follows: y₂ = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3]. It follows from y₁ and y₂ that their unique value is both [1 2 3]. Then, the probability values in formula (6) are calculated as follows:According to formula (6), it follows thatAccording to formula (1) or (2), it follows thatAccording to formula (7), it follows that

3.2. ARI (Adjusted Rand Index)

The ARI is another widely used evaluation metric for measuring the concordance between two clustering results. The RI (Rand Index) measures the similarity between two results. It calculates all pairs of samples, including the pairs in identical or different clusters, and pairs in the predicted and ground truth clusters.

For two clustering results X and Y containing n elements, the RI of them is calculated in terms of the following formula.where a is the number of pairs in identical class in X and identical cluster in Y; b is the number of pairs in identical class in X but not identical cluster in Y; c is the number of pairs that are not in identical class in X but in identical cluster in Y; d is the number of pairs that are neither in identical class in X nor in identical cluster in Y.

The overlap between X and Y can be summarized in a contingency table, in which each element represents the number of objects in common between X and Y. The ARI is adjusted RI, which can be calculated in terms of the following formula.where n_ij are values from the contingency table, a_i is the sum of the i-th row of the contingency table, b_j is the sum of the j-th column of the contingency table, and denotes a binomial coefficient.

The ARI is the corrected-for-chance version of the RI. The RI may only yield a value between 0 and 1; the ARI can yield negative values if the index is less than the expected index. The optimal score of the ARI is 1, which represents that two clustering results are identical. A larger ARI signifies higher concordance between X and Y.

3.3. Purity

The purity [22] is another simple and transparent evaluation metric for evaluating the clustering performance. For purity, each identified cluster is assigned to the label which is most frequent in the cluster, and then the accuracy of this label assignment is computed by counting the number of correctly assigned cells and dividing by the number of cells N. This mapping is not one-to-one and may be biased to the class which has the largest size. Nonetheless, it provides us a simple metric.

For two clustering results X and Y containing n elements, the purity of them is calculated in terms of the following formula.where p and q are the elements of X and Y, respectively.

The purity ranges from 0 to 1 while 1 is the optimal score, which denotes that X and Y have identical clustering accuracy. A larger purity signifies higher concordance between X and Y.

4. Dataset and Preprocessing

4.1. Dataset

In the study, 33 publicly available scRNA-seq datasets are utilized to testify the performance of the IGRDCGA, which are shown in Table 1. The datasets contain various single-cell gene expression data that are published by many different publications. Each row in the datasets denotes an observation or cell while each column denotes a feature or gene. Table 1 shows the features of the datasets, such as GSE, name, the number of cells (#cell), the number of genes (#gene), the number of the ground truth classes (#class), and references.

4.2. Data Preprocessing

For a scRNA-seq dataset containing N cells and M genes, if represents gene expression level of the i-th cell versus the j-th gene, then its adjacent matrix can be expressed as .

To lower the impacts of large gene expression levels on little gene expression levels, all the data in are normalized according to the following formula.where and , respectively, represent the minimum and maximum values of the j-th gene in . The parameter is a very small value, which is to escape the denominator of 0, because and may be both equal to 0 in scRNA-seq dataset.

It is obvious that all the elements are normalized according to each column (each column denotes a gene). Therefore, after preprocessing, the gene expression level of each gene ranges from 0 to 1, and all the elements in D range from 0 to 1 as well.

5. The Proposed Algorithm

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data by combining information gain ratio and genetic algorithm with dynamic crossover (IGRDCGA for short). The coding and the other details of the IGRDCGA are as follows.

5.1. Coding and Initialization

To choose high quality genes from a huge number of genes, we need to design a good coding for each chromosome. For the scRNA-seq dataset containing N cells of M genes, we design a coding of variable length, whose length can be changed. We number M genes into 1∼M; the coding of a chromosome can be expressed as follows:where , respectively, denote the first locus, the second locus, and the l-th locus of the coding, and they are any integer values larger than 1 and less than M; l denotes the length of the coding of the chromosome, which is variable for different chromosomes.

Obviously, the coding of a chromosome signifies a combination of genes, namely, a result of gene selection. Therefore, different chromosomes can signify different gene selection, and coding of variable length can signify gene selection of variable length.

According to the above coding rules of a chromosome, Algorithm 1 presents the initialization algorithm to generate the initial population whose population size is Npop.

(1)	Generate one random integer l in the interval [1, M].
(2)	Generate l random integers in the interval [1, M].
(3)	Loop the above (1) and (2) Npop times to obtain the initial population.

5.2. Crossover Operator

As the length of the coding of a chromosome is variable, we need to design a new crossover operator to fit the coding of variable length. Given the coding of two parents C₁ and C₂, their lengths are, respectively, l₁ and l₂, and the minimum value of l₁ and l₂ is marked as l₃. The number of the genes to exchange in crossover operator should be less than or equal to l₃. The study presents a new crossover operator as shown in Algorithm 2.

(1)	Generate one random integer l₄ that is less than or equal to l₃.
(2)	For C₁, generate the location to exchange loc₁, which contains l₄ different random integers that are larger than or equal to 1 and less than or equal to l₁.
(3)	For C₂, generate the location to exchange loc₂, which contains l₄ different random integers that are larger than or equal to 1 and less than or equal to l₂.
(4)	The coding bits of the location loc₁ in C₁ and the coding bits the location loc₂ in C₂ are exchanged among each other.

5.3. Mutation Operator

Similarly, we also need to design a new mutation operator to fit the coding of variable length. Given the coding of one parents C₃, its length is l_5. The location of the genes to mutate in mutation operator should be less than or equal to l_5. The study presents a new mutation operator as shown in Algorithm 3.

(1)	Generate one random integer l₆ that is less than or equal to l₅.
(2)	For C₃, generate the location to mutate loc₃, which contains l₆ different random integers that are larger than or equal to 1 and less than or equal to l₅.
(3)	The coding bits of the location loc₃ in C₃ are replaced into the other integers that are not equal to any of the coding bits of C₃.

5.4. Detailed Steps of IGRDCGA

To illustrate the main process of the proposed algorithm, Figure 1 shows the flow chart of the IGRDCGA.

The detailed steps of the IGRDCGA are summarized as follows. Step (1). Compute information gain ratio IGR of each gene in terms of formula (5) first. Then, eliminate those genes whose IGR is 0. Step (2). To eliminate irrelevant genes and choose high quality genes, the threshold method is used to choose those genes whose IGR is the top best . In later numerical experiment, . Step (3). Implement Algorithm 1 described in Section 5.1 to obtain the initial population P₁. Step (4). Compute the fitness and obtain the best chromosome.(1)For each chromosome p_i in P₁, select those genes determined by p_i to obtain a new data ; then conduct k means clustering on to compute NMI metric as the fitness of .(2)Obtain the best chromosome and its fitness by comparing the fitness. Step (5). Let the number of iterations ; then judge whether or not. If yes, then output and , and the IGRDCGA terminates; otherwise, turn to the next step. Step (6). Tournament selection strategy [55, 56] is used to choose chromosomes from P₁ to obtain new population P₂. Step (7). Generate a random r. If r is less than or equal to the crossover probability , then randomly select two chromosomes from P₂ to perform the crossover operator described in Section 5.2. The final population is marked as P₃. Step (8). Generate a random r. If r is less than or equal to the mutation probability , then randomly select one chromosome from P₃ to perform the mutation operator described in Section 5.3. The final population is marked as P₄. Step (9). Compute the fitness value of each chromosome in P₄ and obtain the best chromosome and its fitness . If , then . . Step (10). Turn to Step 5.

5.5. Time Complexity Analysis of IGRDCGA

Suppose that a scRNA-seq dataset contains N cells and M genes, and the population size in GA is Npop. The time complexity of the IGRDCGA is analyzed as follows.

The time complexity of Step 1 is O (M), the time complexity of Step 2 is O (1), and the time complexity of Algorithm 1 in Step 3 is O (Npop). In Step 4, the time complexity of the k means clustering is O (N), and that of obtaining the best chromosome is also O (Npop). Step 5 to Step 10 are the main iteration process of the IGRDCGA; their time complexities determine the time complexity of the IGRDCGA. The execution count of Steps 5 and 6 is both t_max (the maximal number of iterations), and the time complexities are both O (t_max). In Step 7, the time complexity of selecting two chromosomes is O (t_max), while the counts of executions of the crossover operator are l3 (the minimum value of the lengths of two chromosome codes), and its time complexity is still O (l3 t_max). Therefore, the time complexity of Step 7 is O (l3 t_max). Similarly, the time complexity of Step 8 is still O (l6 t_max). In Step 9, the time complexity of computing the fitness is O (t_max Npop N), and the time complexity of obtaining the best chromosome is also O (t_max Npop). Namely, the time complexity of Step 9 is still O (t_max Npop). Obviously, the time complexity of Step 10 is O (t_max).

To sum up, the time complexity of the IGRDCGA involves the following time complexities: O (1), O (M), O (Npop), O (t_max), O (l3 t_max), O (l6 t_max), O (t_max Npop), and O (t_max Npop N). Obviously, O (1) < O (M), O (Npop) < O (t_max Npop) < O (t_max Npop N), O (t_max) < O (t_max Npop) < O (t_max Npop N). Therefore, the decisive time complexities of the IGRDCGA are O (M), O (l3 t_max), O (l6 t_max), O (t_max Npop N). As l3 < M and l6 < M, it follows that O (l3 t_max) < O (M t_max), O (l3 t_max) < O (M t_max), and O (M) < O (M t_max). Thus, the time complexity of the IGRDCGA is determined by O (t_max Npop N) and O (M t_max). From Table 1, it can be clearly shown that, for most single-cell datasets, . Commonly, . Consequently, for most single-cell datasets, the time complexity of the IGRDCGA can be considered as O (M t_max).

6. Numerical Results

In order to evaluate the performances of the IGRDCGA, two frequently used clustering algorithms, k means and spectral clustering [57], a state-of-the-art single-cell classification algorithm SIMLR [58], are employed to compare it. To compare the performance of gene selection of the IGRDCGA, two frequently used dimensionality reduction algorithms, the PCA and tSNE, are also utilized to compare the IGRDCGA. For the SIMLR, we use its MATLAB program, which can be accessed by the address: https://github.com/BatzoglouLabSU/SIMLR. For the k means, PCA, and tSNE, the study utilizes the built-in functions in MATLAB.

6.1. Parameter Values

The related parameters of the IGRDCGA are described as follows.(i)Parameter for IGR. The threshold .(ii)Terminal Condition for the IGRDCGA. The population size ; the crossover probability ; the mutation probability ; the maximal number of iterations .

The other competing algorithms use their default parameter values.

6.2. Results

6.2.1. Comparisons of the IGR

The IGR of each gene shows its relevance. The IGR ranges from 0 to 1 while 0 represents that the gene has no relevance for the classification. Therefore, this can be employed to obviate irrelevant genes roughly. We compute the IGR of each gene for all scRNA-seq datasets and obtain the number of the irrelevant genes (#irrelevant genes) and their percentages in respect of the total genes. The obtained results are shown in Table 2.

From Table 2, we can obviously see that there exist irrelevant genes in most datasets. The percentage of irrelevant genes versus the total genes can indicate gene redundancy rates. There are 24 datasets whose gene redundancy rate is larger than 0 in Table 2, which account for 72.72% (total 33 datasets). This demonstrates that the IGR = 0 is a good way to obviate the irrelevant genes.

However, Table 2 also shows that there are 9 datasets whose redundancy rate is 0. Namely, it cannot find the irrelevant genes from the 9 datasets by means of the IGR = 0. This also demonstrates that the IGR = 0 can only determine irrelevant genes and cannot determine irrelevant genes. Therefore, our proposed algorithm IGRDCGA utilizes a threshold method to eliminate irrelevant genes.

Comparatively speaking, the datasets containing more genes possess higher redundancy rates of genes and vice versa. Nevertheless, this is not absolute, as can be shown in Table 2.

6.2.2. Comparisons of Evaluation Metrics

We perform the IGRDCGA and several competing algorithms on the above scRNA-seq datasets described in Section 4.1. Three evaluation metrics NMI, ARI, and purity are employed to evaluate the performances of the IGRDCGA and the other four competing algorithms. All the algorithms are independently performed for 20 runs to obtain their average values. For all 33 datasets, the average values of NMI, ARI, and purity metrics are shown in Tables 3–5, respectively.

From Table 3, we can obviously see that, for 22 of 33 datasets, the IGRDCGA gains the largest NMI in six algorithms, which account for 66.67% (22 of 33 datasets). Meanwhile, we can also observe that, for 6 of the rest 12 datasets, the differences of the largest NMI and the NMI obtained by the IGRDCGA are very little. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest NMI is, respectively, 0, 1, 6, 2, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of NMI.

Table 4 shows that, for 24 of 33 datasets, the IGRDCGA acquires the maximal ARI in six algorithms, which account for 72.72%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the maximal ARI is, respectively, 0, 0, 3, 1, and 7. By contrast, the IGRDCGA is superior to the other five competing algorithms in terms of ARI.

From Table 5, it can be clearly seen that, for 21 of 33 datasets, the purity metrics obtained by the IGRDCGA are the largest in six algorithms, which account for 63.63%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the largest purity metrics is, respectively, 0, 1, 6, 1, and 6. By comparison, the IGRDCGA outperforms the other five competing algorithms in terms of purity metric.

By comparing Tables 3–5, we summarize the best evaluation metrics obtained by six algorithms in Table 6, where the NMI, ARI, and purity metrics are, respectively, denoted by 1, 2, and 3.

From Table 6, we can clearly see that, for 15 of 33 datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are all the best in six algorithms, which account for 45.45%. For the k means, spectral clustering, SIMLR, PCA, and tSNE, the number of the datasets that they obtain the best NMI, ARI and purity metrics is, respectively, 0, 0, 2, 1, and 6. Thus, the IGRDCGA outperforms the other five competing algorithms in terms of NMI, ARI, and purity metrics.

In the meantime, Table 6 shows that only in partial datasets does the IGRDCGA gain the best NMI, ARI, and purity metrics. For the Allodiploid, the IGRDCGA, SIMLR, and PCA obtain the best NMI, ARI, and purity metrics. We can observe from Table 1 that the Allodiploid possesses much less number of cells and genes compared with the other datasets. Namely, for the datasets with smaller dimensions, an algorithm is easy to obtain the best NMI, ARI, and purity metrics. In this case, the PCA is the fittest method as it is the simplest and easiest to implement in the above three algorithms. For the Ting, Chung, and Li2, the IGRDCGA obtains the best NMI and ARI, while the SIMLR obtains the best purity metrics. For the Camp15, Camp17, Grun, and Muraro, the IGRDCGA obtains the best ARI and purity metrics, while the SIMLR obtains the best NMI. By the comparisons of the IGRDCGA with the SIMLR, we can clearly see that the number of the best metrics obtained by the IGRDCGA is far larger than that obtained by the SIMLR. This fully demonstrates that the IGRDCGA is superior to the SIMLR in terms of NMI, ARI, and purity metrics.

For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron datasets, the tSNE gains the best NMI, ARI, and purity metrics. However, for the other datasets, three evaluation metrics obtained by the tSNE are far worse than those obtained by the IGRDCGA. As shown in Table 3, there are 17 datasets whose NMI of the tSNE is less than 0.1, while there are none datasets whose NMI of the IGRDCGA is less than 0.1. In Table 4, there are 20 datasets whose ARI of the tSNE is less than 0.1, while there are none datasets whose ARI of the IGRDCGA is less than 0.1. In Table 5, there are 19 datasets whose purity metrics of the tSNE are less than 0.5, while there are only 4 datasets whose purity metrics of the IGRDCGA are less than 0.5. Namely, for most datasets except the above six datasets, the NMI, ARI, and purity metrics obtained by the IGRDCGA are superior to those obtained by the tSNE. This attests that the IGRDCGA outperforms the tSNE for most datasets in terms of NMI, ARI, and purity metrics.

By elaborative analyses for the above six datasets and the other datasets, we can clearly observe the following principal features and differences. To begin with, the six datasets all possess very low redundancy rates of genes. Table 2 shows that the redundancy rates of the Zeisel and Baron are, respectively, 9.8 and 0.11, and the other four datasets are all 0. For the datasets with low redundancy rates, the IGRDCGA may lose some genes to cause the decreasing of the classification performances. In addition, the six datasets have relatively low NMI and ARI for six algorithms. From Table 3, we can easily see that all the NMI metrics of the Nestorowa, Wang, and Manno_m are, respectively, less than 0.38, 0.33, and 0.48. From Table 4, it can be obviously seen that all the ARI metrics of the Nestorowa, Wang, Manno_m, and Baron are, respectively, less than 0.26, 0.20, 0.19, and 0.33. Thirdly, the six datasets are approximately and completely sparse. From our provided appendix file, it can obviously observe that the data values within [0, 0.1] in all datasets are more than those of the other data intervals. Nevertheless, the situation of the six datasets is a great deal more highlighted, and they are approximately and completely sparse if we let the data values within [0, 0.1] to be 0. For the Nestorowa, Wang, Patel, Manno_m, Zeisel, and Baron, the data values within [0, 0.1], respectively, account for more than 88%, 95%, 73%, 91%, 80%, and 91%, which are so high that they are nearly cover up the data values within the other ranges.

6.2.3. Iteration Plots and Heat Maps

In order to clearly illustrate the performance of the IGRDCGA, three scRNA-seq datasets, Allodiploid, Chung, and Kolodz, are selected as representative datasets according to the different values of the NMI metrics in Table 3. For the Allodiploid, Yeo, and Grun, their iteration plots are, respectively, illustrated in Figures 2–4. For the Allodiploid, the NMI metric obtained by the IGRDCGA is 1, which represents that the NMI metrics are consistent with the ground truth class labels. It can be clearly seen from Figure 2 that the iteration plot of the Allodiploid is parallel to the X-axis and its value is the maximal NMI of 1. Namely, the IGRDCGA obtains the maximal NMI at the first iteration. Therefore, the later iterations are useless. In the above 33 scRNA-seq datasets, only the Allodiploid displays the plot illustrated in Figure 2.

In Figure 3, the iteration plot of Yeo is ascending as the number of iterations increases. The ascending trend is very obvious at the early period of the iteration, and the ascending trend stops as the number of iterations increases. Namely, the NMI metric turns into a fixed number at the later period of the iteration. In the above 33 scRNA-seq datasets, most datasets display a similar plot shown in Figure 3.

From Figure 4, we can obviously see that the iteration plot of Grun is always ascending as the number of iterations increases. The ascending trend is always very obvious at the whole period of the iteration. In the above 33 scRNA-seq datasets, a few datasets display a similar plot shown in Figure 4.

In order to illustrate the classification performance of the IGRDCGA more visually, the heat maps of the Allodiploid, Sasagawa, and Yan are illustrated in Figures 5–7, respectively.

The Allodiploid contains 16 cells, 2406 genes, and 2 classes, whose ground truth class labels are illustrated in the first column of Figure 5. From Figure 5, it can be obviously seen that the results obtained by the IGRDCGA, SIMLR, and PCA are fully concordant with the ground truth class labels; those obtained by the k means and spectral clustering are nearly one class; and those obtained by the tSNE are oscillatory. Obviously, the results obtained by the k means, spectral clustering, and the tSNE are a great deal worse than those obtained by the IGRDCGA, SIMLR, and PCA; the results of the tSNE are the worst in the six algorithms.

The dataset Sasagawa contains 23 cells, 32700 genes, and 3 classes, whose ground truth labels are illustrated in the first column of Figure 6. From Figure 6, it can be obviously observed that the results obtained by the IGRDCGA are fully concordant with the ground truth class labels; those obtained by spectral clustering are nearly one class; those obtained by the tSNE are oscillatory and the worst. In the results obtained by the k means, SIMLR, and PCA, the results of the subclass 2 (the ground truth class labels are 2) are fully concordant with the ground truth class labels while the other subclasses are confused. Therefore, for the Sasagawa, the IGRDCGA gains the best results while the tSNE gains the worst results.

The first column of Figure 7 illustrates the ground truth labels of the Yan, which contains 90 cells, 20214 genes, and 6 classes. From Figure 7, we can clearly see that none of the results obtained by six algorithms are concordant with the ground truth class labels. However, the results obtained by the IGRDCGA are closer to the ground truth class labels than those obtained by the other algorithms, and the results of the subclasses 2, 3, 4, 5, and 6 are fully concordant with the ground truth class labels. The results obtained by the tSNE are oscillatory and the worst in the six algorithms. The results obtained by the other algorithms are all confused. Therefore, for the Yan, the IGRDCGA obtains the best results whereas the tSNE obtains the worst results.

7. Conclusion and Future Work

In this study, we present a novel algorithm to address the gene selection and classification for scRNA-seq data. It combines information gain ratio (IGR) and genetic algorithm with dynamic crossover (DCGA) and are abbreviated as IGRDCGA. It utilizes information gain ratio to eliminate irrelevant genes roughly and utilizes DCGA to choose high quality genes finely. We have conducted the IGRDCGA and several competing algorithms on 33 publicly available scRNA-seq datasets. The obtained results demonstrate that the IGRDCGA can eliminate irrelevant genes and choose high quality genes effectively, and it is superior to the other several competing algorithms in terms of classification accuracy.

This algorithm is going on for further enhancement and improvement. One attempt is to utilize a more efficient coding to speed up its converging rate and stability. Another attempt is to extend the IGRDCGA to classification algorithms of the other high dimensional problems.

Data Availability

The datasets supporting this study are publicly available and they can be downloaded from EMBL-EBI (https://www.ebi.ac.uk/) or the NCBI Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/).

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by Science Research Foundation for High-level Talents of Yulin Normal University (no. G2021ZK17).

References

Y. Wang, H. Liu, F. Wei, T. Zong, and X. Li, “Cooperative coevolution with formula-based variable grouping for large-scale global optimization,” Evolutionary Computation, vol. 26, no. 4, pp. 569–596, 2018.
View at: Publisher Site | Google Scholar
B. Nakisa, M. N. Rastgoo, D. Tjondronegoro, and V. Chandran, “Evolutionary computation algorithms for feature selection of EEG-based emotion recognition using mobile sensors,” Expert Systems with Applications, vol. 93, pp. 143–155, 2018.
View at: Publisher Site | Google Scholar
D. Y. Eroglu and K. Kilic, “A novel hybrid genetic local search algorithm for feature selection and weighting with an application in strategic decision making in innovation management,” Information Sciences, vol. 405, pp. 18–32, 2017.
View at: Publisher Site | Google Scholar
N. Maleki and Y. Zeinali, “A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection,” Expert Systems with Applications, vol. 164, p. 113981, 2020.
View at: Google Scholar
M. Tahir, A. Tubaishat, F. Al-Obeidat et al., “A novel binary chaotic genetic algorithm for feature selection and its utility in affective computing and healthcare,” Neural Computing and Applications, 2020.
View at: Publisher Site | Google Scholar
A. H. Mohammad, “Comparing two feature selections methods (information gain and gain ratio) on three different classification algorithms using Arabic dataset,” Journal of Theoretical & Applied Information Technology, vol. 96, no. 6, pp. 1561–1569, 2018.
View at: Google Scholar
H. Uğuz, “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowledge-Based Systems, vol. 24, no. 7, pp. 1024–1032, 2011.
View at: Google Scholar
A. Chinnaswamy and R. Srinivasan, “Hybrid information gain based fuzzy roughset feature selection in cancer microarray data,” in Proceedings of the International Conference on Innovations in Power and Advanced Computing Technologies (I-PACT), Vellore, India, April 2017.
View at: Google Scholar
C. Dai, Y. Wang, M. Ye, X. Xue, and H. Liu, “An orthogonal evolutionary algorithm with learning automata for multiobjective optimization,” IEEE Transactions on Cybernetics, vol. 46, no. 12, pp. 3306–3319, 2016.
View at: Publisher Site | Google Scholar
X. Xue and Y. Wang, “Using memetic algorithm for instance coreference resolution,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, pp. 580–591, 2016.
View at: Publisher Site | Google Scholar
J. Liu, Y. Wang, N. Fan, S. Wei, and W. Tong, “A convergence-diversity balanced fitness evaluation mechanism for decomposition- based many-objective optimization algorithm,” Integrated Computer-Aided Engineering, vol. 26, no. 2, pp. 159–184, 2019.
View at: Publisher Site | Google Scholar
H. Liu, Y. Wang, L. Liu, and X. Li, “A two phase hybrid algorithm with a new decomposition method for large scale optimization,” Integrated Computer-Aided Engineering, vol. 25, no. 4, pp. 349–367, 2018.
View at: Publisher Site | Google Scholar
M. Ye, “A hybrid genetic algorithm for the minimum exposure path problem of wireless sensor networks based on a numerical functional extreme model,” IEEE Transactions on Vehicular Technology, vol. 65, no. 10, pp. 8644–8657, 2015.
View at: Google Scholar
X. Xue and Y. Wang, “Optimizing ontology alignments through a memetic algorithm using both MatchFmeasure and unanimous improvement ratio,” Artificial Intelligence, vol. 223, pp. 65–81, 2015.
View at: Publisher Site | Google Scholar
D. Cai and Y. Wang, “A new decomposition based evolutionary algorithm with uniform designs for many-objective optimization,” Applied Soft Computing, vol. 30, no. 1, pp. 238–248, 2015.
View at: Google Scholar
Y.-M. Cheung, F. Gu, and H.-L. Liu, “Objective extraction for many-objective optimization problems: algorithm and test problems,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 5, pp. 755–772, 2016.
View at: Publisher Site | Google Scholar
P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized mutual information feature selection,” IEEE Transactions on Neural Networks, vol. 20, no. 2, pp. 189–201, 2009.
View at: Publisher Site | Google Scholar
A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual information to evaluate overlapping community finding algorithms,” 2011, https://arxiv.org/abs/1110.2515.
View at: Google Scholar
O. Abedinia, N. Amjady, and H. Zareipour, “A new feature selection technique for load and price forecast of electrical power systems,” IEEE Transactions on Power Systems, vol. 32, no. 1, pp. 62–74, 2017.
View at: Publisher Site | Google Scholar
C. Xu and Z. Su, “Identification of cell types from single-cell transcriptomes using a novel clustering method,” Bioinformatics, vol. 31, no. 12, pp. 1974–1980, 2015.
View at: Publisher Site | Google Scholar
V. Y. Kiselev, K. Kirschner, M. T. Schaub et al., “SC3: consensus clustering of single-cell RNA-seq data,” Nature Methods, vol. 14, no. 5, pp. 483–486, 2017.
View at: Publisher Site | Google Scholar
S. Wagner and D. Wagner, “Comparing clusterings: an overview,” Universität Karlsruhe, Fakultät für Informatik Karlsruhe, Karlsruhe, Germany, 2007.
View at: Google Scholar
X. Li, X. L. Cui, J. Q. Wang, Y. K. Wang, Y. F. Li, and L. Y. Wang, “Generation and application of mouse-rat allodiploid embryonic stem cells,” Cell, vol. 164, no. 1-2, pp. 279–292, 2016.
View at: Publisher Site | Google Scholar
Y. Sasagawa, I. Nikaido, T. Hayashi et al., “Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity,” Genome Biology, vol. 14, no. 4, p. 3097, 2013.
View at: Publisher Site | Google Scholar
Ramsköld, S. Luo, Y. C. Wang et al., “Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells,” Nature Biotechnology, vol. 30, no. 8, pp. 777–782, 2012.
View at: Publisher Site | Google Scholar
F. H. Biase, X. Cao, and S. Zhong, “Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing,” Genome Research, vol. 24, no. 11, pp. 1787–1796, 2014.
View at: Publisher Site | Google Scholar
D. T. Ting, B. S. Wittner, M. Ligorio et al., “Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells,” Cell Reports, vol. 8, no. 6, pp. 1905–1918, 2014.
View at: Publisher Site | Google Scholar
W. Chung, H. H. Eum, H.-O. Lee et al., “Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer,” Nature Communications, vol. 8, no. 1, p. 15081, 2017.
View at: Publisher Site | Google Scholar
T. Yeo, S. J. Tan, C. L. Lim et al., “Microfluidic enrichment for the single cell analysis of circulating tumor cells,” Scientific Reports, vol. 6, no. 1, pp. 22076–22087, 2016.
View at: Publisher Site | Google Scholar
N. Leng, L. F. Chu, C. Barry et al., “Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments,” Nature Methods, vol. 12, no. 10, pp. 947–950, 2015.
View at: Publisher Site | Google Scholar
A. Schlitzer, V. Sivakamasundari, J. Chen et al., “Identification of cDC1- and cDC2-committed DC progenitors reveals early lineage priming at the common DC progenitor stage in the bone marrow,” Nature Immunology, vol. 16, no. 7, pp. 718–728, 2015.
View at: Publisher Site | Google Scholar
X. Su, Y. Shi, X. Zou et al., “Single-cell RNA-Seq analysis reveals dynamic trajectories during mouse liver development,” BMC Genomics, vol. 18, no. 1, pp. 1–14, 2017.
View at: Publisher Site | Google Scholar
D. Usoskin, A. Furlan, S. Islam et al., “Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing,” Nature Neuroscience, vol. 18, no. 1, pp. 145–153, 2015.
View at: Publisher Site | Google Scholar
Q. Deng, D. Ramsköld, B. Reinius, and R. Sandberg, “Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells,” Science, vol. 343, no. 6167, pp. 193–196, 2014.
View at: Publisher Site | Google Scholar
X. Fan, X. Zhang, X. Wu et al., “Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos,” Genome Biology, vol. 16, no. 1, pp. 148–217, 2015.
View at: Publisher Site | Google Scholar
M. Goolam, A. Scialdone, S. J. L. Graham et al., “Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos,” Cell, vol. 165, no. 1, pp. 61–74, 2016.
View at: Publisher Site | Google Scholar
A. A. Kolodziejczyk, J. K. Kim, J. C. H. Tsang et al., “Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation,” Cell Stem Cell, vol. 17, no. 4, pp. 471–485, 2015.
View at: Publisher Site | Google Scholar
S. Nestorowa, F. K. Hamey, B. Pijuan Sala et al., “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation,” Blood, vol. 128, no. 8, pp. e20–e31, 2016.
View at: Publisher Site | Google Scholar
B. Treutlein, D. G. Brownfield, A. R. Wu et al., “Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq,” Nature, vol. 509, no. 7500, pp. 371–375, 2014.
View at: Publisher Site | Google Scholar
J. G. Camp, F. Badsha, M. Florio et al., “Human cerebral organoids recapitulate gene expression programs of fetal neocortex development,” Proceedings of the National Academy of Sciences, vol. 112, no. 51, pp. 15672–15677, 2015.
View at: Publisher Site | Google Scholar
J. G. Camp, K. Sekine, T. Gerber et al., “Multilineage communication regulates human liver bud development from pluripotency,” Nature, vol. 546, no. 7659, pp. 533–538, 2017.
View at: Publisher Site | Google Scholar
L. Yan, M. Yang, H. Guo et al., “Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells,” Nature Structural & Molecular Biology, vol. 20, no. 9, pp. 1131–1139, 2013.
View at: Publisher Site | Google Scholar
Y. J. Wang, J. Schug, K.-J. Won et al., “Single-cell transcriptomics of the human endocrine pancreas,” Diabetes, vol. 65, no. 10, pp. 3028–3038, 2016.
View at: Publisher Site | Google Scholar
H. Li, E. T. Courtois, D. Sengupta et al., “Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors,” Nature Genetics, vol. 49, no. 5, pp. 708–718, 2017.
View at: Publisher Site | Google Scholar
L. Jin, “Single‐cell transcriptomes reveal characteristic features of human pancreatic islet cell types,” EMBO Reports, vol. 17, no. 2, pp. 178–187, 2016.
View at: Google Scholar
A. P. Patel, I. Tirosh, J. J. Trombetta et al., “Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma,” Science, vol. 344, no. 6190, pp. 1396–1401, 2014.
View at: Publisher Site | Google Scholar
A. Alex, “Pollen, Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex,” Nature Biotechnology, vol. 32, no. 10, pp. 1053–1058, 2014.
View at: Google Scholar
G. La Manno, D. Gyllborg, S. Codeluppi et al., “Molecular diversity of midbrain development in mouse, human, and stem cells,” Cell, vol. 167, no. 2, pp. 566–580, 2016.
View at: Publisher Site | Google Scholar
B. Tasic, V. Menon, T. N. Nguyen et al., “Adult mouse cortical cell taxonomy revealed by single cell transcriptomics,” Nature Neuroscience, vol. 19, no. 2, pp. 335–346, 2016.
View at: Publisher Site | Google Scholar
Z. Amit, “Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq,” Science, vol. 347, no. 6226, pp. 1138–1142, 2015.
View at: Google Scholar
D. Grün, M. J. Muraro, J. C. Boisset et al., “De novo prediction of stem cell identity using Single-Cell transcriptome data,” Cell Stem Cell, vol. 19, no. 2, pp. 266–277, 2016.
View at: Publisher Site | Google Scholar
M. Baron, A. Veres, S. L. Wolock et al., “A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure,” Cell Systems, vol. 3, no. 4, pp. 346–360, 2016.
View at: Publisher Site | Google Scholar
M. J. Muraro, G. Dharmadhikari, D. Grün et al., “A single-cell transcriptome atlas of the human pancreas,” Cell Systems, vol. 3, no. 4, pp. 385–394, 2016.
View at: Publisher Site | Google Scholar
Y. Xin, J. Kim, H. Okamoto et al., “RNA sequencing of single human islet cells reveals type 2 diabetes genes,” Cell Metabolism, vol. 24, no. 4, pp. 608–615, 2016.
View at: Publisher Site | Google Scholar
L. Adam and D. Lipowska, “Roulette-wheel selection via stochastic acceptance,” Physica A: Statistical Mechanics and Its Applications, vol. 391, no. 6, pp. 2193–2196, 2012.
View at: Google Scholar
V. Ho-Huu, T. Nguyen-Thoi, T. Truong-Khac, L. Le-Anh, and T. Vo-Duy, “An improved differential evolution based on roulette wheel selection for shape and size optimization of truss structures with frequency constraints,” Neural Computing and Applications, vol. 29, no. 1, pp. 167–185, 2018.
View at: Publisher Site | Google Scholar
W. Chen and G. Feng, “Spectral clustering: a semi-supervised approach,” Neurocomputing, vol. 77, no. 1, pp. 229–242, 2012.
View at: Publisher Site | Google Scholar
B. Wang, J. Zhu, E. Pierson, D. Ramazzotti, and S. Batzoglou, “Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning,” Nature Methods, vol. 14, no. 4, pp. 414–416, 2017.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Junhong Feng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

514

Downloads

480

Citations

Wireless Communications and Mobile Computing

Innovative Artificial Intelligence-Based Internet of Things for Smart Cities and Smart Homes

Gene Selection and Classification of scRNA-seq Data Combining Information Gain Ratio and Genetic Algorithm with Dynamic Crossover

Abstract

1. Introduction

2. Related Work

2.1. Information Gain Ratio

2.2. Genetic Algorithm

3. Evaluation Metrics

3.1. NMI (Normalized Mutual Information)

3.2. ARI (Adjusted Rand Index)

3.3. Purity

4. Dataset and Preprocessing

4.1. Dataset

4.2. Data Preprocessing

5. The Proposed Algorithm

5.1. Coding and Initialization

5.2. Crossover Operator

5.3. Mutation Operator

5.4. Detailed Steps of IGRDCGA

5.5. Time Complexity Analysis of IGRDCGA

6. Numerical Results

6.1. Parameter Values

6.2. Results

6.2.1. Comparisons of the IGR

6.2.2. Comparisons of Evaluation Metrics

6.2.3. Iteration Plots and Heat Maps

7. Conclusion and Future Work

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright