Abstract

Computational analysis of microarray data has provided an effective way to identify disease-related genes. Traditional disease gene selection methods from microarray data such as statistical test always focus on differentially expressed genes in different samples by individual gene prioritization. These traditional methods might miss differentially coexpressed (DCE) gene subsets because they ignore the interaction between genes. In this paper, MIClique algorithm is proposed to identify DEC gene subsets based on mutual information and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples. Clique analysis is a commonly used method in biological network, which generally represents biological module of similar function. By applying the MIClique algorithm to real gene expression data, some DEC gene subsets which correlated under one experimental condition but uncorrelated under another condition are detected from the graph of colon dataset and leukemia dataset.

1. Introduction

Microarray data may provide much useful information for disease gene identification and medical diagnosis because microarray has the ability to measure the expression levels of thousands of genes simultaneously [1]. Among the huge number of genes, only a small fraction of them show strong correlation with a certain phenotype. Many statistical and supervised methods such as -test, neural network are utilized to mine genes that are differentially expressed under different conditions [2, 3]. However, these gene selection techniques are often based on individual gene prioritization by measuring the correlation of each gene with particular disease types. The individual gene prioritization list does not indicate interaction relationships among genes. So these traditional techniques might ignore the differentially coexpressed (DCE) gene subsets which are defined to be highly correlated under one experimental condition but uncorrelated under another condition [4]. Disease-related differentially coexpressed genes are those which exhibit similar expression patterns in normal samples but share no similarity in disease samples. Figure 1 depicts the simulated differentially coexpressed disease genes between normal samples (samples1–20) and disease samples (samples 21–40). The coexpression pattern in normal samples disappears in disease samples.

Identification of disease specific DEC gene subsets is very helpful for disease diagnosis and clinical treatment. The DEC genes should be analyzed by gene subsets instead of individual genes. Clustering algorithms are often used to find gene groups which display similar expression profiles [5, 6]. However, the DEC genes only show highly correlated expression patterns in one biological state, not across the entire dataset. Biclustering is a method to identify gene subsets exhibiting consistent patterns over a subset of experimental conditions, but this method is still not proper for identification of DEC gene groups because the experimental conditions may not be in the same biological state [7, 8].

Kostka and Spang proposed the first method to investigate DEC gene subsets by using an additive model and a stochastic search algorithm [9]. AlteredExpression was an improved algorithm based on additive model to detect optimal DEC gene subsets with best RRV (ratio of residual variance between two different samples) and minimal F-score [10]. Varadan and Anastassiou proposed an approach called Entropy Minimization and Boolean Parsimony (EMBP) to identify gene subsets whose joint expression state predicts the presence or absence of a particular disease with minimum uncertainty [4]. The coXpress was developed to identify groups of gene that are differentially coexpressed in different biological states by using a resampling method to calculate -value for each clustered group [11]. These methods took into account all possible gene subsets by searching the whole dataset; it was a huge computational burden as the number of genes increases.

In this paper, the MIClique algorithm is proposed to explore DEC gene subsets in an intuitive way based on mutual information (MI) and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples, and then the symmetric mutual information matrices are binarized by selecting two threshold values. The adjacency matrix of graph is obtained by logical operation with vertices corresponding to genes and edges corresponding to relationships between genes. Gene cliques detected by MIClique represent DEC gene subsets, which are highly correlated under one experimental condition but uncorrelated under another condition.

2. Materials and Methods

2.1. Mutual Information (MI)

The interaction relationships of genes are very complex, including linear and nonlinear. Compared with linear similarity measures such as Euclidean distance and Pearson correlation [12, 13], the mutual information is a general measure of statistical dependence between variables and capable of detecting any type of functional relationship, which is widely used in gene expression analysis [14]. For the application of MI on gene expression data, the continuous experimental data need to be partitioned into discrete intervals or bins [15]. Entropy and MI are two central concepts of Shannon's theory of information [16]. Table 1 describes the related concepts of MI.

The physical meaning of is the reduction of the uncertainty of due to knowledge of (or vice versa). Note that , and so entropy is the self-information. The nonnegative equals zero if and only if and are statistically independent, meaning that the variables and do not follow any kind of dependence.

2.2. Clique Enumeration of Graph Theory

Graph theoretical concepts are useful for the description and analysis of relationships in biological systems. Clique analysis is a core component of graph in many biological applications such as gene expression networks analysis, cis regulatory motif finding, and matching three-dimensional molecular structures [17]. Generally clique represents biological module of similar function and biological annotations.

For a simple undirected graph with the set of vertices and edges, two vertices are called adjacent if they are joined by an edge. The degree of a vertex is the number of connected edges; thus the degree of an isolated vertex is zero. Weight of each edge is a value between the pair connection, which might represent costs, lengths, or correlation, and so forth. A complete graph is a graph with every pair of nodes joined by an edge. Clique is complete subgraph and all pairs of vertices in the clique are connected. A maximal clique is a clique not contained in any other complete subgraph. The adjacency matrix of an undirected graph is a symmetric matrix in which the entry if the node and node are connected by an edge and 0 otherwise. If the graph is a clique, then is a matrix with 1 off the diagonal and 0 on the diagonal. If the graph contains a clique, the adjacency matrix of that clique is a submatrix of . Identification of all maximal cliques in a graph is a problem of clique enumeration [18]. Bioconductor, the open project for the analysis and comprehension of genomic data, provides a large collection of software for working with graphs and cliques [19]. Some social network analysis tools are also efficient in clique analysis [20].

But for imperfect systems or experimental data, the requirement of complete connectivity for maximal cliques is stringent; so more general notions of cohesive subgroups should be considered including n-cliques, k-plexes, and k-core [21]. For undirected and unweighted graph, a commonly used measure of network cohesion is density, which simply refers to the ratio of the number of edges that is actually present in the graph to maximum possible number of edges. A large density indicates high interconnectedness and cohesion in the network. The density of clique is 1.

2.3. The Main Process of MIClique

For each set of microarray data involving genes from samples, is the expression value of the th gene in th sample. The sample set is divided into two subsets: (normal samples) and (disease samples); so is also divided into and . Differentially coexpressed disease genes are those of high mutual information values in normal samples but of low MI values in disease samples.

The detailed process of MIClique is as follows.

Step 1  1. Calculating the mutual information of each pair of genes in and , then two square symmetric mutual information matrices and are obtained. A big value of mutual information means that the gene and gene are strongly coexpressed in normal samples, while a low value represents weak coexpression.

Step 2  2. Binarizing the mutual information matrices by selecting two threshold values and ( ), respectively, for and , one has the following.(i) If , then , else .(ii) If , then , else .(iii) .(iv) If then .
The matrices and are binarized mutual information matrices for and . is a logical symmetric matrix obtained by “AND” operation on and . If is 1, it means that gene and gene are coexpressed in normal samples while suffer an alteration in disease groups.

Step 3  3. The matrix can be transformed to the adjacency matrix of a graph with vertices corresponding to genes and edges corresponding to biological interactions. There is an edge between vertices and in if . The DEC disease genes, which present a similar expression pattern in normal samples but suffer a distinct alteration in disease samples, are represented as a completely connected subgraph. So the problem of identifying DEC disease gene subsets is converted into clique detection based on adjacency matrix.

2.4. Threshold Selection

How to select the threshold values of and is very important for biological experimental interpretation. Different threshold values lead to different results. If the is high and is low, the graph has few edges and many isolated vertices. As decreases and increases, more edges are added to the graph, until it is completely connected. A graph with a large number of isolated vertices generally will fail to fall into a clique, but too many edges will cause a lot of overlapped cliques, which also are not very informative for data analysis. Proper thresholds will lead to a proper percentage of isolated vertices and reasonable experimental results. The threshold values are related with data sources and data types, and so forth, and they can be selected by graph density and percentage of isolated vertex. Figure 2 gives the gene networks for normalized simulated gene data by MIClique algorithm. The percentage of isolated vertices decreases and the number of edges increases as decreases and increases.

3. Results and Discussion

Real gene expression data including colon dataset and Leukemia dataset are selected to illustrate the application of the proposed MIClique algorithm [22, 23]. The colon dataset contains expression levels of 2000 genes with the highest minimal intensity selected from 6500 genes across 62 samples, 40 tumor samples, and 22 normal samples. The dataset was normalized before further data analysis. The leukemia dataset contains gene expression profiles of acute leukemias measured using Affymetrix high-density oligonucleotide arrays: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The dataset contains 7129 human genes, 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL), and 25 cases of AML. Only 3374 genes remained after data preprocessing.

3.1. Results of Colon Dataset

Different threshold values are selected for colon dataset. Figure 3 gives the percentage of isolated vertices and the density of the graph (number of edges present in graph divided by maximum possible number of edges, which is ). The final thresholds for colon data are selected as 2.2 and 1.0. Then the data are transformed into gene network by MIClique algorithm.

The maximal cliques are detected from this gene network, with the minimum size of clique as four. An overlapped clique group with six cliques and eight genes is found. Table 2 lists the gene accession numbers in each clique and Figure 4 displays the overlapped clique group graphically. These tightly overlapped cliques form a cohesive subgroup. There are eight vertices and 19 edges in the cohesive subgroups with the density of 0.68 (the maximum possible number of edges is ).

Figure 5 shows values of the eight genes, where each plot is a representation of the MI matrix in either the normal samples or disease samples. Each MI value in the matrix is represented as a square, with the color of the square representing the amount of value. The color scale used is black to white, with black representing the smallest value of MI and white representing largest value of MI. The MI values range from 2.072 to 2.477 in normal samples and from 0.508 to 1.095 in disease samples. This view shows all the MI values in an intuitive way. These eight genes form a differentially coexpressed gene subset, which is disease-related gene module identified by MIClique algorithm. Table 3 lists the Genbank accession number, the gene symbols, accession number in UniProtKB (UniProt Knowledgebase), and gene descriptions given by colon data. The UniProtKB is the central hub for the collection of information on proteins such as amino acid sequence, protein name or description, taxonomic data, and biological ontology [24]. Figure 6 depicts gene expression profiles of the eight genes in normal and disease samples. As shown in Figure 6, the profiles of these genes are highly coexpressed in normal samples (samples 1–22) while the coexpression pattern disappears in disease samples (samples 23–62).

Table 4 lists gene annotations of the eight genes from Gene Ontology (GO) obtained by AmiGO searching tool. GO is a database to support biologically meaningful annotation for the description of the molecular function, biological process, and cellular component of gene products [25]. As observed in Table 4, some of the genes are of the common biological functions and involved in the same biological processes such as muscle development, calcium ion binding, and regulation of striated muscle contraction. The results of Aigner et al. showed that ZEB1 is associated with human colorectal cancer, and ZEB1 is a key player in pathologic epithelial to mesenchymal transition (EMT) associated with tumour progression [26]. Claeskens et al. have proved that Hevin is downregulated in many cancers and Hevin may be a potential target for cancer diagnosis and therapy [27]. Meanwhile the results of colon dataset by MIClique coincide with those of other researchers. For example, all these eight genes are included in the differentially expressed genes for colon dataset selected by unified framework [28]; some of these genes are consistent with the results of other researchers [2931].

3.2. Comparisons with Other Similarity Measures

The definition of the similarity measures is very important for identification of the relationships among genes. Euclidean distance and correlation coefficient are traditional similarity measures commonly used in gene expression analysis. But both of them are unsuitable for nonlinear relationships that might exist between the patterns. Euclidean distance fails to detect the simultaneous upregulated or downregulated expression levels with large amplitude absolute changes. Compared with Euclidean distance and Pearson correlation coefficient, the usage of the MI measure yields a more significant performance [32].

Figures 7 and 8 show Euclidean distance values matrices and Pearson correlation coefficient values matrices of the eight genes identified by MIClique from colon dataset respectively. The Euclidean distance values range from 2.025 to 7.073 in normal samples and range from 1.676 to 5.497 in disease samples. The Pearson correlation coefficient values range from 0.151 to 0.946 in normal samples and range from 0.242 to 0.891 in disease samples. Both of the figures display no indication of differentially coexpression patterns among the eight genes.

3.3. Leukemia Data

The samples of leukemia dataset are divided into two subclasses of disease samples: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The MIClique algorithm is applied to the preprocessed and normalized leukemia dataset with and . A group of DEC genes is identified, which are coexpressed in ALL samples but not in AML samples. The MI values of these eight genes in DEC group range from 1.944 to 3.348 in ALL samples and range from 0.764 to 1.225 in AML samples with the average MI values 2.550 in ALL samples and 0.934 in AML samples, respectively. Table 5 lists the Genbank accession numbers, gene symbols, and gene descriptions given by leukemia dataset. Besides the MIClique can identify DEC genes correlated in AML but not in ALL. All these DEC genes are helpful for understanding disease pathogenesis of leukemia and biological function of gene modules.

4. Conclusions

The difference between the MIClique and supervised gene selection methods is that MIClique algorithm evaluates the contributions of genes to phenotype by gene subets, rather than individual genes. Although the aim of MIClique is not to select discriminative genes between normal and disease tissues, or between different types of disease samples, the identified genes are still very informative for samples classification. For example, most of the genes identified by MIClique from colon dataset are also differentially expressed genes, which are consistent with the results of other researches.

It is clear that the MIClique algorithm is very efficient in identifying DEC genes. The DEC genes focus on the interaction among gene pairs and disease-related gene network, which is very important for understanding disease pathogenesis and biological function of gene modules. The MIClique algorithm has provided a new and intuitive way to biological and clinical cancer research.