Abstract

Diagnostic genes are usually used to distinguish different disease phenotypes. Most existing methods for diagnostic genes finding are based on either the individual or combinatorial discriminative power of gene(s). However, they both ignore the common expression trends among genes. In this paper, we devise a novel sequence rule, namely, top- irreducible covering contrast sequence rules (TopIRs for short), which helps to build a sample classifier of high accuracy. Furthermore, we propose an algorithm called MineTopIRs to efficiently discover TopIRs. Extensive experiments conducted on synthetic and real datasets show that MineTopIRs is significantly faster than the previous methods and is of a higher classification accuracy. Additionally, many diagnostic genes discovered provide a new insight into disease diagnosis.

1. Introduction

It has been proved that many diseases are closely related with genes [13]. In bioinformatics, such genes are called diagnostic genes. Capturing these genes is an important task, which helps in diagnosis, prediction, and treatment of diseases [4].

According to biological theory, only a small number of genes are directly related with a certain disease [5]. Biologists always want to exploit fewer genes to provide higher disease prediction accuracy. In practice, how to pick out these diagnostic genes to distinguish different disease phenotypes from a massive amount of gene expression data is often an intractable problem.

Many studies have shown that contrast rules are very promising for this problem. Contrast rules refer to the rules that frequently appear in one class but rarely in other classes, denoted as , where represents the diagnostic genes and represents a certain disease. Most of such methods can be divided into two categories, that is, single discrimination based [6] and combinatorial discrimination based [7]. The former evaluates every gene according to their individual discriminative power to the target classes and then selects top-ranked genes. The latter often models the problem as a subset search problem and focuses on the combinatorial discriminative power of a set of genes. However, neither of the two exploits the relationship among genes such that some important diagnostic genes may be missed.

In this paper, we tackle the problem by utilizing the order relationship among genes. Below is a real example for an immediate comprehension to our basic idea.

Example 1. Figure 1 consists of two subfigures. In the top subfigure, 4 genes are expressed over 25 samples. Samples 1~16 are cancerous (labeled as “”) and samples 17~25 are normal (labeled as “”). In the bottom subfigure, another set of 3 genes is expressed over the same set of samples. The existing singleton or combination discriminability-based methods cannot distinguish the two phenotypes. Since most genes are of similar average expression values in the two phenotypes, they will not be selected by the singleton approach. Moreover, all genes are expressed in both phenotypes. Thus, the combination approach based on the cooccurrence of genes will not select them either. Both of the methods ignore the hidden interrelation among genes. In the top subfigure, the gene order over the samples of cancerous phenotype “” is always . Such order is disturbed in normal phenotype “”. In the bottom subfigure, the gene order in normal phenotype “” is , while in cancerous phenotype “” such order does not exist. Based on the ordered expression values, the disease phenotypes (the two shadowed “blocks”) are well identified.

Example 1 indicates that contrast sequence rules may be a promising solution to the mentioned problem. Another advantage of incorporating the sequence rule into diagnostic gene finding is that we may obtain higher disease prediction accuracy by fewer genes. This is intuitively because the order contains both individual information and combinatorial information. In [8], we proposed a contrast sequence rules mining algorithm, namely, NRMINER, and showed its effectiveness and efficiency. However, there are still some issues demanding a further consideration.

Given genes, there is up to subsets of genes. Moreover, each subset of genes corresponds to permutations. Thus, the number of contrast sequence rules is at least in theory. On one hand, massive rules pose a crucial challenge for biologists to interpret and validate the results. On the other hand, this may take too much time such that the proposed method is not practically feasible. In practice, we often need only a small set of representative contrast sequence rules instead of all the rules. This is also the so-called top- problem in database and data mining communities. Accordingly, the goal of this paper is to discover top- covering irreducible contrast sequence rules (TopkIRs for short) from a given gene expression dataset.

Compared with the existing methods, our contributions in this paper are claimed as follows.(1)We propose the concept of top- covering irreducible contrast sequence rule, which greatly reduces the burden for biologist to interpret and validate the results and practically enables an efficient diagnostic gene finding method.(2)We devise the criteria of ranking irreducible contrast sequence rules. Based on the criteria, we can pick out shorter and fewer but more representative rules to build classifier with higher classification accuracy.(3)We develop a novel algorithm called MineTopIRs to directly discover top- covering irreducible contrast sequence rules without postprocess. As we know, few works address this problem in the context of sequence mining.

The rest of this paper is organized as follows. In Section 2, we introduce some preliminaries and give our problem definition. Section 3 introduces the criteria of ranking rules. Section 4 details the MineTopIRs algorithm. Section 5 includes the experimental results and analysis. Finally, Section 6 concludes this paper.

2. Preliminary

In this section, we first introduce some basic concepts useful for further discussion and then formalize the problem to be addressed in this paper.

2.1. Basic Concepts

A microarray dataset is an matrix, with samples and genes . A real value in represents the expression value of gene on sample . An example microarray dataset of 7 genes and 6 samples is shown in Table 1, where the last column lists the class label for each sample.

As mentioned, we want to tackle the problem from the gene order perspective. Accordingly, we propose the EWave model, a sequence model to represent the gene expression data. Next are some necessary concepts.

Definition 2. Given an expression matrix of a sample set, , and a gene set, , if for a grouping threshold , , and some sample , there exists a subset, , of genes holding both (1) and (2), we say is an equivalent dimension group, or an EDG in short, of the sample :

Specifically, we call a gene satisfying (1) but excluded from an EDG by (2) a “breakpoint.” The method of creating EDGs is detailed in [8]. It is worthy to note that no order is considered in an EDG, where the expression values have no significant differences.

An EWave model can be used to represent the sequences of EDGs. Figure 2 shows the EWave model corresponding to the running example in Table 1, where . In each row of an EWave model, all genes are increasingly ordered according to their expression values on sample , and the pointer pointing from to indicates an EDG starting at and ending at . We omit the pointers pointing to a gene itself.

Different from the other traditional sequence-like data, since the overlap among different EDGs is allowed in the EWave model, a gene in an EDG can also belong to several other EDGs at the same time. Given a sample and a gene , the sequence of EDGs of is denoted as . Then, we call the index of the first EDG in containing the head position of with respect to and the index of the last EDG in containing the tail position of with respect to , denoted as and , respectively.

Example 3. For in Figure 2, the head position of ,  (), is 2, and the last position of , (), is 3.

Definition 4. Let be a sequence of EDGs in an EWave model, where the order of genes is . Then, we call a gene sequence is contained by , denoted as , if there exist the integers such that , . Further, we refer to the gene sequence , where any pair of genes are not in the same EDG, as a significant chain.

Example 5. In Figure 2, is a significant chain of $1, but is not, since and coexist in the same EDG.

As mentioned above, we aim to capture the difference among different sample phenotypes from a sequence point of view. Thus, the benefit of EWave model has two aspects. On one hand, not only the gene expression data are very noisy, but also sometimes the gene expression values are very close. If we only consider the significant chain, the difference between genes is large enough so that the difficulty to determine the order among genes is overcome. On the other hand, the high dimension of gene expression data is largely reduced at the same time. Next, we introduce some concepts related with the contrast sequence rule under the EWave model.

Definition 6. Let be EWave modeled gene expression data. Then, for a given sequence rule , denoted as , where is a significant chain and is a given class label, the support of is defined as the number of the sequences of EDGs in containing , denoted as and the sample support set of denoted as . The confidence of is defined as the ratio of the number of the sequences of EDGs containing to that of the sequence of EDGs containing , denoted as .

Example 7. In Figure 2, let be the rule . Then, and . Further, since = 4, = 3/4 = 75%.

Definition 8. Let be an EWave modeled gene expression dataset and a specified class label. is a rule group with antecedent support set and consequent , iff   ,   and   , .

Example 9. In Figure 2, . Thus, they make up a rule group with antecedent support set and a specified class label .

In this paper, we want to use the contrast sequence rules to distinguish the sample phenotypes. However, the number of contrast sequence rules in the dataset is prohibitively large, and most of them are redundant. Discovering all contrast sequence rules is inefficient and trivial. Thus, we propose the concept of irreducible contrast sequence rule, which is more concise and representative.

Definition 10. Let be an EWave modeled gene expression dataset. A sequence rule in the form of is called a contrast sequence rule if and are no less than the minimum support threshold and the confidence threshold , respectively, where is a sequence and is a class label.

Example 11. Suppose and %. Then, : in Figure 2 is a contrast sequence rule since and .

Definition 12. For any given contrast sequence rule : of , we call it an irreducible contrast sequence rule if any of ) has . In other words, any subrule of a contrast sequence rule should not be a contrast sequence rule.

Example 13. : in Figure 2 is not an irreducible contrast sequence rule since there exists a subrule of , say : , such that .

Definition 14. Given , an EWave modeled gene expression dataset, the top- covering irreducible contrast sequence rules for a sample is the set of rules , where the antecedent of is contained by , , and there exists no rule ,   such that can substitute any rule in based on the rule priority. For brevity, we will use the abbreviation TopIRs to refer to top- covering irreducible contrast sequence rules for each sample.

Example 15. Suppose . Then, for sample in Figure 2, the top- covering irreducible contrast sequence rules is the set of rules : , : . This is because    and , both and are irreducible contrast sequence rules, and there is no other rule which can substitute or due to .

2.2. Problem Description

Given a gene expression dataset where each sample is attached with a unique class label, the equivalent threshold , the minimum support threshold , and the confidence threshold , the problem is to efficiently discover the set of top- covering irreducible contrast sequence rules for each sample.

3. Criteria of Ranking Rules

In this section, we introduce the criteria of ranking rules. In order to evaluate the (dis)similarity between sequences, we propose the concept of projection distance which is more suitable for EWave modeled gene expression data. The reason is that projection distance takes into account not only the difference on the same position between two sequences but also the displacement between the two items.

Assume is a gene sequence and is the gene sequence corresponding to sample , the projection of on , denoted as , refers to the sequence of all elements in , permuted according to their relative orders in . Further, if a pair of items in , denoted as , has the reversal relative order in , we call it a reverse pair. Then, for an item , if it is at the th locus in and at the th locus in , we call the displacement of between and , denoted as .

Definition 16. Given a gene sequence and the sequence corresponding to sample , the projection distance between and is defined by the following formula: where is a Boolean function expressed as , if is a reversal pair; otherwise, .

Now, we adopt a similarity function defined based on the concept of projection distance (or simply PD) to identify the (dis)similarity between a sequence and its projection on sample . The similarity function is formally defined as follows.

Definition 17. Given a gene sequence and the gene sequence corresponding to sample , the PD similarity between and , denoted as ), is defined aswhere is the length of gene sequence .

From (4), we can find that the smaller the projection distance between two sequences, the more the similarity of the sequences. If , , which means the two sequences are totally the same. Next, we introduce the criteria of ranking rules with two cases.

Definition 18. The priority within the same rule group: given two rules : ,  : , and , we say is prior to if

From (5), we can conclude that the more the antecedent of the rule is different from the gene sequence in the nonsupport set, the higher the priority the rule has.

Example 19. In Figure 2, the support sets of rules and are both , but based on (5), , so is more prior than .

Definition 20. The priority between rule groups: given two rules : , : , and , we say is prior to , if and only if one of the following three conditions satisfied:   ;    and ;   , and is discovered before .

Example 21. In Figure 2, assume : , : , and : . Because and , which is higher than that of , is more prior than . Also, and , so is more prior than .

4. The MineTopIRs Algorithm

In this section, we present our algorithm, called MineTopIRs, to solve the problem given in Problem Statement. First, we give a naive method to construct classifier based on contrast sequence rules.

Step 1. Discover all the frequent sequence patterns with a low minimum support threshold.

Step 2. Combine each sequence pattern with a class label to generate a sequence rule. Then, pick out the contrast sequence rule with highest confidence for each sample in the dataset.

Obviously, this naive two-step mining method generates too many rules in Step 1, which takes too long time. Moreover, selecting only one rule for each sample is often not enough. Instead, our algorithm is one-pass process, which is much more efficient. Further, each sample is guaranteed to be covered by top- irreducible contrast sequence rules. In what follows, we detail the proposed MineTopIRs algorithm.

4.1. Head-Tail Matrix

The Head-Tail matrix is a useful structure to accelerate the detection whether a sequence is a significant chain corresponding to some sample template sequence , which is a necessary condition of the antecedent of a contrast sequence rule. Table 2 gives the Head-Tail matrix corresponding to the model shown in Figure 2, where each row represents a considered sample, and each column represents a remained gene. Every entry in the matrix records a two-dimensional vector , where denotes the head position of the gene in , that is, , and denotes the tail position of the gene in , that is, . For example, in Figure 2, and , so the entry at row 3 and column 1 of Table 2 records .

An efficient way to decide whether a sequence is a significant chain with respect to is that we only consider any neighboring pair of genes such as and ; if is always true, we say that must be a significant chain for , which is the sequence of EDGs of sample . Note: While computing the support of a gene sequence, we use the Head-Tail matrix with , which makes the order between genes in the sequence significant enough. However, when computing the projection distance of a gene sequence for some , we use the Head-Tail matrix with , which makes the displacement of a reverse pair easily determined.

4.2. The Mining Algorithm

The search space of enumerating all gene sequences is prohitably large. Thus, a suitable traversal framework with some effective pruning strategies is necessary.

In this paper, we adopt a breadth-first traversal framework. As we know, most sequence pattern mining methods such as BIDE [9] and FEAT [10] adopt a depth-first traversal. The goodness is that exploiting the antimonotonicity of support, the depth-first traversal can directly prune searching space based on the current sequence without generating candidate set. However, depth-first traversal is not suitable to solve the problem raised in this paper. The reason is that the confidence of irreducible contrast sequence rule is not antimonotonic, which requires us to detect whether all subrules of the current rule satisfy the conditions defined in Definition 12 that is the confidence of all subrules below . For example, suppose the length of current sequence rule is , we need to detect all the subrules, which shows the computation is very large. Under the premise of not establishing access rules index, it is possible to repeatedly access many rules. The abovementioned two cases are very time-consuming. On the contrary, the breadth-first traversal can be a good solution to the problem mentioned above. We only need to detect whether all the ()-size subrules meet the conditions. Further these subrules can be obtained by directly accessing the current rule candidate set which is more efficient.

Formally, the algorithm is shown in Algorithm 1. There are four input parameters of the algorithm, the original dataset , equivalent threshold , the minimum support , and confidence threshold . Because of solving the problem in gene sequence perspective, the algorithm will first transform into the EWave model and then construct the Head-Tail matrix which can accelerate the calculation of rule support. At the same time, the top- covering irreducible contrast sequences rules for each sample with consequent , denoted as , will be initialized. Also, we put all the 1-size rules that consist of single gene into rule candidate set Candi_R. Then the function breathfirst_search is called to perform the breath-first traversal to find out the top- rules for each sample.

Input: a gene expression dataset ; the required number of rules ; the equivalent threshold ; the support threshold ;
the confidence threshold
Output: All top- covering irreducible contrast sequence rules for each sample with class label
()  Convert dataset into the EWave model , w.r.t. ;
()  Construct Head-Tail matrix;
()  Initiate a list of rules with both support and confidence values of 0, for each sample with class
label ;
()  Initiate the rule candidate set Candi_R with all 1-size sequence rules;
()  Call breathfirst_search(candi_R, , , 1);
()  Return for every with class label ;
Function: breathfirst_search(candi_R, , , )
()  while candi_R  do
()   foreach -size rule    generated based on the  -size rules in candi_R do
()    if -size subrule of    exists in candi_R then
()     if supp() > then Pruning rule 1;
()      if conf() < then
()       Pruning rule 2;
()       add into candi_R;
()      else
()       Check the th covering rule for each sample to find
       the lowest confidence minconf and the corresponding support sup;
()      if (conf() > minconf)(conf() = minconfsupp() ≥ sup) then
()      Pruning rule 3;
()      Update for each sample with
        based on Definitions 18 and 20;
()   end
()  end
()  Delete all the -size rules in candi_R;
()  ++;
() end

The function breathfirst_search takes in four parameters: the rule candidate set Candi_R, minimum support , confidence threshold , and the size of rule . When the algorithm executes to the level, it generates all the ()-size rules based on the rules in Candi_R (line 2). For each ()-size rule, the algorithm is based on three pruning rules (lines 4, 6, and 11) to detect whether it will be put into Candi_R for further extension (line 8) or used to update the top- covering rules for samples in its support set (line 13) or just be pruned. It is worth noting that the confidence of all the rules in Candi_R must be below because once the confidence of a rule exceeds , all the super rules of it cannot be irreducible contrast sequence rules. After the end of each loop, the algorithm deletes the whole l-size rules in Candi_R (line 17). The algorithm ends when Candi_R = (line 1).

4.2.1. Pruning Strategies

We next illustrate the pruning techniques that are used in MineTopIRs. With the help of these pruning rules, we can find out the top- covering irreducible contrast sequence rules for each sample efficiently.

Pruning Rule 1. Let : be the current considered sequence rule; if there exists a sequence rule : , , and , the rule itself and all its super rules can be pruned.

Proof. Based on the definition of irreducible contrast sequence rule, if a sequence rule : is irreducible contrast sequence rule, it requires that : , . Thus, if any of its subrules : do not satisfy this condition, the sequence rule : cannot be an irreducible contrast sequence rule. Similarly, none of its super rules can be irreducible contrast sequence rules.

Specific to our algorithm, we store each rule whose confidence and all its subrules’ confidence are below in Candi_R for further extension. When deciding whether a newly generated -size rule is to be pruned, we only need to test if all of its ()-size subrules are in Candi_R. If not, we can safely prune this sequence rule and all its super rules.

Pruning Rule 2. Let : be the current considered sequence rule and the minimum support threshold. If , then the current rule and all its super rules are pruned.

Proof. It is immediately derived from the a priori property of sequence [11] and Definition 12.

In MineTopIRs, we can use the constraint of top- to prune rules. Combined with Definition 20, we compute minconf and sup, the critical point of TopIRs thresholds for the samples in , where minconf is the minimum confidence value of the discovered TopIRs of all the samples in and sup is the corresponding support. Assume the top- covering irreducible contrast sequence rules of each sample are ranked according to the priority between rule groups such that :

Pruning Rule 3. Given the current considered sequence rule : and , minconf and sup computed according to (6), if the rule is less prior based on the priority between rule groups (Definition 20) than , , then the rule and all its super rules cannot become a rule in the top- covering irreducible contrast sequence rules list of any sample and can be safely pruned.

If the current sequence rule : cannot be pruned by Pruning Rule 3, there are two situations. On one hand, when there are no rules in that have the same sample support set as that of , we only need to detect if is more prior than , if so, we substitute for . On the other hand, because in this paper we want to find out top- rules for each sample with different sample support sets, when there is some rule in that has the same sample support set as that of , we need to find out that if is more prior than the rule has the sample support set with based on the priority within the same rule group (Definition 18), if so, we replace this rule with which can guarantee the current rules in have the highest priority.

In addition, another optimization method is utilized in Pruning rule 3. If we find all TopIRs have 100% confidence and the lowest support value of rules is larger than , we dynamically increase the user-specified support threshold.

5. Performance Studies

In this section, we will look at both the efficiency of our algorithm in discovering TopIRS and the usefulness of the discovered rules. All our experiments were performed on a HP PC with 2.33 GHz Intel Core 2 CPU, 2 GB RAM, and a 160 GB hard disk running Windows XP. Algorithms were coded in Standard C.

Datasets. Four real gene expression datasets for experimental studies: Leukemia [1], DLBCL Tumor [2], Hereditary Breast Cancer (HBC) [3], and Prostate Cancer (PC) [12]. Table 3 shows the characteristics of the four datasets: the number of samples (#sample), the number of genes (#gene), and the label of class . The number of samples in every class is shown in the last column. Moreover, we generate the synthetic datasets by using a specialized dataset generator [8].

5.1. Efficiency of MineTopIRs

In term of efficiency, we compare MineTopIRs with R-FEAT and NRMINER [8]. On one hand, R-FEAT is changed from the sequence generator mining algorithms FEAT [10]. Briefly, we apply FEAT on a given dataset, when a generator is found, we decide whether could be a result by checking all rules , where , satisfying the conditions based on Definition 14. On the other hand, the NRMINER algorithm adopts the template driven method to find out all the interesting nonredundant contrast sequence rules, which are necessary for checking whether the conditions in Definition 14 are satisfied. We should point out that the rules discovered by MineTopIRs are a subset of the above two existing methods.

In Figure 3, we study how the running time varies with #sample and #gene by increasing #sample from 10 to 30 while fixing #gene to 100 and then increasing #gene from 20 to 100 while fixing #sample to 30, where the synthetic datasets are utilized. Figures 3(a) and 3(b) show that the running time becomes longer with #sample and #gene increasing. This is because the searching space also becomes larger. However, the MineTopIRs is always much faster than the other two methods; the reason is that our algorithm can directly discover the results in one step. However, the other two are two-step mining methods, which need to first discover a bigger result set and then conduct the postprocessing. Further, with the searching space increasing, the number of rules after first step mining grows exponentially. Thus, it is very time-consuming.

Figure 4 shows the effect of varying towards runtime. We observe similar tendencies on all datasets. It is quite reasonable that MineTopIRs is monotonously increasing with . Also, as shown in Figure 5, MineTopIRs is monotonously decreasing with . Figures 6 and 7 show the effect of varying minimum support threshold and the minimum confidence threshold on four real gene expression datasets. Figures 6(a)6(d) show the running time varying with the minimum support threshold , where the other two parameters and are set to 0.8 and 0. Note that the -axes in Figures 6 and 7 are in logarithmic scale. We run MineTopIRs by setting . In Figure 7, changes from 70% to 90% while and is fixed in every dataset. As seen from Figure 6, running time decreases with the increasing of . This is because the increasing of prunes more useless rules. We also find out that MineTopIRs is usually one order of magnitude faster than the other two algorithms, especially at low minimum support. The reason MineTopIRs outperforms the other two algorithms is that R-FEAT and NRMINER discover a large number of rules at lower minimum support while the number of rules discovered by MineTopIRs is bounded. Besides, MineTopIRs can use Pruning strategy 1 to prune the search space; however, R-FEAT and NRMINER do not meet this property. Figure 7 shows that the running time of both NRMINER and R-FEAT does not change significantly as is increasing, which is because the pruning strategies of these methods are mainly based on support threshold . However, the running time a little increases with the increasing . This is because with the increasing of , the rules whose confidence below will also increase; thus the pruning ability decreases a little. Despite so, the MineTopIRs is still faster than the other two algorithms based on the above reasons in Figure 6.

5.2. Effectiveness of MineTopIRs

In terms of the effectiveness of MineTopIRs, the classification accuracy and the complexity are used as the performance standard for evaluation. Moreover, the biological significance of the discovered genes is also discussed.

5.2.1. Accuracy and Complexity

We build a classifier called TopIR classifier based on the rules that MineTopIRs discovered. The TopIR classifier is composed of subclassifiers, denoted as . Each classifier is built based on all the top- rules for each sample in the dataset. We call IR1 the main classifier and are backup classifiers. We use every subclassifier in order until the test sample is successfully classified. Besides both main and backup classifiers we set a default class which is set as the majority class of the training data. If a test sample cannot be classified by the classifiers, we put it into the default class.

When building each subclassifier, the score function in (7) [13] is adopted, where represents the rules matching the test sample in class , and represents all the rules in class . To which class a test sample should be assigned is decided by a matched rule of the highest score:

In the experiments, we adopt 10-fold cross validation to test the average classification accuracy of TopIR classifier and compare it with NR [8] and CBA and IRG [14] classifiers. The results in Table 4 show that TopIR classifier performs much better than CBA and IRG classifiers. Compared with CBA which is built with the Top-1 covering irreducible contrast sequence rules, TopIR classifies much fewer test data using default class. IRG classifier is built based on the association rules, which illustrates that sequence rules can reflect data characteristics better. TopIR classifier is more accurate than NR classifier on most dataset; however, it uses much fewer rules () to build classifier than NR. In our experiment, and the rules used in NR classifier are usually more than ten thousand [8]. Furthermore, we discover that the average length (AL for short) of sequence rules used in TopIR classifier is shorter than that of IRG. This result verifies that the MineTopIRs could provide as high as diagnostic accuracy using as fewer as possible genes, which is very valuable for biologists to further follow up biological or clinical validation of selected genes [15].

5.2.2. Biological Significance

Different from the traditional methods, MineTopIRs characterizes the pathogenesis of a disease from a sequence-like point of view, which incorporates the orders among genes and can be seen as the pathway of disease causing. In this part, by showing some interesting results from Leukemia dataset [1], we emphasize the fact that not only can MineTopIRs find the genes revealed by the traditional methods, but also it can find some genes ignored by the traditional methods.

Table 5 lists the top-10 genes most frequently occurring in the discovered TopIRs for the diagnosis of “AML” samples and “ALL” samples, where the genes with “” mean they are also included in the benchmark, that results from eight statistics based gene ranking methods [16]. The two most frequent genes appear in Table 5, which also appear in the benchmark. Gene TIMP2 is a member of the TIMP gene family, the proteins encoded by which are natural inhibitors of the matrix metalloproteinases. Reference [17] reveals that the transcription of TIMP2 in SHI-1 cells of AML is higher than other leukemic cells. Gene ZFP36 expression is upregulated in human T-lymphotropic virus 1- (HTLV-1-) infected cells. HTLV-1 is associated with adult T-cell leukemia/lymphoma [18].

In addition, for the genes without “”, though they are not in the benchmark, we still cannot ignore these genes. For example, the gene sequence including frequent gene CCT5 in Table 5 appears in most “ALL” sample but fewer occurs in “AML” samples. But, any of its subsequence does not have the ability of distinguishing samples which indicates that any gene in is irreducible and well reflects the synergy between the genes. Thus, these genes also have very important potential values for biologists to further explain.

6. Conclusion

In this paper, we study an important problem in bioinformatics, that is, discovering diagnostic gene patterns from gene expression data. Unlike any previous work on this topic, we tackle the problem by exploiting the ordered expression trend of genes, which can better reflect the gene regulation pathway. In order to capture the more accurate diagnosis by using as few as possible rules, we propose the concept of top- covering irreducible contrast sequence rules for each sample of gene expression data. Further, an efficient method called MineTopIRs is developed to find all TopIRs. Considering the real noisy scenario in gene expression data, we first use an EWave model, which, essentially different from the current models, characterizes gene expression data from a sequence-like perspective. Then, we can use MineTopIRs to discover the bounded number of TopIRs in one mining process, which can directly be used to build classifier. Extensive experiments conducted on both synthetic and real datasets show that MineTopIRs is both effective and efficiency. It may offer a new point of view from diagnostic gene discovery to the biologists.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by National Natural Science Foundation of China (61272182, 61100028, 61073063, 61173030, and 61173029), 863 program (2012AA011004), 973 program (2011CB302200-G), National Science Fund for Distinguished Young Scholars (61025007), State Key Program of National Natural Science of China (61332014), New Century Excellent Talents (NCET-11-0085), China Postdoctoral Science Foundation (2012T50263 and 2011M500568), and Fundamental Research Funds for the Central Universities (N130504001).