Current microarray data mining methods such as clustering, classification, and association analysis heavily rely on statistical and machine learning algorithms for analysis of large sets of gene expression data. In recent years, there has been a growing interest in methods that attempt to discover patterns based on multiple but related data sources. Gene expression data and the corresponding literature data are one such example. This paper suggests a new approach to microarray data mining as a combination of text mining (TM) and information extraction (IE). TM is concerned with identifying patterns in natural language text and IE is concerned with locating specific entities, relations, and facts in text. The present paper surveys the state of the art of data mining methods for microarray data analysis. We show the limitations of current microarray data mining methods and outline how text mining could address these limitations.

1. Introduction

DNA microarrays facilitate the simultaneous measurement of the expression levels of thousands of genes [1, 2]. As a result, this high-throughput technology has led to increased amount of gene expression data. Microarrays have been used for a variety of studies, including gene coregulation studies, gene function identification studies, identification of pathway and gene regulatory networks, predictive toxicology, clinical diagnosis, and sequence variance studies. For a complete description about microarrays and its analytical tasks, refer to the books [35]. Current microarray data mining methods such as clustering, classification, and association analysis are based on statistical and machine learning algorithms. Most of these techniques are purely data driven and do not incorporate significant amounts of biological knowledge. Considering the statistically ill-defined nature of microarray data (many more variables than observations) and the massive body of existing biological knowledge, it is imperative that we exploit that knowledge for analysis and interpretation of microarray data. Text mining techniques constitute a promising technology for automating the incorporation of scientific knowledge in the microarray data mining process.

Applying domain knowledge is fundamental in any scientific discovery process. In biology, domain knowledge is available in vast collections of the literature in natural language form such as abstracts [6] and full-text journal articles [7, 8] and also as textual annotations in databases such as SwissProt [9] and GenBank [10] For example, the biological abstract database PubMed comprises more than 19 million citations for biomedical articles from MEDLINE and life science journals. The rapid growth of literature databases and structured repositories renders it increasingly difficult for humans to access the required information within reasonable time constraints. Text mining and information extraction are computerized techniques facilitating the automated filtering and analysis of large amounts of electronic texts. Text mining can be defined as the process of identifying nontrivial, implicit, previously unknown, and potential useful patterns in natural language text [11]. Information extraction, on the other hand, focuses on the identification of specific predefined classes of entities, relations, or facts in natural language text, records, and presents this information in a structured format [12]. In a simple definition, the goal of information extraction is to extract nuggets of information from text and text mining is to find new knowledge. However, both text mining and information extraction are complex processes composed from multiple tasks approached by disciplines such as statistics, information retrieval, natural language processing, machine learning, and artificial intelligence. For a complete description and review about TM and IE methods and natural language processing, refer to the books [1316].

In recent years, there has been a growing interest in methods that attempt to simultaneously discover patterns occurring in multiple data sources. Gene expression data analysis is an example for such a scenario, as it is concerned with the analysis of the actual expression data in conjunction with existing relevant textual information of genes, proteins, diseases, and so on. The goal here is to come up with solutions for statistical issues and limitations in current microarray data mining with integrated text mining.

The following sections introduce in detail the current statistical limitations in all the three stages of microarray data mining pipeline, that is, (i) data preprocessing, (ii) data modeling or model construction, and (iii) postprocessing, and outline how text mining could address these limitations with related literature references.

2. Text Mining Perspectives in Microarray Data Mining

Can text mining be incorporated into the microarray analysis process: where and how? This review tries to provide an answer to this question. Common strategies of data mining in microarray are(i)data preprocessing;(ii)data modeling or model construction;(iii)data postprocessing.Text mining techniques constitute a promising technology for automating the incorporation of scientific knowledge in all the above data mining process. For example, text mining outputs (i.e., new knowledge) could be combined and correlated with actual gene expression data in model construction (i.e., clustering, classification, and association analysis). Further, text mining can also be applied earlier in data preprocessing for feature selection, data transformation, and data enrichment and in the postprocessing phase for interpretation and knowledge-based validation microarray analysis results (see discussions below). The three major steps of microarray data mining and how text mining approaches could help are shown in Figure 1.

2.1. Data Preprocessing

The data preprocessing task data mining focuses on the preparation and transformation of the “raw” data for further analysis and processing. The most important aspects in microarray data preprocessing include data transformation (normalization, centralization, and standardization), the handling of missing value imputation and other forms of error correction, data discretization, feature selection and dimension reduction, and data enrichment.

Of major interest is the task of feature selection. Feature selection and dimension reduction techniques are important for tackling the dimensionality problem inherent in microarray data (a.k.a curse of dimensionality or large-p-small-n problem). Feature selection can also directly lead to the discovery of marker genes, that is, genes associated with the phenotype under investigation (e.g., cancer class). To our knowledge, all publications addressing the issue of feature selection in the context of microarrays involve statistical or machine learning methods only, such as the p-metric [17], information gain ranking [18], and Chi square statistic [19]. These methods take into account only the numerical values of the microarray matrix. However, these purely statistical approaches have been criticized by King and Sinha and O’Neill et al. [20, 21]. West et al. [22] point out that not only highly expressed genes are of main interest, but also are those genes whose expression highly correlates with the phenotype, regardless of the level of expression. Clearly, statistical significance does not necessarily imply biological relevance.

The problem with the methods relying on the numerical representation of gene expression can be illustrated in a simple example. Suppose that we are interested in identifying those genes whose expression profile highly correlates with a cancer, and let us assume that we investigate two classes, A (cancer) and B (normal). If a gene is highly overexpressed in samples of type A, but underexpressed in samples of type B, then the vast majority of currently applied feature selection methods would identify this gene as a marker gene. However, a large number (if not the majority) of genes found to be overexpressed in tumor cells simply reflect the fact that aggressive cells tend to be in active cell cycle and thus express genes that are known to play a pivotal role in the cell cycle. Thereby, genes associated with general homeostasis functions such as metabolism, protein synthesis, and cytoskeletal structures are more highly expressed and are likely to dominate the expression profile. Genes might play a pivotal role in cancerogenesis, yet they are not frequently missed from being identified [21]. Furthermore, it is known that cancer is a multistep process [23]. However, microarray experiments in cancer research deal with “mature” cancers, that is, tumors that have already reached a critical mass that can be diagnosed. Therefore, it is problematic to identify the “trigger genes,” that is, those genes that are responsible for cancerogenesis in the first place. How can we prevent the genes associated with general homeostasis functions from obscuring the global picture? Current number-based approaches are unable to readjust the focus on those genes. Text mining could be the method of choice to address this problem. Text classification and clustering methods have the potential to identify those genes that are involved in general homeostasis functions using the literature data and thus allowing the exclusion of those genes from further analysis. Further, text clustering might be used to group genes according to their functions using related literature of genes, which allows to select functionally important genes from different clusters.

In contrast to number crunching statistical and machine learning approaches, text mining can also select genes on the basis of other semantic criteria. This facilitates the filtering of genes that are known to be involved in specific pathways, sharing similar functions, or have same cellular localization. For example, suppose that we are interested in comparing the expression of some genes of known or putative cancer inhibitory function in two different types of cancer. For each gene, good sources of scientific texts are available. By mining these resources, we can identify the genes that we are interested in and compare, for instance, their expression in different cancer types.

2.2. Data Modeling

The data modeling task focuses on construction of data models using expression data as clusters, association rules, and classification systems.

Current clustering of microarray data has been performed on the basis of gene expression measurements represented in numerical format, that is, mostly as real numbers [2426]. However, the results of clustering and other data mining methods depend on a number of factors, not only on the data to be analyzed. Different clustering algorithms, for example, usually detect different, yet meaningful, groups in the same data set. Clustering algorithms can produce different result for different parameter initializations, for example, k-means and fuzzy c-means [26]. Most algorithms compute some kind of similarity or distance function as criterion for assigning objects to clusters. However, the notion of similarity is extremely difficult to define in high-dimensional data sets such as microarray data [27]. Further, biostatisticians do not agree over which number-based clustering technique should be applied under which circumstances [28]. Furthermore, biological validation and interpretation of cluster analysis results is a difficult task [29].

Text mining can pave the way to a fundamentally different concept of clustering instead of focusing on the numeric expression values, one could group genes according to a semantic concept membership, for example, with respect to their function, cell localization, or role in a disease, and the expression profiles are inspected in the next step. For example, Raychaudhuri and colleagues [30] investigated whether the genes with in a gene expression cluster share a common biological function based on the associated scientific literature using text-clustering methods. Another system, PubGene [31], is used for the functional interpretation of genes expressed in gene expression clusters using the literature information. PubGene initially complies a network of human genes by finding the cooccurrences genes names within the scope of relevant Medline abstracts. Next, it uses genes names in this network to search for other literature references involving these genes and annotates the networks with that literature information. These compiled annotations were then compared to the gene expression cluster results for the correlation.

Various machine learning and data mining methods are currently used for classifying gene expression data. However, these methods have not been developed to address the specific requirements of gene microarray analysis. As pointed out by Sabatti [32], the main issues for the microarray data analysis by machine learning and data mining methods include experimental design, noise level, measurement errors, and quality. Furthermore, such statistical techniques do not address the biologist’s requirement for sound mathematical confidence measures and also misclassification costs; that is, they are indifferent to the costs associated with false positive and false negative classifications [33]. Therefore, purely statistical models are prone to a variety of problems. We believe that text mining methods have a great potential to complement current statistical and machine learning techniques, thus representing a more adequate framework for analysis and interpretation of microarray data. Following are few examples on how text mining could help to answer these questions.

Classification of microarray data requires methods for dimension reduction to overcome the curse of dimensionality. Using text mining, it may become possible to build more sophisticated classifiers that take into account existing background knowledge. For example, assume a classification problem in cancer subtype or outcome prediction [34]. In such classification problem, the classifier might weigh the genes differently, depending on their function, instead of focusing on the numeric expression values alone. This will be the potential area of research, and as per our knowledge no work has been done along these lines. Instead, many groups explored the possibilities of complimenting these kinds of studies with text mining [3537]. However, integrated analysis is a challenging and open problem.

For example, consider the KDD cup 2002 task 2 [35]. It explores the possibility of predicting a gene expression classification system using the literature and other data resources. The task was to analyze the knockout behavior of yeast genes in expression data using text literature of about 15,000 scientific abstracts and other data resources such as gene interaction information, subcellular localization hierarchies, and gene functions. The winning solution combines information extraction and text mining methods. IE methods were used to extract key features/terms from the given literature, which were used by the text classification algorithm to classify the genes.

Mining microarray data for association between genes in the form of pathways can reveal new insights into underlying biological or pathological systems, but it is a computationally complex task because thousands of potentially interesting associations can be derived from a data set that comprises a large number of genes. However, text mining methods could come up with predefined sets of relations and interactions between genes and proteins. Resultant interactions and associations may be combined and correlated with gene expression analysis. This problem has been conferred by Natarajan et al. in [36]. They have shown how text mining could be a complement to a purely data-driven approach in finding the impact of S1P on invasivity of a glioblastoma cell lines. Another text mining system, Literature Lab [37], was used to find the association between the pathway related to FOSB and genes with increased expression in metastatic prostate cancer.

2.3. Data Postprocessing

This task focuses on biological validation, and interpretation of microarray analysis results applying statistical and/or machine learning methods for the analysis of microarray data requires a sound validation of the results. This validation could in principle involve three basic steps:(1)validation using statistics ( values, confidence intervals, measures like specificity, sensitivity, etc.);(2)validation based on further biological experiments;(3)validation based on available domain knowledge (provided by human expert, obtained in semistructured format from scientific literature repositories, or as highly structured records in information bases such as SwissProt, GenBank, and others).

The validation of the obtained results using statistics is the most straightforward type of validation. For example, many studies focus on the classification of cancer types based on gene expression profiles [38, 39]. To assess the statistical significance of their results, many researchers use statistical tests, including random permutation tests. The validation of the expression study may also trigger additional biological experiments. Yeoh et al. [19], for example, validated their results by FISH experiments. Combining multiple cytogenetic experiments may also validate the results of a microarray experiment; see, for example, Berrar et al.’s work [40], where inductive classification methods often employ simple accuracy measures (e.g., precision, lift) and cross-validation to validate the induced classifiers. Statistical validation is relatively straightforward and inexpensive (both in terms of money and time), while to validate and interpret microarray analysis results in terms of their biological relevance, on the other hand, is not. It requires the combined expertise of the involved scientists, such as pathologists, cyotogenetists, chemists, and biologists. For example, a potential outcome of a microarray study focusing on cancer-related genes might reveal that a specific gene is statistically strongly correlated to the cancer under investigation. Here, text mining comes into play—by validating the result using existing knowledge bases such as scientific text. The analysis of the text repositories may reveal that the identified gene does play a pivotal role in a specific pathway that is related to the cancer under investigation. This approach is far from being new; in fact, it is common practice to validate the results by referring to already published material. However, this type of validation is still being performed in a highly “manually” fashion by the researchers. By sifting through the already existing vast amount of information, text mining could be the tool for finding new nuggets of knowledge and become a valuable complement to the biological validation.

3. Conclusion

The analysis of microarrays has become one of the main research areas in computational biology, and new methods and applications are continuously being developed. Here, we have reviewed current limitations of microarray data mining using statistical and machine learning methods and how text mining could come up with solutions to overcome these limitations. The fact that free text information is far from being unambiguous remains an open problem. As Chang and Altman pointed out [41], maybe the most difficult problem is that our knowledge about living systems is fluid as definitions of words and conceptual paradigms change.

Microarray technology has become a mature method for monitoring the expression of thousands of genes, yet the incorporation of existing domain knowledge remains an open problem. We believe that currently applied statistical and machine learning methods are too limited to exploit the full potential of microarray data. A plethora of information is already out there in publicly accessible databases, and TM can help to incorporate this knowledge in the analysis process. The high-throughput technologies of the next generation, such as protein chips, are most likely to entail similar problems than those related to gene microarray data. We believe that text mining will become the tool to achieve the next quantum leap in analyzing these high-throughput data.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.


The author wishes to thank Werner Dubitzky and Daniel Berrar at the University of Ulster, UK, for their valuable suggestions and fruitful discussions.