Abstract

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.

1. Introduction

One of the most important research fields in data mining is mining interesting patterns (such as sequences, episodes, association rules, correlations, or clusters) in large data sets. Frequent itemset mining is one of the earliest such concepts originating from economic market basket analysis with the aim of understanding the behaviour of retail customers, or, in other words, finding frequent combinations and associations among items purchased together [1]. Market basket data can be considered as a matrix with transactions as rows and items as columns. If an item appears in a transaction it is denoted by 1 and otherwise by 0. The general goal of frequent itemset mining is to identify all itemsets that contain at least as many transactions as required, referred to as minimum support threshold. By definition, all subsets of a frequent itemset are frequent. Therefore, it is also important to provide a minimal representation of all frequent itemsets without losing their support information. Such itemsets are called frequent closed itemsets. An itemset is defined as closed if none of its immediate supersets has exactly the same support count as the itemset itself. For comprehensive reviews about the efficient frequent itemset mining algorithms, see [2, 3].

Independently of frequent itemset mining, biclustering, another important data mining concept, was proposed to complement and expand the capabilities of the standard clustering methods by allowing objects to belong to multiple or none of the resulting clusters purely based on their similarities. This property makes biclustering a powerful approach especially when it is applied to data with a large number of objects. During recent years, many biclustering algorithms have been developed especially for the analysis of gene expression data [4]. With biclustering, genes with similar expression profiles can be identified not only over the whole data set but also across subsets of experimental conditions by allowing genes to simultaneously belong to several expression patterns. For comprehensive reviews on biclustering, see [46].

One of the most important properties of biclustering when applied to binary data is that it provides the same results as frequent closed itemsets mining (Figure 1). Such biclusters, called inclusion-maximal biclusters (or IMBs), were introduced in [7] together with a mining algorithm, BiMAX, to discover all biclusters in a binary matrix that are not entirely contained by any other cluster. By default an IMB can contain any number of genes and samples. Once additional minimum support threshold is required for discovering clusters having at least as many genes as the provided minimum support threshold (i.e., minimum number of genes), BiMAX and all frequent closed itemset mining methods result in the same patterns.

In this paper we propose an efficient pattern mining method to find frequent closed itemsets/biclusters when applied to binary high-dimensional data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment, rigorously tested on both synthetic and real data sets, and freely available for researchers (http://pr.mk.uni-pannon.hu/Research/bit-table-biclustering/).

2. Problem Formulation

In this section we will show how both market basket data and gene expression data can be represented as bit-tables before providing a new mining method in the next section. In case of real gene expression data, it is a common practice of the field of biclustering to transform the original gene expression matrix into a binary one in such a way that gene expression values are transformed to 1 (expressed) or 0 (not expressed) using an expression cutoff (e.g., twofold change of the log2 expression values). Then the binarized data can be used as classic market basket data and defined as follows (Figure 2): let be the set of transactions and let be the set of items. The Transaction Database can be transformed into a binary matrix, , where each row corresponds to a transaction and each column corresponds to an item (right side of Figure 2). Therefore, the bit-table contains 1 if the item is present in the current transaction and 0 otherwise [8].

Using the above terminology, a transaction is said to support an itemset if it contains all items of ; that is, . The support of an itemset is the number of transactions that support this itemset. Using for support count, the support of itemset is . An itemset is frequent if its support is greater than or equal to a user-specified threshold . An itemset is called -itemset if it contains items from ; that is, . An itemset is a frequent closed itemset if it is frequent and there exists no proper superset such that .

The problem of mining frequent itemsets was introduced by Agrawal et al. in [1] and the first efficient algorithm, called Apriori, was published by the same group in [9]. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of the previously determined frequent itemsets to identify longer and longer frequent itemsets. Mannila et al. proposed the same technique independently in [10], and both works were combined in [11]. In many cases, frequent itemset mining approaches have good performance, but they may generate a huge number of substructures satisfying the user-specified threshold. It can be easily realized that if an itemset is frequent then all its subsets are frequent as well (for more details, see “downward closure property” in [9]). Although increasing the threshold might reduce the resulted itemsets and thus solve this problem, it would also remove interesting patterns with low frequency. To overcome this, the problem of mining frequent closed itemsets was introduced by Pasquier et al. in 1999 [12], where frequent itemsets which have no proper superitemset with the same support value (or frequency) are searched. The main benefit of this approach is that the set of closed frequent itemsets contains the complete information regarding its corresponding frequent itemsets. During the following few years, various algorithms were presented for mining frequent closed itemsets, including CLOSET [13], CHARM [14], FPclose [15], AFOPT [16], CLOSET+ [17], DBV-Miner [18], and STreeDC-Miner [19]. The main computational task of closed itemset mining is to check whether an itemset is a closed itemset. Different approaches have been proposed to address this issue. CHARM, for example, uses a hashing technique on its TID (transaction identifier) values, while AFOPT, FPclose, CLOSET, CLOSET+, or STreeDC-Miner maintains the identified detected itemsets in an FP-tree-like pattern-tree. Further reading about closed itemset mining can be found in [20].

The formulations above yield the close relationship between closed frequent itemsets and biclusters, since the goal of biclustering is to find biclusters , such that . Therefore, while the size restriction for columns in a bicluster corresponds to the frequency condition of itemsets, the “maximality” of a bicluster corresponds to the closeness of an itemset. Thus, if itemsets that contain less than min_rows number of rows are filtered out, the set of all closed frequent itemsets will be equal to the set of all maximal biclusters.

3. Mining Frequent Closed Itemsets Using Bit-Table Operations

In this section we introduce a novel frequent closed itemset mining algorithm and propose efficient implementation of the algorithm in the MATLAB environment. Note that the proposed method can also be applied to various biclustering application fields, such as gene expression data analysis, after a proper preprocessing (binarization) step. The schematic view of the proposed pipeline is shown in Figure 3.

3.1. The Proposed Mining Algorithm

The mining procedure is based on the Apriori principle. Apriori is an iterative algorithm that determines frequent itemsets level-wise in several steps (iterations). In any step , the algorithm calculates all frequent -itemsets based on the already generated ()-itemsets. Each step has two phases: candidate generation and frequency counting. In the first phase, the algorithm generates a set of candidate -itemsets from the set of frequent ()-itemsets from the previous pass. This is carried out by joining frequent ()-itemsets together. Two frequent ()-itemsets are joinable if their lexicographically ordered first items are the same and their last items are different. Before the algorithm enters the frequency counting phase, it discards every new candidate itemset having a subset that is infrequent (utilizing the downward closure property). In the frequency counting phase, the algorithm scans through the database and counts the support of the candidate -itemsets. Finally, candidates with support not lower than the minimum support threshold are added into the set of frequent itemsets.

A simplified pseudocode of the Apriori algorithm is presented in Pseudocode 1, which is extended by extracting only the closed itemsets in line 9. While the procedure generates candidate itemsets , the method (in row 5) counts the support of all candidate itemsets and removes the infrequent ones.

itemsets}
2
3  while   
4    
5    
6    
7    
8  end
9

The storage structure of the candidate itemsets is crucial to keep both memory usage and running time reasonable. In the literature, hash-tree [9, 11, 21] and prefix-tree [22, 23] storage structures have been shown to be efficient. The prefix-tree structure is more common, due to its efficiency and simplicity, but naive implementation could be still very space consuming.

Our procedure is based on a simple and easily implementable matrix representation of the frequent itemsets. The idea is to store the data and itemsets in vectors. Then, simple matrix and vector multiplication operations can be applied to calculate the supports of itemsets efficiently.

To indicate the iterative nature of our process, we define the input matrix () as where represents the th column of , which is related to the occurrence of the th item in transactions. The support of item can be easily calculated as .

Similarly, the support of itemset can be obtained by a simple vector product of the two related vectors because when both and items appear in a given transaction the product of the two related items can be represented by the AND connection of the two items: . The main benefit of this approach is that counting and storing the itemsets are not needed; only matrices of the frequent itemsets are generated based on the element-wise products of the vectors corresponding to the previously generated -frequent itemsets. Therefore, simple matrix and vector multiplications are used to calculate the support of the potential itemsets: , where the th and th element of the matrix represent the support of the itemset, where represents the set of ()-itemsets. As a consequence, only matrices of the frequent itemsets are generated, by forming the columns of the as the element-wise products of the columns of ; that is, , for all , where means the Hadamard product of matrices and .

The concept is simple and easily interpretable and supports compact and effective implementation. The proposed algorithm has a similar philosophy to the Apriori TID [24] method to generate candidate itemsets. None of these methods have to revisit the original data table, , for computing the support of larger itemsets. Instead, our method transforms the table as it goes along with the generation of the -itemsets, , . represents the data related to the 1-frequent itemsets. This table is generated from , by erasing the columns related to the nonfrequent items, to reduce the size of the matrices and improve the performance of the generation process.

Rows that are not containing any frequent itemsets (the sum of the row is zero) in are also deleted. If a column remains, the index of its original position is written into a matrix that stores only the indices (“pointers”) of the elements of itemsets . When matrices related to the indexes of the -itemsets are ordered, it is easy to follow the heuristics of the Apriori algorithm, as only those itemsets will be joined whose first items are identical (the set of these itemsets form the blocks of the matrix).

Figure 4 represents the second step of the algorithm, using in the procedure.

3.2. MATLAB Implementation of the Proposed Algorithm

The proposed algorithm uses matrix operations to identify frequent itemsets and count their support values. Here we provide a simple but powerful implementation of the algorithm using the user friendly MATLAB environment. The MATLAB code 2 (Algorithm 1) and code 3 (Algorithm 2) present working code snippets of frequent closed itemset mining, only within 34 lines of code.

(1)  s{1}=sum(bM); items{1}=find(s{1}≥suppn)';s{1}=s{1}(items{1});
(2)  dum=bM'*bM; =find(triu(dum, 1)≥suppn); items{2}= ;
(3)  k=3
(4)  while isempty(items{k−1})
(5)    items{k}= ; s{k}= ; ci= ;
(6)    for i=1:size(items{k−1},1)
(7)    vv=prod(bM(:,items{k−1}(i,:)), 2);
(8)   if k==3; s{2}(i)=sum(vv); end;
(9)   TID=find vv>0);
(10)  pf=(unique(items{k−1}(find(ismember(items{k−1}(:,1:end −1),
(11)      items{k−1}(i,1:end −1), “rows”)), end)));
(12)  fi=pf(find(pf>items{k−1}(i, end)));
(13)  forjj=fi'
(14)   j=find(items{1}==jj);
(15)   v=vv(TID).*bM(TID,items{1}(j)); sv=sum(v);
(16)     items{k}= items{k}; items{k−1}(i,:)items{1}(j) ; s{k}= s{k}; sv ;
(17)  end
(18)  end
(19)   k=k+1
(20) end

(1)  for k=1 : length(items)−1
(2)   Citems{k}= ;
(3)   for   i=1:size(items{k}, 1)
(4)      part=0;
(5)      for   j=1:size items{k−1}, 1)
(6)    IS intersect(items{k}(i,:), items{k+1 j,:));
(7)    if and((sum(ismember(items{k}(i,:), IS))==k), s{k}(i)==s{k+1}(j))
(8)      part=part+1; end
(9)     end
(10)   if part==0
(11)     Citems{k}= Citems{k}; items{k}(i,:) ;
(12)   end
(13)  end
(14) end
(15) Citems{k+1}=items{end};

The first code segment presents the second step of the discovery pipeline (see Figure 3). Preprocessed data is stored in the variable bM in bit-table format as discussed above. The first and second steps of the iterative procedure are presented in lines 1 and 2, where and are calculated. The Apriori principle is realized in the while loop in lines 4–19. Using the notation in Pseudocode 1, s are generated in lines 10-11 while s are prepared in the loop in lines 12–16.

MATLAB code 3 (Algorithm 2) shows the usually most expensive calculation, the generation of closed frequent itemsets, which is denoted by extraction of frequent closed itemsets in Figure 3. Using the set of frequent items as the candidate frequent closed itemsets, our approach calculates the support as the sum of columns (see Section 3.2) and eliminates nonclosed itemsets from the candidate set (line 11). Again, an itemset is a frequent closed itemset if it is frequent and there exists no proper superset such that . This is ensured by the loop in lines 5–9.

4. Experimental Results

In this section we compare our proposed method to BiMAX [7], which is a highly recognized reference method within the biclustering research community. As BiMAX is regularly applied to binary gene expression data, it serves as a good reference for the comparison. Using several biological and various synthetic data sets, we show that, while both methods are able to discover all patterns (frequent closed itemsets/biclusters), our pattern discovery approach outperforms BiMAX.

To compare the two mining methods and demonstrate the computational efficiency, we applied them to several real and synthetic data sets. Real data come from various biological studies previously used as reference data in biclustering research [2528]. For the comparison of the computational efficiency, all biological data sets were binarized. For both the fold-change data (stem cell data sets) and the absolute expression data (Leukemia, Compendium, and Yeast-80) fold-change cutoff 2 is used. Results are shown in Table 1 (synthetic data) and Table 2 (real data), respectively. Both methods were able to discover all closed patterns for all synthetic and real data sets. The results show that our method outperforms BiMAX and provides the best running times in all cases, especially when the number of rows and columns is higher. Biological validation of the discovered patterns together with detailed explanations is given in [28].

5. Conclusion

In this paper we have proposed a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on a simple bit-table based matrix and vector multiplication approach and ensures that all patterns can be discovered in a fast manner. The proposed algorithm can be successfully applied to various bioinformatics problems dealing with high-density biological data including high-throughput gene expression data.

Disclosure

Attila Gyenesei is joint first author.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The financial support TÁMOP-4.2.2.C-11/1/KONV-2012-0004 Project is gratefully acknowledged. The research of Janos Abonyi was realized in the frames of TMOP 4.2.4. A/2-11/1-2012-0001 “National Excellence Program Elaborating and operating an inland student and researcher personal support system.” The project was subsidized by the European Union and cofinanced by the European Social Fund.