Abstract

Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other methods in analyzing microarray data.

1. Introduction

Clustering, that is, finding similar groups of objects in a dataset, is an interesting technique especially for large data. Usually clustering algorithms assume that every object must belong to one and only one cluster (single-membership), but there are several real situations in which objects belong to more than one group (overlapping or multiple-membership). One of the applications of overlapping clustering is in bioinformatics. In biology, genes have more than one function carried out by coding proteins that participate in multiple metabolic pathways. Therefore, overlapping clustering could be useful in microarray data, which assigns gene expression data to multiple clusters simultaneously [1].

Density-based clustering can find clusters of different shapes so that they are useful in finding overlapped clusters. Furthermore, it is rather robust concerning outliers [2] and is very effective in clustering microarray data. These methods, even with the ability of finding overlapping clusters, do not use a probabilistic model. So, it is difficult to determine the probability of events and to compare an overlapping method with other methods. Therefore, a probability density-based clustering model, which provides overlapping, is required.

In this paper, the Bayesian-OverDBC algorithm is presented. This algorithm is a novel density-based clustering algorithm that has several advantages over traditional algorithms. It defines a probabilistic model of data which can be used to predict distribution of overlapping clusters. Bayesian hypothesis could be tested to determine which of the clusters is an overlapping cluster and which ones are merged or even discarded. Therefore, the algorithm may be interpreted as a Dirichlet Process Mixture (DPM) model.

Bayesian-OverDBC is based on OverDBC [3]. In OverDBC, initial cores (points with high density values) are formed based on density functions. Clusters are formed around the core objects and can be improve through local search. These steps are also taken in Bayesian-OverDBC. But in this algorithm, the decision to create, merge, or delete overlapping clusters is made by using probabilistic models and Bayesian hypotheses. Similar work has been done by Heller and Ghahramani [4] for modeling overlapping clusters (IOMM). This method uses an exponential distribution to model each cluster and creates overlap clusters using the product of distributions.

Evaluation results show that the Bayesian-OverDBC algorithm could find overlapped clusters and works more effectively than DBSCAN (a nonoverlapping density-based clustering) and IBP (an overlap clustering model) in microarray data. Obviously, this method can be generalized to other datasets in different applications.

The main contributions of the paper can be summarized as follows:(1)It introduces a density function to find probable core objects.(2)It introduces a probabilistic Bayesian model for overlapping density-based algorithm. The traditional density-based algorithms do not define a probabilistic model of data, so comparison with other models is hard.(3)It introduces new parameters which affect overlapping and the possibility of their occurrence.

The rest of the paper is organized as follows. At first, in Section 2, we give a brief overview of some of the clustering methods (overlapping and nonoverlapping methods). In Section 3, the concepts of density-based clustering methods are reviewed. Section 4 includes concepts of the new Bayesian model. Bayesian-OverDBC is described in Section 5. In Section 6, the results of the evaluation synthetic microarray-like datasets and real datasets and also a comparison with other methods are described.

Different clustering methods are introduced in statistics, machine learning, and data mining. The idea of multiple-membership clustering has recently emerged as an important topic in some research areas. Multiple-membership clustering methods were divided into three categories [5]: Soft Models, Multiple-Membership Extensions to Hierarchical Agglomerative Clustering, and Similarity-Space Additive Clustering. In the following, these multiple-membership clustering techniques and their features are generally reviewed.(1)Soft Models: soft model algorithms allow a point to be a partial member of some or all clusters. There are two primary methods for soft clustering: soft -means [6] and SVD-like matrix decompositions [7].(2)Multiple-Membership Extensions to Hierarchical Agglomerative Clustering (HAC): HAC is a simple clustering algorithm and has served as the starting point for several multi-membership clustering algorithms. “Jardine-Sibson B-clustering and Articulation Point Cuts” [8] and “Pyramid Hierarchical Clustering” [9] are a straightforward extension of single-link agglomerative clustering.(3)Similarity-Space Additive Clustering: ADCLUS [10] is an additive method for modeling similarity matrices. ADCLUS provides a weight for each cluster which is convenient for interpretation and discards unimportant clusters.

In [11], a probabilistic model of a microarray dataset is proposed. This method (SBK) models each observed expression value as a sample drawn from a Gaussian sample. The mean is a sum of real-valued activations of the processes that a gene participates in. The problem then is to find (binary membership matrix) and (real-valued activity matrix) so as to maximize the joint probability (; ; ), where is the input data. This paper demonstrates the application of the algorithm on the yeast stress response dataset finding that the discovered overlapping clusters have much better performance (as determined by value) than clusters discovered by other overlapping methods.

SBK uses the expectation-maximization method [12], so it has the existing problems in this area such as the local maximum. In addition, the algorithm needs to define convergence threshold. Determining the threshold value is highly sensitive to data and may directly affect the convergence or nonconvergence of the algorithm. Also, the algorithm requires an automatic startup process, so it requires an initial value for the cluster membership matrix. The initial value usually is the output of the -means or hierarchical clustering algorithms. All of these algorithms, in initialization phase, increase time complexity and space requirements.

Cheng and Church in [13] give a biclustering (coclustering) algorithm for finding biclusters in microarray data. A bicluster is a submatrix (rows and columns ) that minimizes some objectives such as MSR (mean square residue). In [14] a Bayesian biclustering method is introduced which is named BCC. It allows a mixed membership to row and column clusters. BCC uses separate Dirichlet priors over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. Some advantages of BCC are the following: the ability to handle sparse collections, being usable to diverse data types for all exponential family distributions, and flexible Bayesian priors using Dirichlet distributions; none of [13] or [14] provides overlapping functionality for clusters.

In [15] a probabilistic nonparametric Bayesian model for finding multiple clusters is introduced. This model can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. This model allows for not only learning the multiple clustering but also automatically learning the number of views and the number of clusters in each view.

This model and a similar model in [16] both assume that the features in each view are not overlapping. However, in many applications, some features may be shared among views. In other words, although the concept of multifeature clustering has been considered, the models are not able to find overlapping clusters.

A new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters, is presented in [4]. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions. The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters are modeled using an Indian Buffet Process (IBP) [17].

IOMM is implemented using a sampling method with a high repetition rate which needs a large time. Moreover, IOMM sampling method accepts all samples; the convergence of the algorithm is not provable in some datasets. In the next section some details of traditional density-based clustering algorithms, like DBSCAN, are reviewed. It also describes some of OverDBC features, that is, a density-based algorithm able to find overlapping clusters.

3. Traditional Density-Based Clustering

The key idea of density-based clustering is that each object in a cluster defines the neighborhood of a given radius with at least a minimum number of objects. Density-based clustering discovers clusters of arbitrary shapes in spatial databases with noise. Here density can be defined as the number of points within a specified radius. Density-based clustering techniques include mainly three techniques: DBSCAN (Density-Based Spatial Clustering of Application with Noise) [18], OPTICS (Ordering Points to Identify the Clustering Structure) [19], and DENCLUE (Density Clustering) [20].

The method presented in this paper (and also in OverDBC) uses the concepts of DBSCAN for clustering. So, some of the features of this algorithm are described. To find a cluster, DBSCAN starts with an arbitrary point and retrieves all points density-reachable from . An object is directly density-reachable from object if is within the -neighborhood of and is a core point. This procedure yields a cluster around the . If is a border point (points on the border of the cluster), no points are density-reachable from and DBSCAN visits the next point of the database.

There are several limitations to the traditional DBSCAN algorithm. The algorithm provides no guide to choosing the “correct” number of clusters. The quality of DBSCAN depends on the distance measure used in the algorithm. It is often difficult to know which distance metric to choose, especially for special data such as images or sequences and also for high-dimensional data. DBSCAN is not entirely deterministic; border points that are reachable from more than one cluster can be part of either cluster. This situation does not arise often but it is not inevitable.

OverDBC was introduced in Figure 1. It is a density-based algorithm for finding overlapping clusters which is based on DBSCAN. OverDBC allows objects to have multimembership in a restricted number of clusters where the total number of clusters is unbounded. In [3] it is proved that OverDBC is significantly better than nonoverlapping clustering algorithm such as DBSCAN in microarray data.

Traditional density-based algorithms do not define a probabilistic model of the data, so it is hard to ask how “good” a clustering is. Also, it is hard to compare this traditional method to other models, make predictions, or even cluster new data into existing clusters. In the following sections, statistical inference is used to overcome these limitations in OverDBC.

4. Bayesian-OverDBC Model

In this section, Bayesian-OverDBC is presented. It defines a probabilistic model of data which can predict the distribution of overlapping clusters. This model obtains the probability of overlap between a new cluster with previous clusters. If the overlapping probability with previous clusters is low, local search is carried out and a new cluster is formed. But, if the overlapping probability is high, will be invoked.

This function determines a lower bound on the number of shared objects of two clusters drown from a given dataset. It is defined based on double counting theory [21] and provides great improvement in overlap clustering. compares the new cluster with all previous overlapping clusters. If there is a large number of overlap data points, these clusters are merged.

To get the overlap probability, effective parameters and variables must be specified. A Bayesian graphical model can clearly show these relationships. Definition of variables, parameters, and hyperparameters in a graphical model is discussed in inferential statistics in which the value of a latent variable can be inferred based on the values of other variables.

In this paper, the overlap among clusters is shown with a binary matrix including rows and columns. If cluster th and cluster th overlap, then . One of the effective factors on overlapping of the th cluster is the dataset which is under investigation to be used for the th cluster formation. has some parameters based on which data are distributed (as shown with ). Most of the data, especially microarray data, follow a normal distribution; so the parameters of are the mean and the variance of data (in a vector form).

In addition to the data distribution in , clusters that were created before the th cluster can influence overlapping or nonoverlapping of cluster th. If is considered a symbol for each of the previous clusters, can have a value between 1 and . Hypothesis is defined by and it shows that the data in are independent. Alternative hypothesis, indicates that the data in are not independent and can be associated with two or more cores of clusters. The above idea is inspired by the assumption introduced in Heller and Ghahramani [22] for Bayesian hierarchical clustering.

Transaction matrix () is another effective variable on overlapping. is a binary matrix (a pattern of 0’s and 1’s) showing the membership of points in clusters. The parameter indicates the attraction probability for each of the core objects. will influence the value of transaction matrix and is considered as , where is the number of clusters. Each of the shows the attraction of the th core and consequently the presence probability of data objects in the th cluster.

Based on the above parameters, a Bayesian graphical model for overlapping clusters is presented in Algorithm 1. This graphical model is a head-to-head Bayesian model [23], in which a node has multiple parents which are independent of each other.

Overlapping Density Based Clustering Algorithm (OverDBC)
Input: Expression Matrix ()
output: Overlap clusters set ()
//phase 1: For each two point and in
  Find the value of Similarity matrix ()
//phase 2: Find_core_list();
//phase 3: For all of core_list do
     = next_cluster
    expandcluster(, , Neighbors);
    add to
//phase 4: Func_bound_over();
Expandcluster ( , , Neighbors)
= Link list.new ( );
For each point in neighbors
   If is in Volume()
  Neighbors = neighbors neighbors’
Return
Find_core_list ( )
Core_list = undefined
 For each in
If Density() > avg_Density
  Core_list.insert();
 Core_list. Sort( ) base on Closeness Centrality value
 Return Sorted Core_list;
Func_bound_over ( )
 Compute (the maximum number of overlap point)
For all , in C
If  >=  then
= merge();
Delete from
Return

This model shows the interaction among variables and parameters that affect overlap density-based clustering algorithms. As the graphic model Figure 1 shows that affected variables on overlapping , , and are independent. By considering an occurrence of transaction matrix , creation of clusters , and dataset , the probability of the overlap of th cluster with the th cluster () is computed by

In Section 4.1, we will show how the finite mixture model concepts can be used to compute . The computation of , data distribution , and conditional probability will be described in Section 4.2.

4.1. Probability of Transaction Matrix

The computation of has been done based on the finite mixture model [24]. In finite mixture models we assume that there are cores, each associated with a parameter , the attraction value of core th for all data points in . According to the graphical model in Figure 1,   is computed by

In finite mixture model, objects and cores are defined. The fact that object belongs to cluster is indicated by a binary variable . Each object may belong to multiple clusters, so th row of does not have any restrictions. The (, ), thus, forms a binary transaction matrix (). We will assume that each object belongs to cluster with probability of ; therefore, the clusters are generated independently. Under this model, given , the conditional probability of the matrix with having is computed by where is the number of data points that are in the neighborhood of core th and have a radius of less than . In the following, computation methods for and will be described.

In order to find , first we should compute distance between two points and (). can be computed by Euclidean distance between and th core, or it may be a distance measure which is obtained from Pearson correlation coefficient. In this paper, the Pearson correlation coefficient is used.

Lower indicates more correlation with the th core and, hence, the greater density of the core object. The parameter is the computed standard deviation for all data points in the neighborhood region of th core. A core density function of an object is the impact of all the data points on its neighborhood. For each of the points in the neighborhood of , the density function is defined by the following [25]:

So (the attraction value of core th for all data points in ) is the sum of the probability density functions for all points, which is specified in the following [25]: We define as the event of the attraction by the th core. So is specified in the following: If we assume that a prior on follows a beta distribution with the parameters and and is conjugate to the binomial. The probability of any under the Beta distribution and the concept of Bayesian inference [24] is given by where is the beta function and is computed byIf in (8) and ( is the concentration parameter for each core density and is the number of cores) are assumed then (7) is rewritten asEquation (8) is achieved by exploiting the recursive definition of the gamma function, where we have used the fact that for [24]. So (7) can be rewritten as

4.2. Probability Model of Clusters and Data Distribution

According to the Bayesian model presented in the previous section, computing methods of and will be described in this section. As defined in Section 4, the hypothesis shows that all the data in cluster are in fact generated independently and only belong to cluster . The alternative hypothesis states that data in cluster may belong to two or more clusters. Obviously, the relation (11) exists between and :Thus, considering the graphical model (Figure 1),   is computed by is the prior probability of . To compute from (12), first, is computed. If represents a point and is the value of transaction matrix for th point and th core (whose value is zero or one) then the expected number of presence in different clusters () will be computed byThe greater value for shows the more probability of the presence of in some clusters. If is computed for all expected in , then is obtained by

can be obtained based on (11) and (14).

To compute , the IBP model [24] will be used. IBP is a simple generative process obtained from the case of customers eating from Indian buffets. customers (i.e., data points in our clustering model) line up on one side of an Indian buffet with infinite number of dishes (i.e., clusters). The first customer serves himself from Poisson () dishes ( is the concentration parameter of clusters). The next customers serve themselves dishes in proportion to the dish popularity, such that customer serves herself the dish with probability , where is the number of previous customers which had served themselves with dish . is obtained in By using the IBP model, is computed bywhere is the number of objects in . is defined as a new symbol for the expected number of objects in presented in multiple clusters; the value of is equal to the number of in which for them is satisfied. Therefore, is computed by

Based on (17), It is clear that the greater value for reduces the probability of formation. can be obtained in a similar way in (15). By placing (14) and (15) in (12), the value of will be computed.

In the following, the computation of will be described. Graphical model in Figure 1 represents a dataset , which is generated independently and uniquely from a probability model with vector parameters . Each of the is a one-dimensional vector. Generally, microarray data (which are used for the evaluation algorithm) have a normal distribution, so could be normal distribution parameters ( and vectors) which are the median and variance, respectively. By using dataset , the conditional probability of can be computed by using the following [25]:

In the graphic model, some of the variables may be latent or unobserved. For example we might not know the mean and variance of the Gaussian distribution which generated our data, and we may also be interested in inferring these values. If there is information about , values of hidden variables could be inferred using Variational Bayes method [26].

To complete the last part of (1), that is, the overlap probability for th and th clusters, is computed based on the number of data points expected to be present in two clusters simultaneously by the following:

In (16), is the expected number of points in , is the expected number of points in , and is the expected number of points in both and . These parameters are computed using transaction matrix ().

By computing in (1), prediction of overlap degree between the new cluster and all other previous clusters is possible. In Section 5, this prediction will be used to provide an overlap Bayesian clustering algorithm.

5. Bayesian-OverDBC Algorithm

In this section, Bayesian-OverDBC algorithm will be introduced. This algorithm defines a probability model for the data which can be used to predict the distribution of overlap clusters. This algorithm completes OverDBC algorithm. More details of OverDBC are in [3]. OverDBC consists of four phases which are as follows:(1)Selection of the original core points.(2)Density estimation and determining whether a selected point is really a core or not.(3)Improving clustering by using local search around core points.(4)Merging clusters if it is possible (in case clusters have excessive same genes).

The first phases () in Bayesian-OverDBC are the same as in OverDBC. The primary difference of these two algorithms is in phase . In this phase, based on Bayesian model shown in Figure 1, the overlap probability of a new cluster is computed in comparison to all other previous clusters. If the overlap probability is smaller than , the local search is continued around the core and a new cluster is formed. Value of will be determined by trial and error method on the dataset. If the overlap probability is more than , is invoked. This function determines a lower bound on the number of shared objects of two clusters drown from a given dataset. The is defined based on double counting theory [21] and provides great improvement in overlap clustering. Output of the function is represented as . If the number of overlap points of two clusters is greater than , the two clusters should be merged to form a larger cluster. Obviously, with these changes the membership matrix is also changed.

Bayesian-OverDBC (Algorithm 2) has many advantages over traditional density-based clustering methods. It defines a probabilistic model of data which can be used to predicate distribution of overlap clusters. Bayesian hypothesis testing could be used to decide which of the clusters exists as overlap clusters and which one merges or even is discarded. In the next section the results of comparison of Bayesian-OverDBC with other algorithms will be described.

Baysian Overlapping Density Based Clustering Algorithm (OverDBC)
Input: Expression Matrix () , Data model
Output: Bayesian overlap clusters, (membership-matrix).
New cluster may be merged based on () probability.
//phase 1
(1) compute transaction matrix ()
//phase 2
(2) Find Core genes based on Density and Closeness Centrality
(3) Add gene to Core genes () based on density and Cc relations.
//phase 3
(4) For  All in (Set of Core Object) Repeat:
(5) If    Start Local Search to find nearest neighbors , Save cluster .
  Else For   to
   Based above probability select one of these paths:
    if    then Start local search to construct new cluster .
    Else  invoke func_bound_over( ) and return results
   End of For
End of If

6. Evaluation

Our evaluation experiments were performed on two different types of data: synthetic microarray-like data and real dataset of microarray. By using microarray techniques; it is possible to measure the expression levels of thousands of genes under several experimental conditions. Microarray data provide a lot of information about the molecule transaction in genome level, which is important for gene regulatory network detection. In a formal representation, microarray data were represented as a matrix. Rows represent genes and columns represent conditions. th and th matrix member shows the expression level of gene in condition . In [11], apart from demonstrating their approach on gene microarray data and evaluating standard biology databases, they also showed results on microarray-like synthetic data. We employed three synthetic datasets of different sizes:(1)Small-synthetic dataset: a dataset with .(2)Medium-synthetic dataset: a dataset with .(3)Large-synthetic dataset: a dataset with .

Bayesian OverDBC has been evaluated on two real datasets of microarray gene expression data. The algorithm has been implemented on the Arabidopsis thaliana abiotic stress dataset (DS1) [27] and on the yeast cell cycle dataset (DS2) [28].

DS1 is a 3D dataset from multiple sclerosis patients which has been published in 2003. The condition dimension consisted of 13 multiple-sclerosis patients, monitored over 7 time points after IFN- injection. The Arabidopsis thaliana datasets were composed of different abiotic stress stimulus experiments conducted in the root and shoot tissue.

DS2 was extracted from a dataset that shows the fluctuation of expression levels of approximately 6000 genes over two cell cycles (17 time points).

To evaluate the clustering results, precision, recall, and -measure were calculated over pairs of points. These measures try to determine whether the prediction of the pair existing in the same cluster was correct with respect to the underlying true categories in the data. Precision is calculated as the fraction of pairs correctly put in the same cluster. Recall is the fraction of actual pairs that were identified. -measure is the harmonic mean of precision and recall.

We compared Bayesian-OverDBC results with DBSCAN, which can only assign each object to a single cluster. We compared these algorithms using an score, which takes into accounts both precision and recall and which can be computed from the true gene assignments to clusters. Also, we compared Bayesian-OverDBC with IOMM that allows genes to belong to multiple overlap clusters (Table 1).

The first column is the name of the datasets, the second column is precision value, and the third and fourth columns are recall and measure.

Although in a few positions the values of precision or recall for Bayesian-OverDBC is lower than other algorithms, the measure has higher values in comparison with other methods indicating the good performance of Bayesian-OverDBC algorithm.

We compared our method with IOMM using omega index. The omega index extends the Adjusted Rand Index (ARI) [29] to overlapping clustering [30]. In addition to counting the number of common pairs occurring together in 0 cluster or 1 cluster, Omega index also counts the number of pairs occurring together in clusters. Using the terms from Table 2 omega index () and expected omega () are computed by the following, respectively:Table 2 shows parameters used in omega index. It contains symbols which are required to compare two methods of clustering.

In Table 2, clusters in the first algorithm (Bayesian-OverDBC) form rows and clusters in the second algorithm (IOMM) form columns. So, is the number of clusters in the first algorithm and is the number of clusters in the second algorithm. In this table, is the number of objects which are in th cluster by method 1 and in th cluster be method 2. ( as a brief form) is the number of objects which are the same clusters by method 1 and method 2. More details about omega index are in [30]. The omega index requires an adjustment to remove clusters sharing the same number of labels by chance which is computed by (22)

Of the other metrics such as NMI, PNMI, and aligned NMI [30], the omega index gives the most optimistic measure of multiple-membership similarity. We compared Bayesian-OverDBC and IOMM using omega index and we found for DS1 and for DS2. It indicates that Bayesian-OverDBC assigns data points to overlap clusters in a similar way with IOMM.

These results also show that Bayesian-OverDBC is an effective density-based method for overlap clustering and its performance in finding relevant pairs is very similar to or even better than IOMM. Furthermore, IOMM sampler should be run for 2000–3000 iterations. Time complexity of IOMM is and the time complexity of Bayesian-OverDBC is . As a result, Bayesian-OverDBC has better performance in time complexity compared to IOMM.

7. Discussion

This paper explained Bayesian-OverDBC which is a new density-based clustering method modelling overlapped clusters. The Bayesian-OverDBC extends traditional density-based model using probabilistic method to find and predict overlap clusters. While most of the research in this area has focused on disjoint clustering, many real microarray datasets, and as a result many gene regulatory networks, have inherent overlapping partitions. Density-based clustering methods, even with the ability of producing overlapping clusters, do not use a probabilistic model. So, it is difficult to determine probability of events and to compare an overlapping method with other methods. Therefore, a probability density-based clustering model, which provides overlapping, is required. It is proved that Bayesian overlapping clustering may be significantly better than other similar methods of clustering. As overlapping clustering is a still-developing field, there are several subjects for future development such as techniques for visualization and interpretation, new algorithms and new means of comparison, and techniques for model selection.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.