Bayesian-OverDBC: A Bayesian Density-Based Approach for Modeling Overlapping Clusters

Mirzaie, Mansooreh; Barani, Ahmad; Nematbakkhsh, Naser; Mohammad-Beigi, Majid

doi:https://doi.org/10.1155/2015/187053

Mathematical Problems in Engineering

On this page

Abstract Introduction Related Work Evaluation Discussion References Copyright Related Articles

Research Article | Open Access

Volume 2015 | Article ID 187053 | https://doi.org/10.1155/2015/187053

Bayesian-OverDBC: A Bayesian Density-Based Approach for Modeling Overlapping Clusters

Mansooreh Mirzaie,^1,2Ahmad Barani,²Naser Nematbakkhsh,²and Majid Mohammad-Beigi³

Academic Editor: Huaguang Zhang

Received18 Mar 2015

Revised14 Jun 2015

Accepted21 Oct 2015

Published13 Dec 2015

Abstract

Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other methods in analyzing microarray data.

1. Introduction

Clustering, that is, finding similar groups of objects in a dataset, is an interesting technique especially for large data. Usually clustering algorithms assume that every object must belong to one and only one cluster (single-membership), but there are several real situations in which objects belong to more than one group (overlapping or multiple-membership). One of the applications of overlapping clustering is in bioinformatics. In biology, genes have more than one function carried out by coding proteins that participate in multiple metabolic pathways. Therefore, overlapping clustering could be useful in microarray data, which assigns gene expression data to multiple clusters simultaneously [1].

Density-based clustering can find clusters of different shapes so that they are useful in finding overlapped clusters. Furthermore, it is rather robust concerning outliers [2] and is very effective in clustering microarray data. These methods, even with the ability of finding overlapping clusters, do not use a probabilistic model. So, it is difficult to determine the probability of events and to compare an overlapping method with other methods. Therefore, a probability density-based clustering model, which provides overlapping, is required.

In this paper, the Bayesian-OverDBC algorithm is presented. This algorithm is a novel density-based clustering algorithm that has several advantages over traditional algorithms. It defines a probabilistic model of data which can be used to predict distribution of overlapping clusters. Bayesian hypothesis could be tested to determine which of the clusters is an overlapping cluster and which ones are merged or even discarded. Therefore, the algorithm may be interpreted as a Dirichlet Process Mixture (DPM) model.

Bayesian-OverDBC is based on OverDBC [3]. In OverDBC, initial cores (points with high density values) are formed based on density functions. Clusters are formed around the core objects and can be improve through local search. These steps are also taken in Bayesian-OverDBC. But in this algorithm, the decision to create, merge, or delete overlapping clusters is made by using probabilistic models and Bayesian hypotheses. Similar work has been done by Heller and Ghahramani [4] for modeling overlapping clusters (IOMM). This method uses an exponential distribution to model each cluster and creates overlap clusters using the product of distributions.

Evaluation results show that the Bayesian-OverDBC algorithm could find overlapped clusters and works more effectively than DBSCAN (a nonoverlapping density-based clustering) and IBP (an overlap clustering model) in microarray data. Obviously, this method can be generalized to other datasets in different applications.

The main contributions of the paper can be summarized as follows:(1)It introduces a density function to find probable core objects.(2)It introduces a probabilistic Bayesian model for overlapping density-based algorithm. The traditional density-based algorithms do not define a probabilistic model of data, so comparison with other models is hard.(3)It introduces new parameters which affect overlapping and the possibility of their occurrence.

The rest of the paper is organized as follows. At first, in Section 2, we give a brief overview of some of the clustering methods (overlapping and nonoverlapping methods). In Section 3, the concepts of density-based clustering methods are reviewed. Section 4 includes concepts of the new Bayesian model. Bayesian-OverDBC is described in Section 5. In Section 6, the results of the evaluation synthetic microarray-like datasets and real datasets and also a comparison with other methods are described.

Different clustering methods are introduced in statistics, machine learning, and data mining. The idea of multiple-membership clustering has recently emerged as an important topic in some research areas. Multiple-membership clustering methods were divided into three categories [5]: Soft Models, Multiple-Membership Extensions to Hierarchical Agglomerative Clustering, and Similarity-Space Additive Clustering. In the following, these multiple-membership clustering techniques and their features are generally reviewed.(1)Soft Models: soft model algorithms allow a point to be a partial member of some or all clusters. There are two primary methods for soft clustering: soft -means [6] and SVD-like matrix decompositions [7].(2)Multiple-Membership Extensions to Hierarchical Agglomerative Clustering (HAC): HAC is a simple clustering algorithm and has served as the starting point for several multi-membership clustering algorithms. “Jardine-Sibson B-clustering and Articulation Point Cuts” [8] and “Pyramid Hierarchical Clustering” [9] are a straightforward extension of single-link agglomerative clustering.(3)Similarity-Space Additive Clustering: ADCLUS [10] is an additive method for modeling similarity matrices. ADCLUS provides a weight for each cluster which is convenient for interpretation and discards unimportant clusters.

In [11], a probabilistic model of a microarray dataset is proposed. This method (SBK) models each observed expression value as a sample drawn from a Gaussian sample. The mean is a sum of real-valued activations of the processes that a gene participates in. The problem then is to find (binary membership matrix) and (real-valued activity matrix) so as to maximize the joint probability (; ; ), where is the input data. This paper demonstrates the application of the algorithm on the yeast stress response dataset finding that the discovered overlapping clusters have much better performance (as determined by value) than clusters discovered by other overlapping methods.

SBK uses the expectation-maximization method [12], so it has the existing problems in this area such as the local maximum. In addition, the algorithm needs to define convergence threshold. Determining the threshold value is highly sensitive to data and may directly affect the convergence or nonconvergence of the algorithm. Also, the algorithm requires an automatic startup process, so it requires an initial value for the cluster membership matrix. The initial value usually is the output of the -means or hierarchical clustering algorithms. All of these algorithms, in initialization phase, increase time complexity and space requirements.

Cheng and Church in [13] give a biclustering (coclustering) algorithm for finding biclusters in microarray data. A bicluster is a submatrix (rows and columns ) that minimizes some objectives such as MSR (mean square residue). In [14] a Bayesian biclustering method is introduced which is named BCC. It allows a mixed membership to row and column clusters. BCC uses separate Dirichlet priors over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. Some advantages of BCC are the following: the ability to handle sparse collections, being usable to diverse data types for all exponential family distributions, and flexible Bayesian priors using Dirichlet distributions; none of [13] or [14] provides overlapping functionality for clusters.

In [15] a probabilistic nonparametric Bayesian model for finding multiple clusters is introduced. This model can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. This model allows for not only learning the multiple clustering but also automatically learning the number of views and the number of clusters in each view.

This model and a similar model in [16] both assume that the features in each view are not overlapping. However, in many applications, some features may be shared among views. In other words, although the concept of multifeature clustering has been considered, the models are not able to find overlapping clusters.

A new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters, is presented in [4]. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions. The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters are modeled using an Indian Buffet Process (IBP) [17].

IOMM is implemented using a sampling method with a high repetition rate which needs a large time. Moreover, IOMM sampling method accepts all samples; the convergence of the algorithm is not provable in some datasets. In the next section some details of traditional density-based clustering algorithms, like DBSCAN, are reviewed. It also describes some of OverDBC features, that is, a density-based algorithm able to find overlapping clusters.

3. Traditional Density-Based Clustering

The key idea of density-based clustering is that each object in a cluster defines the neighborhood of a given radius with at least a minimum number of objects. Density-based clustering discovers clusters of arbitrary shapes in spatial databases with noise. Here density can be defined as the number of points within a specified radius. Density-based clustering techniques include mainly three techniques: DBSCAN (Density-Based Spatial Clustering of Application with Noise) [18], OPTICS (Ordering Points to Identify the Clustering Structure) [19], and DENCLUE (Density Clustering) [20].

The method presented in this paper (and also in OverDBC) uses the concepts of DBSCAN for clustering. So, some of the features of this algorithm are described. To find a cluster, DBSCAN starts with an arbitrary point and retrieves all points density-reachable from . An object is directly density-reachable from object if is within the -neighborhood of and is a core point. This procedure yields a cluster around the . If is a border point (points on the border of the cluster), no points are density-reachable from and DBSCAN visits the next point of the database.

There are several limitations to the traditional DBSCAN algorithm. The algorithm provides no guide to choosing the “correct” number of clusters. The quality of DBSCAN depends on the distance measure used in the algorithm. It is often difficult to know which distance metric to choose, especially for special data such as images or sequences and also for high-dimensional data. DBSCAN is not entirely deterministic; border points that are reachable from more than one cluster can be part of either cluster. This situation does not arise often but it is not inevitable.

OverDBC was introduced in Figure 1. It is a density-based algorithm for finding overlapping clusters which is based on DBSCAN. OverDBC allows objects to have multimembership in a restricted number of clusters where the total number of clusters is unbounded. In [3] it is proved that OverDBC is significantly better than nonoverlapping clustering algorithm such as DBSCAN in microarray data.

Traditional density-based algorithms do not define a probabilistic model of the data, so it is hard to ask how “good” a clustering is. Also, it is hard to compare this traditional method to other models, make predictions, or even cluster new data into existing clusters. In the following sections, statistical inference is used to overcome these limitations in OverDBC.

4. Bayesian-OverDBC Model

In this section, Bayesian-OverDBC is presented. It defines a probabilistic model of data which can predict the distribution of overlapping clusters. This model obtains the probability of overlap between a new cluster with previous clusters. If the overlapping probability with previous clusters is low, local search is carried out and a new cluster is formed. But, if the overlapping probability is high, will be invoked.

This function determines a lower bound on the number of shared objects of two clusters drown from a given dataset. It is defined based on double counting theory [21] and provides great improvement in overlap clustering. compares the new cluster with all previous overlapping clusters. If there is a large number of overlap data points, these clusters are merged.

To get the overlap probability, effective parameters and variables must be specified. A Bayesian graphical model can clearly show these relationships. Definition of variables, parameters, and hyperparameters in a graphical model is discussed in inferential statistics in which the value of a latent variable can be inferred based on the values of other variables.

In this paper, the overlap among clusters is shown with a binary matrix including rows and columns. If cluster th and cluster th overlap, then . One of the effective factors on overlapping of the th cluster is the dataset which is under investigation to be used for the th cluster formation. has some parameters based on which data are distributed (as shown with ). Most of the data, especially microarray data, follow a normal distribution; so the parameters of are the mean and the variance of data (in a vector form).

In addition to the data distribution in , clusters that were created before the th cluster can influence overlapping or nonoverlapping of cluster th. If is considered a symbol for each of the previous clusters, can have a value between 1 and . Hypothesis is defined by and it shows that the data in are independent. Alternative hypothesis, indicates that the data in are not independent and can be associated with two or more cores of clusters. The above idea is inspired by the assumption introduced in Heller and Ghahramani [22] for Bayesian hierarchical clustering.

Transaction matrix () is another effective variable on overlapping. is a binary matrix (a pattern of 0’s and 1’s) showing the membership of points in clusters. The parameter indicates the attraction probability for each of the core objects. will influence the value of transaction matrix and is considered as , where is the number of clusters. Each of the shows the attraction of the th core and consequently the presence probability of data objects in the th cluster.

Based on the above parameters, a Bayesian graphical model for overlapping clusters is presented in Algorithm 1. This graphical model is a head-to-head Bayesian model [23], in which a node has multiple parents which are independent of each other.

Overlapping Density Based Clustering Algorithm (OverDBC)
Input: Expression Matrix ()
output: Overlap clusters set ()
//phase 1: For each two point and in
Find the value of Similarity matrix ()
//phase 2: Find_core_list();
//phase 3: For all of core_list do
= next_cluster
expandcluster(, , Neighbors);
add to
//phase 4: Func_bound_over();

Expandcluster ( , , Neighbors)
= Link list.new ( );
For each point in neighbors
If is in Volume()
Neighbors = neighbors neighbors’
Return
Find_core_list ( )
Core_list = undefined
For each in
If Density() > avg_Density
Core_list.insert();
Core_list. Sort( ) base on Closeness Centrality value
Return Sorted Core_list;
Func_bound_over ( )
Compute (the maximum number of overlap point)
For all , in C
If >= then
= merge();
Delete from
Return

This model shows the interaction among variables and parameters that affect overlap density-based clustering algorithms. As the graphic model Figure 1 shows that affected variables on overlapping , , and are independent. By considering an occurrence of transaction matrix , creation of clusters , and dataset , the probability of the overlap of th cluster with the th cluster () is computed by

In Section 4.1, we will show how the finite mixture model concepts can be used to compute . The computation of , data distribution , and conditional probability will be described in Section 4.2.

4.1. Probability of Transaction Matrix

The computation of has been done based on the finite mixture model [24]. In finite mixture models we assume that there are cores, each associated with a parameter , the attraction value of core th for all data points in . According to the graphical model in Figure 1, is computed by

In finite mixture model, objects and cores are defined. The fact that object belongs to cluster is indicated by a binary variable . Each object may belong to multiple clusters, so th row of does not have any restrictions. The (, ), thus, forms a binary transaction matrix (). We will assume that each object belongs to cluster with probability of ; therefore, the clusters are generated independently. Under this model, given , the conditional probability of the matrix with having is computed by where is the number of data points that are in the neighborhood of core th and have a radius of less than . In the following, computation methods for and will be described.

In order to find , first we should compute distance between two points and (). can be computed by Euclidean distance between and th core, or it may be a distance measure which is obtained from Pearson correlation coefficient. In this paper, the Pearson correlation coefficient is used.

Lower indicates more correlation with the th core and, hence, the greater density of the core object. The parameter is the computed standard deviation for all data points in the neighborhood region of th core. A core density function of an object is the impact of all the data points on its neighborhood. For each of the points in the neighborhood of , the density function is defined by the following [25]:

So (the attraction value of core th for all data points in ) is the sum of the probability density functions for all points, which is specified in the following [25]: We define as the event of the attraction by the th core. So is specified in the following: If we assume that a prior on follows a beta distribution with the parameters and and is conjugate to the binomial. The probability of any under the Beta distribution and the concept of Bayesian inference [24] is given by where is the beta function and is computed byIf in (8) and ( is the concentration parameter for each core density and is the number of cores) are assumed then (7) is rewritten asEquation (8) is achieved by exploiting the recursive definition of the gamma function, where we have used the fact that for [24]. So (7) can be rewritten as

4.2. Probability Model of Clusters and Data Distribution

According to the Bayesian model presented in the previous section, computing methods of and will be described in this section. As defined in Section 4, the hypothesis shows that all the data in cluster are in fact generated independently and only belong to cluster . The alternative hypothesis states that data in cluster may belong to two or more clusters. Obviously, the relation (11) exists between and :Thus, considering the graphical model (Figure 1), is computed by is the prior probability of . To compute from (12), first, is computed. If represents a point and is the value of transaction matrix for th point and th core (whose value is zero or one) then the expected number of presence in different clusters () will be computed byThe greater value for shows the more probability of the presence of in some clusters. If is computed for all expected in , then is obtained by

can be obtained based on (11) and (14).

To compute , the IBP model [24] will be used. IBP is a simple generative process obtained from the case of customers eating from Indian buffets. customers (i.e., data points in our clustering model) line up on one side of an Indian buffet with infinite number of dishes (i.e., clusters). The first customer serves himself from Poisson () dishes ( is the concentration parameter of clusters). The next customers serve themselves dishes in proportion to the dish popularity, such that customer serves herself the dish with probability , where is the number of previous customers which had served themselves with dish . is obtained in By using the IBP model, is computed bywhere is the number of objects in . is defined as a new symbol for the expected number of objects in presented in multiple clusters; the value of is equal to the number of in which for them is satisfied. Therefore, is computed by

Based on (17), It is clear that the greater value for reduces the probability of formation. can be obtained in a similar way in (15). By placing (14) and (15) in (12), the value of will be computed.

In the following, the computation of will be described. Graphical model in Figure 1 represents a dataset , which is generated independently and uniquely from a probability model with vector parameters . Each of the is a one-dimensional vector. Generally, microarray data (which are used for the evaluation algorithm) have a normal distribution, so could be normal distribution parameters ( and vectors) which are the median and variance, respectively. By using dataset , the conditional probability of can be computed by using the following [25]:

In the graphic model, some of the variables may be latent or unobserved. For example we might not know the mean and variance of the Gaussian distribution which generated our data, and we may also be interested in inferring these values. If there is information about , values of hidden variables could be inferred using Variational Bayes method [26].

To complete the last part of (1), that is, the overlap probability for th and th clusters, is computed based on the number of data points expected to be present in two clusters simultaneously by the following:

In (16), is the expected number of points in , is the expected number of points in , and is the expected number of points in both and . These parameters are computed using transaction matrix ().

By computing in (1), prediction of overlap degree between the new cluster and all other previous clusters is possible. In Section 5, this prediction will be used to provide an overlap Bayesian clustering algorithm.

5. Bayesian-OverDBC Algorithm

In this section, Bayesian-OverDBC algorithm will be introduced. This algorithm defines a probability model for the data which can be used to predict the distribution of overlap clusters. This algorithm completes OverDBC algorithm. More details of OverDBC are in [3]. OverDBC consists of four phases which are as follows:(1)Selection of the original core points.(2)Density estimation and determining whether a selected point is really a core or not.(3)Improving clustering by using local search around core points.(4)Merging clusters if it is possible (in case clusters have excessive same genes).

The first phases () in Bayesian-OverDBC are the same as in OverDBC. The primary difference of these two algorithms is in phase . In this phase, based on Bayesian model shown in Figure 1, the overlap probability of a new cluster is computed in comparison to all other previous clusters. If the overlap probability is smaller than , the local search is continued around the core and a new cluster is formed. Value of will be determined by trial and error method on the dataset. If the overlap probability is more than , is invoked. This function determines a lower bound on the number of shared objects of two clusters drown from a given dataset. The is defined based on double counting theory [21] and provides great improvement in overlap clustering. Output of the function is represented as . If the number of overlap points of two clusters is greater than , the two clusters should be merged to form a larger cluster. Obviously, with these changes the membership matrix is also changed.

Bayesian-OverDBC (Algorithm 2) has many advantages over traditional density-based clustering methods. It defines a probabilistic model of data which can be used to predicate distribution of overlap clusters. Bayesian hypothesis testing could be used to decide which of the clusters exists as overlap clusters and which one merges or even is discarded. In the next section the results of comparison of Bayesian-OverDBC with other algorithms will be described.

Baysian Overlapping Density Based Clustering Algorithm (OverDBC)
Input: Expression Matrix () , Data model
Output: Bayesian overlap clusters, (membership-matrix).
New cluster may be merged based on () probability.
//phase 1
(1) compute transaction matrix ()
//phase 2
(2) Find Core genes based on Density and Closeness Centrality
(3) Add gene to Core genes () based on density and Cc relations.
//phase 3
(4) For All in (Set of Core Object) Repeat:
(5) If Start Local Search to find nearest neighbors , Save cluster .
Else For to

Based above probability select one of these paths:
if then Start local search to construct new cluster .
Else invoke func_bound_over( ) and return results
End of For
End of If

6. Evaluation

Our evaluation experiments were performed on two different types of data: synthetic microarray-like data and real dataset of microarray. By using microarray techniques; it is possible to measure the expression levels of thousands of genes under several experimental conditions. Microarray data provide a lot of information about the molecule transaction in genome level, which is important for gene regulatory network detection. In a formal representation, microarray data were represented as a matrix. Rows represent genes and columns represent conditions. th and th matrix member shows the expression level of gene in condition . In [11], apart from demonstrating their approach on gene microarray data and evaluating standard biology databases, they also showed results on microarray-like synthetic data. We employed three synthetic datasets of different sizes:(1)Small-synthetic dataset: a dataset with .(2)Medium-synthetic dataset: a dataset with .(3)Large-synthetic dataset: a dataset with .

Bayesian OverDBC has been evaluated on two real datasets of microarray gene expression data. The algorithm has been implemented on the Arabidopsis thaliana abiotic stress dataset (DS1) [27] and on the yeast cell cycle dataset (DS2) [28].

DS1 is a 3D dataset from multiple sclerosis patients which has been published in 2003. The condition dimension consisted of 13 multiple-sclerosis patients, monitored over 7 time points after IFN- injection. The Arabidopsis thaliana datasets were composed of different abiotic stress stimulus experiments conducted in the root and shoot tissue.

DS2 was extracted from a dataset that shows the fluctuation of expression levels of approximately 6000 genes over two cell cycles (17 time points).

To evaluate the clustering results, precision, recall, and -measure were calculated over pairs of points. These measures try to determine whether the prediction of the pair existing in the same cluster was correct with respect to the underlying true categories in the data. Precision is calculated as the fraction of pairs correctly put in the same cluster. Recall is the fraction of actual pairs that were identified. -measure is the harmonic mean of precision and recall.

We compared Bayesian-OverDBC results with DBSCAN, which can only assign each object to a single cluster. We compared these algorithms using an score, which takes into accounts both precision and recall and which can be computed from the true gene assignments to clusters. Also, we compared Bayesian-OverDBC with IOMM that allows genes to belong to multiple overlap clusters (Table 1).

The first column is the name of the datasets, the second column is precision value, and the third and fourth columns are recall and measure.

Although in a few positions the values of precision or recall for Bayesian-OverDBC is lower than other algorithms, the measure has higher values in comparison with other methods indicating the good performance of Bayesian-OverDBC algorithm.

We compared our method with IOMM using omega index. The omega index extends the Adjusted Rand Index (ARI) [29] to overlapping clustering [30]. In addition to counting the number of common pairs occurring together in 0 cluster or 1 cluster, Omega index also counts the number of pairs occurring together in clusters. Using the terms from Table 2 omega index () and expected omega () are computed by the following, respectively:Table 2 shows parameters used in omega index. It contains symbols which are required to compare two methods of clustering.

In Table 2, clusters in the first algorithm (Bayesian-OverDBC) form rows and clusters in the second algorithm (IOMM) form columns. So, is the number of clusters in the first algorithm and is the number of clusters in the second algorithm. In this table, is the number of objects which are in th cluster by method 1 and in th cluster be method 2. ( as a brief form) is the number of objects which are the same clusters by method 1 and method 2. More details about omega index are in [30]. The omega index requires an adjustment to remove clusters sharing the same number of labels by chance which is computed by (22)

Of the other metrics such as NMI, PNMI, and aligned NMI [30], the omega index gives the most optimistic measure of multiple-membership similarity. We compared Bayesian-OverDBC and IOMM using omega index and we found for DS1 and for DS2. It indicates that Bayesian-OverDBC assigns data points to overlap clusters in a similar way with IOMM.

These results also show that Bayesian-OverDBC is an effective density-based method for overlap clustering and its performance in finding relevant pairs is very similar to or even better than IOMM. Furthermore, IOMM sampler should be run for 2000–3000 iterations. Time complexity of IOMM is and the time complexity of Bayesian-OverDBC is . As a result, Bayesian-OverDBC has better performance in time complexity compared to IOMM.

7. Discussion

This paper explained Bayesian-OverDBC which is a new density-based clustering method modelling overlapped clusters. The Bayesian-OverDBC extends traditional density-based model using probabilistic method to find and predict overlap clusters. While most of the research in this area has focused on disjoint clustering, many real microarray datasets, and as a result many gene regulatory networks, have inherent overlapping partitions. Density-based clustering methods, even with the ability of producing overlapping clusters, do not use a probabilistic model. So, it is difficult to determine probability of events and to compare an overlapping method with other methods. Therefore, a probability density-based clustering model, which provides overlapping, is required. It is proved that Bayesian overlapping clustering may be significantly better than other similar methods of clustering. As overlapping clustering is a still-developing field, there are several subjects for future development such as techniques for visualization and interpretation, new algorithms and new means of comparison, and techniques for model selection.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

R. Harpaz and R. Haralick, “Exploiting the geometry of gene expression patterns for unsupervised learning,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), pp. 670–674, IEEE, Hong Kong, August 2006.
View at: Publisher Site | Google Scholar
P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2011.
M. Mirzaie, A. Barani, N. NematBakhsh, and M. Beigi, “OverDBC: a new density-based clustering method for detecting overlapped clusters from microarray data,” IDA Journal, vol. 19, no. 6, 2015.
View at: Google Scholar
K. A. Heller and Z. Ghahramani, “A nonparametric Bayesian approach to modelling overlapping clusters,” in Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS '07), vol. 2 of JMLR Workshop and Conference Proceedings, pp. 187–194, San Juan, Puerto Rico, March 2007.
View at: Google Scholar
A. H. Katherine and Z. Ghahramani, “A nonparametric Bayesian approach to modelling overlapping clusters,” in Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS '07), San Juan, Puerto Rico, March 2007.
View at: Google Scholar
X. Bai, S. Luo, and Y. Zhao, “Entropy based soft K-means clustering,” in Proceedings of the IEEE International Conference on Granular Computing (GrC '08), pp. 107–110, August 2008.
View at: Google Scholar
O. Alter, P. O. Brown, and D. Botstein, “Singular value decomposition for genome-wide expression data processing and modeling,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 18, pp. 10101–10106, 2000.
View at: Publisher Site | Google Scholar
N. Jardine and R. Sibson, “The construction of hierarchic and non-hierarchic classifications,” Computer Journal, vol. 11, no. 2, pp. 177–184, 1968.
View at: Publisher Site | Google Scholar
P. Bertrand and E. Diday, “A visual representation of the compatibility between an order and a dissimilarity index: the pyramids,” Computational Statistics Quarterly, vol. 2, no. 1, pp. 31–41, 1985.
View at: Google Scholar
R. N. Shepard and P. Arabie, “Additive clustering: representation of similarities as combinations of discrete overlapping properties,” Psychological Review, vol. 86, no. 2, pp. 87–123, 1979.
View at: Publisher Site | Google Scholar
E. Segal, A. Battle, and D. Koller, “Decomposing gene expression into cellular processes,” in Proceedings of the 8th Pacific Symposium on Biocomputing (PSB '03), pp. 89–100, Lihue, Hawaii, USA, January 2003.
View at: Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society Series B: Methodological, vol. 39, no. 1, pp. 1–38, 1977.
View at: Google Scholar
Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ICMB '00), pp. 93–103, La Jolla, Calif, USA, August 2000.
View at: Google Scholar
H. Shan and A. Banerjee, “Bayesian Co-clustering,” in Proceedings of the 8th IEEE International Conference on Data Mining (ICDM '08), pp. 530–539, Pisa, Italy, December 2008.
View at: Publisher Site | Google Scholar
Y. Guan, J. G. Dy, D. Niu, and Z. Ghahramani, “Variational inference for nonparametric multiple clustering,” in Proceedings of the Workshop on Discovering, Summarizing and Using Multiple Clustering at the ACM SIGKDD International Conference on Knowledge Discovering and Data Mining (MultiClust '10), 2010.
View at: Google Scholar
V. Mansinghka, E. Jonas, C. Petschulat, B. Cronin, P. Shafto, and J. Tenenbaum, “Cross-categorization: a method for discovering multiple overlapping clusterings,” in Proceedings of the Nonparametric Bayes Workshop at NIPS, Whistler, Canada, December 2009.
View at: Google Scholar
T. T. Grifths and Z. Z. Ghahramani, “Infinite latent feature models and the Indian buffet process,” in Proceedings of the 20th Neural Information Processing Systems, Vancouver, Canada, December 2006.
View at: Google Scholar
M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD '96), pp. 291–316, Portland, Ore, USA, August 1996.
View at: Google Scholar
M. Ankerst, M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 49–60, Philadelphia, Pa, USA, June 1999.
View at: Google Scholar
A. Hinneburg and H.-H. Gabriel, “Denclue 2.0: fast clustering based on kernel density estimation,” in Advances in Intelligent Data Analysis VII, vol. 4723 of Lecture Notes in Computer Science, pp. 70–80, Springer, Berlin, Germany, 2007.
View at: Publisher Site | Google Scholar
S. Jukna, Extremal Combinatorics with Applications in Computer Science, Springer, New York, NY, USA, 2nd edition, 2011.
View at: Publisher Site | MathSciNet
K. A. Heller and Z. Ghahramani, “Bayesian hierarchical clustering,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 297–304, August 2005.
View at: Publisher Site | Google Scholar
E. Alpaydın, Introduction to Machine Learning, The MIT Press, Cambridge, Mass, USA, 2010.
T. L. Griffiths and Z. Ghahramani, “The Indian buffet process: an introduction and review,” Journal of Machine Learning Research, vol. 12, pp. 1185–1224, 2011.
View at: Google Scholar
E. Biçici and D. Yuret, “Locally scaled density based clustering,” in Adaptive and Natural Computing Algorithms, vol. 4431 of Lecture Notes in Computer Science, pp. 739–748, Springer, Berlin, Germany, 2007.
View at: Publisher Site | Google Scholar
C. Fox and S. Roberts, “A tutorial on variational Bayesian inference,” Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95, 2012.
View at: Publisher Site | Google Scholar
M. Garcia-Hernandez, T. Z. Berardini, G. Chen et al., “TAIR: a resource for integrated Arabidopsis data,” Functional & Integrative Genomics, vol. 2, no. 6, pp. 239–253, 2002.
View at: Publisher Site | Google Scholar
P. T. Spellman, G. Sherlock, M. Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998.
View at: Publisher Site | Google Scholar
W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
View at: Publisher Site | Google Scholar
L. M. Collins and C. W. Dent, “Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions,” Multivariate Behavioral Research, vol. 23, no. 2, pp. 231–242, 1988.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2015 Mansooreh Mirzaie et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1521

Downloads

935

Citations

Mathematical Problems in Engineering

Bayesian-OverDBC: A Bayesian Density-Based Approach for Modeling Overlapping Clusters

Abstract

1. Introduction

2. Related Work

3. Traditional Density-Based Clustering

4. Bayesian-OverDBC Model

4.1. Probability of Transaction Matrix

4.2. Probability Model of Clusters and Data Distribution

5. Bayesian-OverDBC Algorithm

6. Evaluation

7. Discussion

Conflict of Interests

References

Copyright