Mathematical Problems in Engineering

Volume 2015, Article ID 187053, 9 pages

http://dx.doi.org/10.1155/2015/187053

## Bayesian-OverDBC: A Bayesian Density-Based Approach for Modeling Overlapping Clusters

^{1}Department of Computer Engineering, Golpayegan University of Technology, Isfahan 87717-65651, Iran^{2}Department of Software Engineering, Faculty of Computer Engineering, University of Isfahan, Isfahan 81746-73441, Iran^{3}Department of Bio-Medical Engineering, University of Isfahan, Isfahan 81746-73441, Iran

Received 18 March 2015; Revised 14 June 2015; Accepted 21 October 2015

Academic Editor: Huaguang Zhang

Copyright © 2015 Mansooreh Mirzaie et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other methods in analyzing microarray data.

#### 1. Introduction

Clustering, that is, finding similar groups of objects in a dataset, is an interesting technique especially for large data. Usually clustering algorithms assume that every object must belong to one and only one cluster (single-membership), but there are several real situations in which objects belong to more than one group (overlapping or multiple-membership). One of the applications of overlapping clustering is in bioinformatics. In biology, genes have more than one function carried out by coding proteins that participate in multiple metabolic pathways. Therefore, overlapping clustering could be useful in microarray data, which assigns gene expression data to multiple clusters simultaneously [1].

Density-based clustering can find clusters of different shapes so that they are useful in finding overlapped clusters. Furthermore, it is rather robust concerning outliers [2] and is very effective in clustering microarray data. These methods, even with the ability of finding overlapping clusters, do not use a probabilistic model. So, it is difficult to determine the probability of events and to compare an overlapping method with other methods. Therefore, a probability density-based clustering model, which provides overlapping, is required.

In this paper, the Bayesian-OverDBC algorithm is presented. This algorithm is a novel density-based clustering algorithm that has several advantages over traditional algorithms. It defines a probabilistic model of data which can be used to predict distribution of overlapping clusters. Bayesian hypothesis could be tested to determine which of the clusters is an overlapping cluster and which ones are merged or even discarded. Therefore, the algorithm may be interpreted as a Dirichlet Process Mixture (DPM) model.

Bayesian-OverDBC is based on OverDBC [3]. In OverDBC, initial cores (points with high density values) are formed based on density functions. Clusters are formed around the core objects and can be improve through local search. These steps are also taken in Bayesian-OverDBC. But in this algorithm, the decision to create, merge, or delete overlapping clusters is made by using probabilistic models and Bayesian hypotheses. Similar work has been done by Heller and Ghahramani [4] for modeling overlapping clusters (IOMM). This method uses an exponential distribution to model each cluster and creates overlap clusters using the product of distributions.

Evaluation results show that the Bayesian-OverDBC algorithm could find overlapped clusters and works more effectively than DBSCAN (a nonoverlapping density-based clustering) and IBP (an overlap clustering model) in microarray data. Obviously, this method can be generalized to other datasets in different applications.

The main contributions of the paper can be summarized as follows:(1)It introduces a density function to find probable core objects.(2)It introduces a probabilistic Bayesian model for overlapping density-based algorithm. The traditional density-based algorithms do not define a probabilistic model of data, so comparison with other models is hard.(3)It introduces new parameters which affect overlapping and the possibility of their occurrence.

The rest of the paper is organized as follows. At first, in Section 2, we give a brief overview of some of the clustering methods (overlapping and nonoverlapping methods). In Section 3, the concepts of density-based clustering methods are reviewed. Section 4 includes concepts of the new Bayesian model. Bayesian-OverDBC is described in Section 5. In Section 6, the results of the evaluation synthetic microarray-like datasets and real datasets and also a comparison with other methods are described.

#### 2. Related Work

Different clustering methods are introduced in statistics, machine learning, and data mining. The idea of multiple-membership clustering has recently emerged as an important topic in some research areas. Multiple-membership clustering methods were divided into three categories [5]: Soft Models, Multiple-Membership Extensions to Hierarchical Agglomerative Clustering, and Similarity-Space Additive Clustering. In the following, these multiple-membership clustering techniques and their features are generally reviewed.(1)Soft Models: soft model algorithms allow a point to be a partial member of some or all clusters. There are two primary methods for soft clustering: soft -means [6] and SVD-like matrix decompositions [7].(2)Multiple-Membership Extensions to Hierarchical Agglomerative Clustering (HAC): HAC is a simple clustering algorithm and has served as the starting point for several multi-membership clustering algorithms. “Jardine-Sibson B-clustering and Articulation Point Cuts” [8] and “Pyramid Hierarchical Clustering” [9] are a straightforward extension of single-link agglomerative clustering.(3)Similarity-Space Additive Clustering: ADCLUS [10] is an additive method for modeling similarity matrices. ADCLUS provides a weight for each cluster which is convenient for interpretation and discards unimportant clusters.

In [11], a probabilistic model of a microarray dataset is proposed. This method (SBK) models each observed expression value as a sample drawn from a Gaussian sample. The mean is a sum of real-valued activations of the processes that a gene participates in. The problem then is to find (binary membership matrix) and (real-valued activity matrix) so as to maximize the joint probability (; ; ), where is the input data. This paper demonstrates the application of the algorithm on the yeast stress response dataset finding that the discovered overlapping clusters have much better performance (as determined by value) than clusters discovered by other overlapping methods.

SBK uses the expectation-maximization method [12], so it has the existing problems in this area such as the local maximum. In addition, the algorithm needs to define convergence threshold. Determining the threshold value is highly sensitive to data and may directly affect the convergence or nonconvergence of the algorithm. Also, the algorithm requires an automatic startup process, so it requires an initial value for the cluster membership matrix. The initial value usually is the output of the -means or hierarchical clustering algorithms. All of these algorithms, in initialization phase, increase time complexity and space requirements.

Cheng and Church in [13] give a biclustering (coclustering) algorithm for finding biclusters in microarray data. A bicluster is a submatrix (rows and columns ) that minimizes some objectives such as MSR (mean square residue). In [14] a Bayesian biclustering method is introduced which is named BCC. It allows a mixed membership to row and column clusters. BCC uses separate Dirichlet priors over the mixed membership and assumes each observation to be generated by an exponential family distribution corresponding to its row and column clusters. Some advantages of BCC are the following: the ability to handle sparse collections, being usable to diverse data types for all exponential family distributions, and flexible Bayesian priors using Dirichlet distributions; none of [13] or [14] provides overlapping functionality for clusters.

In [15] a probabilistic nonparametric Bayesian model for finding multiple clusters is introduced. This model can discover several possible clustering solutions and the feature subset views that generated each cluster partitioning simultaneously. This model allows for not only learning the multiple clustering but also automatically learning the number of views and the number of clusters in each view.

This model and a similar model in [16] both assume that the features in each view are not overlapping. However, in many applications, some features may be shared among views. In other words, although the concept of multifeature clustering has been considered, the models are not able to find overlapping clusters.

A new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters, is presented in [4]. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions. The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters are modeled using an Indian Buffet Process (IBP) [17].

IOMM is implemented using a sampling method with a high repetition rate which needs a large time. Moreover, IOMM sampling method accepts all samples; the convergence of the algorithm is not provable in some datasets. In the next section some details of traditional density-based clustering algorithms, like DBSCAN, are reviewed. It also describes some of OverDBC features, that is, a density-based algorithm able to find overlapping clusters.

#### 3. Traditional Density-Based Clustering

The key idea of density-based clustering is that each object in a cluster defines the neighborhood of a given radius with at least a minimum number of objects. Density-based clustering discovers clusters of arbitrary shapes in spatial databases with noise. Here density can be defined as the number of points within a specified radius. Density-based clustering techniques include mainly three techniques: DBSCAN (Density-Based Spatial Clustering of Application with Noise) [18], OPTICS (Ordering Points to Identify the Clustering Structure) [19], and DENCLUE (Density Clustering) [20].

The method presented in this paper (and also in OverDBC) uses the concepts of DBSCAN for clustering. So, some of the features of this algorithm are described. To find a cluster, DBSCAN starts with an arbitrary point and retrieves all points density-reachable from . An object is directly density-reachable from object if is within the -neighborhood of and is a core point. This procedure yields a cluster around the . If is a border point (points on the border of the cluster), no points are density-reachable from and DBSCAN visits the next point of the database.

There are several limitations to the traditional DBSCAN algorithm. The algorithm provides no guide to choosing the “correct” number of clusters. The quality of DBSCAN depends on the distance measure used in the algorithm. It is often difficult to know which distance metric to choose, especially for special data such as images or sequences and also for high-dimensional data. DBSCAN is not entirely deterministic; border points that are reachable from more than one cluster can be part of either cluster. This situation does not arise often but it is not inevitable.

OverDBC was introduced in Figure 1. It is a density-based algorithm for finding overlapping clusters which is based on DBSCAN. OverDBC allows objects to have multimembership in a restricted number of clusters where the total number of clusters is unbounded. In [3] it is proved that OverDBC is significantly better than nonoverlapping clustering algorithm such as DBSCAN in microarray data.