Advances in Fuzzy Systems

Volume 2015, Article ID 729072, 8 pages

http://dx.doi.org/10.1155/2015/729072

## A Collaborative Framework for Privacy Preserving Fuzzy Co-Clustering of Vertically Distributed Cooccurrence Matrices

Osaka Prefecture University, 1-1 Gakuen-cho, Nakaku, Sakai, Osaka 599-8531, Japan

Received 10 February 2015; Revised 12 March 2015; Accepted 12 March 2015

Academic Editor: Rustom M. Mamlook

Copyright © 2015 Katsuhiro Honda et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In many real world data analysis tasks, it is expected that we can get much more useful knowledge by utilizing multiple databases stored in different organizations, such as cooperation groups, state organs, and allied countries. However, in many such organizations, they often hesitate to publish their databases because of privacy and security issues although they believe the advantages of collaborative analysis. This paper proposes a novel collaborative framework for utilizing vertically partitioned cooccurrence matrices in fuzzy co-cluster structure estimation, in which cooccurrence information among objects and items is separately stored in several sites. In order to utilize such distributed data sets without fear of information leaks, a privacy preserving procedure is introduced to fuzzy clustering for categorical multivariate data (FCCM). Withholding each element of cooccurrence matrices, only object memberships are shared by multiple sites and their (implicit) joint co-cluster structures are revealed through an iterative clustering process. Several experimental results demonstrate that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual site-wise analysis. The novel framework makes it possible for many private and public organizations to share common data structural knowledge without fear of information leaks.

#### 1. Introduction

Data mining is a powerful tool for many private and public organizations in supporting efficient decision making, and they have been utilizing various databases, which are independently and securely stored in each organization. However, it is often quite expensive or impossible to store enough data by each of themselves and many analysts believe that we can get much more useful knowledge by utilizing multiple databases stored in different organizations. In these collaborative data analysis, a significant problem is the privacy issue. For example, in many corporations, customer segmentation by clustering is a fundamental approach in possible marketing while their customer privacy must be securely protected and each data record such as purchase history and personal profiles must not be published to other corporations or organizations. Similar situations are found in many other organizations such as hospitals with clinical records and governments with military intelligences.

Privacy preserving data mining (PPDM) [1] is a fundamental approach for utilizing multiple databases including personal or sensitive information without fear of information leaks. A possible approach is a priori -anonymization of databases for secure publication [2, 3], but such anonymization can bring information losses. Another approach for utilizing all distributed information is to analyze the information without revealing each element. In -means clustering, several secure processes for estimating cluster centers were proposed [4, 5], in which the mean vector of each cluster is calculated with an encryption operation.

In this paper, a novel collaborative framework for utilizing vertically partitioned cooccurrence matrices in fuzzy co-cluster structure estimation is proposed, where cooccurrence information among objects and items is separately stored in several sites. In vertically distributed databases, it is assumed that all sites share common objects but they are characterized with different independent items in each site. The goal is to reveal the global co-cluster structures varied in whole separate databases without publishing each element of independent databases to other sites.

The remaining parts of this paper are organized as follows: Section 2 gives a brief review on related works and Section 3 shows their problems and possible solutions. Section 4 provides explanations on the conventional fuzzy co-clustering model and Section 5 proposes a novel collaborative framework for applying fuzzy co-clustering considering privacy issues. In Section 6, several experimental results demonstrate that collaborative analysis can contribute to revealing global intrinsic co-cluster structures of separate matrices rather than individual site-wise analysis. Finally, a summary conclusion is given in Section 7.

#### 2. Background

Co-clustering is a fundamental technique for summarizing mutual cooccurrence information among objects and items. For example, in document clustering, mutual cooccurrence information of documents and keywords are utilized for revealing intrinsic document clusters with their keywords summaries. In purchase history analysis, mutual connections among customers and their promising products are investigated considering purchase preferences. Co-clustering provides pairwise cluster structures among objects and items and has been widely investigated in both probabilistic [6] and heuristic contexts [7]. In this paper, fuzzy clustering approaches are focused on.

Fuzzy clustering has been proved to have many advantages against hard ones from such view points as noise and initialization sensitivities. Fuzzy variants of co-clustering have also been demonstrated to be useful in such applications as document analysis [8] and collaborative filtering [9, 10]. The goal of fuzzy co-clustering is to simultaneously estimate memberships of both objects and items from a cooccurrence information matrix. For example, in document analysis, each document (object) is characterized by several keywords (items) with their appearance frequencies (degree of cooccurrences), and the goal is to extract document-keyword clusters with their fuzzy memberships for analyzing their contents.

Fuzzy clustering for categorical multivariate data (FCCM) [11] is a Fuzzy -Means- (FCM-) type [12] co-clustering model, in which a co-cluster aggregation criterion is maximized supported by entropy-based membership fuzzification [13, 14] in FCM-like iterative optimization algorithm. Several fuzzy co-clustering models were proposed based on similar concepts with FCCM, in which other fuzzification mechanisms were adopted [8, 15–18].

In order to analyze distributed databases in -means-type clustering, several secure processes for estimating cluster centers were proposed [4, 5], in which the mean vector of each cluster is calculated with an encryption operation. However, in fuzzy co-clustering, the clustering criteria of cluster aggregation degrees were defined without cluster centers and the conventional secure framework cannot be adopted. Then, a novel secure mechanism is needed, where the main problems to be solved remained as summarized in the next section.

#### 3. Problems and Solution

In the -means-type secure clustering model for vertically distributed data [4, 5], multiple sites share common objects, such as customers and patients, while having their own vector observations only, such as customer profiles of their own stores and clinical records in their own hospitals. In order to reveal the intrinsic object clusters without publishing each observation, each coordinate of cluster centers is separately calculated in each site and the derived coordinates are shared by all sites.

On the other hand, fuzzy co-clustering does not use cluster centers as cluster prototypes and utilizes two types of fuzzy memberships only. Then, the conventional secure framework for -means-type clustering cannot be adopted, and a secure process for calculating the fuzzy memberships must be developed.

In the following, in this paper, a novel framework for calculating fuzzy memberships in fuzzy co-clustering of vertically distributed cooccurrence matrices is proposed following a brief review on the conventional fuzzy co-clustering models. In order to calculate* object* memberships, the sum of products of* item* memberships and cooccurrence observations are needed, and vice versa. In the proposed secure process, the sum calculation is securely achieved through an encryption operation, in which the sum can be calculated by concealing each value.

The novel framework is constructed in the FCCM context only, which is the basic model of fuzzy co-clustering. However, it is easily expected that a similar extension is directly applicable to the other FCCM variants without discussions because all the FCCM variants are based on the FCCM updating process.

#### 4. Methodology of Fuzzy Co-Clustering

Assume that we have a cooccurrence matrix on objects and items , in which represents the degree of cooccurrence of item with object . The goal of co-clustering is to simultaneously partition objects and items into co-clusters by estimating two types of fuzzy memberships. Object partitions are represented by object memberships , which is the memberships degree of object to cluster and is forced to be exclusive in the same way with FCM such that . On the other hand, in order to avoid trivial solutions, item partitions are represented by item memberships , which are mostly responsible for representing the mutual typicalities in each cluster such that .

Oh et al. [11] proposed the FCM-type co-clustering model, which is called FCCM, by modifying the FCM algorithm for handling cooccurrence information, where the cluster aggregation degree of each cluster is maximized:The first term to be maximized measures the aggregation degree of objects and items in cluster , such that it becomes larger when mutually familiar objects and items having a large , simultaneously, have large memberships in a cluster. Here, this aggregation degree is only designed for hard partition because the term is a linear function with respect to both of and , where we have always and . Then, in order to derive fuzzy memberships and , the aggregation measure must be nonlinearized.

In FCCM, the entropy-based fuzzification method [13, 14] was adopted instead of the standard approach in FCM because the exponential weight in FCM can work only in the minimization framework of positive objective functions. and tune the degree of fuzziness of memberships, where a larger brings fuzzier partitions while a smaller brings crisp partitions.

The clustering algorithm is an iterative process of updating and using the following rules:

This FCCM process was also reconstructed with other fuzzification mechanisms. For example, Fuzzy CoDoK [8] utilized the quadric term-based regularization [19] for avoiding calculation overflows. Honda et al. [15] adopted K-L information-based regularization [20] for handling unbalanced cluster sizes. As discussed in Section 3, these extended models generally follow the original FCCM procedure and have similar characteristics. So, in this paper, the novel collaborative framework is described in the FCCM context only.

#### 5. Fuzzy Co-Clustering with Privacy Consideration

##### 5.1. Privacy Consideration in -Means Clustering

When each object is characterized by -dimensional observation , -means algorithm tries to minimize the within-cluster errors by iterating cluster center updating and nearest prototype assignment. Let be the center of cluster . In cases of distributed databases, we must care about privacy issues in either of the two phases by adopting such a technique as encryption operation [5].

For vertically distributed databases, where the elements of are separately stored in several sites, distances between object and cluster centers are calculated under collaboration of all sites. Here, the clustering criterion is the sum of squared errors and should be calculated by concealing each value of from other sites. Once we find the nearest prototype assignment of each object, we can independently calculate new in each site by sharing the object membership information.

Although the above secure framework is also useful in many other -means-type clustering algorithms such as FCM, it cannot be directly adopted to co-clustering ones because co-clustering does not use cluster prototypes but considers two types of memberships.

In this paper, similar ideas are adopted to fuzzy co-clustering tasks.

##### 5.2. Fuzzy Co-Clustering with Privacy Consideration

Assume that sites () share common objects () and have different cooccurrence information on different items, which are summarized into matrices , where is the number of items in site and . Figure 1 shows a visual image of vertically distributed cooccurrence matrices. For example, we have a group of corporations (or hospitals, countries, etc.) and each of them has its independent customer purchase history (or patients’ records, military intelligence, etc.).