Abstract

Although the goal of clustering is to reveal structural information from unlabeled datasets, in cases with partial structural supervisions, semi-supervised clustering is expected to improve partition quality. However, in many real applications, it may cause additional costs to provide an enough amount of supervised objects with class labels. A virtual sample approach is a practical technique for improving classification quality in semi-supervised learning, in which additional virtual samples are generated from supervised objects. In this research, the virtual sample approach is adopted in semi-supervised fuzzy co-clustering, where the goal is to reveal object-item pairwise cluster structures from cooccurrence information among them. Several experimental results demonstrate the characteristics of the proposed approach.

1. Introduction

Clustering or cluster analysis is a basic technique for unsupervised classification, whose goal is to reveal intrinsic substructures varied in large scale unlabeled datasets. In some applications, however, it is possible to utilize partial knowledge on the substructures [1], such as must-link and cannot-link among some objects [24], class labels on a part of objects [57], or predefined fuzzy membership degrees [8, 9], and it is expected that partition quality can be significantly improved by utilizing such partial knowledge. In this study, a situation of having semi-supervision is considered, in which we have some supervising objects in conjunction with their class labels. Semi-supervised clustering [5, 10] is a practical approach for utilizing partial supervised information with the goal of improving the partition quality in unsupervised classification. Such partial knowledge can be utilized in two phases of supervised initialization and supervised membership assignment [6, 7].

Although partition quality is expected to be improved with sufficient semi-supervisions, in many real world applications, it is often difficult to utilize an enough amount of supervised objects. For example, in many web data analyses, various open data are available while most of them are unlabeled and their class labels cannot be provided without heavy costs. A promising approach to improve classification quality in semi-supervised learning is the virtual sample approach, in which additional virtual samples are artificially generated from several supervised objects [1113].

In this paper, the virtual sample approach is adopted in semi-supervised fuzzy co-clustering while the conventional model [13] was designed for semi-supervised classification in multidimensional data space. Fuzzy co-clustering is a fundamental technique for summarizing mutual cooccurrence information among objects and items such as document-keyword frequencies in document analysis [14] and customer-products preferences in purchase history analysis [15]. The task of co-clustering is achieved by simultaneously estimating memberships of both objects and items from a cooccurrence information matrix. Fuzzy co-clustering induced by MMMs concept (FCCMM) [16] is a fuzzy co-clustering algorithm induced by multinomial mixture models (MMMs) [17], where statistical mixture models are interpreted as -means-type classification models with regularized objective functions. The iterative algorithm is composed of updating two types of memberships of items and objects. In semi-supervised fuzzy co-clustering [18], partial knowledge was utilized in two phases of supervised initialization and supervised object membership assignment in a similar manner to the framework of [6, 7]. In supervised initialization, initial item memberships are generated using labeled objects only. Then, in supervised object membership assignment, memberships of supervised objects are prefixed while others are updated in the iterative algorithm.

The remaining parts of this paper are organized as follows: Section 2 provides a brief review of the MMMs-induced fuzzy co-clustering, which is then combined with a semi-supervised framework. An artificial process for generating virtual samples is introduced in Section 3. Section 4 presents experimental results and Section 5 gives summary conclusions.

2. MMMs-Induced Fuzzy Co-Clustering and Its Variant for Semi-Supervised Clustering

2.1. Fuzzy Co-Clustering Model Induced by MMMs Concept

Assume that we have a cooccurrence matrix on objects and items , in which represents the cooccurrence degree of object and item . For example, can be the frequency of keyword (item) in document (object) in document analysis. In many traditional results, it has been shown that co-cluster structure analysis is useful for summarizing contents such as intrinsic chapter or category information of many documents with their representative keywords. In fuzzy co-clustering context, the task is reduced to the problem of finding the pairwise clusters of mutually familiar objects and items, in which the goal is to estimate the fuzzy memberships of both objects and items such that mutually familiar object and item pairs have large memberships in the same cluster.

MMMs [17] is a probabilistic mixture model for co-clustering, in which each component density is multinomial distribution. Multinomial distribution is a multicategory extension of binomial distribution, where the probability of an object with item cooccurrence vector is defined as the joint probability of all items with their frequencies of . MMMs construct a mixture distribution by iteratively estimating the item occurrence probability and the a priori probability of generative distributions in conjunction with estimating object a posteriori probability to each generative model.

Following the soft clustering interpretation of probabilistic mixture models [19], Honda et al. [16] introduced a fuzzy co-clustering-based interpretation of MMMs, where the pseudo-log-likelihood function was decomposed into an object-item aggregation measure and K-L information-based fuzzification penalty. FCCMM [16] is an MMMs-induced fuzzy co-clustering model, in which the degree of object partition fuzziness can be tuned with an adjustable penalty weight on the K-L information-based penalty. For extracting fuzzy co-clusters, the objective function to be maximized is defined aswhere and are the fuzzy memberships of object and item to cluster , respectively. is the volume of cluster such that . From the MMMs viewpoint, is the a posteriori probability of class given object and should be constrained to . On the other hand, corresponds to the generative probability of item in component and should be . Then, is mainly responsible for exclusive object partition while just represents the typicality of item within a generative model.

is the aggregation criterion of objects and items, which measures the degree of aggregation of familiar objects and items, and becomes larger when mutually familiar object-item pairs with large have large memberships and in a same cluster . From the viewpoint of -means-type clustering with respect to object memberships , the aggregation criterion is essentially a (linear) hard clustering measure, and the K-L information term [20, 21] is responsible for estimating soft partition of objects in the same manner with the soft partition nature of Gaussian mixture models [19]. tunes the degree of fuzziness of object partition. When , the objective function is reduced to the pseudo-log-likelihood function of MMMs with component distributions. Then, in the case of , FCCMM brings a fuzzier co-cluster partition than MMMs while object partition becomes more crisp with . As an extremal case, implies a crisp co-clustering model. It was shown that a careful tuning of fuzziness degree can contribute to improvement of partition quality of MMMs by reducing initialization and noise sensitivities [16].

The updating rules for these model parameters are given asThe clustering algorithm is the 3-step iterative process composed of these updating rules.

Algorithm of FCCMM(1)Let be the number of clusters. Choose the fuzzification weight .(2)Initialization: initialize object memberships , with random values such that .(3)Iterative Process: iterate the following process until convergence of all .(a)Update cluster volumes using (3).(b)Update item memberships using (4).(c)Update object memberships using (2).

2.2. Semi-Supervised Fuzzy Co-Clustering and Inductive Classification

When some objects have their intrinsic class labels, the partition quality of unsupervised clustering is expected to be improved with their supports [5, 10]. A possible semi-supervised framework for fuzzy co-clustering [18] utilized such partial supervision in two levels: supervised initialization and supervised membership assignment. Assume that we have an cooccurrence matrix, where only a part of objects, such that , has class labels but others do not. A framework of semi-supervised fuzzy co-clustering is given as follows [18].

Semi-Supervised Fuzzy Co-Clustering Framework(1)Let be the number of clusters, which is usually equivalent to the class number of the supervised objects. Choose the fuzzification weight .(2)Initialization: initialize the memberships of supervised objects such that in the labeled class and in others. For unsupervised objects , , where is a small positive. (The initial cluster volumes and item memberships are estimated by mainly reflecting supervised objects only.)(3)Iterative Process: iterate the following process until convergence of all .(a)Update cluster volumes using (3).(b)Update item memberships using (4).(c)Update memberships of unsupervised object using (2). (The memberships of supervised objects are prefixed and unchanged.)

In the initialization level, a plausible initial co-clusters are estimated considering only the supervised objects. Here, this approach is available only if the amount of supervised objects is enough for estimating cluster-wise item preferences, and each cluster index strictly corresponds to the intrinsic class index. In case with insufficient supervision, this initialization step should be performed with the conventional procedure of random assignment for avoiding overfitting to a few supervised objects.

Next, in the iterative optimization level, a fixed crisp object membership is assigned to each supervised object reflecting its class label. The partial supervision can contribute to guiding other unlabeled objects to plausible co-clusters. This second level is expected to be useful even if we have only a few supervised objects.

Once we got a co-cluster structure, we can perform inductive classification of new (unlabeled) objects supported by the co-cluster information [18].

Inductive Classification Considering Fuzzy Co-Cluster Structure(1)Assume that we have estimated co-cluster model parameters of and . The goal is to predict the class of a new test object , which is associated with its cooccurrence information .(2)Membership Calculation: calculate fuzzy memberships by (2).(3)Maximum Membership Assignment: search for the largest and output its class label.

This inductive classification approach was shown to outperform the supervised classification with a small set of supervised objects only [18]. This implies that unlabeled objects can contribute to effectively estimating class-wise distributions rather than supervised learning with insufficient supervisions.

3. Generation of Virtual Samples for Semi-Supervised Fuzzy Co-Clustering

Although the performance of semi-supervised clustering is expected to be improved as the number of labeled objects becomes larger, it may cause a high cost to generate an enough amount of supervised objects in real applications. The virtual sample approach is a practical strategy for improving classification quality in semi-supervised learning without additional costs. In the remaining parts of this paper, the virtual sample approach is adopted in semi-supervised fuzzy co-clustering.

Sassano [13] proposed two methods for generating virtual samples for text classification based on the following assumption:The category of a document is unchanged even if a small number of words are added or deleted.

Because the documents belonging to the same category usually contain several common keywords, deletion or addition of a small number of words is expected not to have a severe impact on classification quality. In [13], two strategies of Deletion and Addition were considered and were utilized in semi-supervised support vector machine learning of text documents. In GenerateByDeletion, virtual samples were generated by deleting some portions of the original supervised documents and were added to the class of the original ones. On the other hand, in GenerateByAddition, virtual samples were generated by adding a small number of words into the original supervised documents. The words to be added are taken from documents, the label of which is the same as that of the original document.

In this paper, the two generative strategies are introduced to semi-supervised fuzzy co-clustering tasks. Assume that is the cooccurrence information vector on a supervised object , where just represents the appearance/absence of item in object . Its virtual copy is given as an additional virtual (supervised) sample to be added to partial supervision after slight revisions based on the two strategies.

GenerateByDeletion(1)Copy to .(2)For each item with , if then set , where is a function which generates a random number from to , and is a parameter to decide how many elements are deleted.

This procedure implies that each virtual object is a virtual copy of the original supervised object having a smaller number of appearances. Virtual object is almost equivalent to the original with while becomes sparse as is larger.

Here, the detailed process is demonstrated with a toy example using a set of 6 supervised objects () shown in Table 1, where each cooccurrence information vector is composed of ten items () and the objects belong to one of two supervised classes (). Be noted that class 1 is mainly related to the first 5 items while class 2 is related to others.

Using GenerateByDeletion, class 1 object can generate such virtual objects as , , or with label “1”. Another class 2 object can generate such as , , or with label “2”.

GenerateByAddition(1)Construct the set of all supervised objects, whose class labels are the same as that of , and generate a temporal item set composed of all items that appeared in the supervised object set.(2)Copy to .(3)For each item with , if then randomly select item from the temporal item set and set as .

This procedure implies that each virtual object is a virtual copy of the original supervised object having a larger number of appearances. Virtual object is almost equivalent to the original with while becomes dense as is larger.

In the toy example of Table 1, GenerateByAddition first constructs temporal item sets as for class 1 and for class 2. Then, a class 1 object can generate such virtual objects as , , or with label “1”. Another class 2 object can generate such as , , or with label “2”. Here, it should be noted that is always if item does not appear in all supervised objects of the class.

The generated virtual objects are added to the set of supervised objects and are utilized as semi-supervision in semi-supervised learning.

4. Numerical Experiment

The classification quality of semi-supervised fuzzy co-clustering with virtual objects is investigated through numerical experiments in this section.

Two benchmark datasets CiteSeer dataset and Cora dataset were used, which are available from LINQS webpage of Statistical Relational Learning Group UMD (http://linqs.cs.umd.edu/projects/index.shtml). CiteSeer dataset consists of 3312 scientific publications (objects, ), each of which is described by a 0/1-valued word vector () indicating the absence/presence of 3703 unique words (items, ) and is classified into one of six classes (). Cora dataset also consists of 2708 scientific publications (objects, ) and 1433 unique words (items, ) with , each of which is classified into one of seven classes ().

The classification quality was investigated through 5-fold cross-validation scheme. In applying semi-supervised fuzzy co-clustering, the dataset was first partitioned into 5 disjoint subsets. Four subsets were utilized for the training set to be used in semi-supervised fuzzy co-clustering and the remaining one subset was used for the test set to validate the classification ability. This training/test trials were iterated 5 times rotating different test subsets. The number of clusters was set as the actual class number; that is, in CiteSeer dataset and in Cora dataset. The fuzzification weight was set as for CiteSeer dataset and for Cora dataset, respectively, such that a slightly fuzzier model rather than MMMs with can contribute to better performances [16].

4.1. Preliminary Experiment

First, a preliminary experiment was performed with the goal of investigating the effects of the amount of supervised objects, where the semi-supervised framework was implemented without virtual objects. The amount of supervised objects, which are available in training data, was varied with various ratios of for two benchmark datasets but the class labels of the remaining objects were withheld to be unknown. “” corresponds to the conventional unsupervised model, which utilizes no supervisions. “” corresponds to the fully supervised model, where all training objects have their class information and co-cluster models (cluster-wise item memberships) were independently estimated in each class. After co-cluster estimation, the classes of the unsupervised training objects were predicted with the largest memberships. The classes of the test objects were also predicted by the inductive classification scheme of Section 2.

The recognition rates are compared in Table 2 for CiteSeer dataset and Table 3 for Cora dataset. “” for training data are missing as shown with “—” because no unsupervised objects remained in training sets. In both CiteSeer dataset and Cora dataset, the recognition rates of unsupervised co-clustering ( supervision) are inferior to those of semi-supervised co-clustering. These results imply that semi-supervised co-clustering is more effective than nonsupervised co-clustering when partial supervisions are available, and the partition quality becomes better as the amount of available supervisions is larger. And the recognition rates in supervision or larger are almost comparative while the supervision is significantly inferior to others. Then, if we have supervision only, it is quite difficult to achieve sufficient recognition rates and we need additional information for improving the partition quality.

By the way, in these datasets, too many supervisions, such as more than , could not contribute to further improvement of test evaluation. It may be because too many supervisions can bring the overfitting to training samples and cause poor generalization capability. Then, semi-supervised learning is expected to contribute to improving generalization capability rather than fully supervised learning.

In the following experiments, a situation with partial supervision was designed for simulating an insufficient partial knowledge case, where the goal is to demonstrate the advantage of additional virtual partial supervision.

4.2. Investigation of Effects of Virtual Objects

Next, the effect of virtual objects in semi-supervised fuzzy co-clustering was investigated. In this experiment, the classification abilities of semi-supervised fuzzy co-clustering of the cases with/without virtual objects were compared. Following the result of the above preliminary experiment, supervised objects were included in training sets for semi-supervised co-clustering as a simulation of the situation where an enough amount of supervised objects is not available. The virtual objects were generated from each of these supervised objects by GenerateByDeletion and GenerateByAddition, and then they were added to training sets and utilized as additional supervision. Supported by the additional information of virtual objects, the partition quality is expected to be improved rather than the original supervision case.

Additionally, the influences of the number of virtual objects were also investigated, where the number of virtual objects was increased by iterating the generative procedures. Furthermore, a hybrid method was also applied using both GenerateByDeletion and GenerateByAddition, simultaneously. In this hybrid method, at least 2 virtual objects are generated from one supervised object.

Both the recognition rates of training and test objects were estimated in the same manner with the preliminary experiment. Tables 4 and 5 show the results with CiteSeer dataset. Tables 6 and 7 show the results with Cora dataset. Bold types indicate the improved quality rather than the original supervision case (without virtual objects). “GBD” and “GBA” mean GenerateByDeletion and GenerateByAddition, respectively, and “GBD+GBA” is their hybrid method. In this experiment, the number of virtual objects per supervised object was varied in adopting different random seeds in GenerateByDeletion and GenerateByAddition and varied in adopting different random seeds in the hybrid method. Parameter of addition/deletion rate was varied with .

In both training and test cases, the recognition rates of semi-supervised co-clustering with virtual objects are superior to those without virtual objects when the number of virtual objects generated from a supervised objects is relatively small. These results indicate that virtual objects can contribute to improving the classification ability of semi-supervised fuzzy co-clustering when we do not have enough semi-supervision in datasets.

However, it seems that too many virtual objects degrade the partition quality. In this experiment, the quality was not improved when 5 or more virtual objects per supervised objects were generated. This indicates that a small number of virtual objects can contribute to improving the quality of semi-supervision while too many virtual objects bring a negative impact on the effect of semi-supervision. It may be because too many virtual objects can bring the overfitting to the virtual objects and cause poor generalization capability. Then, it will be important to generate a proper volume of virtual objects.

4.3. Validation of Class-Wise Partition Quality

Finally, the validity of class-wise partition quality is studied, where the best results of the virtual sample approach are compared with the original one given without virtual objects. The best results were given with for the test set of CiteSeer dataset and for the test set of Cora dataset, respectively.

The quality of result of each class was validated by -measure. Assume that the classification result in class is shown as in Table 8. -measure is defined as follows:where and are precision and recall, respectively, as follows:Precision means the proportion of objects with actual class label out of objects classified to class . Recall means the proportion of objects classified to class out of objects with actual class label . -measure is the harmonic mean of precision and recall , and larger -measure implies better classification ability. Supported by the 5-fold cross-validation scheme, the average of 5 trials is compared.

Tables 9 and 10 show -measure of each class for test objects. “—” denotes the failure of -measure calculation caused by missing of or , where the clustering model failed to find the corresponding class objects in a trial after maximum memberships classification with very fuzzy boundaries. These tables indicate that the results with virtual objects are superior to those without virtual objects for most classes except for Agents and HCI in CiteSeer dataset and Reinforcement_learning in Cora dataset. Therefore, virtual objects can mostly contribute to improving partition quality from the viewpoint of class-wise recognition.

5. Conclusion

In this paper, the effect of virtual objects in semi-supervised fuzzy co-clustering was demonstrated. Following the previous study [13], two novel procedures for generating virtual objects for cooccurrence data analysis were proposed and their utility was investigated through numerical experiments. In the numerical experiments with two benchmark datasets, the effects of virtual objects in conjunction with the effects of the number of virtual objects per original supervised object were compared and it was indicated that classification quality of semi-supervised fuzzy co-clustering can be improved without additional cost for generating supervised objects by adding several virtual objects while the classification quality can be degraded with too many virtual objects.

Future work includes improving the quality of virtual samples. For example, it may be possible to improve the quality of the virtual samples by evaluating the plausibility of the additional samples with some cluster validity measures for fuzzy co-clustering [22]. Furthermore, besides the simple virtual copy of objects, virtual copy of items may be another possible direction.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Ministry of Education, Culture, Sports, Science and Technology, Japan, under Grant-in-Aid for Scientific Research (no. 26330281).