Abstract

Data analysis is the foundation of Internet of Things (IoT) based applications, and clustering is an effective technology of data analysis. Clustering ensemble integrates multiple base clustering results to obtain a consensus result and thus improves the clustering performance in stability and robustness. However, it is difficult for existing clustering ensemble algorithms to achieve a satisfying ensemble result, when the base clustering results are unreliable. Concerning this problem, we develop a new clustering ensemble model in this paper, which has several advantages compared with traditional algorithms: (i) structure information about the data is effectively extracted from the base clusterings; (ii) data characteristics and structure information are integrated in an elegant fashion, in the production of the consensus clustering result; and (iii) our model has the generative ability that makes the model achieve outstanding performance when training samples are insufficient. In our model, the structural information is extracted by explicating the coupling relationships between base clusterings and between samples in clustering members. Then, data characteristics and structure information are combined in a generative graph representation learning framework. And the objectives of representation learning and consensus clustering are integrated into a unified optimization model, in which the prior distribution of the data is approximated by a Gaussian mixture model (GMM). Extensive experiments are conducted on multiple IoT datasets; the results prove that our model not only performs better than the conventional clustering ensemble algorithms but also outperforms the state-of-the-art deep clustering methods.

1. Introduction

With the rapid development and widespread use of IoT technology, many types of data are produced constantly with unprecedented speed. The effective analysis and mining around these huge amounts of data have gradually become an important requirement to enhance the value of IoT data [1]. As a typical data mining technology, clustering analysis plays an important role in many IoT data analysis scenarios, such as network energy saving [2, 3], privacy protection [4], attack detection [5, 6], service computing [7], and pattern discovery [8, 9]. The goal of clustering analysis is to divide the unknown data into a set of clusters based on a certain similarity measurement between data samples, so that samples in the same cluster are close to each other, and those in different clusters are different from each other. With the help of clustering analysis, the distribution pattern hidden in the unknown data can be easily identified. Typical clustering strategies mainly consist of partitioning methods, hierarchical methods, grid-based methods, and graph methods. In the past decades, a great deal of research works have been conducted to improve the clustering performance from multiple directions, such as similarity measurement, cluster number recognition, cluster structure optimization, atypical data processing, and performance evaluation [10, 11]. The performance and adaptability of clustering analysis in various data analysis scenarios have been improved significantly. However, as an unsupervised learning method, clustering analysis has the following limitations: (i) due to the lack of supervision information, the design of the clustering algorithm depends on human’s subjective hypothesis. As a result, different clustering algorithms may get distinct partitioning results on the same data. (ii) Searching the optimal clustering result is often a nonconvex optimization problem; thus, clustering result always depends on the input parameters and initializations to a great extent. (iii) Real-world data, such as the IoT data, is often multidimensional or multisource. Therefore, one cluster may have diverse distributions or structures from different perspectives. And it is difficult for any clustering method to identify all cluster patterns completely.

To solve these problems, clustering ensemble is proposed driven by the idea of ensemble learning. Clustering ensemble can obtain a final partition result which describes the inner cluster structure of the data more effectively, by combining multiple clustering results on the data. Compared with any single clustering algorithms, the result of clustering ensemble has several significant advantages in terms of reliability, robustness, interpretability, and scalability. Besides, clustering ensemble is friendly to parallel computation and distributed deployment. There are two main phases in the clustering ensemble [12]: (i) generating a set of cluster partitions for the data, which are called base clusterings, and (ii) designing an efficient consensus function to integrate base clusterings into a final partition result. It is shown that the validity of ensemble result is closely related to the diversity of base clusterings. To this end, several strategies are used to generate disparate base clusterings [13], such as using different clustering algorithms, setting different parameters or initializations for a clustering algorithm, extracting different subsets of data, and projecting the data to different feature spaces. To produce a consensus clustering result by combining base clusterings, many works focused on designing various consensus functions for the clustering ensemble model. Each consensus function abstracts the base clustering results to a specific form of ensemble-information matrix. Based on three general types of such matrix (i.e. the label-assignment matrix, the pairwise similarity matrix, and the binary cluster-association matrix), different consensus functions found in these works can be categorized to four major families: (i) relabeling strategy [14]. These algorithms find label correspondence and relabel each partition in accordance with a reference partition and produce the final result by use of a combination method such as voting. (ii) Feature-based methods [15, 16]. These techniques predict cluster assignments using the nominal information that is originally obtained from base clusterings, without searching for correspondence among labels or relabelling. (iii) Pairwise similarity-based algorithms [17, 18]. This specific category of clustering ensemble methods is based principally on the pairwise similarity among data samples. (iv) Graph-based approaches [19, 20]. This family of strategies utilizes the graph structure to solve the clustering ensemble problem. They generally construct a weighted graph from the base clusterings and produce the final result by partitioning the graph using certain graph partitioning methods. In recent years, some clustering ensemble algorithms [18, 21] have been used in IoT data analysis and achieved superior performance to a single clustering algorithm. It is worth noting that IoT data always includes a large number of explicit characteristic information, as well as abundant structure information that describes the intrinsic organization of the data. The data characteristics and structure information describe IoT data from different aspects; therefore, both of them can provide valuable guidance on producing the final ensemble result. However, existing clustering ensemble algorithms, according to our knowledge, produce the final partition result by employing base clusterings either in feature space or exploring structure relations. They seldom consider to combine these two types of data information in the design of the consensus function. This limitation raises a problem that the clustering ensemble result may be suffering from the unreliable base clusterings or incomplete data information.

To explore and utilize various types of information implicit in the IoT data comprehensively, we propose a novel clustering ensemble model in this paper, which integrates data characteristics with structure information in producing the consensus clustering result. As will be discussed, our work devotes to solve the following two key problems to achieve the information integration: (i) how can we extract effective structural information hidden in the raw data? In general, structure information of the data can be expressed by certain relationships, such as the pairwise similarity among data samples, the nominal information originally obtained from an ensemble, and the associations between data samples or those among clusters. In fact, the raw data and base clustering results can be viewed as different organization forms for same data samples. Therefore, extracting the structure information by solely focusing on the similarity between data samples or associations among base clusterings is far from sufficient. This is the first key problem we intend to address. (ii) How to integrate data characteristics and structure information into appropriate representations for producing the final ensemble result? Data characteristics and structure information describe the data in different space and interact with each other in the formation of cluster structure. However, they cannot be simultaneously processed in existing ensemble strategies. Therefore, how to combine these two different types of information elegantly and learn their appropriate representations for clustering ensemble is another key problem.

For the first key problem, we consider to capture the coupled clustering and sample similarity from base clusterings to describe the structure information of data. On the one hand, all the base clusterings are produced on the same data, and there must be some relationships among those ensemble members. On the other hand, samples from the data are more or less associated in terms of certain coupling relationships rather than independent. Based on these knowledge, we plan to extract structural information by explicating and integrating the coupling relationships between base clusterings and between data samples.

To address the second key problem, we employ a variational graph autoencoder (VGAE) [22] module to learn the specific representations from both data characteristics and structure information, which are suitable for clustering objective. And by assuming the prior statistic of the latent representations to be a Gaussian mixture distribution, we derive a joint optimization model which combines representation learning and cluster partitioning into a unified framework.

In this work, we propose a novel clustering ensemble model with the motivation that integrates data characteristics with structure information in aggregation of the clustering results by employing the powerful representation ability of deep learning. In fact, our work can be viewed as an improvement for clustering ensemble approach with structure constraints to handle data with complex distribution. Alternatively, our work can be also viewed as an enhancement of deep clustering method by imposing a global model explicitly in latent space. Our main contributions can be summarized as follows: (i)We discuss how to capture effective structure information for the data by exploiting the base clustering results comprehensively(ii)We design an encoder-decoder specific network to transform the integration of data characteristics and structure information into a graph representation learning problem by treating the data as a graph organized by global structure relationships. We employ a mixture of Gaussian to approximate the prior distribution of the latent representation, which is a tractable parametric model for clustering tasks by nature(iii)We construct a unified optimization model with aggregation of representation learning and consensus cluster partitioning and show how to train the network by maximizing the evidence lower bound (ELBO) using the stochastic gradient variational Bayes (SGVB) estimator and the reparameterization trick(iv)Extensive experiments on several IoT datasets demonstrate the superiority of our approach in comparison with several clustering ensemble algorithms and deep clustering methods

The remainder of this paper is organized as follows. In Section 2, some related works are introduced, respectively. In Section 3, our generative clustering ensemble model is proposed, and each component in the model is illustrated in detail. To evaluate the performance of the proposed model, a series of experiments are conducted and analyzed in Section 4. Finally, the conclusions and discussions of future work are given in Section 5.

In this section, we introduce the most related works: autoencoder (AE) and variational autoencoder (VAE), representation learning for clustering, and graph representation learning.

2.1. Autoencoder and Variational Autoencoder

AE can be regarded as a nonlinear generalization of PCA to reduce data dimensionality, in which high-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer. AE is composed of three basic elements: an adaptive, multilayer encoder network that transforms the high-dimensional data into a low-dimensional code; a similar decoder network that recovers the data from the code; and a loss function that evaluates the lost information in dimensionality reduction. There are several important characters of autoencoder: (i) the module is learned automatically from data samples, (ii) the reconstructed data is degenerated compared with the original data, and (iii) the module is data-specific.

Using a similar encoder-decoder structure, the idea of VAE actually has relatively little to do with classical AE models but is deeply rooted in the variational Bayesian methods [23, 24]. Instead of mapping the input into a fixed vector, the VAE maps it into a distribution in the latent space. And by sampling from the latent distribution, the decoder network can be viewed as a generative module that creates some new samples similar to, but not identical to, the training data. The assumptions of this model are relatively weak, and its training process is fast via backpropagation. VAE does make an approximation, but the error introduced by this approximation is arguably small. These characteristics significantly make VAE to be a popular generative model.

2.2. Representation Learning for Clustering

Recently, some works focus on exploiting the powerful representation ability of deep learning model to learn a better data representation for clustering task, in order to improve the clustering performance. In [25], a two-stage deep clustering framework is constructed, in which deep learning is used to acquire feature representations in subspace, and then, these features are used to predict the cluster assignments. To guarantee that the learned features are suitable for clustering task, many latter works attempt to incorporate clustering objective into the deep learning framework. Specifically, the deep embedded clustering (DEC) algorithm [26] learns low-dimensional data representations using an autoencoder construction without the decoder and proposes an assistant objective distribution based on soft cluster assignment produced on the learned representations. In DCE, the clustering optimization and training of encoder parameters are simultaneously implemented in a self-learning form. To overcome the misguidance issue in feature mapping, Guo et al. [27] utilized a complete autoencoder to improve the DEC, in which the clustering loss not only accomplishes cluster partitioning but also guarantees that the learned data representations maintain the original local structure. Similarly, DECE [28] uses a convolutional autoencoder and a single layer classifier to learn the data representation and the cluster distributions, respectively, in which the DNN is optimized by minimizing the reconstruction error and the relative entropy between the cluster distributions and their priori. In [29], the authors proposed a joint dimensionality reduction (DR) and -means clustering algorithm, to obtain the “clustering-friendly” latent representations and better clustering result simultaneously. Its optimization criterion is composed of three parts: dimensionality reduction which is realized by a SAE framework, data reconstruction, and cluster structure-promoting regularization. Bo et al. [30] developed a structural deep clustering network (SDCN) to integrate the structural information into deep clustering, in which a delivery operator is designed to transfer the representations learned by autoencoder to the corresponding GCN layer, and a dual self-supervised mechanism is constructed to unify these two DNN architectures and guide the update of the whole model.

In the above algorithms, AE learns effective representations for input data and then reconstructs samples with one-to-one correspondence between original data. Unfortunately, the deep clustering models may suffer from the overfitting problem caused by AE networks, due to its powerful learning capacity and the insufficiency of training samples. To overcome this drawback, some works attempt to replace AE with VAE as the network construction for deep clustering. In VAE, the encoder turns to find the mapping relation of data distribution. The latent variables sampled from the learned distribution can effectively capture statistic characteristics of the data. And then, the decoder which is also named generator is capable of generating new data samples for any cluster distribution. This mechanism will help the clustering model acquire more abundant information about the inherent cluster structure of the data. For example, Jiang et al. [31] design an unsupervised generative deep clustering algorithm variational deep embedding (VaDE), in which the cluster distributions are modeled by GMM, and the latent representations of data are learned by DNN. The solution of VaDE is realized by variational inference; specifically, its ELBO is optimized using the SGVB estimator and the reparameterization trick. Choong et al. [32] considered the community structure discovering as a graph clustering problem and proposed a generative model, namely, variational graph autoencoder for community detection (VGAECD). Unlike traditional approaches, the VGAECD does not require a predefined community structure, and it is capable of exploiting feature-rich information of a network. Hwang et al. [33] addressed the issue of clustering complex and high-dimensional wafer maps in semiconductor manufacturing, by proposing a variational deep clustering algorithm namely one-step VAE+DPGMM. In this algorithm, a GMM is implemented to a VAE framework to extract more suitable features for the clustering task, and a Dirichlet process is further applied in the variational autoencoder mixture framework for automated one-step clustering. Compared with conventional two-step clustering methods, the model can considerably increase the chance to distinguish small differences of wafer map patterns.

All the algorithms above construct deep clustering framework based on deep networks, and they provide effective data representations for clustering task by introducing the powerful representational ability of the deep network. However, these algorithms all focus on the general clustering problem, rather than clustering ensemble. And they predict the cluster assignments by use of data characteristics but neglect the structure information implicit in the data. This fact motivates us to consider how to take full advantage of deep networks to learn appropriate representations for multiple data information, in order to enhance the performance of clustering ensemble model.

2.3. Graph Representation Learning

The structure information reveals the intrinsic relationships among data samples, which can provide an important guidance on learning the data representation for clustering ensemble task. As a ubiquitous data organizational form, graph is of many intuitive advantages on describing structure information of the data. It can capture interactions between data samples and make the structure information be efficiently recorded and accessed. In order to incorporate the graph-formed structure information into a machine learning model, graph representation learning can be employed to encode the high-dimensional, non-Euclidean graph information into a feature vector. Graph representation learning is aimed at converting graph data into a low-dimensional, compact, and continuous feature space and preserves the topological structure, vertex content, and other side information in graph as complete as possible. From the encoder-decoder perspective, various graph representation learning methods can be abstracted to a framework consisting two key mapping functions [34]: an encoder, which maps each vertex to a low-dimensional vector or embedding, and a decoder, which reconstructs the graph data from the learned embeddings. Generally, the objective of the encoder-decoder graph representation learning model is optimized by minimizing the reconstruction error or loss of the pairwise vertex similarities between the input graph and the reconstructed graph. Most graph representation learning methods fall into two broad categories: (i) shallow representation approaches, which are largely inspired by classic matrix factorization techniques [35] or random walks [36, 37] using an embedding lookup encoder function, and (ii) generalized encoder-decoder architectures, which use a more complex encoder, often based on DNNs [38, 39] and dependent on the topological structure and vertex attributes [40] of the graph more generally. Among the latter categories, VAE is a common DNN construction. Some works introduce VAE by adding a prior constraint to compress information about a node’s local neighborhood. For example, the algorithm in [22] learns representations of an attribute network under the VAE framework by employing a graph convolutional network (GCN) encoder and an inner product decoder. To address the incomplete filtering issue encountered in traditional GCN-based graph autoencoders, [41] proposed graph convolutional autoencoders with colearning of graph structure and node attributes (GASN) based on VAE. The GASN encodes and decodes the node attributes and graph structure comprehensively by use of a completely low-pass graph encoder and a high-pass graph decoder.

3. The Proposed Model

Given a set of IoT data samples in -dimensional space, each sample is represented by a vector of attribute values. Also, let be a set of base clusterings and be the base clustering, such that , where denotes the number of clusters in and is the cluster. For each , denotes the cluster label in the base clustering to which belongs. The label set containing all the different labels in base clustering is denoted by , and the set of samples whose cluster labels is in is specified as . The problem of clustering ensemble is to find a new partition , where denotes the number of clusters in the final result.

3.1. Overview of the Model

In this section, we will illustrate our proposed generative clustering ensemble (GCE) model, where the overall framework is shown in Figure 1. In this work, the data characteristics mainly refer to the data features, which are used to express properties of individual data object. And the structure information represents the relationships between data objects, which are used to describe connections between different objects intrinsic in the data. We first discuss how to effectively extract structure information from an ensemble of base clusterings. Then, we construct a variational network that combines representation learning and clustering ensemble objective into a joint optimization process. In the representation learning phase, we treat data samples as the vertexes of a relationship graph and integrate data characteristics with structure information in a VGAE module. While in the production of the consensus clustering result, we utilize a GMM to approximate the prior distribution of the latent representation. At last, we optimize the encoder-decoder module and the GMM jointly in the form of stochastic inference.

3.2. Structure Information Extracting

In our proposed GCE model, it is important to construct an appropriate affinity matrix to describe structure information of the data. In many graph-based clustering ensemble methods, the affinity matrix is usually constructed by finding several nearest neighbors for a given sample and evaluating their similarities through a certain mapping function. The similarity defined in this way is limited in revealing the local relationships between data samples rather than global relationships among the whole data. In clustering ensemble context, we can actually acquire richer information about data structure from base clusterings. Different from typical methods, we believe that the global relationships among the data can be determined by two elements [42]: coupling of base clusterings and coupling of data samples.

3.2.1. Coupling of Base Clusterings

The coupling of base clusterings is defined to represent relationship among different ensemble members, which are composed of two components: intracoupling that reflects the involvement of cluster label occurrence frequency within a base clustering and intercoupling that indicates the interaction between two base clusterings. Specifically, we define the coupled clustering similarity for clusters (CCSC) between two cluster labels in a certain base clustering to describe the coupling relationship of base clusterings, which can be calculated as

is the CCSC between and , is the intracoupled clustering similarity for clusters (IaCSC) between and , is the intercoupled relative similarity for clusters (IcRSC) between and based on another base clustering , and is the label set of . The IaCSC captures the base clustering frequency distribution by calculating the frequency of cluster labels within a base clustering, and it is defined as

The IcRSC characterizes the base clustering dependency aggregation by comparing cooccurrence of the cluster labels among different base clusterings. It is defined as where is the weight of base clustering , , and is calculated as

In equation (4), is a set that represents , and is the subset of cluster labels in base clustering for the corresponding samples .

3.2.2. Coupling of Data Samples

Similarly, the coupling relationships among data samples can be also discussed from intraperspective and interperspective, respectively. In terms of the intraperspective, the similarity between and is represented by intracoupled sample similarity (IaSS), which is defined as

The IaSS refers to the average sum of the CCSC between the associated cluster labels ranging over all the base clusterings. From the interperspective, we can describe the interaction between different samples by mining the correlation among their neighbors. Accordingly, we define the intercoupled sample similarity (IeSS) between two samples using their common neighbors: where is the IeSS between samples and in terms of other samples in . is the neighbor set of , and it is defined as where is the kernel function, and is a threshold of . For instance, with the Gaussian kernel, the is defined as where is a normalizer to make , is the width of Gaussian kernel, and denotes a certain similarity measure for samples, such as Euclidean dissimilarity for numeric charicteristics or Jaccard coefficient for categorical attributes.

3.2.3. Construction of Structure Information

Obviously, the position of a data sample in a clustering depends on which cluster it belongs to. Thus, the clustering coupling and sample coupling can be integrated through the corresponding clusters. Specifically, we employ IaSS as the similarity measure in equation (8) to define coupled clustering and sample similarity (CCSS) between samples, and we have where is the CCSS between and and the neighbor sets are defined as and , respectively. The kernel function in equation (8) can be rewritten as

In this way, the CCSS not only takes into account both the intracoupled and intercoupled interactions between base clusterings but also incorporates both the intracoupled and intercoupled relationships between samples. Given a series of clustering members, we can define an affinity matrix that stores the structure information about the data, and each entry of the matrix denotes the global similarity between samples and . Specially, we set the value of as where is defined in equation (9).

3.3. The Generative Clustering Ensemble Model

Based on the affinity matrix, the data can be viewed as a sample similarity graph implying both data characteristics and structure information. To combine these two types of data description, we design a joint clustering ensemble model within the framework of VGAE, in which the priori of the latent representation is approximated as a mixture of Gaussian distributions.

3.3.1. Inference Model

Given a dataset and its structure information formed by an affinity matrix , the inference model is parameterized by a two-layer GCN:

Here, denotes latent representation of . For a certain data sample and its structure information vector , the corresponding latent vector in the low-dimentional space can be obtained by where and are mean and variance of the latent vector . Each latent vector is sampled from a distribution obtained by a GCN, i.e., , and . The GCN structure is defined as where the function is a graph convolutional layer and and are learnable weight matrices for the first layer and second layer, respectively. And is shared between and .

3.3.2. Generative Model

In our model, we assume the training data is generated from a Gaussian mixture distribution; i.e., the clustering ensemble result can be approximated by a GMM, and each sampled latent vector should lie in a cluster modeled by one Gaussian component with a certain probability. For each training data sample , we learn a latent representation and introduce a -dimensional vector satisfying and to indicate the prior cluster distributions of the data. The generative process can be modeled as follows:

From the consistent clustering partition, sample a cluster , where is the categorical distribution parameterized by . (i)From the picked cluster, sample a vector , where and denote the mean and variance of the Gaussian component, respectively(ii)From the reconstructed , sample a vector . For binary data, choose , where is a multivariate Bernoulli distribution, and is computed by . For real-value data, choose , where is a multivariate Gaussian distribution, and , are learned by . The function is a nonlinear function parameterized by , and in our model, the inner product decoder is used

According to the above generative process, we can factorize the joint probability as

Since and are independently conditioned on , we have

3.4. Learning Algorithm

Our GCE model can be tuned by maximizing the log-likelihood of the given data samples as

By using Jensen’s inequality, we have where denotes the evidence lower bound (ELBO) of and is the variational posteriori approximating the true posterior . By assuming to be a mean field distribution, we can factorize it as

According to equations (16) and (22), the can be rewritten as equation (23). By submitting the inference model defined by equations (13), (17), (18), and (19) and using the Monte Carlo SGVB estimator, the can be further rewritten as equation (24). where is the total number of samples in the SGVB estimator, and are the dimensionalities of training data and latent vector, respectively, is the element of , and denote the and element of vector , respectively. is calculated by , in which is the Monte Carlo sample picked from defined in equation (13). In order to employ gradient backpropagation on the stochastic layer, the reparameterization trick is used here, and can be calculated as where the learning rate is sampled from , is the element-wise multiplication operator, and and are learned by the GCN network formulated by equation (14).

In our variational clustering ensemble framework, the solution of consistent clustering result is to find the posterior distribution that maximizes the ELBO. By regrouping the like terms in equation (23), the can be further rewritten as where is the Kullback-Leibler divergence function that measures the distance between two distributions and is a Gaussian prior distribution for latent vector. The first term in equation (26) is independent of , and the second term is nonnegative due to the definition of KL divergence. Thus, the achieves the maximum value when is satisfied. Consequently, the optimal distribution can be approximated by

Since the representation learning and the cluster partitioning are incorporated in an integrated framework, the latent vector is guaranteed to be an appropriate representation of for clustering ensemble, and we use as an approximation to . Meanwhile, the information loss introduced by the mean field assumption in equation (19) can be mitigated by the relationship between and captured in .

To further explore how our optimal model could work on producing a consensus clustering result by incorporating data characteristics and structure information, we rewrite the ELBO in equation (21) as

It is obvious that the first term in equation (28) is a reconstruction component, which promotes our framework employing latent embedding and clustering ensemble result to explain the relationships among data samples effectively. And the second term is the KL divergence between the variational posterior and the prior distribution modeled by a Gaussian mixture distribution. This KL divergence can be considered a regularization term in our optimal objective that guarantees the learned representation to lie on a Gaussian mixture manifold. As a result, two advantages can be clearly recognized. (i) From an overall perspective, our framework jointly optimizes VGAE and GMM to obtain effective data representation and an appropriate cluster partitioning. (ii) Particularly, in the representation learning section, the data characteristics and the structure information are integrated elegantly in a generative framework.

3.5. Overall Implementation

By integrating the above derivation steps and optimization solution, the implementation of the proposed GCE model is summarized in Algorithm 1.

Input: Data samples X, learning rate , number
   of Monte Carlo samples in SGVB
   estimator M, epochs L.
Output: Consistent clustering result
   .
1 Produce an ensemble of base clusterings for X;
2 From Equation (11), construct the affinity matrix
  A to represent structure information for X;
3 Choose ;
4 fordo
5  fordo
6    ;
7    ;
8    Sample ;
9    Sample ;
10       Generate reconstructed ;
11       From Equation (24), compute
      ;
12       Backpropogate gradients
13  end
14 end
15 From Equation (27), obtain the category
  assignment ;
16 return

4. Results and Discussion

In this section, a number of experiments are conducted on several IoT datasets to evaluate the validity and superiority of the proposed model.

4.1. Datasets and Evaluation Metrics

Several widely known real datasets are employed for testing, namely, KDD’99, NSL-KDD, AWID, and UCI-IoT. The KDD’99 is a comprehensive network flow dataset, which is usually used as a benchmark in intrusion detection tasks. The NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD’99, which removes redundant and duplicate records and is more suitable for comparing different intrusion detection methods. AWID is another commonly used network security dataset, which consists of both normal and intrusive network traffic records collected from real 802.11 wireless networks. UCI-IoT is a real traffic dataset collected by lots of commercial devices, whose goal is to recognize 11 different types of traffic situations: one normal operations and 10 malicious attacks. All these datasets are very large and complex; it is difficult to conduct experiments on the whole dataset. Thus, preprocessing is required to construct appropriate datasets for our experiments. For the above four datasets, we randomly draw 20,000 samples from each of the whole dataset. In these preprocessed datasets, symbolic features are mapped to a series of numeric values by one-hot encoding for ease of handling. Besides, we also employ the Network Simulator 2 (NS2) to generate a wireless sensor network (WSN) security dataset, in which 50 sensor nodes are used to simulate attacker records. The sensor nodes transport protocol messages and data messages according to Ad hoc On-Demand Distance Vector (AODV) protocol with a constrained bit rate. These synthetic records contain normal messages and 7 types of security attack cases, which account for scale of the whole dataset. The details of all the datasets used here are summarized in Table 1.

In the experiments, we use clustering accuracy rate (CAR), adjusted rand index (ARI), and normalized mutual information (NMI), which are widely utilized in clustering task, to evaluate performances of different algorithms. For a set of data samples , and are used to denote the clustering result and the true category assignment, respectively. The number of data samples in and are represented as and . And the number of common samples of and is recored as . Then, the three clustering performance indexes can be defined as follows.

4.2. Contrastive Algorithms

In the experiments, we compare the proposed model with several representative clustering ensemble methods, which consist of four categories: (i)Two relabeling-based approaches, including the selective voting (SV) and selective weighted voting (SWV) ensemble algorithms [14](ii)Two feature-based approaches, including the expectation maximization (EM) algorithm [15] and the coclustering ensemble (CoCE) approach [16](iii)Two pairwise similarity approaches, including the Weighted Connected-Triple (WCT) algorithm proposed by Iam-On et al. [17] and the Hierarchical Flexi Ensemble Clustering (HFEC) model [18](iv)Two graph-based approaches [19], including the cluster-based similarity partitioning algorithm (CSPA) and the ultrascalable spectral clustering (U-SPEC) algorithm [20].

From another perspective, the proposed model can be viewed as an improved deep clustering method that incorporates the structure information extracted from base clusterings. Thus, in our experiments, we also compare it with three deep clustering methods including DEC [26], IDEC [27], and VaDE [31].

4.3. Experimental Setup

To evaluate all algorithms under the same condition, the following experimental settings are adopted in experiments. (i)For all algorithms and all datasets, the numbers of clusters are set to be the true numbers of categories. And parameters of all the reference algorithms are set according to their authors’ suggestions(ii)The -means is conducted 200 times independently with random and different initializations to produce 200 base clustering results. And these results are equally divided into 10 subsets, where each subset consists of 20 base results. Then, each ensemble clustering algorithm in the experiments is run on these subsets and produce 10 ensemble results. The average values of these ensemble results are reported as the final outcomes for comparisons(iii)For deep clustering algorithms, they employ the network architectures adopted in the GCE model, for a fair comparison. All the layers in the encoder-decoder framework are fully connected, and ReLU is selected as the activate function. The network construction of encoder and decoder are mirrored set as and , respectively, where is the dimensionality of the input data. To improve the computational efficiency, the Adam optimizer is used, and the minibatch size is 100. The learning rate is initialized to be 0.02 and decreases every 10 epochs with a decay factor of 0.9. To prevent the models from trapping into local optima or saddle point at the beginning of training, the pretraining method in DEC is adopted in these deep clustering algorithms. In the testing experiments, all the deep clustering algorithms are executed 50 times on each dataset, respectively, and the average results are used for comparison(iv)For the VaDE and the proposed model, a stacked autoencoder is used to pretrain the encoder and decoder. And in the proposed model, the Gaussian kernel is employed to extract the structure relationships between data samples. The kernel width is set in the interval for different datasets, with a step size 0.1. Besides, the parameter in the proposed model is set to be 0.5

4.4. Experimental Results
4.4.1. Performance Analysis

Tables 24 illustrate the CAR, ARI, and NMI results of GCE and other compared algorithms on all datasets, respectively. The best results of different algorithms are marked in boldface. From these results, some notable points can be found: (i)Compared with -means, both clustering ensemble algorithms and deep clustering algorithms achieve better results in aspects of CAR, ARI, and NMI. It can be concluded that the cluster assignments produced by -means are quit unreliable, as each dataset consists of several linearly inseparable categories. Clustering ensemble algorithms can enhance the clustering performance effectively by integrating multiple weak base results. Deep clustering algorithms learn appropriate representations in latent space for clustering objective; as a result, they recognize cluster patterns of these datasets more reliably than -means which divides clusters on raw datasets directly(ii)Compared to reference clustering ensemble algorithms, which produce final clustering result solely by means of unreliable base clusterings, our GCE model outperforms them considerably on all datasets. Different from these contrastive clustering ensemble algorithms, our model not only explores base clusterings to extract comprehensive structure information but also integrates data characteristics in feature space with extracted structure information to generate effective reorganization for the data. In this way, the unreliable partitioning in base clusterings can be rectified, to a certain extent. That is why our model can easily outperform other clustering ensemble algorithms(iii)From the experiment results, we can also find that the GCE model achieves better clustering performances in terms of CAR, ARI, and NMI compared with reference state-of-the-art deep clustering algorithms. The results strongly demonstrate the effectiveness and superiority of our joint representation learning strategy that integrates structure information and data characteristics elegantly for prediction of cluster assignments. It is worth noting that, on most datasets, the VaDE achieves the second best results after our model. This is mainly because these two algorithms can recognize cluster distributions more precisely from the random-sampled subsets owing to the capability of generating samples. And VaDE and GCE both utilize the GMM model to be their classifiers, which can approximate arbitrary distribution smoothly. Unlike the VaDE, our model learns joint data representations that provide a positive guidance to depict the formation of clusters not only depending on data characteristics but also incorporating the structure information captured from base clusterings. That is to say, the representations learned in the GCE contain richer information about intrinsic pattern of the data. Therefore, the GCE model outperforms VaDE

To illustrate the impact of randomness on the GCE model, the standard deviation (std) of its running results (CAR, ARI, and NMI) based on 10 base clustering subsets are listed in Table 5. It can be found that the std value is really small on each dataset. Thus, we can draw a conclusion that the GCE model is of strong robustness to randomness.

4.4.2. Parameter Analysis

In this part, we analyze the relationship between the parameter and the ensemble result of the GCE model. is used to control the size of neighbor set in structure information extraction, which may make some impacts on the construction of the affinity matrix for data reorganization. A smaller will bring more samples into the neighbor set of a given sample, and more neighbors will involve more sufficient information giving assistance for the cluster partitioning. However, the other side of the coin is that inconsistent neighbors will also be more likely included, which may introduce misguidance for the cluster partitioning as well as additional computations. In the experiments, we conduct the GCE model 50 times on 5 datasets with different values of . And the average results on each dataset are shown in Figure 2. According to these results, some notes can be found as follows. (i) The proposed model achieves the best result 5when the parameter takes a certain value, and its performance will degrade as the value of gets too large or too small. That is to say, too large or too small neighbor set may lead to a degradation of the clustering performance. (ii) No matter what value the parameter is, the GCE model reflects more or less advantages over the reference deep clustering algorithms. It demonstrates that introducing structure information into representation learning can and do enhance the ability of capturing cluster distributions.

5. Conclusion

A novel clustering ensemble model is developed in this paper, in order to effectively integrate data characteristics and structure information. Different from conventional clustering ensemble algorithms which generate the final clustering result solely relying on potentially unreliable base clusterings, our model produces the consensus cluster assignment depending on both base clusterings and raw dataset. It first exploits structure information about the data from a set of base clusterings and then transformed the integration of data characteristics and structure information into a graph representation learning problem by reconstructing the data as a sample similarity graph. The learned data representations can not only capture description of the data from both of feature space and structure space but also be suitable for clustering objective. Thus, the final consensus clustering result can be responsibly acquired from it. We conduct experiments by comparing the model with several state-of-the-art clustering ensemble and deep clustering algorithms on several IOT datasets. The experimental results demonstrate the effectiveness and superiority of the proposed model in contrast to the reference algorithms.

In future work, we will extend our model for more complex data types and application tasks. And we also plan to improve it to deal with super-large-scale datasets by optimizing its execution mechanism.

Data Availability

The datasets used in the experiments can be acquired from the following websites: KDD’99—http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html; NSL-KDD—https://www.unb.ca/cic/datasets/nsl.html; AWID—https://icsdweb.aegean.gr/awid/; and UCI-IoT—http://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We wish to thank the authors of the compared algorithms for sharing their codes. This work is supported by the National Natural Science Foundation of China (Nos. 61902227, 62076154, 62022052, and U21A20513).