Abstract

Image clustering is a complex procedure, which is significantly affected by the choice of image representation. Most of the existing image clustering methods treat representation learning and clustering separately, which usually bring two problems. On the one hand, image representations are difficult to select and the learned representations are not suitable for clustering. On the other hand, they inevitably involve some clustering step, which may bring some error and hurt the clustering results. To tackle these problems, we present a new clustering method that efficiently builds an image representation and precisely discovers cluster assignments. For this purpose, the image clustering task is regarded as a binary pairwise classification problem with local structure preservation. Specifically, we propose here such an approach for image clustering based on a fully convolutional autoencoder and deep adaptive clustering (DAC). To extract the essential representation and maintain the local structure, a fully convolutional autoencoder is applied. To manipulate feature to clustering space and obtain a suitable image representation, the DAC algorithm participates in the training of autoencoder. Our method can learn an image representation that is suitable for clustering and discover the precise clustering label for each image. A series of real-world image clustering experiments verify the effectiveness of the proposed algorithm.

1. Introduction

Clustering is a basic unsupervised learning problem whose purpose is to divide data into several subgroups. Generally speaking, the elements in the same subgroup are similar and different from the elements of the other subgroups [1]. Image clustering is one of the fundamental high-dimension data clustering tasks in computer vision and machine learning [2]. Despite decades of development, the reliable clustering method of image data is still an outstanding problem [3, 4].

From the perspective of image representation, there are two types of image clustering methods, which are the traditional image clustering methods and deep clustering methods [5]. The traditional image clustering methods group data on handcrafted features and treat feature extraction and clustering separately [6]. Based on this insight, many attempts have been dedicated to developing suitable clustering feature extracting techniques such as manually designed feature descriptors, including Bag of feature (BOW) [7], Histogram of Oriented Gradient (HOG) [8], Principal Component Analysis (PCA) [9], and Scale-Invariant Feature Transform (SIFT) [10]. However, the representation ability is limited by using handcrafted features that do not depend on the distribution of input data. How to establish an effective feature representation is a crucial problem that needs to be solved in image clustering.

In recent years, a deep neural network has been successfully applied in various supervised learning tasks [11, 12]. The reasons for the success of deep neural networks are to learn more essential representation of images by constructing a network with multiple hidden layers and train the network with a large number of data [13]. Motivated by the success of deep neural networks in supervised learning, some unsupervised deep learning methods have been used to image clustering. These methods are called deep clustering [14]. Most previous deep clustering studies are two-stage training schemes based on an autoencoder. First, they usually train an autoencoder to reduce the dimension of image data. Then, the encoder acts as a feature extractor and uses a clustering algorithm to train it simultaneously. The two-stage clustering methods have been widely studied and successfully applied in many works [15ā€“19]. The reason for the effectiveness of autoencoder based methods is that it can preserve some properties of data by adding prior knowledge to subjective. Thus, the encoder constructs a feature representation that can comprehensively describe the image information. However, since no clustering-driven objection participates during the training two-stage clustering methods, the learned encoder may not be suitable for clustering.

Latter, one-stage clustering methods that jointly accomplish feature transformation and clustering come into being. Deep adaptive image clustering (DAC) is a typical one-stage image clustering algorithm [20]. It defines an effective objective and proposes a self-learning scheme to realize image clustering. The defined objective function is used to update the parameters of a convolutional network by selecting highly confident image pairs and the cluster assignment is integrated into classification labels. However, there are two crucial factors that affect the stability and effectiveness of the DAC algorithm. On the one hand, the initialization of the convolutional network is an important factor affecting the performance of DAC. On the other hand, with the training of DAC, the local structure preservation of representation cannot be guaranteed. Thus, the image representation in the distorted feature space will hurt the clustering performance.

To overcome the problems of DAC, we present an image clustering representation learning method based on autoencoder (AE) [21] and deep adaptive image clustering (DAC) [20]. Specifically, to obtain the essential features of the image and provide initial parameters for DAC, we incorporate a fully convolutional autoencoder into DAC algorithm. As a clustering algorithm, DAC helps to train the autoencoder to get clustering friendly features. Autoencoder can guarantee the feature space not to be distorted. The proposed method can learn the image representation suitable for clustering and simultaneously find the clustering labels of each image. Extensive experiments verify the effectiveness of the proposed algorithm.

The main contributions of this paper can be concluded in three aspects:(1)We propose a novel system based on an AE and DEC and use it to learn an informative image representation(2)Since AE and DEC can complement each other, we use the learned image representation to realize image clustering(3)We conduct extensive experiments on four real-world datasets to verify the effectiveness of the proposed algorithm

The rest of the paper is organized as follows. The Section 2 will briefly introduce the related work of our paper. Section 3 proposes the clustering algorithm as well as some details of the algorithm. Section 4 provides a series of experiments to verify the effectiveness of the proposed algorithm. The last section briefly concludes our paper.

2.1. Deep Clustering

Deep clustering refers to clustering with the related algorithm of deep neural networks [22]. Existing deep clustering algorithms are mainly to seek some effective ways to combine deep feature learning with traditional clustering methods, which are mainly divided into two categories: (I) a two-stage work that apply clustering after a representation is learned; (II) a one-stage work that jointly optimize the representation learning and clustering [23].

Two-stage methods usually train an autoencoder at the first stage. Then, the encoder acts as a feature extractor and uses a clustering algorithm to obtain the clustering results. Autoencoder (AE) is a classical feature learning method that is based on deep neural networks and image reconstruction loss function [18]. Recently, many image algorithms attempt to regularize the learning of image representation of autoencoder with the loss function of the traditional clustering algorithm. For instance, Deep Embedding Clustering (DEC) utilizes KL-divergence as a loss function to measure the distance between the distribution of image feature and the target distribution [24]. Ghasedi Dizaji et al. propose Stacked Auto-Encoder (SAE) to learn a deep learning-based latent feature representation and use it to improve classification [18]. Peng et al. propose a novel clustering method by minimizing the discrepancy between pairwise sample assignments for each data point [25]. Gaussian Mixture Variational Autoencoders (GMVAE) is a representative generation-based clustering algorithm that incorporates Gaussian distribution to variational autoencoder [16]. The advantage of AE is that it keeps the essential information of features in the process of clustering algorithm training, and the learned representation is more suitable for clustering tasks. Thus, it can avoid the degradation of the solution and improve clustering performance. The disadvantage of the two-stage method is the mismatch problem between image representation and clustering. Specifically, the clustering algorithm does not participate in representation learning, which will lead to the blindness of representation learning.

One-stage methods combine image representation with clustering learning. For instance, deep adaptive image clustering (DAC) is a typical one-stage image clustering algorithm [20]. It defines an effective objective and proposes an adaptive mechanism to realize image clustering. Guo et al. propose Improved Deep Embedded Clustering (IDEC) algorithm to take care of data local structure preservation [26]. IDEC trains AE and self-training simultaneously to realize local feature preservation. (CatGAN) uses general Generative Network Adversarial (GAN) and entropy as loss function to realize data clustering [27]. JULE proposes a recurrent framework for joint unsupervised learning of deep representations and image clusters [17]. The effectiveness of these learning schemes has been proved in theory and practical experiments. However, there are two crucial factors that affect the stability and effectiveness of these algorithms. On the one hand, the initialization of the convolutional network is an important factor. On the other hand, with the training going on, the local structure preservation of representation cannot be guaranteed.

3. Image Clustering Based on AE and DEC

3.1. Autoencoder

AE is a type of artificial neural network which are used to learn efficient data codings in an unsupervised manner [28]. Generally speaking, the objection of an AE is to extract a feature (encoding) of the input data. In the field of computer vision, it is usually used to learn image representations and reduce image dimension [16, 29].

Consider being a set of images, where denotes the number of images. An AE reduces the dimension of images from high-dimensional spaces to a low dimensional space and . The embedding of the dataset is denoted by . The function that performs the embedding denotes as . Thus, Generally, to guarantee the learned representation can adequately represent the input image information. The following reconstruction loss is used to train the autoencoder network:where is the decoder that maps the representation to the output.

In our algorithm, to extract essential features and preserve spatial locality of images, we adopt a fully convolutional autoencoder to realize the image feature extraction stream.

3.2. Deep Adaptive Image Clustering

DAC is a clustering algorithm that is realized by a convolutional neural network (CNN) and an adaptive training mechanism [20]. It employs some constraints on the classification output and generates a feature for image clustering.

Let us assume that and are two unlabeled images, denotes an unknown binary output received by the generated label, where if and belong to the same cluster and otherwise, . In this case, is the dot product of and , and it indicates the similarity, where is a classification network. Based on the similarity of the input image features, the binary labels are defined as follows:where is an adaptive parameter, and are two learnable thresholds.

For network training, the objection function of DAC is defined as follows:where denote the estimated similarity of and with a classification network parameter , an indicator coefficient matrix to predict the training samples, which means that the sample is selected to train the network, and otherwise. is defined as follows:

is the loss function and is defined as follows:

3.3. Network Architecture

The network architecture of the proposed clustering algorithm is shown in Figure 1. There are two streams in our network, the autoencoder stream, and the DAC stream. The autoencoder stream is realized by several fully convolutional layers and the DAC stream is composed of the autoencoderā€™s encoder and several fully connected layers. Considering a dataset with input samples that need to be clustered. The number of clustering is a priori knowledge. be the output of the encoder. Thus, the encoder can be defined as a nonlinear mapping and the decoder is where and are the parameters of encoder and decoder, respectively. denotes the output of the decoder and the output of DAC can be represented as , which is a classification network and is the parameters of fully connected layers.

3.4. Loss Function

Since we aim to seek an encoder that makes the extracted feature more suitable for clustering. The reconstruction loss of the autoencoder is added to the initialization and training process of the DAC algorithm. On the one hand, the reconstructive loss function is used to assist the learning of image representation as it can learn the essential feature of input images and avoid the distortion of feature space in the training process of DAC. On the other hand, the loss of an autoencoder is only focused on image reconstruction, which loses the useful information needed for clustering. DAC loss can guide it to obtain a better representation suitable for clustering. Thus, we define the complete loss function as follows:where is a balance coefficient. and are reconstruction loss and clustering loss, respectively. The final objective function is as follows:where the definitions of and are as follows:

3.5. Network Training

In this section, we present the whole training process of our algorithm. To minimize the loss function proposed in (7), we first abandon the DAC stream and pretrain a convolutional autoencoder by using loss . The trained encoder can provide initial parameters for the DAC algorithm. Thus, the DAC algorithm will select more accurate labeled image pairs in the initial stage. Then, we simultaneously train the autoencoder stream and the DAC stream by minimizing (7). The detailed algorithm is formalized as Algorithm 1.

ā€‰Input: input images ; number of clusters ; the threshold and ; the batch size and the learning rate and ; the balance coefficient
ā€‰Output: reconstruction images ; clustering labels of input images.
(1)Pertain a fully convolutional autoencoder;
(2)for in do
(3)Extract the images feature in the batch ;
(4)Compute the similarity and pseudolabel based on (2);
(5)Calculate the indicator coefficient based on (4);
(6)Updata , and by minimizing (7);
(7)end for
(8)Updata by minimizing (7).
(9)for in do
(10) and ;
(11)end for

When the autoencoder is trained, it seems to get a clustering friendly representation by finetuning the autoencoderā€™s encoder in the DAC algorithm. The encoder is connected with a number of fully connected layers to form a classification network, and the labels are calculated by the algorithm proposed in [20]. However, we suppose that this kind of finetuning can distort the representation space, which may weaken the expressive power and thereby hurt clustering results. For this reason, in the process of training the DAC stream, the autoencoder also needs to be trained to maintain the DAC algorithm to obtain highly confidence labeled image pairs.

4. Experiments

In this section, we carry out a series of experiments to verify the effectiveness of our algorithm. All the experiments are carried out in Tensorflow and Keras environment running Ubuntu14.04, Inter(R) Core i7-4790 CPU 3.6ā€‰GHz and Titan X GPU 12ā€‰GB.

4.1. Datasets

In this part, fore challenging image datasets including Fashion-MNIST, Cifar-10, Cifar-100, and STL-10 datasets are selected to verify the effectiveness of our algorithm. We first briefly introduce datasets.

4.1.1. Fashion-MNIST

Fashion-MNIST is a dataset of Zalandoā€™s article images, which includes a training set of 60,000 examples and a test set of 10,000 examples [30]. In the Fashion-MNIST dataset, each example is a grayscale image, associated with a label from 10 classes.

4.1.2. Cifar-10 and Cifar-100

CIFAR-10 contains 50,000 training images and 10,000 test images from 10 classes [31]. Each image has a size of . Cifar-100 is similar to Cifar-10, except it has 10 times fewer images per class.

4.1.3. STL-10

The STL-10 dataset is an image dataset used to develop unsupervised feature learning, deep learning, and self-supervised learning algorithms [32]. It is inspired by the CIFAR-10 dataset but with some modifications. The high-resolution dataset () will make it a challenging benchmark to develop more scalable unsupervised learning methods.

In our experiments, the training set and validation set of each dataset are jointly utilized. In particular, the 20 superclasses of Cifar-100 dataset are considered in all the experiments. We summarize the detailed information of each dataset in Table 1.

4.2. Evaluation Metrics

Three commonly clustering metrics including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI) are adopted to evaluate the performance of clustering algorithms. These metrics reflect the cluster performance from different perspectives. ACC measures the best matching between the clustering labels and ground truth labels. NMI measures the similarity between pairs of clusters [33]. ARI establishes a baseline by using the expected similarity of all pairwise comparisons between clusters specified by a random model [34].

4.3. Competitors

We compare the performance to traditional clustering methods and deep clustering methods. Specifically, traditional clustering methods include K-means [35], Self-tuning Spectral Clustering (SSC) [36]. Deep clustering methods include Greedy Layer-Wise Training of Deep Networks (GLWTDN) [37], Consistent Inference of Latent Representations (CILR) [15], Gaussian Mixture Variational Autoencoders (GMVAE) [16], Categorical Generative Adversarial Networks (CatGAN) [27], Joint Unsupervised Learning (JULE) [17], Deep embedding clustering (DEC) [24], and DAC [20]. We summarize the clustering results of the mentioned methods on all datasets in Table 2. Next, we will briefly introduce the competitors.

4.3.1. Traditional Image Clustering Methods

K-means++ and SSC: these methods first use Bag of Wording (BOW) to encode the images, and then, the image features are clustered to achieve image clustering.

4.3.2. Deep Clustering Methods

GLWTDN: it first trains an AE model to extract image features and then uses the K-means algorithm to cluster the image features [37].

CILR: it first adopts consistent inference of latent representations (CILR) to generate latent labeled data points of the inputs. Then, CILR is derived to pretrain DNNs by minimizing the distance between latent labeled data points to realize image clustering [15].

GMVAE: it uses the Gaussian mixture model as a prior distribution to improve the traditional variational autoencoder. It uses the improved latent vector as image representation and then clusters representation to realize image clustering [16].

CatGAN: it uses general Generative Network Adversarial (GAN) and entropy as loss function to realize data clustering [27].

JULE: it proposes a recurrent framework for joint unsupervised learning of deep representations and image clustering [17].

DEC: it first learns image representations from an AE. Then, clusters are obtained by utilizing a typical K-means algorithm [24].

DAC: it formulates image clustering as a binary pairwise classification problem and identifies these pairs of images which should belong to the same cluster [20].

4.4. Experimental Settings

Following the setting in DAC, we set the initial thresholds which construct highly confident pseudolabel to and , respectively. The initialization of the parameter is set to 0. Considering the convergence of our algorithm, we set the learning rate of to . The balance coefficient is . For training, we adopt the well-known Adam optimizer with an initial learning rate . In addition, the batch size is set to in all the experiments. The detailed network architecture used in each dataset is shown in Table 3.

4.5. Results
4.5.1. Clustering Performance Comparison

In this part, we first compare our method with many state-of-the-art methods including K-means [35], Self-Tuning Spectral Clustering (SPC) [36], Greedy Layer-Wise Training of Deep Networks (GLWTDN) [37], Consistent Inference of Latent Representations (CILR) [15], Gaussian Mixture Variational Autoencoders (GMVAE) [16], Categorical Generative Adversarial Networks (CatGAN) [27], JULE-SF [17], JULE-RC [17], Deep embedding clustering (DEC) [24], and DAC [20]. We summarize the clustering results of the mentioned methods on all datasets in Table 2.

As shown in Table 2, for each dataset, the performance of deep clustering methods is superior to that of traditional clustering algorithms. Autoencoder based method such as AE outperforms traditional algorithm K-means with a large margin, which justifies the fascinating potential of autoencoder in clustering task. Furthermore, note that the proposed method outperforms the other algorithms on all datasets. In addition, the clustering accuracy of our algorithm outperforms all competitive baselines, with significant margins of 4.79%, 4.47%, 4.95%, and 3.94% in the case of Fashion-MNIST, Cifar-10, Cifar-100, and STL-10, respectively. These results verify the effectiveness of our method in image clustering tasks.

Figure 2 shows the confusion matrixes of the clustering results for Cifar-10 and STL-10 datasets. The values along the diagonal represent the percentage of samples correctly classified into the corresponding categories. We can find that all clustering accuracy is average and stable for these two datasets. This proves that our method does not aggregate samples into a few categories and can effectively avoid the degenerate solution problem.

4.5.2. Visualization

In this part, we use two methods to visualize the clustering results of our algorithm. The first visualization experiment is conducted on the Fashion-MNIST dataset. We randomly sampled 10000 samples of the representation and mapped them to a 2-dimension vector by using t-SNE [30]. The experiment results are shown in Figure 3. In Figures 3(a)ā€“3(f), different colors indicate different clusters and the corresponding clustering accuracies are reported as follows. The experimental results show that the proposed algorithm can effectively improve the separability of data, which is helpful to improve the clustering accuracy of the image.

In the second visualization experiment, we qualitatively analyze the cluster results by the proposed Cifar-10 dataset. For each category, we randomly select an image as the original image at the first stage. Then, we pick up several samples, which are the smallest Euclidean distance between original images from the same cluster. Finally, we pick the samples which are closest to the original image in the incorrect cluster. All the picked images are shown in Figure 4, and we mark the incorrect samples with red boxes. Form the visualization results, we can find that the successful cases not only depend on texture information but also contain some semantic information of categories. The failure cases also contain a lot of texture information similar to the source images. It implies that our method not only captures image appearance information but also captures some abstract image information for image clustering. This is the reason why the proposed method can precisely discover cluster assignments.

4.5.3. On the Effect of Number of Clusters

In this part, we study the effect of the number of clusters on our algorithm and compare the results with the DAC algorithm. For each dataset, we conduct 6 experiments on different training sets. For Cifar-10 and STL-10 dataset, the number of training sets varies in the range of at equal intervals. For Cifar-100 dataset, the number of training samples varies in the range of with an interval of 2. We report the variation curves of clustering accuracy with the number of clusters in Figure 5. The detailed numerical results are shown in Table 4.

As shown in Figure 5 and Table 4, with the increase in the number of clusters, the accuracy of clustering decreases gradually. For all datasets, the clustering accuracy of our method is always higher than that of the DAC algorithm in the different number of clusters. In addition, the other two metrics results also show the superiority of the proposed algorithm. This is because autoencoder can reduce the randomness of searching label pair and improve clustering performance of DAC algorithm. The experimental results also show the stability of the algorithm.

4.5.4. On the Effect of the Parameter and

In this experiment, we mainly study the effect of the parameter and . The range of parameters is selected by the grid search of the region in a step size of . In Figure 6, we report the clustering accuracies with different . From the results, we can find that when and tend to be close, the clustering accuracy is the highest. This is mainly because the autoencoder can guarantee the local structure of image representation and prevent the distortion of feature space. It also means that autoencoder can promote the clustering performance of DAC, which explains the reason why our algorithm is effective.

5. Conclusion

In this paper, we present a novel representation learning method and use it to solve the image clustering problem. To generate more informative representations for clustering, we borrow the DAC algorithm and incorporate it to train a fully convolutional autoencoder. The proposed algorithm was evaluated on unsupervised clustering tasks using popular datasets, achieving competitive results compared to the current state of the art. Furthermore, we may improve the proposed algorithm by applying some deep feature extraction models, e.g., Variational AutoEncoder (VAE) and Generative Adversarial Networks (GANs). To improve our method, it is an interesting direction to learn the distribution of image data instead of reconstructing the image. We will see this for future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Key Research and Development Program of China (no. 2018YFB1307400), the Science and Technology Project of the State Grid Corporation of China (no. SGSDDK00KJJS2000090). The authors appreciate the following research: https://ieeexplore.ieee.org/document/9184041.