Abstract

This paper describes a new image generation algorithm based on generative adversarial network. With an information-theoretic extension to the autoencoder-based discriminator, this new algorithm is able to learn interpretable representations from the input images. Our model not only adversarially minimizes the Wasserstein distance-based losses of the discriminator and generator but also maximizes the mutual information between small subset of the latent variables and the observation. We also train our model with proportional control theory to keep the equilibrium between the discriminator and the generator balanced, and as a result, our generative adversarial network can mitigate the convergence problem. Through the experiments on real images, we validate our proposed method, which can manipulate the generated images as desired by controlling the latent codes of input variables. In addition, the visual qualities of produced images are effectively maintained, and the model can stably converge to the equilibrium. However, our model has a difficulty in learning disentangling factors because our model does not regularize the independence between the interpretable factors. Therefore, in the future, we will develop a generative model that can learn disentangling factors.

1. Introduction

The generative adversarial network (GAN) model that is one of the prominent generative models has been successfully applied to image generation task [1]. The basic idea of GAN is to adversarially train two different models, a generator and a discriminator. The generator aims to generate fake samples similar to real samples from random noise variables. Meanwhile, the discriminator learns to distinguish real samples from fake samples obtained by the generator. When a GAN model reaches the convergence, the generator can give indistinguishable samples with real samples. In addition, distributed representations of the inputs can be obtained from the hidden representations of the GAN model, which reflect the underlying factors of variation that generate the data [2]. However, the original GAN model has difficulties in training and modal collapse [3].

To overcome these problems, many improved models have been proposed [46]. While the original GAN model has a binary classification discriminator that distinguishes fake samples from real ones, the recent models construct more informative discriminator costs such as an autoencoder cost that assign lower cost to real samples and higher cost to fake ones [3, 6, 7]. Boundary Equilibrium GAN (BEGAN) [3] used an autoencoder cost as a discriminator cost and balanced adversarial network using proportional control theory, and it, as a result, converged to diverse images of the highest visual quality without a complex alternating training procedure. However, the learned representations of images are entangled and barely interpretable because each dimension in distributed representations does not have a specific meaning.

Unsupervised learning is a general problem, which requires extraction of some valuable information from unlabeled data. Representation learning is one popular framework, which tries to learn a representation from unlabeled data that can be easily decoded [810]. In generative models, if we can learn interpretable representations with methods of unsupervised learning, it can be useful for generating new data. In fact, there are many generative models that construct new data with high quality with arbitrarily bad representations [11]. However, good generative models, including the idea of unsupervised learning, are expected to learn an interpretable representation and synthesize new data that can be manipulated as desired.

On the assumption that a good representation can find the underlying causes from the samples and is interpretable, supervised or semisupervised generative models have been developed [1217]. Early methods of representation learning were based on autoencoders or RBMs (Restricted Boltzmann Machines) [18, 19]. Recently, VAEs (Variational Autoencoders) [12] achieved splendid semisupervised results on MNIST dataset [20], and GANs learned image representation that enables linear algebra on coded space [4]. There were several attempts to learn disentangled or interpretable representations with supervised datasets [21]. Such methods train the model to match one class of representations and supplied label. Similar to that, adversarial autoencoders [22] and VAEs learned representations with class label separated from other variations. To avoid those methods that explicitly label the variations, weakly supervised methods were introduced.

Because VAE and GAN frameworks have been popularly used in generative modeling, many models for learning disentangled or interpretable representations based on them have been proposed [13, 16, 17]. In generative models, it is important that some factors can be manipulated to generate a new image. However, in supervised models, what disentangling factors are learned is determined when constructing a model. Therefore, the factors that can be manipulated are also determined. Mathieu et al. developed a conditional generative model for learning to disentangle the factors of variation [13]. Their model consisted of two kinds of factors of variations such as the specified factors and the remaining factors. The specified part is modeled as VAE framework, and the remaining part is modeled as GAN framework. Xiao et al. proposed DNA-GAN, a supervised learning model for learning disentangling factors of variation [17]. They iteratively trained the model to address the problem of unbalanced multiattribute datasets. Their model also consisted of attribute-relevant and attribute-irrelevant parts. Higgins et al. developed a Symbol-Concept Association Network (SCAN) which learns higher-level concepts based on disentangled visual primitives, and these concepts can be used to generate new images [23]. They measured the accuracy and diversity where high accuracy means that a model understands the meaning of symbol and high diversity means the samples have the variety in terms of the unspecified attributes. The SCAN model showed good performance on accuracy and diversity. Even if these models were effective in learning disentangled representations, they required the paired training data with labels and could not overcome the instability problem of GAN objective.

Meanwhile, there exists a method that learns disentangled representations with unsupervised learning. Unsupervised learning is more desirable because not only the labeling cost is high but also the labels annotated by human may be inconsistent and insufficient [24] Kim and Mnih proposed an unsupervised factor VAE, which enhanced disentanglement over β-VAE by introducing the total correlation penalty [24]. However, they conducted the experiment only for the artificial dataset. The hossRBM successfully learned disentangled representation in an unsupervised way on Toronto Face Dataset, which showed emotional changes in generated images. However, hossRBM can only learn discrete latent factors, and the computation cost grows exponentially. InfoGAN model made it possible to learn interpretable representation in purely unsupervised way by introducing the mutual information concept of latent code and input representation [25]. InfoGAN can learn both discrete and continuous latent factors as interpretable representations. Also InfoGAN usually requires the same amount of training time of typical GANs. Although this model learned interpretable and meaningful representations, it still had convergence or modal collapse problem because of binary classification discriminator. Therefore, we aim at developing a generative model that learns interpretable distributed representations and improves training GAN model.

In this study, we proposed IBEGAN model which introduces mutual information regularization between latent codes of the generator and latent representations of the autoencoder model of the discriminator. In our model, the training procedure follows the one of the BEGAN model for stable training and high visual quality.

In the remainder of the paper, we first review the related GAN models such as BEGAN and InfoGAN in Section 2. In Section 3, we describe our proposed model, information-based boundary equilibrium generative adversarial networks (IBEGANs), in terms of GAN objective functions and model architecture. In section 4, we evaluate whether our model can learn interpretable representations by manipulating the generated images through the latent codes in real-world image datasets, and the analyses of the image quality and the model convergence are performed. Finally, section 5 concludes this study.

There are many GAN models to improve the generated input quality and obtain effective latent representations. Energy-Based GANs (EBGANs) [6] attempted to use an energy function to model the discriminator based on the autoencoder model. Deep Convolutional GANs (DCGANs) [4] first used convolutional layer architecture and produced significantly improved visual samples. Wasserstein GANs (WGANs) [26] proposed to use Wasserstein distance as a measure of distance, which led to stable convergence. In this section, we introduce two GAN models that improve training procedure and learn interpretable representation: BEGAN and InfoGAN.

2.1. Boundary Equilibrium Generative Adversarial Networks

BEGAN construct GAN objective by using Wasserstein distance for autoencoder model. This method balances the generator and discriminator and also provides a new approximate convergence measure. BEGAN has a simpler architecture and easier training procedure compared to other typical GANs. As a result, BEGAN can produce the samples of human faces, with the best visual quality.

BEGAN has proposed to match the loss distributions of autoencoders instead of matching data distributions directly. BEGAN uses the loss for a pixel wise autoencoder as:

Let be loss distributions of autoencoders, be the set of couplings of and , and be the respective means. Then, Wasserstein distance is bounded by Jensen’s inequality as follows in [3]:where have the same marginal distributions with .

From this inequality, GAN loss is designed to maximize the lower bound of Wasserstein distance between autoencoder losses. Let and be the parameters of the discriminator and the generator, respectively, and be the losses of the discriminator and the generator, respectively, and be the latent representation from autoencoder model. Then, the GAN objective can be expressed as follows:

In the GAN model, balancing losses of the discriminator and the generator is difficult because the discriminator usually wins over the generator easily. Therefore, BEGAN introduces a new hyperparameter for stable convergence as well as balancing goal as follows:

If becomes lower, the discriminator will focus on learning how to autoencode the real images. So image diversity will decrease. So BEGAN called as the diversity ratio.

BEGAN proposed to use proportional control theory to keep the equilibrium balanced at by using a new variable which controls how much to focus on the loss of the generator during gradient descent. Finally, the full BEGAN objective is as follows:where is the learning rate for .

BEGAN has stable convergence and simple training procedure because it is not required to pre-train the discriminator and to train the discriminator and the generator alternatively. Nevertheless, this model cannot obtain interpretable representation which is useful to generate new samples from the generative model. Therefore, in the following section, we explain the InfoGAN model which learns the meaningful latent code.

2.2. Information Maximizing Generative Adversarial Networks

InfoGAN introduced the information theoretical modification to original GAN. It enabled the model to learn interpretable representations, which inspired our proposed method. InfoGAN relates the latent variable to the input variable by maximizing the lower bound of mutual information.

InfoGAN decomposes the input noise vector to the incompressible noise and structured semantic latent code . The latent code is a concatenation of latent variables that were assumed to have factored distribution . In InfoGAN model, the incompressible noise and the latent code are both fed to the generator network, so the form of the generator becomes , but the generator can ignore the information that contains. To solve this problem, the mutual information of and is maximized to restrict the generator on using the noise in a highly entangled way, where the lower bound of mutual information can be derived by defining an auxiliary distribution .

Hence the minimax game of InfoGAN is defined as follows:where .

InfoGAN successfully captured disentangled representations in MNIST dataset, 3D rendered image dataset, and the SVHN dataset. Those tasks were all unsupervised tasks, and the interpretable representations learnt by InfoGAN were competitive with several representations learnt by existing supervised methods. However, InGAN could not be applied to high-quality image examples because the binary classification discriminator was used instead of pixelwise error. Therefore, we propose a new information-based GAN model effectively reflecting pixelwise error.

3. Proposed Method

In this study, we proposed the IBEGAN model to generate and control new images with high visual quality and diversity. We used an autoencoder model with pixelwise error and proportional control theory to construct the discriminator loss as BEGAN model to alleviate convergence and diversity-quality balance problems. However, in the autoencoder model, we cannot restrict the kinds of latent representation such as discrete or continuous variable. IBEGAN consists of both continuous latent variable and discrete latent code. Therefore, we construct a new model architecture to learn interpretable representations with the discriminator using the autoencoder model.

First, our generator has two latent representations: incompressible noise and latent code , which are used for the inputs of the generator , where learns compact representation of image and learns an interpretable factor that can manipulate generating images. An autoencoder model with pixelwise loss is constructed, where autoencoder loss distributions is used for constructing the losses of the discriminator and generator. Unlike the InfoGAN model, IBEGAN can obtain a reproducible encoding because an image can be obtained by decoding in our model. However, like the InfoGAN model, the posterior distribution should be approximated by because we cannot explicitly estimate a posterior . If a new network architecture is used for , computational cost will be increased, so we share the encoder network to construct the network of where can be obtained from . In addition, we used a convolutional architecture to improve image quality. Figures 1 and 2 show the architecture of IBEGAN. The discriminator is a convolutional deep neural network [27, 28], designed as a deep autoencoder.

The generator has the same convolutional structure with the discriminator’s decoder except the initial inputs and the first fully connected layer. This causes the simplicity of architecture, and the training procedure becomes simpler. The input of the generator is the concatenation of and , where is sampled uniformly.

As a result, we construct the following loss functions from the networks:where is an autoencoder loss as in Equation (3).

We trained an autoencoder network for the discriminator loss , a generator network for the generator loss , where regularizes the generator not to ignore latent code . In our objectives, the meaning of can be preserved as in the BEGAN model because we update regardless of the regularizer . Therefore, we will measure the convergence of GANs as in [3].

4. Experiments

The goal of experiments is to evaluate if our proposed method can learn interpretable representations from image datasets while assuring the newly produced images to maintain high visual quality. Therefore, our experiment consists of two parts. First, we show the generated images while changing the learned latent code and fixing the latent variable and interpret the results. To evaluate the learned interpretable codes, we measure the accuracy of the generated samples. Second, we measure whether IBEGAN converges to the equilibrium by inspecting the change of loss and using convergence measure.

4.1. Data Description

In this paper, we used two real-world image datasets: CelebA and LSUN bedroom.

Figure 3 shows CelebA dataset, which contains 202,599 images of celebrities. CelebA has a various facial poses, various races, and rotations around the camera axis [29]. Our proposed model generated the images that seem realistic and captured meaningful and manipulative representations.

Figure 4 shows LSUN bedrooms dataset, which contains about 3,033,042 images of bedrooms [30]. Though LSUN is a high resolution dataset with a lot of samples, the proposed model can successfully learn the representations and generate high-quality samples.

In our experiments, we resized the image size of both CelebA data and LSUN bedroom data to 64 × 64, and with CelebA data, we used center crop of the image. The face data that were used in training are illustrated in Figure 5.

4.2. Experimental Design

In the experiments, we construct the network architectures of the discriminator and the generator as in Figures 1 and 2, where 3 × 3 convolutions with ELU (exponential linear units) layers are repeatedly applied at the previous outputs [31]. Each layer is repeated twice because the repetition of convolution can produce better visual quality. In encoder network, downsampling is done by subsampling with 2 strides. In decoder and generator networks, upsampling is done by using the nearest neighbor method. At the end of the encoder, the intermediate input is passed to a fully connected layer, and the embedding , new input of the decoder is obtained. To capture the auxiliary distribution , the embedding is passed to the next fully connected layer with the same output dimensions as the sum of the dimensions of the latent codes.

We estimated an auxiliary distribution by using network which shares a lot of parts with the encoder network and constructed from the softmax cross entropy between and the estimated .where is the number of latent codes, is the dimension of the latent code , , is category of the latent code , and . This regularizer can make IBEGAN learn the meaningful latent codes. We will inspect this property in Section 4.3.1.

In [23], the effectiveness of disentangling factors was proved by measuring the accuracy, but IBEGAN cannot be verified in that way because the interpretable factors are learned in a unsupervised way. Although IBEGAN does not have any supervised label for the latent code unlike [23], the latent code should be determined to generate samples. Therefore, we used the latent code as the supervised label and estimated the most probable latent code from the auxiliary distribution. We generate artificial images by changing the latent codes , estimating , and comparing the estimated codes with the latent codes . We calculate the accuracy of the estimation to provide the objective measure for the interpretability of our generated images.

To train the model, all two losses and in Equation (7) are minimized in an iteration, and the variable is updated to control how much to concentrate on the loss of the generator versus the loss of the data, x, during the gradient descent. We used the different basic settings of some hyperparameters for CelebA and LSUN bedroom datasets because images in CelebA datasets share more common features than LSUN. In both basic settings, we used a small batch size to avoid a memory error, and the Adam optimizer was used to minimize the losses [32]. In addition, diversity ratio was set to 0.5 and fixed. However, we set the dimension of hidden representation as 64 for CelebA and 128 for LSUN bedroom. In case of a learning rate of , we used 0.001 for CelebA dataset and 0.0005 for LSUN bedroom dataset.

We should carefully select a control parameter λ of the regularization term in Equation (7). Unlike the InfoGAN model, we could not set λ to 1 because we used an autoencoder-based discriminator loss instead of a classifier-based loss. We initialized λ to and updated with the learning rates of the generator and the discriminator. However, unlike the learning rates, λ was increased when it was updated.

To measure the convergence of the model, we used a convergence measure that is proposed in [3].

Because IBEGAN uses the proportion control algorithm, is stabilized when the model reaches a final stage. If a model collapses, cannot be stabilized. We analyze the convergence of our model from various loss values and this convergence measure in Section 4.3.2.

4.3. Experimental Results
4.3.1. Interpretable Representation

In this section, we evaluate our generative model in terms of interpretable representation. From the learned latent codes, we can find meaningful features and tendencies in spite of the absence of any additional label information. In addition, we quantitatively verify that our model can generate the manipulated images by changing the interpretable latent code .

To generate the artificial face images like CelebA dataset, first we choose to model the latent codes with two categorical codes, and . In this setting, our suggested model learns to represent latent factors as gender and hair style. Figure 6 demonstrates that the variation of is related to gender. Figure 7 shows that the variation of changes the hair style of the generated images. We can control the generated image of a person to look more manly by changing , and we can try different hair styles on images by controlling , which proves that our proposed model successfully learned interpretable representations.

Aforementioned experimental results showed the latent factors of hair style and masculinity. As a result, two factors are highly correlated because men and women usually have different hair styles. We also experimented with different level of categorical code in CelebA dataset. In this experiment, we selected latent codes with two categorical codes, and . In this setting, both latent factors indicated independent feature of the image.

Figure 8 shows the second experimental result in CelebA dataset. This image illustrates variation of both and with same incompressible noise . Nine adjoined images are from same incompressible noise and row denote varying and column refers to .

We can find that can capture the angle of the face, since generated faces tend to look at the left side as changes. Also, refers to masculinity since generated image’s gender differs as changes. Moreover, we found that image with same has common in its background, skin color, and facial expressions such as smile. We can conclude that incompressible noise denotes overall air of the image.

Two experiments in CelebA showed that interpretable representations in various latent codes , can be either independent or dependent because our model learns the latent codes without determining the desirable factors. In our first experiments in CelebA, two latent codes were correlated because the first code captured masculinity and the second code denoted hair style. In the second experiment settings, two latent codes captured hair style and angle of the face. Two latent codes in the second experiments were independent.

We also implemented our suggested model to LSUN bedroom dataset, and the result is illustrated in Figure 9. In this experiment, we used latent codes with single categorical code with ten dimensions, . We generated the images by fixing 10 different incompressible noises and changing the one-hot encoding of the latent code . Therefore, in Figure 9, rows refer the same and columns refer the same . In this experiment, captures size of bed in each image. We can see in Figure 9 that size of bed changes as latent code changes. In addition, similarly to CelebA, the incompressible noise can capture overall air of the image.

In this section, we also verify the effectiveness of the interpretable latent codes. We generate the new images and calculate the auxiliary distribution . To generate an image using generator network, feeding the latent codes is required. Therefore, we can naturally obtain the label information of the generated image. Table 1 illustrates the classification results based on auxiliary distribution . We repeated the experiments 100 times and provided the average and the standard deviation of the accuracy.

In Table 1, the first and the second rows illustrate the accuracy of the individual factor whereas the last row illustrates the accuracy of two factors at the same time. As expected, our learned interpretable codes are well reflected in the generated images and showed high accuracies and small standard deviations in all cases.

4.3.2. Convergence of Models

In this section, we demonstrate that the IBEGAN model can produce high visual quality and be stably trained. These properties come from the advantage of the BEGAN model.

We aim not only to learn interpretable representations but also to obtain high-quality images. To verify this, we compared generated images of CelebA dataset from ours with images from the InfoGAN model. This result is illustrated in Figure 10. Both generated images changes hair styles by changing latent code . We can see that our suggested model shows better image quality compared to the previous model.

To inspect the convergence of our model, we illustrate all loss values in Figure 11. We constructed the generator and discriminator losses based on autoencoder losses of real and fake data. Therefore, we drew the discriminator loss in Figure 11(a), the generator loss in Figure 11(b), the autoencoder loss of real data in Figure 11(c), and the autoencoder loss of fake data in Figure 11(d).

From the plots, we found that the pattern of the discriminator loss is similar to the autoencoder loss of real data and the pattern of the generator loss is similar to the autoencoder loss of fake data. It is because we set small , , and . While the discriminator loss was stably declining, the generator loss was unstable in the beginning of the training but eventually stabilized. In this experiment, we set the balance parameter to 0.5. When the model converges, satisfies the balancing equation (4), where is the ratio of to . At the end of the iteration, the autoencoder loss of real data is about twice the autoencoder loss of fake data.

In addition, we evaluate the convergence of our model with the convergence measure (9). Figure 12 shows the change of generated images with . The fidelity of image is low when is high. Also, when become stabilized, the images hardly change. We find that the IBEGAN model stably converges.

5. Conclusion

This paper proposed a meta-algorithm called IBEGAN. IBEGAN does not require any kind of supervision and is still able to learn interpretable representations. In addition, the IBEGAN model can have a simple training procedure and generate high-quality images by introducing equilibrium concept and proportional control theory as in the BEGAN model. In the experiment, we can obtain produced images that maintain the high visual quality, and the images can be manipulated as desired by changing the values of latent codes.

In this study, we used an autoencoder model to sophisticate the discriminator, but we should additionally construct the generator network unlike the BEGAN model because the deterministic autoencoder model cannot control the distribution of hidden representation. Therefore, the IBEGAN model can be extended to sharing generator network with the encoder network by using variational autoencoder.

However, the IBEGAN model had a difficulty in learning latent codes independently. This model can be extended to adding a regularization term that could make latent factors independently. In this way, IBEGAN model would produce more robust disentangling latent codes that are independent.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (No. 2016R1A2B3014030). Also, it was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) (No. 2017R1A5A1015626).