Abstract

Unpaired image translation is a challenging problem in computer vision, while existing generative adversarial networks (GANs) models mainly use the adversarial loss and other constraints to model. But the degree of constraint imposed on the generator and the discriminator is not enough, which results in bad image quality. In addition, we find that the current GANs-based models have not yet been implemented by adding an auxiliary domain, which is used to constrain the generator. To solve the problem mentioned above, we propose a multiscale and multilevel GANs (MMGANs) model for image translation. In this model, we add an auxiliary domain to constrain generator, which combines this auxiliary domain with the original domains for modelling and helps generator learn the detailed content of the image. Then we use multiscale and multilevel feature matching to constrain the discriminator. The purpose is to make the training process as stable as possible. Finally, we conduct experiments on six image translation tasks. The results verify the validity of the proposed model.

1. Introduction

Image translation [1] is similar to language translation, which converts the input image of source domain to a corresponding image of target domain. For example, inputting an image of the pear and turning it into an image of the apple. There are many methods [16] to solve image translation problem, but GANs based methods [1, 711] have gained increasing attention in the image translation. In the methods of GANs, they view the input image of source domain as the input of generator, which generates fake samples to deceive discriminator. And then the discriminator is responsible for judgment whether they are real samples of target domain or generated fake samples, in which, the deep convolution or deconvolution neural networks [1217] are often used to construct the generator or discriminator.

According to whether the datasets are paired or unpaired, the image translation of GANs based methods can be roughly classified into two categories: paired and unpaired image translation. For paired methods [1, 18], they require paired datasets. It is very difficult to prepare paired training datasets in practical applications, whose cost is expensive [19]. To reduce the cost of obtaining paired training datasets, [2022] propose the unpaired methods, which are unsupervised domain translation methods [23].

However, these paired or unpaired models are inadequate in generating the detailed information of images. To bring the better result of image translation, it is still a challenge task because of the following problems: (1) how to constrain the generator to generate the detailed content in the image and (2) how to stabilize the training process to obtain better performance of model, such as generated images and generalization performance.

On the one hand, there is the constraint generator. These GANs based methods gradually close to real data distribution by adjustment the generator parameters. In other words, this type of GANs based methods does not need to be pre-modelled, so they are too free for modelling. For the situation that there are many pixels in the image, this type of methods has an uncontrollable problem. To address this problem, [24] proposes constraining generator and discriminator by adding condition variable . Inspired by adding constraint, many researchers have come up with models from different perspectives. The first is to add text or semantic information to model. Reference [25] adds text information into the cascaded GANs, which generates high definition of image from text. Reference [26] proposes structural GANs, which incorporates semantic information into a conditional generative model. The second is to add regularization term. Reference [27] combines mutual information to conduct adversarial modelling, which actually adds a regularization term of mutual information. Reference [7] brings in cyclic consistency constraint to achieve cross domain image translation. Reference [28] takes advantage of Wasserstein distance and the gradient penalty to conduct adversarial modelling. Then, due to the fact that Nash equilibrium points of original GANs are not asymptotically stable, in order to overcome this difficulty, [29] adds a regularization term in gradient update and proves that the equilibrium points of the original GANs are locally asymptotically stable after adding it.

By looking at the similar objects in real life, we find that they have a certain similarity in appearance or structure. In addition, there are few methods to constrain generator by adding an auxiliary domain [30]. Inspired by this, we add the auxiliary domain, which is used to help the generator learn the detailed information in the image during image translation.

On the other hand, there is the stabilization training process. To obtain better generated images, many researchers have put forward their ideas on stabilize training process. Firstly, there is the missing model perspective. Reference [31] analyses the problem of original GANs objective function, which leads to easily the gradient vanishing and missing model when training GANs. Then [28] achieves the improvement of training process. Since the convergence of original GANs training is not proved, [32] proposes a two time scale update rule to train GANs, which can make the training process converge to a local Nash equilibrium. Secondly, there is the probabilistic perspective. Reference [33] studies the distribution of the squared singular values of the input-output Jacobian of the generator. Thirdly, there is the multiscale discriminator. Reference [34] proposes a multiscale discriminator to stabilize training process and generates the high resolution image. Inspired by these discussions, in order to stabilize the training process and improve the discriminating ability of the discriminator. We make use of ideas of the multiscale and multilevel to constrain the discriminator, which is implemented by deep convolutional neural networks [15, 3537].

In this paper, we focus on the unpaired image translation task based on the method of GANs. To try to solve two problems: (1) In GANs based methods [79, 20, 21], the generator lacks control over the detailed information in the process of image generation. We try to constrain generator to generate more detailed content during image translation. (2) To obtain better generated images or generalization performance of model, we stabilize the process of GANs training as much as possible.

To solve the aforementioned concerns, we propose a novel unpaired image translation framework from the perspective of simultaneous constraint generator and discriminator. On the one hand, in order to constrain generator, we add the auxiliary domain to model, which can help generator learn the detailed information in the image. Then we combine the additional auxiliary domain with the domains to be learned to model and design multiple generators and discriminators for image translation. On the other hand, to improve the ability of discriminator or stabilization training process, we use multiscale and multilevel feature matching to constrain discriminator. Finally, we use multiple generative losses, multiscale discriminator losses, multilevel feature matching losses, and full cycle consistency losses to constrain the proposed model.

Our main contributions are

(1) We propose an unpaired, multiscale and multilevel feature matching generative adversarial networks (MMGANs) by adding auxiliary domain to achieve cross domain image translation.

(2) We modify the original GANs model from the perspective of simultaneous constraint generator and discriminator. In our model, we add an additional auxiliary domain as auxiliary information to help generator learn the details information during generative images. And we constrain the discriminator by multiscale and multilevel feature matching losses.

(3) Finally, we conduct experiments on the six tasks of image translation. According to the proposed evaluation method and the generated images, the experimental results show that our model has better performance.

The rest of this paper is organized as follows. Section 2 describes the proposed method and the detailed model. Section 3 provides the results and discussion. Section 4 concludes this paper.

2. Materials and Methods

In this paper, inspired by the constraint generator, cycle consistency of CycleGAN, multiscale discriminator and multilevel feature matching, we design the MMGANs model. On the one hand, we constrain generator by adding an additional domain [30] as ancillary information. On the other hand, we make use of multiscale and multilevel feature matching to constrain discriminator. To achieve this model, we specifically design multiple generator losses and multiscale discriminator losses, full cycle constraint losses, and multilevel feature matching losses.

2.1. Formulation Description

We focus on unpaired image translation problem. For the convenience of the following description, we suppose that image translation is implemented in the domains and . The added auxiliary domain is . We have , , and samples from the , , and domains. In order to implement the proposed model, we design six generators (namely, G1: , G2: , G3: , F1: , F2: , and F3: ) and three discriminators (namely, , , and ). In particular, to stabilize the training process, our discriminator has multiscale (). Moreover, to improve the quality of generated images, we add the multilevel feature matching in the discriminator.

For example, the generator G3 generates samples of the target domain . Its input is domain. The generator F1 generates samples of the target domain , whose input is domain. Then the discriminator determines whether the generated samples , , and the real samples are real or fake. The discriminators and are similar to . The framework of MMGANs model and its discriminator model are shown in Figures 1 and 2.

2.2. Adversarial Loss with Multiscale and Multilevel

The proposed MMGANs model is inspired by [30], which consists of multiple generators and discriminators. In the proposed model, we use multiscale and multilevel to constrain the discriminator. In this way, the training process can be stabilized and the discriminator pays attention to the identification of the detailed content of the image.

Concretely, there are three inputs , , and . It needs to distinguish real or fake sample in the three inputs for the discriminator. To stabilize the training process, we design multiscale discriminator losses. The adversarial losses with multiscale discriminator are as follows:where the subscript Dis means the cases of multiscale discriminator.

To improve further the quality of generated images, we add the constraint on the discriminators, which focus on the detailed content judgment of the generated image. So, we specifically add the multilevel feature matching losses of discriminator. They are as follows:where the subscript F means the norm in the multilevel feature matching losses.

2.3. Objective Function of MMGANs

Our entire objective function includes multiple losses of the generators and multiscale discriminators, cycle losses, full cycle losses and multilevel feature losses. The MMGANs model optimal aim isAnd L iswhere , , and are parameters of vectors which are rows (dimension 13). and , respectively, mean cycle loss and full cycle loss [30].

We use the Adam optimizer to solve this model and we give a description of the algorithm. See Algorithm 1.

(1) Initialization , , generator parameters , , , , , ,
discriminator parameters , , . Suppose = (, , , , , , , , ).
(2) while do
(3) Sample one sample at a time , , .
(4) The losses G1_loss, F1_loss, G2_loss, F2_loss, G3_loss and F3_loss of generators
are calculated according to formula (8).
(5) for i =1 to do
(6) for j =1 to do
(7) The losses DX_loss, DY_loss and DZ_loss of discriminators are respectively
calculated according to formula (1), (2), (3), (4), (5), (6).
(8) end for
(9) end for
(10) Adam( (G1_loss, F1_loss, G2_loss, F2_loss, G3_loss, F3_loss, DX_loss,
DY_loss, DZ_loss), , , )
(11) ++
(12) end while

In addition, this article is a further version of our previous work [30], which has been greatly improved in the following places.

Firstly, we modify the discriminator by multiscale and multilevel constraint. In this way, the training process is more stable and the discriminator pays more attention to the matching of features in the generated image. The goal is to make the details of the generated image more realistic. Secondly, we also add the UNIT method [9] as baseline in the experiment. Furthermore, in the performance evaluation part, the AMT evaluation index is further modified to make the evaluation methods more objective and comprehensive.

3. Results and Discussion

Our experimental environment is ubuntu16.04, tensorflow, python3, GPU: GeForce GTX 1080 Ti, and memory: 64GB, CPU: Intel [email protected] 36.

3.1. Training and Testing Datasets

We use fruits and seasons datasets for experiment in this paper. In which, the fruits dataset includes images of apple, orange, and pear and the seasons dataset covers the images of summer, autumn, and winter. Finally, we adopt the datasets of [30] and resize all images as 128 128. The training and testing sets are shown in the Table 1.

3.2. Baseline

To evaluate the performance of the MMGANs, we compare our method with CycleGAN [7], DualGAN [8], and UNIT [9].

CycleGAN: It makes use of the adversarial loss and cycle constraint to achieve the unpaired and unsupervised image translation from domain to domain or domain to domain.

DualGAN: It takes advantage of dual learning to constrain generator and discriminator. There are two generators and two discriminators to achieve the unpaired and unsupervised image translation in the DualGAN. This way expands the basic GANs into two coupled GANs.

UNIT: It encodes the images of two domains into a shared latent space through the shared weighted encoder and then realizes unsupervised cross-domain image translation by GANs.

3.3. Performance Evaluation

Amazon Mechanical Turk (AMT) [8] is one of the methods for evaluating generated images by GANs. It requires that the observers only pick out the real or fake image among the testing images. For selecting only real or fake samples, the performance of the model is not well characterized. The generated images can be classified into four categories by the observers: better, worse, both of are bad or good.

Based on the above discussion, we propose a comprehensive way to describe the model. Firstly, let several observers distinguish worse, better, and both of good or bad images and count the numbers of four types of picked images. Then we obtain the mean number of them and respectively calculate their proportions.

Supposing that is the sum of the testing images. Compared with CycleGAN, DualGAN, and UNIT models, we, respectively, calculate the values of , , , and , which represent the mean number of worse, better, and both of good or bad generated images using MMGANs model. Next, the percentages can be calculated, respectively:

Finally, we analyze the model by quantifying indicators , , , and .

3.4. Results of Testing

To be fair, all experiments are trained 200,000 times and testing by the data sets mentioned in this paper. The experiment focuses on two points: (1) compared with CycleGAN, DualGAN, UNIT, and MMGANs, we show generated images in different datasets and (2) we use our comprehensive performance evaluation method to calculate a quantitative ratio and analyze it.

For the convenience of narration, we set, respectively, six experimental cases on image translation. As shown in the Table 2.

3.4.1. Setting Training Parameters

In this section, we discuss parameter settings about the experiment, which includes , , , , , , , , and , and select the multiscale of discriminator. In addition, we set the norm when training. In the specific implementation, we adopt the network structure of the CycleGAN model.

We choose the parameters by considering the training time and the quality of the generated images. Parameter : set = = = 10, = 5, = = 0, and = = = 1 for all cases. The multiscale and multilevel parameters are shown in the Table 3. For example, we show that the results of values of different multiscale and multilevel from apple to orange in Figure 3. It explains that the different values of multiscale and multilevel affect the generated images. Therefore, we use different multiscale and multilevel parameters for different image translation tasks.

3.4.2. Generated Images

To fairly compare the models, we present generated images by CycleGAN, DualGAN, UNIT, and MMGANs. And the auxiliary domains are, respectively, pear, orange, apple, winter, autumn, and summer. The testing results are shown as in the Figures 4, 5, and 6.

In Figures 4, 5, and 6, the results show that the generated images by MMGANs are more realistic. The outline, color, background, and foreground are better than the generated images by CycleGAN, DualGAN, and UNIT. What is more, for generated images with multiple target entities, our model works also better. In contrast, the fidelity of generated images by CycleGAN, DualGAN and UNIT is even worse. These results tell that the augmented auxiliary domain information helps MMGANs model focus on the generation of image details. The generator is constrained by the augmented auxiliary domain, which makes generated images more realistic.

3.4.3. Performance Results

According to our proposed evaluation method, we only tell a few observers to sort the images from the testing results. For example, when testing apple2orange, based on the observers first impression, let observers judge which image in the testing results is worse, better, both of are good or bad.

Compared with CycleGAN, DualGAN, and UNIT, the testing results show that our model can obtain better performance by comparing the values of , , , and . In addition, they also show that the probability of better quality of generated images is higher than the CycleGAN, DualGAN, and UNIT. Concretely, the value of is higher than from Tables 4, 5, and 6, except for summer2autumn of DualGAN. However, in terms of the average ratio of all image translation tasks, our MMGANs are better than CycleGAN, DualGAN, and UNIT. And the value of means that the compared models have a higher probability of better generated images. On the contrary, the value of represents the compared models being the probability of failure during image translation.

At last, through experiments on the six tasks of image translation, the results prove that the constraint generator and the losses of multiscale and multilevel discriminator make our model better generalization performance. In comparison, our MMGANs has a higher probability of generated better images.

3.5. Discussion

In this paper, we propose an approach to enhance the ability of GANs on image translation using augmented auxiliary domain to constrain generator and using multiscale and multilevel losses to constrain discriminator. In this regard, we have studied the controllability of GANs. Through the study of controllability of GANs, our MMGANs can make the generated images more realistic in image translation.

However, our model also failed in the image translation task. Possible reasons include the following. (1) There are bad samples in our training and testing sets, which may generate bad samples in testing sets. It is shown as in Figures 7 and 8. (2) The performance of model needs further improvement. For example, introducing attention mechanism [3841]. In addition, to make the generated image diversified, and to make the generated process interactive, we will consider using semantic to control image generation in future work.

4. Conclusion

We presented a MMGANs model for image translation, which constrains the generator and discriminator by adding the auxiliary domain, full cycle consistency loss, multiscale, and multilevel loss. In particular, the constraint on the generator allows the generator to learn the detailed content in the image. The constraint of discriminator makes the whole training process more stable and the produced image more lifelike. Through experiments on the six tasks of image translation, our model achieves better performance than CycleGAN, DualGAN, and UNIT.

Data Availability

The apple, orange, pear, summer, autumn, and winter datasets used to support the findings of this study are downloaded from ImageNet and Flickr. This paper has cited [7] and the researchers can access the data easily from ImageNet and Flickr.

Conflicts of Interest

There are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61773093), Key R&D Program (Intelligent Processing Technology of Multi-source Litigation Letters and Visits National 2018YFC0831800), Important Science and Technology Innovation Projects in Chengdu (2018-YF08-00039-GX), and Research Programs of Sichuan Science and Technology Department (2016JY0088, 17ZDYF3184).