Review Article

Deep CNN and Deep GAN in Computational Visual Perception-Driven Image Analysis

Table 5

Evolution of the deep GAN.

S. noModelApplicationNetwork architectureMethodologyKey advantageMajor limitation

1SRGAN [9]Image super-resolutionGenerator: two convolutional layers with small 3 × 3 kernels and 64 feature maps, batch normalization layers, and parametric ReLU. Discriminator: leaky ReLU activation function.A perceptual loss function is proposed, which is composed of content loss and adversarial lossLow-resolution images are converted into a high-resolution image for 4x upscaling factorsTexture information is not real enough, accompanied by some noise

2ACGAN [15]Image synthesisGenerator: has a series of deconvolutional layers, also known as transposed convolutional layers. Discriminator: has a set of 2D convolutional layers with leaky ReLU followed by linear layers and softmax and sigmoid functions for each of its outputs.Two variants of the model were trained to generate images of resolution 64 × 64 and 128 × 128 spatial resolutionsAccuracy can be assessed for individual classesIgnores the loss component arising from class labels when a label is unavailable for a given training image

3CGAN [16]Image-image translation, image tagging, and face generation (Gauthier, J. (2014))Generator and discriminator are conditioned on some arbitrary external information with the ReLU activation function and sigmoid for the output layer.Minimize the value function for G and maximize the value function for DCGAN could easily accept a multimodal embedding as the conditional inputCGAN is not strictly unsupervised; some kind of labeling is required for them to work

4InfoGAN [1, 2]Facial image generationGenerator: upconventional architecture and normal ReLU activation function. Discriminator: leaky ReLU with leaky rate 0.1 applied to hidden layers as nonlinearity.Learn disentangled representations by maximizing mutual informationInfoGAN is capable of learning disentangled and interpretable representationSometimes, it requires adding noise to the data to stabilize the network

5DCGAN [91]General image representationsGenerator: the ReLU activation except for the output layer, which uses the tanh function and batch norm. Discriminator: leaky ReLU for high-resolution modeling and batch norm.The hierarchy of representations is learned from object partsStable, good representations of images, easy convergenceGradients disappear or explode

6LAPGAN [10]Generation of imagesLaplacian pyramid of convolutional networks.Image generation in a coarse-to-fine fashionIndependent training of each pyramid levelNonconvergence and mode collapse

7SAGAN [11]Generation of imagesSpectral normalization is applied to the GAN generator.Realistic images are generatedInception score is boosted, Frechet inception distance is reduced, and images are generated sequentiallyAttention is not extended

8GRAN [12]Generating realistic imagesRecurrent CNN with constraints.Images are generated by incremental updates to the canvas using a recurrent networkSequential generation of imagesSamples collapse on training for a long duration

9VAE-GAN [92]Facial image generationGAN discriminator is used in place of a VAE’s decoder to learn the loss function.Combines GAN and VAE to produce an encoder of data into the latent spaceIts advantage, GAN, and VAE are combined in a single modelThe major drawback of the VAE is the blurry output it generates

10BIGAN [18]Image generationGenerator: in addition to G, it has an encoder to map the data to latent representations. Discriminator: discriminates both data and latent spaces.Learn features for related semantic tasks and use them in unsupervised settingsMinimization of the reconstruction lossGenerator and discriminator are highly dependent on each other in the training phase

11AAE [19]Dimensionality reduction, data visualization, disentangling the style of the image, and unsupervised clusteringGenerative autoencoder.Variational inference is performed by matching the arbitrary prior distribution with the autoencoder’s aggregated posterior of the hidden code vectorBalanced approximation; this method can be extended to semisupervised learning and is better than variational autoencodersSamples generated are blurry and smoothened (Zhang, J. et al., 2018)

12Pix2Pix [93]Image-image translationGenerator: UNET architecture. Discriminator: PatchGAN classifier. ReLU activation function. Batch normalization.The network learns the loss associated with the data and the task, making it applicable for a wide variety of tasksParameter reduction, and realistic images are generatedThe required images are to be one-one paired