Complexity

Review Article

Deep CNN and Deep GAN in Computational Visual Perception-Driven Image Analysis

Table 5

Evolution of the deep GAN.


S. no	Model	Application	Network architecture	Methodology	Key advantage	Major limitation

1	SRGAN [9]	Image super-resolution	Generator: two convolutional layers with small 3 × 3 kernels and 64 feature maps, batch normalization layers, and parametric ReLU. Discriminator: leaky ReLU activation function.	A perceptual loss function is proposed, which is composed of content loss and adversarial loss	Low-resolution images are converted into a high-resolution image for 4x upscaling factors	Texture information is not real enough, accompanied by some noise

2	ACGAN [15]	Image synthesis	Generator: has a series of deconvolutional layers, also known as transposed convolutional layers. Discriminator: has a set of 2D convolutional layers with leaky ReLU followed by linear layers and softmax and sigmoid functions for each of its outputs.	Two variants of the model were trained to generate images of resolution 64 × 64 and 128 × 128 spatial resolutions	Accuracy can be assessed for individual classes	Ignores the loss component arising from class labels when a label is unavailable for a given training image

3	CGAN [16]	Image-image translation, image tagging, and face generation (Gauthier, J. (2014))	Generator and discriminator are conditioned on some arbitrary external information with the ReLU activation function and sigmoid for the output layer.	Minimize the value function for G and maximize the value function for D	CGAN could easily accept a multimodal embedding as the conditional input	CGAN is not strictly unsupervised; some kind of labeling is required for them to work

4	InfoGAN [1, 2]	Facial image generation	Generator: upconventional architecture and normal ReLU activation function. Discriminator: leaky ReLU with leaky rate 0.1 applied to hidden layers as nonlinearity.	Learn disentangled representations by maximizing mutual information	InfoGAN is capable of learning disentangled and interpretable representation	Sometimes, it requires adding noise to the data to stabilize the network

5	DCGAN [91]	General image representations	Generator: the ReLU activation except for the output layer, which uses the tanh function and batch norm. Discriminator: leaky ReLU for high-resolution modeling and batch norm.	The hierarchy of representations is learned from object parts	Stable, good representations of images, easy convergence	Gradients disappear or explode

6	LAPGAN [10]	Generation of images	Laplacian pyramid of convolutional networks.	Image generation in a coarse-to-fine fashion	Independent training of each pyramid level	Nonconvergence and mode collapse

7	SAGAN [11]	Generation of images	Spectral normalization is applied to the GAN generator.	Realistic images are generated	Inception score is boosted, Frechet inception distance is reduced, and images are generated sequentially	Attention is not extended

8	GRAN [12]	Generating realistic images	Recurrent CNN with constraints.	Images are generated by incremental updates to the canvas using a recurrent network	Sequential generation of images	Samples collapse on training for a long duration

9	VAE-GAN [92]	Facial image generation	GAN discriminator is used in place of a VAE’s decoder to learn the loss function.	Combines GAN and VAE to produce an encoder of data into the latent space	Its advantage, GAN, and VAE are combined in a single model	The major drawback of the VAE is the blurry output it generates

10	BIGAN [18]	Image generation	Generator: in addition to G, it has an encoder to map the data to latent representations. Discriminator: discriminates both data and latent spaces.	Learn features for related semantic tasks and use them in unsupervised settings	Minimization of the reconstruction loss	Generator and discriminator are highly dependent on each other in the training phase

11	AAE [19]	Dimensionality reduction, data visualization, disentangling the style of the image, and unsupervised clustering	Generative autoencoder.	Variational inference is performed by matching the arbitrary prior distribution with the autoencoder’s aggregated posterior of the hidden code vector	Balanced approximation; this method can be extended to semisupervised learning and is better than variational autoencoders	Samples generated are blurry and smoothened (Zhang, J. et al., 2018)

12	Pix2Pix [93]	Image-image translation	Generator: UNET architecture. Discriminator: PatchGAN classifier. ReLU activation function. Batch normalization.	The network learns the loss associated with the data and the task, making it applicable for a wide variety of tasks	Parameter reduction, and realistic images are generated	The required images are to be one-one paired