Abstract

Computational visual perception, also known as computer vision, is a field of artificial intelligence that enables computers to process digital images and videos in a similar way as biological vision does. It involves methods to be developed to replicate the capabilities of biological vision. The computer vision’s goal is to surpass the capabilities of biological vision in extracting useful information from visual data. The massive data generated today is one of the driving factors for the tremendous growth of computer vision. This survey incorporates an overview of existing applications of deep learning in computational visual perception. The survey explores various deep learning techniques adapted to solve computer vision problems using deep convolutional neural networks and deep generative adversarial networks. The pitfalls of deep learning and their solutions are briefly discussed. The solutions discussed were dropout and augmentation. The results show that there is a significant improvement in the accuracy using dropout and data augmentation. Deep convolutional neural networks’ applications, namely, image classification, localization and detection, document analysis, and speech recognition, are discussed in detail. In-depth analysis of deep generative adversarial network applications, namely, image-to-image translation, image denoising, face aging, and facial attribute editing, is done. The deep generative adversarial network is unsupervised learning, but adding a certain number of labels in practical applications can improve its generating ability. However, it is challenging to acquire many data labels, but a small number of data labels can be acquired. Therefore, combining semisupervised learning and generative adversarial networks is one of the future directions. This article surveys the recent developments in this direction and provides a critical review of the related significant aspects, investigates the current opportunities and future challenges in all the emerging domains, and discusses the current opportunities in many emerging fields such as handwriting recognition, semantic mapping, webcam-based eye trackers, lumen center detection, query-by-string word, intermittently closed and open lakes and lagoons, and landslides.

1. Introduction

Computer vision (CV), the core component of machine intelligence, is an interdisciplinary field enabling computers to achieve a visual understanding of digital images. For a machine to view as animals or people do, it relies on computer vision. Table 1 contains the list of abbreviations and their expansion used in the manuscript. CV is a booming field and is applied to many of our everyday activities; some of them are face detection, object detection, biometrics, a medical diagnosis from faces, self-checkout kiosk, autonomous vehicles, image recognition, image enhancement, image deblurring, motion tracking, video surveillance, control of robots, analysis of mammography, and X-rays [2226]. The fundamental goal of all these applications is to create a human observer replica in interpreting the scene in a broad sense and perform decision-making for the task at hand [27]. CV and image processing are confusing terms and are often used interchangeably. Image processing receives images as the input, processes them, and outputs images, while CV receives images as the input, processes them, and interprets the images. The output generated by CV is an abstract representation of the image’s constituent or the entire image. D-GAN is proposed as unsupervised learning, but adding a certain number of labels in practical applications can improve its generating ability. However, it is challenging to acquire many data labels, but a small number of data labels can be obtained. Therefore, combining semisupervised learning and GAN is one of the future directions.

Figure 1(a) represents the general architecture of the deep convolutional neural network (D-CNN). D-CNN is similar to a neural network where D-CNN is built with neurons having learning weights and biases [30]. However, in recent times, D-CNN is widely used over a standard neural network as it is faster and computationally inexpensive compared to neural networks. An image, which is a matrix of pixels, is flattened and fed into the neural network. Furthermore, to flatten an image of size 40 × 30 × 1, 1200 neurons are required at the input layer. Here, the complexity is manageable using a neural network. Colored images have layers corresponding to RGB. A total of 3 layers for each color make the number of neurons required at the input layer very high. When an image of size 1024 × 1024 × 3 has to be fed into the neural network, 3,145,728 neurons are required at the input layer, which is computationally expensive. The number of neurons needed at the input layer increases exponentially as the size of the image increases.

Figure 1(b) represents the architecture of deep generative adversarial networks (D-GAN), where G captures the data distribution and generates fake data whose distribution is . The generative model improvises to generate distributions similar to , the real data distribution. The discriminator D is fed either with an actual data sample or generated data sample . The discriminator outputs a probability f(x) belonging to (0, 1), indicating the data source. The generative model is trained to capture the data distribution from the original data. The discriminative model predicts the probability that data have originated from the generator G rather than the original training data. The generator’s goal is to generate data close to the distribution of real data and deceive the discriminator. The purpose of the discriminator is to identify the fake data generated by the generator. Model distributions are generated by feeding random noise as the input through a multilayer perceptron. The discriminator has a multilayer perceptron with a classifier at the end [31].

The generator and discriminator compete until the counterfeits generated by the generator are indistinguishable from the data distribution. Through adversarial training, the quality of data generated by the generator gradually improves [32]. The quality of data samples generated by the generator and the discriminator’s identifying capability improves each iteration interactively. The generator can be any neural network such as artificial neural network, convolutional neural network, recurrent neural network, or long short-term memory, whose task is to learn the data distribution. Simultaneously, a discriminator is essentially a binary classifier capable of classifying the input to be real or fake. The entire network is trained using backpropagation to fine-tune the training. The error is estimated using the sample label, and the discriminator output and the parameters of G are updated using an error backpropagation algorithm. D-GAN is inspired by two-player minimax game theory, which has two players, one benefitting at the loss of the other, and is represented by the following equation [7]:where is the model data distribution and is the generated data distribution.

The output of a discriminator is a probability indicating the origin of the data sample. A probability of 1 or a number very close to 1 represents that the data sample is real data. A probability of 0 or a number close to 0 represents the fake data. When the probability is close to 0.5, it indicates that the discriminator finds it hard to identify counterfeit samples. G is trained repeatedly to make D’s output approach 1 for the data samples generated by G. The model is trained until Nash equilibrium is achieved where a change in strategy does not change the game anymore. The Nash equilibrium is achieved when the generator has gained the capability to generate data close to the real data. The discriminator does not distinguish the real data and generator data. The generator is now considered to have to learn the real-data distribution.

1.1. Contributions of This Survey

This paper presents the general architecture of D-CNN, its application, various methodologies adopted, and its application-based performance. An overview of D-GANs is also discussed with their existing variants and their application in different domains. Furthermore, this paper identifies GANs’ advantages, disadvantages, and recent advancements in the field of computer vision. Also, it aims to investigate and present a comprehensive survey of the essential applications of GANs, covering crucial areas of research with their architectures. Figure 1(c) shows the structure of the survey. This survey presents a detailed description of D-CNN and D-GAN with their architectures. Recent advancements of D-CNN and D-GAN are discussed with their applications. Activation functions used for the CNN are discussed, and various pitfalls of deep learning with their possible solutions are discussed in detail. Applications of D-CNN and D-GAN are analyzed in Sections 4 and 5. Table 2 shows a comparison between the current survey and existing surveys on D-CNN and D-GAN.

2. Biological Vision vs. Computer Vision

Biological vision has tremendous capabilities in retrieving vital information from visual data and analyzing them for functional needs. The perceptual mechanism used by people and animals to interpret the visual world is diverse. Research on biological vision is an excellent source of inspiration for CV and focuses on computationally understanding brain functions’ mechanism for visual interpretation. Understanding the perceptual mechanism of biological vision is the initial step towards interpreting the visual data. Computational understanding of biological vision in the current research studies is based on the framework defined by David Mar [40]. Biological vision can perform tasks with high reliability, even if the visual data are noisy, cluttered, and ambiguous. It can efficiently solve computationally complex problems and that are still challenging for CV. The fundamental goal of CV, the science of image analysis, is to automate computational methods to extract visual information and understand the image’s content for decision-making [41, 42]. From CV’s perspective, an image is a sequence of square pixels that may be aligned as an array or matrix. At a higher level, the structure of both biological and computer vision is the same. Nevertheless, both systems’ objective is the same: to extract and represent the visual data into useful information for making actions.

3. Deep Learning

Deep learning or hierarchical learning has emerged as a subfield of machine learning, which, in turn, is a subfield of artificial intelligence [43]. Artificial intelligence is an effort to make machines think and automatically perform intellectual tasks otherwise performed by humans. AI is a classical programming paradigm where humans craft rules for the data, and the machine outputs the answers. Questions arose as if a machine could automatically learn data processing rules by looking at the data. Machine learning, a new programming paradigm, came into existence as an answer to this question. With machine learning, data and the solutions were fed for the machines to craft the rules. A machine learning model is trained rather than being explicitly programmed. Machine learning and deep learning came into existence when a need arose to solve fuzzy and more complex problems such as language translation, speech recognition, and image classification [44, 45]. At its core, deep learning and machine learning are about learning the representation of the data at hand to get the expected output. Deep learning models are capable of learning complex relationships existing between the inputs and the outputs. Deep learning uses multiple processing layers to discover the data’s intricate structure with multiple abstraction levels [46]. The deep in deep learning is a reference to successive layers of representation. Weights parameterize the multiple nonlinear hidden layers in a deep learning model. For a network to correctly map the inputs to its targets in a deep learning model, proper values are to be set for the weights of all layers present in a network.

Any deep learning problem has an actual target value and the predicted value. The difference between the actual and the predicted value is called the loss function. A distance score, which represents the network performance, is computed based on real and predicted values. Initially, the input’s weights are randomly assigned, and the expected output is far from the actual output, and accordingly, the distance score is very high. The weights are then adjusted, the training loop is repeated, and the distance score decreases. Tens and hundreds of iterations are performed over thousands of examples, and a minimal loss value represents that the outputs are close to the target. Deep learning has exponentially improved state-of-the-art object detection, speech recognition, and many other domains [47]. Figure 2 shows the deep learning models. D-CNN has excelled in processing images, speech, audio, and video, while RNN has brought a breakthrough in processing sequential data.

3.1. Deep Convolutional Neural Network

The deep convolutional neural networks, popularly known as D-CNN or D-ConvNets, are a robust ANN class and are the most established deep learning algorithm that have become dominant in computer vision and tons of other applications [48]. Convolutional layers, pooling layers, and fully connected layers are the D-CNN [49] building blocks [50]. D-CNN is designed to process the data that arrive in multiple arrays or grids through its numerous building blocks. The convolutional and pooling layers’ role is to extract the features, while fully connected layers map the extracted features to the output. Deep CNN has multiple convolutions and pooling layers, followed by single or multiple fully connected layers. The input passes through these layers and gets transformed into output through forwarding propagation. Convolution operation and activation function are the backbones of the D-CNN [51].

Tensor, kernel, and feature map are the three essential terminologies to perform convolution operation. Tensor is the input, which is a multidimensional array, the kernel is a small array of numbers, and the feature map is the output tensor, shown in Figure 3. The convolution operation is a linear process where a dot product is performed between the kernel and the input tensor. Each element of the kernel is multiplied with the corresponding tensor element and summed up to arrive at the output value placed in the feature map’s corresponding position [52]. The convolution operation is defined by kernel size and the number of kernels. The kernel size may be 3 × 3, 5 × 5, or 7 × 7 based on the size of the input tensor. The number of kernels is arbitrary, each kernel representing various characteristics of the input. The convolution operation is repeated for each kernel. Figures 3(a)3(c) show an example of a convolution operation. Here, the kernel size is 3 × 3, which is applied across each input tensor element to perform a dot product and return the corresponding value for the output tensor. A drawback of the convolution operation is that the feature map shrinks in size compared to the input tensor. Moreover, this is because the kernel center is not applied across the bordering elements at the input tensor’s right. With a 5 × 5 input tensor, the feature map size shrunk to 3 × 3. The size of the feature map for a tensor and a kernel is determined using the following formula:

Applying this formula, a 49-pixel input will shrink into a 25-pixel feature map, which will further shrink when the process is repeated, resulting in the loss of essential features of the input. Furthermore, to address this issue in deep CNN models with more layers, padding is used, where columns and rows of zeroes are added on all the input tensor sides. Moreover, this is performed to fit the center of the kernel to the input tensor’s rightmost bordering elements for maintaining the size of the feature map the same as that of the input tensor. Figure 4 shows zero padding where rows and columns are added to all the sides of the input tensor. As a result of zero padding, the feature map’s size is 5 × 5, the same as the input tensor.

The number of pixel shifts performed by the kernel over the input tensor is called stride. When the value of stride is 1, the kernel shifts 1 pixel at a time. When the value of stride is 2, the kernel shifts 2 pixels at a time, and so on. Activation functions that are frequently used are logistic sigmoid, hyperbolic tangent, and ReLU. Table 3 presents the recent advancements of the D-CNN in computer vision.

3.2. Activation Functions

The deep learning mechanism is the input is fed into the network, and to the product of input and the weights, a bias is added. An activation function is then applied to the result, and the same process is repeated until the last layer is reached. Activation functions play a significant role in a neural network to define a neuron’s output for a given set of inputs. The activation function takes up the weighted sum of inputs and performs a transformation operation to compress the output between a lower and upper limit. Activation functions are of two types: linear and nonlinear. Deep learning uses nonlinear activation functions for all its classification problems as the output lies between 0 and 1. Without nonlinearity, each layer of the network would execute linear transformations, in which case an equivalent single layer can replace the hidden layers. For a backpropagation to be executed, it is required that the activation function be differentiable. For deep learning, an activation function has to be both nonlinear and differentiable. Some of the standard activation functions in deep learning are sigmoid, tanh, ReLU, and leaky ReLU.

Table 4 shows the most frequently used activation functions. Sigmoid transforms the output between 0 and 1. In recent times, sigmoid has become one of the least used activation functions because of its drawbacks. First, it causes gradients to vanish when the neuron’s activation saturates close to 0 or 1; the gradients in this region are close to zero. The second drawback is that the output is not zero-centered. tanh is another activation function that performs better than the sigmoid activation function. The output lies in the range of −1 and 1 [69]. ReLU is the most popular and frequently used activation function in deep learning. The two problems overcome with ReLU are slow training time of the S-type activation function and vanishing gradient [70, 71]. The mathematics behind ReLU is when the output is 0, conversely if, the output is a linear function [72]. The range of the output is between 0 and infinity.

Since ReLU has zero output for the input’s negative values, the gradient will be zero at this point because the network will not respond to any variations in the input or the error. This problem can make part of the network passive because of dead neurons. This problem called dying leaky ReLU can overcome ReLU. Leaky ReLU is similar to ReLU, except that leaky ReLU does not make the negative input to zero. Instead, it gives a small nonzero value of 0.01 in case of a negative regime of the input. The range is between −∞ and ∞. The purpose of leaky ReLU is to minimize the dying neuron input problem. In multiclass classification problems, the output layer employs softmax as the classification function [73]. Softmax produces output that sums up to a numerical value of 1. So, the output of softmax specifies the probability distribution for n different classes of the target. Moreover, this is why softmax is used explicitly in multiclass classification problems. Due to the gradient disappearance problem, tanh and sigmoid activation functions are not employed on the D-CNN.

3.3. Pitfalls of a Deep Learning Model

The model’s performance indicates how well the model is trained and how well it generalizes new or unseen data. Evaluating the performance of a model is a crucial step in data science. The common barriers in creating high-performance models are overfitting, underfitting, and significantly few training data. Figure 5 shows overfitting, and it is said to occur when a model performs so well on a training set. The performance of the model depreciates on the validation set—loss during the training phase decreases, but the loss during the validation phase increases. Furthermore, this is because the model learns even the unnecessary information from the training set; hence, the model’s performance is too good on the training set.

Nevertheless, the model fails to perform on the validation set. Overfitting can be addressed by improving the model and obtaining more training data. The model can be enhanced by randomly omitting feature detectors from the model’s architecture. This technique is called dropout, developed by Geoff Hinton [74]. A vast number of different networks can be trained in a reasonable time using random dropouts. Thus, different networks are presented for each training case. In a nutshell, the dropout technique assumes that a randomly selected portion of the network is muted for each training case [75]. Furthermore, this is a useful technique as it prevents any single neuron within the network from becoming excessively influential. Thus, the model does not rely too much on any specific feature of the data. Dropout is used when there is an overfitting. Dropout can improve the validation accuracy in later epochs, even if there is no overfitting. Dropout is added based on experimentation, and the usual dropout range is between 20 and 50 percent of the neurons. The graph is shown in Figures 5(a) and 5(b) which show the accuracy and loss curves after two dropouts with a dropout rate of 0.25 and 0.5 have been added. It can be seen that accuracy has improved, and loss decreased after adding dropout.

In many cases, the data available may not be sufficient to train a model. With significantly few training data, the model may not learn patterns from the training data and inhibit the model’s capability to generalize unseen data [76]. It is challenging, expensive, and time-consuming to collect the required new data to train the model [77]. Under such circumstances, data augmentation is warranted, which is a powerful technique for mitigating overfitting. It is a powerful and computationally inexpensive technique of artificially inflating training data size with the data in hand without collecting new data [71, 78]. One or more deformations are applied to the data, while the labels’ semantic meaning is preserved during the transformation [79]. With more data provided to the model, it will generalize well on the validation data. Some popularly used data augmentation techniques are rotation, flip, skew, crop, contrast, brightness adjustment, and zoom in/out [80, 81]. Figure 6 shows different augmentations applied to a single image. Here, flip, rescale, zoom, height, and width shift augmentations are applied to a cat image. Training the model with the additional deformed data makes the model generalize better with unseen data. Figure 7 shows an improvement in accuracy after adding dropout and applying different augmentations to the data.

The next problem usually faced by AI practitioners is underfitting. The model is said to underfit when it cannot learn the patterns even from the training set and exhibits poor performance on the training set. Moreover, this can be addressed by increasing the training data, improving the model complexity, and increasing the training epochs. Deeper models with more neurons per layer can avoid underfitting [82]. Imbalanced datasets are another crucial problem faced as they widely exist in real-world situations and have proven to be the greatest challenge in classification problems. Data access has become comfortable with the advancement of technology; however, data imbalance has become ubiquitous in most of the collected datasets. For example, in medical data, most people are healthy, and unhealthy people are less in proportion than healthy people, significantly affecting classification accuracy. Here, the classes with adequate samples are called the majority class, and classes with inadequate samples are called the minority class. Prediction of minority class becomes problematic as it has a fewer number of samples or insufficient samples.

Several techniques are adopted to handle data imbalance in the dataset. Some of them are weight balancing, over- and undersampling (resampling), and penalizing algorithms [83]. Weight balancing is performed by modifying the weights carried by the training samples when computing the loss. Resampling is one of the frequently adopted techniques where undersampling is done to remove samples from the majority class or oversampling is done to add more samples to the minority class. Oversampling is done by duplicating random records from the minority class. Penalizing algorithm is another technique where the cost of classification mistakes is increased on the minority class. Label noise is another problem in deep learning, and some of its sources are nonexpert labeling, automatic labeling, and data poisoning, adversaries. Dan Hendrycks et al. [84] recommended a loss correction technique to utilize trusted data with clean labels. The authors effectively used trusted data to overcome the effects of label noise on classification.

3.4. Deep Generative Adversarial Network

Generative adversarial networks are a framework proposed by Goodfellow et al. in the year 2014 [7]. Significant improvements were achieved in computer vision applications such as image super-resolution [85], image classification [86], image steganography [87], image transformation [88], video generation, image synthesis [89], video super-resolution [90], and image style transformation. Variants of the D-GAN model were also proposed in recent times.

Figure 8 shows the D-GAN architecture where the generator generates fake images of human faces, and the discriminator’s job is to distinguish the real faces from the fake faces. In general, plausible data are generated by the generator. These data generated by the generator become negative examples of training the discriminator. The discriminator, a binary classification neural network, takes in the real samples and the samples from the generator and learns to distinguish the real samples from the generator’s fake samples. Two loss functions, generator loss and discriminator loss, are backpropagated to the generator and discriminator. The discriminator ignores the generator loss. The generator and the discriminator update the weights based on the loss, where the noise samples i ranging from 1 to m are represented by

3.5. Evolution of the Deep GAN

With deep GAN’s advent by Goodfellow, several variants of the deep GAN were proposed for various CV applications. These deep GAN variants have their own architecture, methodology, advantages, and disadvantages but with the same two-player minimax game theory as the base. Figure 9 shows D-GAN’s evolution with conditions, encoders, loss functions, and process discrete data. Table 5 shows D-GAN’s evolution with its application, architecture, methodology, advantage, and disadvantage.

D-GAN is successfully used in many computer vision applications, and image generation is at the forefront of all these. D-GAN generates images, gradually enhancing the resolution and the quality of images generated. Variants of D-GAN are used for various applications such as image transformation, image deraining [88], increasing image resolution, facial attribute transformation, and fusion of the image. Table 6 shows some of the progressively increasing applications of the D-GAN in computer vision.

4. Applications of the D-CNN in Computer Vision

Most of the D-CNN applications are related to images, while applications of the D-GAN are related to data generation. This section will progress through the essential applications of the D-CNN.

4.1. Image Classification Using the D-CNN

There are several image classification tasks performed using D-CNN [11, 104109]. One of the vital image classification tasks is handwritten digit recognition which recognizes numbers between 0 and 9, where the data from the MNIST database are obtained to predict the correct label for the handwritten digits. MNIST is a database of handwritten numbers widely used as a testbed for various deep learning applications. It has 70,000 images, of which 60,000 are training images and 10,000 are testing images [110]. Figure 10 shows sample images from the MNIST dataset. The images are greyscaled with 28 × 28 pixels, as represented in Figure 10. The 28 × 28 pixels are flattened into a 1D vector of size 784 pixels, and each of these pixels has values between 0 and 255. The black pixel takes the value 255, while the white pixel takes the value one, and various other shades of grey take values between 0 and 255. Handwritten digits are recognized using the D-CNN, which is considered the most suitable model for performing this task.

Data are downloaded from the MNIST database, and it takes some time to download the data during the first run, and the subsequent runs fetch the cached data. The data obtained from the MNIST database have features and labels. The features range from 0 to 225, corresponding to pixels of 28 × 28 images representing digits 0 through 9. The labels represent digits 0–9 of the respective image. It is normalized to scale the data to be between 0 and 1. The actual image from the MNIST dataset and the normalized image are represented in Figures 11(a) and 11(b).

Handwritten digit recognition is implemented using eight hidden layers, where the first layer is the convolutional layer used for feature extraction. ReLU is used as the activation function with 32 filters and a kernel size of 3 × 3 pixels. Another convolutional layer is used with ReLU as the activation function, 64 filters, and kernel size of 3 × 3 pixels. The next hidden layer is a pooling layer where max pooling with 2 × 2 pixels as pool size is used. Next to the pooling layer is the dropout, a technique adapted to prevent overfitting in the neural network [111]. The dropout technique’s key idea is to randomly drop a few units from the network and its connections during the training to reduce overfitting significantly. A dropout rate of 0.25 randomly drops out 1 in 4 units from the network. Between the convolutional layer and the fully connected output layer is the flatten layer. The flatten layer aims to transform the 2D matrix into a 1D vector fed into the fully connected layer. The flattened 1D vector is then passed on to the fully connected layer with ReLU as the activation function. Another dropout with a dropout rate of 0.50 is used. Finally, the output layer with the softmax activation function is used. The softmax activation function is used as its role is to specify the probability distribution for ten different classes. Since the task in hand is a multiclass classification, the output layer has ten nodes or perceptron corresponding to each of the ten categories to predict each class’s probability distribution. The perceptron with the highest probability is picked, and the label associated with it is returned as the output. The model is fit over 12 epochs. The test accuracy achieved is 98.56%, and the test loss is 0.0513, as represented in Figures 12(a) and 12(b). It can be seen that the accuracy increases with the increase in epoch, and loss decreases with the increase in accuracy.

Image classification is a classical problem of computer vision and deep learning. It is challenging because of the image’s variations due to light effects and misalignments [112]. Image classification in a computer sense is a course of action for grouping and categorizing images and labeling them based on their features and attributes [80]. It trains computers to use a well-defined dataset to interpret and classify images to narrow the gap between human vision and computer vision [113]. Some of the existing use cases of image classification are gender classification, social media applications such as Facebook and Snapchat which use image classification to enhance the user experience, and self-driving cars where various objects on their path, namely, vehicle, people, and other moving objects, are recognized [114].

Amerini et al. [115] proposed a novel framework called FusionNet by combining two D-CNN architectures to identify the source social network based on the images. 1D-CNN learns discriminative features, while 2D-CNN architecture infers unique attributes from the image. The learned features are fused using FusionNet, and then the classification is performed. Distinctive traces of social networks embedded in the images are exploited to identify the source. The full-frame images are broken into fixed dimension patches, and the patches are then classified independently. Each of the image patches is processed with D-CNN, and the predictions are obtained. Furthermore, to get the prediction at the image level, a voting strategy is applied at each patch. The label with the majority vote is assigned as the final prediction label. The average accuracy of 94.77% is achieved at the patch level.

4.2. Image Localization and Detection Using the D-CNN

On a glance over the object, human vision is capable of detecting the object, its size, location, and various other features. Object detection using deep learning allows computers to play a crucial role in many real-world applications such as robots, smart vehicles, and self-driving cars [116]. Object detection is one of the most challenging problems and the most important goal of computer vision. Object detection involves identifying different objects in an image using a bounded box. The identified objects can be further analyzed at a granular level to digging deeper into the image. Earlier object identification was performed by splitting the image into multiple pieces and then passing them into a classifier for object detection. Splitting the images into multiple pieces is performed using a sliding window algorithm. In this approach, the detection window is slid through the actual image at multiple positions, and each grid is a smaller piece of the image. Robust visual descriptors needed for object detection are extracted from the image using image processing. The convolutional neural network used visual descriptors to make object or nonobject decisions [117]. Since the process had to be repeated multiple times, it was computationally expensive.

Moreover, to overcome the sliding window algorithm’s shortcomings, object detection was performed using image segmentation. Segmentation is categorized into boundary-based, thresholding, region-based, and boundary-based. When a digital image is passed, the neurons are synchronized based on pixels with similar intensities to form a connected region [118]. It can be contextual if spatial relationships in an image are considered or noncontextual if spatial relationships are not considered. The goal of image segmentation is to alter the representation of an image into a form that is meaningful and easier for analysis. The accuracy of object detection is based on the quality of image segmentation. Similar to image classification problems, networks that are deeper exhibited better performance in object detection. Object localization is the next level of object classification where the objects’ position was also determined with a bounding box and labeling the objects. The difference between localization and detection is that classification with localization handles only one object, whereas detection finds multiple objects in an image and labels.

Tu et al. [119] proposed a method to detect passion fruits based on multiple-scale faster region-based CNN. The detection phase involves multiple-scale feature extractors that extract low- and high-level features. Data augmentation is done to enlarge the training data size. Pretrained residual neural network-101 architecture is used for object detection.

4.3. Document Analysis Using the D-CNN

Documents are the source of information for several cognitive processes, namely, graphic understanding, document retrieval, and OCR. Document analysis plays a crucial role in cognitive computing to extract information from document images. Document analysis is performed by identifying and categorizing images based on regions of interest. There are several existing methods for document analysis, such as pixel-based classification methods, region-based classification method, and connected component classification method. The region-based classification method segments document images into zones and classifies them into semantic classes. Pixel-based classification methods perform document analysis by taking each pixel and generate labeled images using the classifier. The connected component method creates the object hypothesis using local information, further inspected, refined, and classified [120].

D-CNN is widely adopted for document analysis to reduce computational complexity, cost, and data without compromising accuracy. With D-CNN, it is possible to classify images directly from segmented objects without extracting handcrafted features. Maryem Rhanoui et al. [121] performed document-level sentiment analysis using a combination of D-CNN and bidirectional long short-term memory and achieved an accuracy of 90.66%. The features are extracted by the D-CNN, and the extracted features are passed as the input to long short-term memory. Vectors’ built-in word embedding is passed as the input to the CNN. Four filters are applied, and the layer of max pooling is applied after each filter. The results of max pooling were concatenated and passed as the input to binary long short-term memory. The output of binary long short-term memory is passed as the input to a fully connected layer. The fully connected layer connects each piece of information from the input with output information. Finally, the softmax function is applied as an activation function to produce the desired output by assigning classes to articles.

4.4. Speech Recognition Using the D-CNN

Human-machine interaction for intelligent devices, namely, domestic robots, smartphones, and autonomous cars, is becoming increasingly common in daily life. Hence, noise robust automatic speech recognition has become very crucial for the human-machine interface. The basic idea behind speech recognition is to utilize the speaker’s lip movement’s visual information to complement the corrupted audio speech inputs. Automatic speech recognition models the relationship between phones and acoustic speech signals by extracting features and classifying speech signals. Furthermore, this is usually performed in two steps, where in the first step, the raw speech signal is transformed into features using dimensionality reduction and information selection. The second step estimates phonemes using generative or discriminative models. Phoneme class conditional probabilities can be estimated using the D-CNN through the raw speech signal as the input. The features are learned from the raw speech signal in both continuous speech recognition and phoneme recognition tasks.

Kuniaki Noda et al. [122] proposed a CNN-based approach for audiovisual speech recognition. Here, the authors first used a denoising autoencoder to acquire noise features. Then, the authors used the CNN to extract features from mouth area images. The training data for the CNN were raw images and their corresponding phoneme outputs. Lastly, the authors applied the multistream hidden Markov model to integrate audio and visual hidden Markov models trained with corresponding features. The model achieved a 65% word recognition rate with denoised mel-frequency cepstral coefficients with the signal-to-noise ratio under 10 dB for the audio signal input.

5. Applications of the D-GAN in Computer Vision

5.1. Image-to-Image Translation Using the Deep GAN

Remarkable progress has been achieved in image-image translation with the advent of the deep GAN. The image-image translation aims to learn the mapping to translate the image within two different domains, from the source to a target domain, without losing the original image’s identity and reducing the reconstruction loss. Some of the essential image-image translations are converting the real-world images into cartoon images, coloring the greyscale images, and changing a nighttime picture to a daylight picture. D-GAN’s role is to confuse the discriminator by generating images that are close to the real images. D-GAN is incredibly successful in super-resolution, representation learning [123], image generation [124, 125], and image-image translation.

Kim et al. proposed a novel method for image-image translation by incorporating a learnable normalization function and a new attention module. Existing attention-based models lacked behind in handling the geometric changes. This model is incredibly successful in translating images with massive shape changes. The auxiliary classifier is used to obtain an attention map to distinguish between source and target domains. Furthermore, this is done to focus on the region of interest, ignoring other minor regions. Attention maps are inserted both in the generator and discriminator to focus on the region of interest. The attention map embedded in the generator focuses on the essential areas that distinguish the two domains. In contrast, the attention map embedded in the discriminator focuses on distinguishing the target domain’s real and fake image. The choice of the normalization mechanism dramatically improves the quality of the transformed images. Adaptive Layer-Instance Normalization is used to select a ratio between layer normalization and instance normalization adaptively, and the parameters are learned during the training process. The class activation map gives discriminative image regions to determine the class. The model’s performance is superior to the existing state-of-the-art methods on both style transfer and object transfiguration.

5.2. Image Denoising

Image denoising removes noise from images retaining the detailing of the images. Image denoising is significantly improved with the advancement in the D-CNN [126]. However, D-CNN models focus mainly on reducing the mean squared error resulting in images lacking high-frequency details. Furthermore, to overcome this issue, D-GAN is applied to remove noise from images [127, 128]. Zhong et al. proposed a method to remove noise from images using the D-GAN. The architecture of the generator in the D-GAN has a convolutional block and eight dense blocks. Each block comprises a convolutional layer, batch normalization, and ReLU activation. Each layer, except the last layer in the network, is fed with each of the previous layers using skip connections. This method effectively reduces the vanishing gradient problem. The convolutional layer extracts low-level features, while the dense blocks extract the high-level features. The generator network is capable of learning the residual difference between the ground truth and the noisy image. The final 3 × 3 convolutional layer generates the output images. The discriminator network differentiates the fake and the ground truth image, making the final denoised visually appealing image. The model can handle different types of noise, but it cannot handle unknown real noises.

5.3. Face Aging and Facial Attribute Editing

Deep GAN-based methods have been proposed to alter facial attributes to anticipate a person’s future look. Conditional GAN has been widely adopted to perform face aging [129]. D-GANs are also incorporated in facial attribute editing, manipulating facial images’ attributes to generate face with the desired attribute, retaining other facial images’ details. The latent representation of the facial images is decoded to edit the facial attributes. GAN-based methods have been proposed for facial attribute editing, which changes only the desired attributes and preserves the other identities of the facial images, retaining the facial image’s identity [130]. The work uses reconstruction learning to preserve the attribute details and “only change what you want.” The authors applied attribute classification constraints to the generated image rather than imposing constraints on the latent representation to warrant the desired attributes’ correct change. The facial attributes are manipulated to change the facial image with and without a beard, black hair and brown hair, mouth open and mouth close, brown hair and blond hair, and young and old. Yujun Shen et al. [131] interpreted the latent codes of trained models such as StyleGAN and progressive GAN and encoded various semantics in the interpreted latent space. Given a synthesized face, different face attributes such as pose, age, and expression are edited without having to retrain the D-GAN model. Table 7 shows the comparison of handwritten digits generated by D-GAN variants.

6. Open Problems and Future Opportunities for Computational Visual Perception-Driven Image Analysis

This paper discussed the development of computational visual perception with D-GAN and D-CNN. The advantages and disadvantages of various architectures of D-GAN models are discussed. Future research with D-GAN can be performed on model collapse, nonconvergence, and training difficulties. Also, various other shortcomings of the CNN and their solution are reviewed. Table 8 lists the challenges and open problems for computational visual perception-driven image analysis.(i)Handwriting recognition mainly relies on the language model that we furnish to the system and the character modeling quality. This research can be performed to obtain a better handwriting recognition system that is faster and more accurate.(ii)Semantic mapping is promising in autonomous vehicles, but state-of-the-art methods still need improvement to produce reliable tools. To do this, better 3D geometry must be included in mapping to achieve more accurate results in the semantic segmentation process. Moreover, map updating has to be done to ensure that maps are always coherent with reality.(iii)The biggest problem with calibration in webcam-based eye trackers is a variation in the head pose. Furthermore, this has to be handled without modifying the base components of the system. The future works can be directed towards a 3D model-based head tracking, gaze estimation calculated geometrically, and accurate iris segmentation method.(iv)Lumen center detection is based on the geometry and appearance of the lumen. Future works can be directed towards lumen segmentation using the center point of the lumen computed previously as the seed and filling tracheal ring discontinuities to improve segmentation accuracy.(v)In query-by-string word, spotting the latent semantic model’s performance improves when more samples are used to build the model. However, acquiring the transcription of handwritten documents can be tricky so that synthetic information can train the whole framework. Another problem that requires a solution is the vast possible parameter combinations of the bag-of-visual-words model. An adaptive framework can automatically generate spatial representation and codebook based on the model training phase’s indexation errors.(vi)Coastal lagoons alternate between being open and closed to the ocean, resulting in intermittently closed and open lakes and lagoons. These are features by a berm, originated from sediments and sand deposited by tides, winds, and waves from the ocean. Moreover, this helps prevent ocean water from flow further into the lagoon, but rain can cause the lagoon to overflow. Hence, future research can explore the adoption of existing computer vision technology and techniques to monitor ICOLLs, including obtaining water level measurement and berm height and improving the decisions for when to open/close a lagoon entrance.(vii)Presently, scientists are working towards the development of a vaccine to control coronavirus disease. The clinicians and scientists are making all the attempts to prevent, treat, and control the spread of coronavirus. However, there is an urgent need to perform research into the use of SARS-Cov-2 in suitable animal models to examine the replication, transmission, and pathogenesis. The task remains in identifying the source of coronavirus, the immunological basis of the virus, the immune responses, and whether the mutation-prone positive-sense ssRNA virus of coronavirus will become endemic or alters into forms that are more lethal shortly.(viii)The medical imaging community utilizes transfer learning where large-scale annotated data are required. Here, the transfer learning model’s performance can be improved by carefully selecting the source and target domain. Residual learning can make significant advancements if effectively utilized in medical image analysis tasks.(ix)With the potential use of image processing techniques in computer vision, disaster management can be improved. Satellite images of areas that are prone to landslide are taken during a specific interval. Image registration techniques can be used to register these to each other, and the movement can be recorded, which can help predict future landslides. This technique will be beneficial when satellite images of areas that are prone to landslides are available. Deviations in the positions of hills are estimated to predict the number of future landslides.(x)The research related to underwater imaging is evolving to expose undiscovered species underwater. It is not possible to identify all the underwater species by continuously visualizing the recorded videos underwater. Therefore, an automated system to classify or detect the species underwater is required.

7. Conclusions

This paper summarized a comprehensive survey of deep CNN and deep GAN, their basic principles, GAN variants, and their computational visual perception applications. A comparison between biological and computer vision is made to better understand the background of computer vision and understand the architecture of neural networks. This survey presents an extensive comparison of current and existing surveys. This survey extensively surveys the applications of D-GAN and D-CNN. The building blocks of the CNN, recent advancements of the CNN, activation functions, D-GAN evolution, and its recent advancement are discussed in detail. The pitfalls of deep learning and its solutions are discussed briefly. Both state-of-the-art and classical applications of deep GAN and deep CNN are evaluated. Developments in computational visual perception or computer vision with D-GAN and D-CNN in recent years are reviewed. D-GAN is proved to solve the problem of insufficient data and improve the quality of image generation. Experimental results are discussed to explore the ability of variants of D-GAN models. The advantages, disadvantages, and network architectures of different D-GAN models are discussed. Besides, D-CNN and D-GAN applications that have achieved remarkable achievements in various computer vision applications are discussed. Furthermore, D-GAN has a wide variety of applications combined with other deep learning algorithms.

This article extensively surveyed the current opportunities and future challenges in all the emerging domains. This article discussed the current opportunities in many emerging domains such as handwriting recognition, semantic mapping, webcam-based eye trackers, lumen center detection, query-by-string word, intermittently closed and open lakes and lagoons, and landslides. Future research with the D-GAN has to be directed towards model collapse, nonconvergence, and training difficulties. Though there are vast improvements such as weight regularization, weight pruning, and Nash equilibrium, future research in this area is still mandatory. D-GAN in the security domain has more research scope as adversarial attacks on neural networks have become very common. Slight perturbation in samples may lead to the wrong classification by neural networks. Furthermore, to outdo adversarial attacks, it is necessary to make D-GANs more robust to adversarial attacks. Though D-GAN is put forward as unsupervised learning, adding labels to the data will significantly improve the D-GAN’s data quality. Modifying the D-GAN in this way is one of the future research directions.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

N. A. R, P. D. R. V, and C-Y. C. performed the conceptualization and supervised the data. C-Y. C. carried out the funding acquisition. N. A. R, P. D. R. V, K. S, and C-Y. C. investigated the data and performed the methodology. C-Y. C. and K. S. carried out the project administration and validated the data. N. A. R, P. D. R. V, K. S., and U. T wrote, reviewed, and edited the manuscript. All authors read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported by the Ministry of Science and Technology, Taiwan (Grant no. MOST 103-2221-E-224-016-MY3). This research was partially funded by the “Intelligent Recognition Industry Service Research Center” from the Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and the APC was funded by the aforementioned project.