Computational Intelligence in Image Processing 2020View this Special Issue
Blind Stereo Image Quality Evaluation Based on Convolutional Network and Saliency Weighting
With the rapid development of stereo image applications, there is an increasing demand to develop a versatile tool to evaluate the perceived quality of stereo images. Therefore, in this study, a blind stereo image quality evaluation (SIQE) algorithm based on convolutional network and saliency weighting is proposed. The main network framework used by the algorithm is the quality map generation network, which is used to train the distortion image dataset and quality map label to obtain an optimal network framework. Finally, the left view, right view, and cyclopean view of the stereo image are used as inputs to the network frame, respectively, and then weighted fusion for the final stereo image quality score. The experimental results reveal that the proposed SIQE algorithm can improve the accuracy of the image quality prediction and prediction score to a certain extent and has good generalization ability.
With the rapid development of stereo image applications, many related stereo image technologies and services have been introduced in our daily lives as well as in many professional fields [1–9]. A variety of distortions can occur during the collection, transmission, processing, and displaying of stereo images [10–19]. Therefore, it is of immense practical significance to establish a high-performance stereo image quality evaluation method. Stereo image quality evaluation (SIQE) is classified into objective and subjective evaluation. Subjective evaluation is the subjective evaluation of images directly by humans. Because the human visual system is the ultimate recipient of images, subjective evaluation is very persuasive. However, in practical applications, subjective evaluation becomes extremely time-consuming and laborious and is difficult to apply to real-time systems. Therefore, objective evaluation plays a dominant role in SIQE. There are three main types of objective stereo image quality evaluation methods: full-reference (FR) evaluation [4–8], reduced-reference (RR) evaluation , and no-reference (NR) evaluation [10–22]. The FR evaluation method compares the undistorted original image with the distorted image to obtain the difference between them. The reduced-reference evaluation method uses the partially undistorted original image information. The NR evaluation method does not use the undistorted original image at all. Because the original undistorted image is difficult to obtain in practical applications, the NR evaluation method has a higher research value.
Many SIQA methods are designed based on the typical 2D image quality evaluation methods [1–3]. The typical full-reference SIQE method was studied, which was proposed by Chen et al. , mainly uses a stereo image pair, disparity image, and Gabor filter response synthesis image. The central eye image uses an FR 2D image quality assessment method to predict the 3D quality score. Zhou et al.  proposed a new NR SIQE based on binocular self-similarity and deep neural networks. In , the stereo image block is first input to the convolutional neural network and then pooling. The final image quality score is obtained through the multilayer perceptron, where the initial parameters of the convolutional neural network are trained by a large number of natural images. Jiang et al.  performed SIQE by learning color visual characteristics based on nonnegative matrix factorization and considering binocular interaction. The study  proposes an NR stereo image quality assessment based on the combination of wavelet decomposition and statistical models. The SIQE method proposed in [15, 16] and the method proposed in  are based on the evaluation method of binocular vision mechanism. The method proposed in  is based on the gradient dictionary color visual feature learning evaluation method. Wu et al.  proposed an evaluation method based on depth edge information and color signals, which also uses segmented self-encoders. Liu et al.  proposed an SIQE method based on classification and prediction. Jiang et al. proposed an SIQE algorithm for processing multiple distortions , which characterizes the local receptive field characteristics of the visual cortex by learning monocular and binocular local visual primitives for various distorted stereo images quality assessment and single distortion stereo images. Bensalma and Larabi  proposed perceptual quality metric for stereo images based on the binocular energy. Zhou et al.  proposed blind SIQE metric based on binocular combination and extreme learning machine.
Although the above methods have achieved certain effects on the stereo image evaluation problem, these evaluation methods do not consider versatility or image saliency weighting. Moreover, few studies consider the left and right views weight assignment problems of stereo images. Therefore, in this study, an NR SIQE method based on convolutional networks and saliency weighting is proposed. The main contributions of this work are as follows.
First, the training dataset used in the proposed method is a self-made distortion image dataset, and the quality map obtained by the high-performance FR image quality evaluation method is used to render corresponding labels.
Second, the main network framework used by the algorithm is the quality map generation network, which is used to train the distortion image dataset and quality map label to obtain an optimal network framework.
Finally, the left view, the right view, and the cyclopean view of the stereo image are, respectively, used as the input of the quality map generation optimal network framework, and the corresponding quality maps are predicted. Weighted fusion is sequentially performed to obtain the final stereo image quality score.
2. Model Methods
Figure 1 illustrates the overall frame structure of the proposed method. The inputs to the network frame are a left-view distortion image, right-view distortion image, and cyclopean view of the left and right views. The output is a predicted value of the distortion stereo image quality score. As can be seen from the figure, the overall framework can be divided into three submodules, the quality map generation network, the saliency weighting module, and the weighted summation module of the left, right, and cyclopean views. The specifications of each module are described in detail in the following sections.
2.1. Quality Map Generation Network
In the proposed method, the quality map generation network is a main component of the overall framework. The requirement for the quality map generative network is outputting a quality map of the same size with the input image. We construct an improved framework of U-Net, an extension of fully convolutional networks, as a base of quality map generative network because U-Net integrates the hierarchical representations in subsampling layers with the corresponding features in upsampling layers. Hence, these layers increase the resolution of the output. In order to localize, high-resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.
A major component of the quality map generation network is the convolutional layer. In the encoder part, the basic structure consists of a two-layer convolution plus a layer of pooling as a small module. The number of convolution kernels in each convolutional layer is depicted in Figure 2. The size of the input distortion image is × h × 3, the size of the output quality map is × h, and the size of the convolution kernel is 3 × 3. The activation function uses a linear correction unit (ReLU) function. Let denote the parameters of lth filter kernel in the ith convolution layer and denote its corresponding bias. Then, the lth feature map produced in the ith convolution layer is represented bywhere hi−1 denote i − 1th feature maps outputted from previous convolution layer and h0 corresponds to the input image. is the ReLU function.
The supervisory label used in the network is based on the image structure similarity (SSIM) quality map, and the effect of locally obtaining the SSIM quality map is better than that of globally obtaining it.
In this work, we first select 100 source images from a dataset  for the training set. The source images have different scenes (all reference images of the dataset are shown in Figure 2, resized to a fixed-size of 528 × 400). Four commonly observed distortion types, namely, JPEG 2000 (JP2K) compression, JPEG compression, Gaussian blur (GB), and white noise (WN) are used to generate the distorted images. Finally, the SSIM metric is employed to generate the ground-truth objective quality/similarity maps as training labels. In the proposed method, the SSIM quality map used is the local mapping matrix. Let x and y be two image signals. The original and distorted image signals are obtained, thereby obtaining a quality map based on the similarity, and the generation formula of the SSIM quality map is defined aswhere μx is the mean of image x, μy is the mean of image y, σx is the standard deviation of image x, σy is the standard deviation of image y, and σxy is the covariance of x and y and C1 and C2 denote small positive constant for increasing stability when the denominator approaches zero. In this paper, we set C1 = C2 = 0.085 for our experiments.
2.2. Weighted Fusion Module
The weighted fusion module is one of the important components of the proposed method, and its main structure is depicted in Figure 1.
In this study, the fusion of the quality score of the left and right views is based on the widely used Bayesian theory , in which the binocular quality score can be obtained by
In (3), and denote the left retinal view and the disparity/depth-compensated right retinal view, respectively; ϑ denotes the quality; and denote feature distributions that are utilized to balance the roles of binocular visual mechanism in determining the overall visual quality; and the likelihoods and denote quality scores of the left and right views.
The left and right views weights of the distorted stereo image are referred to the left and right views weight assignment methods of the paper , and the formula is as follows:where and are the significant level estimates of the left and right views, respectively, and equations (4) and (5), respectively, represent the weights of the left and right views, thereby obtaining a weighted quality score of the left and right views quality scores:where . Smap,l and Smap,r denote the saliency maps of the left and right views (we use the method of reference  to derive saliency maps from distorted views), respectively. Qmap,l and Qmap,r denote the predict quality maps of the left and right views, respectively.
In addition to the weights used in the combination of the left and right views quality scores, there is a set of weights that left and right views weighted score Qlr and cyclopean view quality score . Smap,m denotes the saliency maps of the cyclopean views. Qmap,m denotes the predict quality maps of the cyclopean views. For the distorted stereo images, the over quality score can be obtained bywhere α is the weight. The selection of weight α is discussed in the experimental section.
3. Experimental Results and Analysis
3.1. Experimental Database
In the experiment, the performance of the proposed method was verified using two published three-dimensional image quality evaluation databases: LIVE 3D Phase I  and LIVE 3D Phase II . Basic information of the two databases is provided below.
LIVE 3D Phase I: This database includes 20 pairs of original undistorted stereo images and 365 pairs of distorted stereo images. The size of each view is 640 × 360. All distorted images contain five different levels of distortion, including JPEG distortion, JPEG 2000 distortion, additive white Gaussian noise (WN) distortion, fast decay channel (FF) distortion, and Gaussian blur (GB) distortion. In addition, each pair of distorted images has a differential mean opinion score (DMOS), which is a human subjective image quality score obtained through a large number of experiments.
LIVE 3D Phase II (LIVE 3D Phase II-Symmetric and LIVE 3D Phase II-Asymmetric) database includes 8 pairs of original undistorted stereo images and 360 pairs of distorted stereo images. The size of each view is 640 × 360. Similar to LIVE 3D Phase I, all the distorted images contain 5 different levels of distortion. For each type of distortion, three sets of symmetrically distorted stereo image pairs and six sets of asymmetrically distorted stereo image pairs are generated for each pair of original stereo images. “Asymmetric” means that the left and right views of the stereo image have different types or different levels of distortion. Each pair of stereo images has a DMOS value.
3.2. Experimental Training Step
The experiment was implemented on a 64 bit memory computer using an NVIDIA GTX 1080 TI. The deep network framework was implemented using the Keras depth framework with TensorFlow as the backend. When training deep neural networks, the optimizer used is Adam, which, also known as “adaptive momentum estimation,” is a parameter updating method that is used to calculate the adaptive learning rate for each parameter. The learning rate of some parameter updating methods is adjusted in the global scope, and the adjustment of all network parameters is equivalent, which makes the adjustment of the learning rate difficult, and requires very good initial network parameters, compared to these parameter updating methods. Adam has a bigger advantage. We use the Adam  optimization method with a learning rate of 1 × 10−4 with a decay of 0.5 every 50 epochs and a weight decay of 1 × 10−5 for regularization to optimize the network. In the iterative process, the batch size used is 16, and the loss function uses the mean square error (MSE).
3.3. Analysis of Experimental Results
The three classical performance evaluation indexes are used to evaluate the performance of the proposed SIQE methods, namely, Pearson linear correlation coefficient (PLCC), Spearman correlation coefficient (SROCC), and root mean square error (RMSE). The PLCC index can be used to evaluate the prediction accuracy of the NR image quality evaluation algorithm. The higher the correlation between the objective prediction score of the image and the subjective score is, the closer the PLCC correlation coefficient is to 1, indicating better performance of the image quality evaluation algorithm.
Since the planar database we created contains only four types of distortion, namely, JPEG, JPEG 2000, WN, and GB, only four types of distortions are tested and analyzed in the experiment. The data shown in Table 1 are the data for all four distortions.
The results of the weight distribution experiment when the left and right views quality scores are combined with the cyclopean view quality score are shown in Table 1. From the experiment, it can be concluded that, in the stereo distortion database, the best performance is obtained under this condition, that is, the value is α = 0.2.
The overall test results are shown in Table 2. Here, the final result obtained with a value of 0.2 was used. It can be seen from the relevant performance index that the image quality evaluation method proposed in this paper has a good generalization ability.
In Tables 3 and 4, two quality evaluation performance indicators PLCC and SROCC are used to compare the SIQE method proposed in this paper with the methods proposed by the predecessors, including three 2D full-reference image quality evaluation methods (SSIM , FSIM , and GMSD ) and three SIQE methods (Chen , Bensalma and Larabi  and Zhou ). It can be seen from Table 3 that the stereo image evaluation method proposed in this paper is better than the other NR image evaluation methods. As can be seen from Table 4, the method proposed in this paper is more advantageous for evaluating asymmetric stereo images. Since the proposed method does not require original undistorted images and human subjective scores, its evaluation performance may be slightly lower than that of the FR evaluation method, but it achieves the desired effect of no-reference SIQE.
A good image quality evaluation method not only yields high performance, but also provides good computational efficiency. Predictive speed is also an important component of image quality evaluation performance indicators, because we need to consider the practicality of the algorithm. Here, we present the calculation time required for a pair of stereo images. In the LIVE 3D Phase I database, the prediction time for a pair of stereo images by using the NVIDIA 1080Ti GPU is approximately 0.034 s (more than 25 frames per second), while testing the pair in the LIVE 3D Phase II database, the prediction time for the stereo image is 0.039 s (more than 25 frames per second). In terms of speed, the method proposed in this paper can achieve real-time prediction (the real-time system run at the speed of 25 frames per second).
In this paper, we propose an effective SIQE method, which is different from the existing image evaluation method based on deep learning. First, the proposed method training dataset is a self-made distortion image dataset and the corresponding quality maps obtained by the high-performance FR image quality evaluation method as labels. Second, the main network framework used by the algorithm is the quality map generation network, which is used to train the distortion image dataset and quality map label to obtain an optimal network framework. The physiological function of the neurons ensured that the predicted results are highly consistent with the original quality map. Finally, the left view, the right view, and the cyclopean view of the stereo image are, respectively, input into the quality map generation network framework, and the corresponding quality maps are predicted, the three quality prediction scores are obtained using saliency weighting, and the three quality values are obtained. Weighted fusion is sequentially performed to obtain the final stereo image quality score. Experiment was performed on two 3D LIVE databases to verify the effectiveness of the proposed algorithm. The experimental results show that the algorithm can improve the accuracy of image quality prediction and prediction scores and has a good generalization ability.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (grant no. 61502429), the Zhejiang Provincial Natural Science Foundation of China (grant no. LY18F020012), the Zhejiang Open Foundation of the MOST Important Subjects, and the China Postdoctoral Science Foundation (grant no. 2015M581932).
G. Jiang, H. Xu, M. Yu, T. Luo, and Y. Zhang, “Stereoscopic image quality assessment by learning non-negative matrix factorization-based color visual characteristics and considering binocular interactions,” Journal of Visual Communication and Image Representation, vol. 46, pp. 269–279, 2017.View at: Publisher Site | Google Scholar
J. Yang, P. An, J. Ma, K., L. Li, and Shen, “No-reference stereo image quality assessment by learning gradient dictionary-based color visual characteristics,” in Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, IEEE, Florence, Italy, May 2018.View at: Publisher Site | Google Scholar
D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, 2014, http://arxiv.org/abs/412.6980.