Abstract

Aiming at improving the video visual resolution quality and details clarity, a novel learning-based video superresolution reconstruction algorithm using spatiotemporal nonlocal similarity is proposed in this paper. Objective high-resolution (HR) estimations of low-resolution (LR) video frames can be obtained by learning LR-HR correlation mapping and fusing spatiotemporal nonlocal similarities between video frames. With the objective of improving algorithm efficiency while guaranteeing superresolution quality, a novel visual saliency-based LR-HR correlation mapping strategy between LR and HR patches is proposed based on semicoupled dictionary learning. Moreover, aiming at improving performance and efficiency of spatiotemporal similarity matching and fusion, an improved spatiotemporal nonlocal fuzzy registration scheme is established using the similarity weighting strategy based on pseudo-Zernike moment feature similarity and structural similarity, and the self-adaptive regional correlation evaluation strategy. The proposed spatiotemporal fuzzy registration scheme does not rely on accurate estimation of subpixel motion, and therefore it can be adapted to complex motion patterns and is robust to noise and rotation. Experimental results demonstrate that the proposed algorithm achieves competitive superresolution quality compared to other state-of-the-art algorithms in terms of both subjective and objective evaluations.

1. Introduction and Motivation

Factors such as environmental changes, inaccurate focusing, optical or motion blur, subsampling, and noise disturbance can have a negative effect on video visual quality. Superresolution (SR) reconstruction technology [14] aims to reconstruct high-resolution (HR) video sequences from their low-resolution (LR) counterparts. With rapid and significant development of computer vision, there is a growing need for HR videos. Video visual resolution quality plays an important role in accurate moving-target tracking and recognition in intelligent video surveillance systems, which can provide more important details of moving targets. HR medical videos are also very useful for doctors to make correct diagnoses. Therefore, SR video has great research significance and application potential.

In recent years, SR reconstruction technology has been one of the most active research fields in smart image and video analytics and processing. SR techniques have been developed to solve SR problems from the frequency domain to the spatial domain. Currently relevant studies include three main categories: interpolation-based SR methods [5, 6], multiframe-based SR methods [79], and learning-based SR methods [10, 11]. Interpolation-based SR methods have relatively low computational cost and therefore are well suited for real-time applications. However, degradation models are not applicable to these methods if blur and noise characteristics vary for different LR video frames. Moreover, additional video details cannot be effectively recovered using these methods because some of the details of interest have usually been blurred.

Multiframe-based SR methods produce HR video sequences by fusing several LR video frames, making full use of complementary and redundant information with similar but not exactly identical details between adjacent video frames at different spatiotemporal scales. At present, two main fields of research address this kind of method. One branch is based on accurate estimation of subpixel motion using methods such as the projections onto convex sets (POCS) method, the maximum a posteriori (MAP) estimation method, and the iterative back projection (IBP) method, which can be applied only to video sequences with relatively simple motions such as global translation. These methods cannot be adapted to more complex motion patterns such as local motion or angles of rotation. The second branch [12, 13] is based on a recently proposed novel probabilistic motion-estimation scheme based on nonlocal similarity, which does not rely on accurate estimation of subpixel motion and can be adapted to more complex motion patterns. Using this novel scheme, Protter et al. [14] proposed a nonlocal fuzzy registration scheme-based SR reconstruction framework based on a 3D nonlocal mean filter (3D NLM) [15]. Subsequently, Gao et al. [16] improved the nonlocal similarity matching method based on Zernike moment feature similarity and proposed a novel Zernike moment-based SR method which improved the noise robustness and rotation invariance of the NLM-based SR process. However, multiframe-based SR methods cannot be adapted to a larger magnification factor and usually fail when insufficient complementary and redundant information between video frames is provided.

In recent years, learning-based SR methods [1719] have received much attention. These methods estimate the missing high-frequency details in the input LR images by learning the relationship between LR image patches and the corresponding HR patches from a training set of LR and HR image pairs. This kind of method can be adapted to larger magnification factors and can produce better superresolved results. This paper concentrates on the learning-based SR method for video SR. Until now, nearly all studies of this kind of method have focused on SR for static images. In this paper, by combining the spatiotemporal similarities between video frames, learning-based SR methods will be extended to the video SR field. In the learning-based image SR field, the representative methods are the neighbor embedding-based SR methods (NESR) and the sparse representation-based SR methods (SRR).

Motivated by locally linear embedding (LLE), Chang et al. [20] first proposed a neighbor embedding-based SR method, which reconstructed HR patches by learning a mapping from the local geometry of the LR image patch manifold to that of the HR image patch manifold. Since then, numerous other methods have been proposed and have achieved good performance. Gao et al. [21] extended this method using sparse neighbor embedding, in which the k-nearest neighbor (k-NN) of each LR patch was adaptively chosen by describing local structural information using the histograms of oriented gradients (HoG) feature. Timofte et al. [22] proposed a novel anchored neighborhood regression method for fast example-based SR, in which the nearest neighbors were computed using correlation with dictionary atoms rather than Euclidean distance. However, when dealing with a huge number of training patches, searching for the nearest neighbor can be prohibitively slow and also can require much memory. Moreover, with increasing magnification factor, the correlation between LR patches and their corresponding HR patches becomes ambiguous [23].

Recently, sparse representation and dictionary learning have been proven to be very effective for SR. In sparse representation-based SR methods, some coupled dictionary learning methods [24, 25] have been proposed for superresolution. Lin and Tang [26] proposed a novel coupled subspace learning strategy to learn mappings between different styles. They first used correlative component analysis to find the hidden spaces for each style to preserve correlative information and then learned a bidirectional transform between the two subspaces. Yang et al. [27] proposed a coupled dictionary learning model for image superresolution. They assumed that coupled HR and LR image dictionaries exist which have the same sparse representation for each pair of HR and LR patches. After learning the coupled dictionary pair, the HR patch was reconstructed on the HR dictionary with sparse coefficients coded by the LR image patch over the LR dictionary. This coupled dictionary learning-based SR method assumes that the representation coefficients of the image pair are strictly equal in the coupled subspace. However, this assumption is too strong to address the flexibility of image structures at different resolutions. To overcome this problem, in [28], a semicoupled dictionary learning-based SR method was proposed, which relaxed the above assumption and assumed that there exists a dictionary pair over which the representations of HR and LR image patches have a stable correlation mapping. He et al. [29] used a beta process for sparse coding, establishing a mapping function between HR and LR coefficients. Moreover, in the methods described in [2830], nonlocal similarities were used to enhance SR performance.

However, these learning-based methods consider nonlocal similarities only in the spatial region of the single image. Therefore, they cannot be directly adapted to video superresolution because they do not make full use of spatiotemporal correlation between video frames, which will influence video spatiotemporal consistency to some extent. This paper aims to solve this problem by extending the concept of single frame-based nonlocal similarities to spatiotemporal nonlocal similarities. A novel learning-based video superresolution method using spatiotemporal nonlocal similarity constraint is proposed which can be adapted to larger magnification factors while effectively preserving video spatiotemporal consistency.

This paper presents a novel learning-based video superresolution reconstruction algorithm using spatiotemporal nonlocal similarity (LBST-SR). The novelty and contributions of this paper are as follows:(1)By combining LR-HR correlation mapping learning and spatiotemporal nonlocal similarity, video SR performance is further improved via fusion of nonlocal similarity structural redundancies at different spatiotemporal scales.(2)With the aim of improving algorithm efficiency while guaranteeing SR quality, the authors propose a novel visual saliency-based correlation mapping strategy between LR and HR patches based on semicoupled dictionary learning. In addition, a self-adaptive regional correlation evaluation strategy based on regional average energy and structural similarity is used in spatiotemporal similarity matching.(3)An improved spatiotemporal nonlocal fuzzy registration scheme using pseudo-Zernike moment (PZM) and structural similarity is proposed for spatiotemporal similarity matching with the aim of further improving SR accuracy and robustness.

The remainder of the paper is organized as follows. Section 2 gives the observation model for video superresolution reconstruction. Section 3 presents the details of the proposed LBST-SR algorithm. Section 4 gives the experimental results and analysis. Conclusions are presented in Section 5.

2. Observation Model for Video Superresolution Reconstruction

The observation model for video superresolution reconstruction shown in Figure 1, which describes the relationship between HR and LR video frames for superresolution reconstruction, can be formulated as follows:where denotes the th original HR video frame and denotes the th observed LR video frame, which is processed by warping , blurring , downsampling , and noise disturbance . describes the motions which occur during video acquisition, such as global or local translation and rotation. denotes the frame number in the video sequence.

3. Proposed LBST-SR Algorithm

3.1. Algorithm Architecture and Mathematical Formulation

On the basis of LR-HR correlation mapping learning between LR patches and the corresponding HR patches, this paper aims to improve the performance of video superresolution reconstruction further by combining spatiotemporal domain nonlocal similarity structural redundancies at different spatiotemporal scales. Therefore, in this paper, a novel learning-based video superresolution reconstruction algorithm using spatiotemporal nonlocal similarity (LBST-SR) is proposed. Objective HR estimations of LR video frames can be obtained by learning LR-HR correlation mapping and fusing spatiotemporal nonlocal similarity information between video frames. With the aim of improving algorithm efficiency while guaranteeing superresolution quality, LR-HR correlation mapping is performed only for the visual salient object region, and then an improved nonlocal fuzzy registration scheme using pseudo-Zernike moment feature and structural similarity is proposed for spatiotemporal similarity matching and fusion. The advantages of the proposed LBST-SR algorithm mainly lie in the following three aspects: it does not rely on accurate estimation of subpixel motion and therefore can be adapted to complex motion patterns (local motions, angles of rotation, etc.); it has high rotation invariance effectiveness and is robust to noise and illumination; and it can be adapted to larger superresolution magnification factors. The proposed algorithm architecture is shown in Figure 2. It includes the following two main processes: LR-HR correlation mapping learning and spatiotemporal nonlocal fuzzy registration and fusion.

Given an input LR video sequence and a set of LR and HR training pairs, the objective is to infer the corresponding HR video sequence , where and denotes the video frame number. The mathematical model of the proposed LBST-SR algorithm is formulated as minimizing the following objective energy function:where denotes the HR estimation of the video sequence. denotes the pixel in the LR sequence . denotes the salient object region in , and denotes nonsalient region in . denotes an LR-HR correlation mapping energy element, denotes a spatiotemporal nonlocal similarity regularization constraint element, and is the balancing parameter between the two elements. Aiming at improving algorithm time efficiency while guaranteeing superresolution quality, the LR-HR correlation mapping is established only for the human-eye concentrated salient object region .

3.2. LR-HR Correlation Mapping Learning

The HR estimations of LR video frames can be obtained by learning correlation mapping between LR and HR patches. With the objective of improving algorithm efficiency while guaranteeing SR quality, the LR-HR correlation mapping is established only for the human-eye concentrated salient object region in the LR video frame . In this paper, a saliency optimization method based on robust background detection [31] is used to detect and extract the visual salient region. The learning process for LR-HR correlation mapping can be formulated as follows: given the LR patch set and the HR patch set , the mapping process can be described as a process of seeking a mapping function from space to space : .

The correlation learning model based on a coupled dictionary assumes that each pair of HR and LR patches has the same sparse representation coefficients. This assumption is too strong to address the flexibility of frame structures at different resolutions, which will restrict superresolution performance. Therefore, in this research, a more flexible and stable semicoupled dictionary learning method has been used to establish correlation mapping between HR and LR patches, which assumes that there exists a stable correlation mapping between the sparse representation coefficients of HR and LR patches. In the LR-HR correlation learning process based on semicoupled dictionary learning, the LR-HR dictionary pair and the correlation mapping matrix can be obtained by minimizing the objective energy function given in where , , , and denote the regularization parameters needed to balance the terms in the objective function; and are the sparse representation coefficients of LR and HR patches, respectively; and denote the reconstruction errors; denotes the mapping error; and and denote the atoms of and , respectively.

To solve the minimization problem for the objective energy function in (3), it can be separated into three subproblems: sparse coding for training samples; dictionary updating; and mapping updating.

Sparse Coding for Training Samples. With the initialization of and the dictionary pair , the sparse coding coefficients and can be obtained by solving (4) using -optimization algorithms:where denotes the mapping from to and denotes the mapping from to . denotes the mapping error generated during is mapped to . denotes the mapping error generated during is mapped to .

Dictionary Updating. With and fixed, the dictionary pair can be updated using

Mapping Updating. With the dictionary pair , , and fixed, the mapping can be updated as follows:

By solving (6), the following expression can be derived:where is an identity matrix.

After obtaining the LR-HR correlation mapping using the above learning process, the superresolution reconstruction is done by using it to derive the HR estimation of the salient object region in the video frame. For the salient object region in LR video frame , the following optimization problem given in (8) is solved to obtain its HR estimation:where is a patch of LR video frame and is the corresponding patch in the initial estimation of HR video frame . An initial estimation of can be obtained using a Bicubic interpolator. Equation (8) can be solved by alternately updating and . The objective HR estimation of each patch in the salient object region of can be derived by solving

3.3. Spatiotemporal Nonlocal Fuzzy Registration and Fusion

The superresolution process based on the learned LR-HR correlation mapping uses only the spatial information in the video frame and the LR-HR mapping. Therefore, it does not make full use of the spatiotemporal relationship between video frames and therefore cannot preserve video temporal consistency. Large quantities of spatiotemporal nonlocal similarity information exist between video frames, and these nonlocal redundancies are very useful for video superresolution reconstruction. Therefore, in this research, video spatiotemporal nonlocal similarity was used to enhance further the performance of the proposed superresolution algorithm based on LR-HR correlation learning. With the objective of improving the performance and efficiency of the spatiotemporal nonlocal similarity matching, the spatiotemporal nonlocal fuzzy registration scheme was improved using the similarity weighting strategy based on PZM feature similarity and structural similarity and the self-adaptive regional correlation evaluation strategy.

3.3.1. Improved Spatiotemporal Nonlocal Fuzzy Registration Scheme Using PZM and Structural Similarity (ZSFR)

Considering good rotation, translation, and scale-invariance properties and insensitivity to noise and illumination of PZM feature, the nonlocal fuzzy registration scheme could be further improved by using this feature, resulting in a more accurate and robust similarity measure between regional features in the nonlocal spatiotemporal domain for weighting calculations. In this way, the performance and robustness of SR reconstruction could be further improved. Unlike traditional methods, the improved spatiotemporal nonlocal fuzzy registration scheme does not rely on accurate estimation of subpixel motion and therefore it can be adapted to complex motion scenes and is robust to noise and rotation.

Let and represent two PZM feature vectors of local regions corresponding to pixel and pixel in the nonlocal search region of pixel , which can be calculated aswhere PZM feature with order and repetition of video frame is defined aswhere and are the radius and angle, respectively, of the pixels in the polar coordinate system, , and . The function is the basis of PZM feature, and denotes the complex conjugate of .

The nonlocal fuzzy registration scheme based on PZM is based on a similarity match in the nonlocal spatiotemporal domain between video frames at different spatiotemporal scales, which is measured by the Euclidean distance between regional PZM feature vectors. The weight of each pixel in the nonlocal spatiotemporal region is calculated based on this similarity as follows:where controls the decay rate of the exponential function and the weight. is a normalization constant, which is calculated as follows:

Note that the higher the PZM order is, the more sensitive the PZM is to noise. Therefore, in the experiments performed in this study, only the first third-order moments, including PZM00, PZM11, PZM20, PZM22, PZM31, and PZM33, were calculated.

By analyzing the weight calculation formula for the PZM-based nonlocal fuzzy registration scheme in (12), it is clear that the time complexity is much too high and increases with the number of LR video frames and the amplification factor. To achieve further improvements in time efficiency and the edge detail-preserving ability of the superresolution algorithm, a novel spatiotemporal nonlocal fuzzy registration scheme (ZSFR) was established by improving the PZM-based spatiotemporal nonlocal fuzzy registration scheme using the similarity weighting strategy based on PZM feature similarity and structural similarity and the self-adaptive regional correlation evaluation strategy.

The improvements in the ZSFR involve two main aspects: with the aim of improving algorithm efficiency, a self-adaptive regional correlation evaluation strategy based on regional average energy and regional structural similarity was constructed for nonlocal similarity matching; and an improved similarity weighting strategy based on regional PZM feature similarity and regional structural similarity was proposed for spatiotemporal nonlocal similarity matching, with the aim of further improving SR performance. To describe this improved ZSFR scheme, the following three definitions are required.

Definition 1 (regional average energy). The video frame is divided into many regions of equal size, and each region is divided into 5 × 5 patches. The total number of pixels in each region is Num, and the energy value of each pixel is denoted by , respectively. is defined as the regional average energy centered on pixel and is calculated as

Definition 2 (PZM feature similarity). Given two regions centered on pixels and , denoted by and , respectively, the corresponding feature vectors extracted from these two regions are and . The parameter controls the decay rate of the exponential function. The PZM feature similarity between these two regions is defined as

Definition 3 (regional structural similarity). Given two regions centered on pixels and , denoted by and , respectively, and are the means of these two regions, and are the standard deviations of these two regions, and is the covariance between the two regions. and are two constants. Then, the structural similarity between the two regions is defined as

In the improved spatiotemporal nonlocal fuzzy registration scheme, the regional correlation is first evaluated to divide the local regions centered on all pixels in the nonlocal search region for pixel into related and unrelated regions. Only related regions are used to calculate the weight, an approach which can further improve time efficiency and is beneficial for mining the most similar patterns to calculate the similarity weight. The regional correlation is calculated by combining the regional average energy and regional structural similarity. Moreover, a self-adaptive threshold is introduced, which yields a self-adaptive regional correlation evaluation mechanism. If two regions are related, the criterion is defined as

The self-adaptive threshold is adaptively determined by the average energy for the region centered on pixel , which leads to a more accurate regional correlation evaluation. is calculated aswhere is an adjustment factor that controls . Experiments have confirmed that the best SR quality is obtained when is set to 0.08.

With the aim of further improving superresolution accuracy and detail-preserving ability, the similarity weight is improved on the basis of the weighting strategy given in (12) by combining the two factors of regional PZM feature similarity and regional structural similarity. The improved similarity weight is calculated as follows:where denotes the pixel to be superresolved and denotes a pixel in the nonlocal search region centered on pixel . The parameter controls the decay rate of the exponential function, as well as the weight. is a normalization constant.

3.3.2. Spatiotemporal Nonlocal Similarity Information Fusion Based on ZSFR

Spatiotemporal nonlocal similarity information fusion is based on the improved nonlocal fuzzy registration scheme using PZM feature similarity and structural similarity. By learning spatiotemporal nonlocal similarities between video frames, the similarity weight is calculated according to (19). The HR estimation of the video frame to be superresolved can then be obtained by spatiotemporal information fusion, which is implemented by a weighted average based on spatiotemporal nonlocal similarities.

Once the weight has been determined, the HR estimation of each pixel in the video frame to be superresolved can be obtained using the weighted average of the pixels in the nonlocal spatiotemporal region. The objective superresolution energy function based on spatiotemporal nonlocal similarity can be expressed as follows:where [] denotes a 3D spatiotemporal region (temporal sliding window). By minimizing the objective energy function in (21), the HR estimation of each video frame can be obtained as follows:where denotes the video frame to be superresolved.

Consequently, the proposed learning-based video superresolution reconstruction using spatiotemporal nonlocal similarity can be performed as follows:where denotes the energy function defined in (8) and is a balancing parameter.

3.4. Implementation Steps of the Proposed LBST-SR Algorithm

The LBST-SR algorithm implementation includes the following steps, as shown in Algorithm 4.

Algorithm 4. LBST-SR algorithm implementation steps are as follows:
Input. LR video sequence , scale amplification factor , HR training dataset , LR training dataset , nonlocal search region size , local region size for similarity weight calculation , weight-controlling filter parameter , and iteration scale .
Output. The superresolved HR video sequence .
Training Process
Step  1. Sample LR and HR patches from LR and HR training datasets and , respectively.
Step  2. Train the LR-HR dictionary pair and the correlation mapping matrix by LR-HR correlation learning according to (3).
Superresolution Reconstruction Process
Step  1. Initialize LR video sequence using the Bicubic interpolator with the aim of obtaining its HR initial estimation .
Step  2. According to the learned dictionary pair and the LR-HR correlation mapping , map each LR patch of the salient region of video frame to its HR estimation using (8) and (9).
Step  3. Update using the improved spatiotemporal nonlocal similarity regularization constraint in (23).
Step  4. Iteratively refine the fusion result for further optimization. Update the counter, . If , return to Step ; otherwise, end the process.

4. Experimental Results and Analysis

4.1. Experimental Dataset and Evaluation Indices

The experimental datasets in this paper consist of the benchmark video sequences taken from the http://trace.eas.asu.edu/yuv/index.html website and the spatial video sequences taken from the YOUKU website (http://www.youku.com/). The superresolution effects were validated in terms of both subjective visual evaluation and four objective quantitative indices: peak signal-to-noise ratio (PSNR), structural similarity (SSIM), feature similarity (FSIM), and root-mean-square error (RMSE), which were calculated as follows:where and denote the length and width of the video frame; and denote the reconstructed frame and the original frame, respectively; and are the means; and are the standard deviations for the original and reconstructed frames; is the covariance for the original and reconstructed frames; and are constants; denotes the whole spatial domain of the video frame; is a similarity measure of the phase congruency and gradient magnitude features between and ; is a chrominance similarity measure between and ; is used to weight the importance of in the overall similarity between and , where , , and are calculated according to [32]. The greater the PSNR is, the closer the reconstructed frame is to the original. The closer SSIM (0 ≤ SSIM ≤ 1) is to 1, the greater is the similarity between the original and reconstructed frame structures. The closer FSIM (0 ≤ FSIM ≤ 1) is to 1, the greater is the similarity between the original and reconstructed frame features. The smaller the RMSE is, the closer the reconstructed frame is to the original.

4.2. Experimental Results and Analysis

This section describes the experiments that were carried out to evaluate the performance of the proposed LBST-SR superresolution reconstruction algorithm and a comparison of these results with five recently proposed representative state-of-the-art superresolution algorithms in terms of both visual quality and objective quantitative indices, including the learning-based ANRSR [22], DPSR [30], and ScSR [27] algorithms, the 3D nonlocal mean-based NL-SR [14] algorithm, and the Zernike moment-based ZM-SR [16] algorithm. In the experiments, ten benchmark and two spatial video sequences were used: “Forman,” “Calendar,” “Coastguard,” “Suzie,” “Mother_Daughter,” “Miss_America,” “Ice,” “Football,” “Carphone,” “Akiyo,” “Satellite-1,” and “Satellite-2.” Based on the motion contents, these video sequences are divided into three categories: “Calendar,” “Suzie,” “Mother_Daughter,” “Miss_America,” and “Akiyo” contain small-motion objects; “Forman,” “Coastguard,” “Carphone,” “Satellite-1,” and “Satellite-2” contain moderate-motion objects; and “Ice” and “Football” contain fast-motion objects. Some complex motion scenes exist in these dynamic sequences, such as local motion patterns and rotations. Each video sequence was decimated by a factor of 1 : 3 and then contaminated by additive Gaussian white noise with . In the proposed LBST-SR algorithm, the spatiotemporal region used for the similarity weight calculation in the nonlocal similarity matching process was . Superresolution with a magnification factor of three was implemented in these experiments.

4.2.1. Objective Quantitative Evaluations

The average SSIM, PSNR, FSIM, and RMSE index values of ANRSR, DPSR, ScSR, NL-SR, ZM-SR, and LBST-SR algorithms for the twelve video sequences are shown in Tables 1, 2, 3, and 4, respectively. Figures 37 show the PSNR, SSIM, and RMSE values of the six algorithms for the “Satellite-1,” “Satellite-2,” “Forman,” “Calendar,” and “Coastguard” sequences. The results indicate that, in most cases, the proposed LBST-SR algorithm yields better performance with higher PSNR, SSIM, and FSIM values and smaller RMSE values than the other five algorithms. In only a few cases, ZM-SR algorithm achieves slightly better effects in terms of some indices than the proposed LBST-SR algorithm. Moreover, the SSIM and FSIM index values demonstrate that the results generated by the proposed LBST-SR algorithm are much closer to the original ones than the other five algorithms in terms of structural similarity and feature similarity, because LR-HR correlation mapping learning and spatiotemporal similarity can recover high-frequency details of video frames more accurately.

In terms of time efficiency of the spatiotemporal similarity matching process for ten benchmark video sequences and two spatial video sequences, the average time per video frame for the spatiotemporal nonlocal fuzzy registration scheme using PZM (ZFR) and the proposed improved nonlocal fuzzy registration scheme using PZM and structural similarity (ZSFR) is given in Table 5. Clearly, compared to ZFR scheme, the proposed PZSFR scheme improves time efficiency significantly while guaranteeing the similarity matching effect. The reason lies mainly in the use of a self-adaptive regional correlation evaluation strategy based on regional average energy and regional structural similarity, which is an improvement over ZFR scheme.

4.2.2. Subjective Visual Evaluations

Figure 8 shows the SR reconstruction visual effects of the six algorithms (ANRSR, DPSR, ScSR, NL-SR, ZM-SR, and LBST-SR) for Frame 6 of the “Forman” sequence, with the magnified local textures marked by the red rectangular box. The frame contains moderate-motion objects (such as local motions of head and mouth and rotation motion of eyes) in the “Forman” sequence. By analyzing global and local detail effects (such as regions around the eyes), it is clear that the proposed LBST-SR algorithm obtains a better visual effect than the other five algorithms. The learning-based ANRSR, DPSR, and ScSR algorithms produce annoying spot artifacts and unnatural visual effects in the face regions. Edge detail blurring phenomena are produced in the ZM-SR algorithm. Some annoying block artifacts are generated in the NL-SR algorithm, which mainly occurred because local complex motions influenced the accuracy of nonlocal similarity matching and fusion between video frames. The proposed LBST-SR algorithm was able to solve this problem because the spatiotemporal similarity matching process can be adapted to complex motion patterns. In comparison, the proposed LBST-SR algorithm not only has clearer edges and contours but also produces smoother effects in the face part.

The superresolved results for Frame 29 of the “Calendar” sequence are shown in Figure 9, with the magnified local textures marked by red and blue rectangular boxes. The results demonstrate that the proposed algorithm generates the best visual effects and produces clearer contours and details. The “Calendar” sequence contains complex object motions, including translation motion, occluded areas, and newly appearing object areas. The proposed algorithm still performed well under such complex motion scenes, benefitting mainly from the improved spatiotemporal nonlocal fuzzy registration scheme based on PZM feature and structural similarity, which is robust to complex motion scenes. The local magnified details indicate that ANRSR, DPSR, and ScSR algorithms introduce noticeable annoying artifacts around the edges of each number. ZM-SR algorithm shows some blurring effects. In the local detail area marked by the red rectangle, the quality of the proposed algorithm is comparable to the NL-SR algorithm, but in the magnified road details area marked by the blue rectangular box, the proposed algorithm produces smoother effects, whereas discontinuous edges and annoying block artifacts are generated in the NL-SR algorithm.

Figure 10 shows the superresolved results for Frame 18 and the magnified local details of the “Coastguard” sequence. The “Coastguard” sequence contains the complex backgrounds and motions of both object and camera. Moreover, complex motions such as translation, occluded areas, and newly appearing object areas exist in this sequence. Under such complex motion scenes, the proposed LBST-SR algorithm still performed better than the other five algorithms. As can be observed from the magnified local details marked by the red rectangle and details in the background regions, annoying black spots and block artifacts are generated in the ANRSR, DPSR, and ScSR algorithms. ZM-SR algorithm produces blurred edges and details, especially in the complex stone bank background area. Annoying block artifacts and discontinuous edges are generated in the NL-SR algorithm because its nonlocal similarity matching strategy cannot be well adapted to the complex motion scenes.

The superresolved results for Frame 1 of the “Akiyo” sequence are shown in Figure 11, with local details magnified to emphasize visual quality. The magnified visual effects of the face region marked in the red rectangle demonstrate that the proposed algorithm is superior to the other five algorithms and produces a more natural and smoother visual effect. ANRSR, DPSR, and ScSR algorithms produce annoying artifacts in the face region and unnatural skin colors. NL-SR algorithm produces block effects. And some blurring phenomena are generated in the ZM-SR algorithm.

The “Satellite-2” sequence contains local motions, light variation, and more object details. Figure 12 shows the superresolved results for Frame 3 of the “Satellite-2” sequence. The magnified details marked by the red rectangle demonstrate that the proposed LBST-SR algorithm generates the best visual effects, producing more natural visual effects and clearer details. Annoying black spot artifacts and unnatural visual effects are produced in the ANRSR, DPSR, and ScSR algorithms. Block artifacts and jagged effects are generated by the NL-SR algorithm, and ZM-SR algorithm produces blurred object details.

5. Conclusions

A novel learning-based algorithm to implement video SR reconstruction using spatiotemporal nonlocal similarity was proposed in this paper. On the basis of LR-HR correlation mapping, spatiotemporal nonlocal similarity structural redundancies were used to improve SR quality further. With the objective of improving algorithm efficiency while guaranteeing SR quality, LR-HR correlation mapping was performed only for the salient object region of the video frame, following which an improved spatiotemporal nonlocal fuzzy registration scheme was established for spatiotemporal similarity matching and fusion using the similarity weighting strategy based on pseudo-Zernike moment feature similarity and structural similarity and the self-adaptive regional correlation evaluation strategy. The proposed spatiotemporal nonlocal fuzzy registration scheme does not rely on accurate estimation of subpixel motion, and therefore it can be adapted to complex motion patterns and is robust to noise and rotation. Experimental results demonstrated that the proposed algorithm achieves competitive SR quality compared to other state-of-the-art algorithms in terms of both subjective and objective evaluations.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) 2012CB821200 (2012CB821206) and the National Natural Science Foundation of China (no. 61320106006).