Image Enlargement Using Multiple SensorsView this Special Issue
Multiscale and Multitopic Sparse Representation for Multisensor Infrared Image Superresolution
Methods based on sparse coding have been successfully used in single-image superresolution (SR) reconstruction. However, the traditional sparse representation-based SR image reconstruction for infrared (IR) images usually suffers from three problems. First, IR images always lack detailed information. Second, a traditional sparse dictionary is learned from patches with a fixed size, which may not capture the exact information of the images and may ignore the fact that images naturally come at different scales in many cases. Finally, traditional sparse dictionary learning methods aim at learning a universal and overcomplete dictionary. However, many different local structural patterns exist. One dictionary is inadequate in capturing all of the different structures. We propose a novel IR image SR method to overcome these problems. First, we combine the information from multisensors to improve the resolution of the IR image. Then, we use multiscale patches to represent the image in a more efficient manner. Finally, we partition the natural images into documents and group such documents to determine the inherent topics and to learn the sparse dictionary of each topic. Extensive experiments validate that using the proposed method yields better results in terms of quantitation and visual perception than many state-of-the-art algorithms.
High-resolution (HR) infrared (IR) images are desired in various electronic imaging applications, such as medical diagnosis, criminal investigation, surveillance, remote sensing, and aerospace. However, given the inherent limitation of relevant imaging devices or other factors, obtaining images at a desired resolution is difficult. Therefore, many efforts have been devoted to improving the spatial resolution of the IR image. Superresolution (SR) is one of the most promising methods in the research community.
At present, a large number of SR methods have been developed successfully. The existing methods for image SR can be divided into three general categories: interpolation-based methods [1, 2], reconstruction-based methods [3–6], and learning-based methods [7–11].
The interpolation-based [1, 2] scheme applies the correlation of neighboring image pixels to approximate the fundamental HR pixels. These types of methods can be easily implemented at a high speed. However, these methods may lead to the loss of detailed information.
Reconstruction-based approaches utilize additional information from low-resolution (LR) images to synthesize an HR image. These approaches are ill-posed estimation problems and require a priori information on images to regularize the solution. Therefore, various regularization methods have been proposed to improve the performance of SR reconstruction, such as the projection on convex sets , maximum a posteriori (MAP) [4, 5], and regularization-based method . Compared with interpolation-based schemes, the reconstruction-based methods deliver better performance with a small desired magnification factor. However, the most common defect of multiframe SR reconstruction is that, with an increase in the magnification factor, the LR inputs cannot provide sufficient information to maintain a high-quality SR reconstruction result.
Learning-based methods presume that the high-frequency details lost in the LR image can be predicted by learning the cooccurrence relationship between LR training patches and their corresponding HR patches. Freeman et al.  first introduced the learning idea for SR reconstruction, which uses a Markov random field model to learn the relationship between local regions of images and their underlying scenes. Various effective tools have been proposed to learn prior information, such as neighbor embedding- (NE-) based methods [8, 12], regression-based methods [9, 10], and sparse coding- (SC-) based methods [11, 13–15]. The NE-based methods estimate each desired HR image patch by linearly combining its neighbor training HR image patches. Chang et al.  introduced locally linear embedding from manifold learning to process the image SR task. Zhang et al.  proposed a partially supervised NE method. However, given the lack of prior textures and details, NE-based methods are weak in visualizing textures and details. The regression-based methods directly estimate the desired HR pixels using some complicated statistical models. Wang and Tang  proposed a principal component analysis-based SR reconstruction method to estimate the desired HR image. Wu et al.  used the kernel partial least squares regression model to handle the one-to-many mapping problem. Wu’s method requires searching the neighbors in the entire training database and using the same number of principal components to synthesize the desired HR feature patches, which result in high computational costs. The SC-based SR method can better retain the most relevant reconstruction neighbors and can restore more image information than the two learning-based methods discussed above. Yang et al.  proposed an approach based on sparse representation, with the assumption that the HR and LR images share the same set of sparse coefficients. Therefore, the HR image can be reconstructed by combining the trained HR dictionary and the sparse coefficients of the corresponding LR image.
The abovementioned SC-based SR methods always suffer from three problems. First, due to the inherent limitation of relevant imaging devices or other factors, IR images always lack detailed information, which leads to unsatisfied IR image reconstruction results. Multiple images acquired by different sensors provide complementary information on the same scene. As such, a reasonable method of improving the resolution of the IR image is the combination of inherently complementary information from the images obtained from different multisensors. Second, a traditional sparse dictionary is learned from patches with a fixed size, which cannot capture the exact information of the images. However, the local structures of an image tend to repeat themselves many times with some similar neighbors across the natural images, not only within the same scale but also across different scales. Details missing in a local structure at a smaller scale can be estimated from similar patches at a larger scale. Different images prefer different patch sizes for optimal representation. Therefore, jointly representing an image at different scales is important. Considering the above cues, we propose a model of obtaining multiscale patches to learn dictionaries. We use a simple model that generates pyramid images and divides such images into multiscale patches. Finally, given that dictionary learning is a key issue of the sparse representation model, considerable effort in learning dictionaries from example image patches has been exerted, leading to state-of-the-art results in image reconstruction. Many dictionary learning methods aim at learning a universal and overcomplete dictionary that represents various image structures. However, for natural images, a large number of different local structural patterns exist. The contents can vary significantly across different images or different patches in a single image. One dictionary is inadequate in capturing all of the different structures. Multiple dictionaries [15, 17] are more effective in representing various contents in an image and provide better reconstruction results than one universal dictionary . Based on these observations, training patches are categorized into multiple groups based on visual characteristics in our algorithm. A subdictionary is then learned in the respective data groups. Unsuitable training sample groups used in dictionary learning lead to artifacts in example learning-based methods . In this study, we group the patches into several categories. Each category corresponds to a topic. We apply the probabilistic latent semantic analysis (pLSA) model  to group the patches and to determine the inherent topics. That means we group the patches into several categories. Each category corresponds to a topic. We then learn the sparse dictionary for each topic. Our framework treats each group individually, thereby leading to more accurate distribution dictionaries. We conduct semantic analysis on a given patch to categorize it to a topic. The given patch can be better represented by the selected topic subdictionary. Thus, the entire image can be more accurately reconstructed using this method than using a universal dictionary, as validated by our experiments.
In summary, this study makes the following three main contributions: IR images always lack detailed information. Meanwhile, VI images contain abundant object edges and details, providing a more perceptual description of a scene for human eyes. This study combines the inherently complementary information from images obtained from different multisensors to improve the resolution of the IR image. To learn the sparse dictionary for representing similar redundancies of local patterns within the same scale and across different scales, this study builds pyramid images downsampled from the images. Then it divides the pyramid images into multiscale patches, thereby representing the image in a more efficient manner and providing a more global look of the image. The pLSA model is applied to group the patches by determining the inherent topics and to group the training patches with similar patterns. Each dictionary is learned from some type of example patches with the same topic, and multiple dictionaries are learned simultaneously. Extensive experimental results show that our proposed method achieves competitive performance compared to state-of-the-art methods.
2. The Proposed SR Scheme
This study proposes a novel sparse representation algorithm, which aims to combine the information of visible images, provide a more global look of the IR image, and simultaneously utilize the inherent topics of IR images in a unified framework. The proposed method can be divided into three steps: (a) combining the information of images from multisensors, (b) obtaining multiscale patches, and (c) learning multitopic sparse dictionaries. In combining the information of visible images, our framework improves the resolution of the IR image when learning the LR sparse dictionary. In obtaining multiscale patches, we build pyramid images and extract multiscale patches from such images, which can provide a more global look of the images. In presenting different structural patterns more accurately, we partition the natural images into documents and group them to determine the inherent topics using the pLSA. A compact subdictionary can then be learned for each topic.
2.1. Combining the Information of Multisensors
Given an observed LR IR image , which is a downsampled and blurred version of the HR image of the same scene, we derive the following equation:where denotes a downsampling operator and is a blurring filter. The goal of a single-image SR is to reconstruct the HR image from the LR image as accurately as possible.
With the LR image , is the set of patch features extracted from :where is an operator that extracts the feature of patch from image .
Images acquired by multisensors provide complementary information on the same scene. IR images always lack detailed information. Meanwhile, VI images contain abundant object edges and details, providing a more perceptual description of a scene for human eyes . As such, combining the detailed information in visible images to improve the resolution of the IR image is reasonable; that is, the information of an LR IR image and the information of the corresponding HR visible image are used for reconstructing an HR IR image.
Applying these four filters, we obtain four description feature vectors for each patch of the LR IR image and its corresponding HR visible image, which are concatenated as one vector in the final gradient representation of the LR patch. The information of the LR IR image and the information of its corresponding HR visible image are combined together to learn the LR sparse dictionary.
With the sparse generative model, each patch feature () can be projected over the LR dictionary , which characterizes the LR patches. This projection produces a sparse representation of via , expressed as follows:where denotes sparse representation atoms. For the HR IR image, high-frequency information is obtained to present the HR patch. The corresponding HR patch feature has sets of patch features extracted from the HR image obtained as follows:
Reapplying the sparse generative model, we havewhere is the HR dictionary that characterizes the HR patches and is coupled with through the relation . This relation indicates that each atom in has its corresponding LR version in and vice versa. We assume that the sparse representation of an LR patch in terms of can be directly used to recover the corresponding HR patch from ; namely, . The process of Sparse representation-based SR by combining the information of visible images is described in Figure 1.
As such, the reconstructed HR image can be built by applying the sparse representation in each and then using the estimated with to obtain each , which together form the image .
The SC is clearly a bridge between the LR and HR patches. The dictionaries and have a key role in generating such SC. The dictionaries and can be easily generated from a set of samples using algorithms, such as K-SVD  and efficient SC [13, 14, 17, 22].
2.2. Obtaining Multiscale Patches
It is observed that different images prefer different patch sizes for optimal performance . Reference  even observed the oversmoothing of artifacts when using unsuitable patches. An explanation for this phenomenon is that dictionary learning from patches with a fixed size cannot capture the exact information of the images. One size of the sample patches corresponds to one scale. However, selecting the exact patch size of the image is difficult. As such, having a multiscale dictionary avoids selecting the patch size in advance. A multiscale treatment can help represent the image in a more efficient manner. In our proposed multiscale framework, we focus on simultaneously obtaining the multiscale patches. First, pyramid images downsampled from the images are built to learn the sparse dictionary for representing similar redundancies of local patterns within the same scale and across different scales. Second, multiscale patches from the pyramid images are then extracted.
Pyramid transform is an effective multiresolution analysis approach. During pyramid transform, each pixel in the low spatial pyramid is obtained by downsampling from its adjacent low-pass filtered HR image. Sequential pyramid images are constructed, as shown in Figure 2. Pyramid images can be generated by Gaussian smooth filtering, as shown in Figure 3.
Let denote the original image. The downsampled version at the th level is obtained by convoluting with a Gaussian kernel , as follows:where denotes the downsampling operator, with the factor at the th level.
After generating the pyramid images, we use the quadtree model  to extract multiscale patches from the pyramid images, as shown in Figure 4. We consider a set of large root patches of size extracted from the sequential pyramid images. The root patch is then divided into subpatches of size along the tree, where is the depth of the tree. After obtaining multiscale patches, we can learn dictionaries from the patches of different scales. Figure 5 illustrates the process of extracting multiscale patches from the pyramid images.
2.3. Learning the Multitopic Dictionary
We partition the natural images into documents and group them to determine the inherent topics using pLSA and to present the different structural patterns more accurately. Each dictionary is learned from some type of example patches with the same topic, and multiple dictionaries are learned simultaneously. The example image patches are classified into many topics by the pLSA model. Given that each topic consists of many patches with similar patterns, a compact subdictionary can be learned for each topic. For an image patch to be coded, the best subdictionary that is most relevant to the given patch is selected. Considering that the given patch can be better represented by the selected subdictionary, the entire image can be more accurately reconstructed than when a universal dictionary is used, as validated by our experiments. The use of multitopic dictionary learning has two main advantages: the training patches are divided into some topics, which ensure that the subdictionary represents the statistical model of the example patches more accurately and the training patches enhance the speed of dictionary learning on each topic and the final reconstruction accuracy through the transfer of knowledge between topics.
2.3.1. Standard pLSA
The pLSA , which is an extension of LSA , provides a probabilistic formulation to model documents in a text collection. The pLSA assumes that the words are generated from a mixture of latent aspects, which can be decomposed from a document. The pLSA model has been used successfully in image classification, image retrieval, and image annotation. The pLSA model ignores the orders of words in a document and instead uses the counts of words occurring in a document. We briefly outline the principle of the pLSA in this subsection. More details can be found in .
A corpus that contains documents is denoted by , and each document is represented with the count of its words from a vocabulary . The entire corpus is summarized by the cooccurrence matrix , where each entry indicates the count of the word in the document . In the framework of the pLSA, the observed word is conditionally independent of the document given a latent variable , which is referred to as the “latent aspect.” The graphical model shown in Figure 6(a) illustrates the form of the joint probability of in the pLSA model. The joint probability of the observed variables is obtained by marginalizing over the latent aspect :
Equation (6) expresses each document as a convex combination of aspect vectors, which results in matrix decomposition, as shown in Figure 6(b). Each document is essentially modeled as a mixture of aspects, the histogram for a particular document being composed of a mixture of the histograms corresponding to each aspect.
The model parameters of pLSA are the two conditional distributions and , which are estimated using the expectation-maximization (EM) algorithm on a set of training documents. characterizes each aspect and remains valid for documents out of the training set. By contrast, is relative only to the specific documents and cannot carry any prior information to an unseen document.
The EM algorithm is used to compute the parameters and by maximizing the log-likelihood of the observed data:
The steps of the EM algorithm are described as follows:
E-step: the conditional distribution is computed from the previous estimate of the parameters:
M-step: the parameters and are updated with the new expected value :
2.3.2. Our Method
Given a collection of IR images, we intend to determine the inherent topics of the images. We use general terms , such as topics, documents, and words, which are mostly used in the text of the literature. In our application, we define the atoms of the sparse dictionary as the “words” of the vocabulary and the sliding window of the sparse dictionary as the “document.” The sliding window consists of patches. Figure 7 shows the sliding window (large blue square) and one patch (small red square) in it. All of the documents are grouped by “topic” based on the cooccurrences of different words within and across the documents. Our method has the following five steps: vocabulary formulation, document representation, topic learning, subdictionary construction, and superresolution image reconstruction (SRIR). Our method is illustrated in Figure 8.
Vocabulary Formulation. We need to represent each document by a collection of words from a vocabulary. A general sparse dictionary with atoms is learned over all of the patches to construct the vocabulary. Each atom in is defined as a word of the vocabulary. All of the atoms of produce the vocabulary for the pLSA model.
Document Representation. We assume that document has patches . We represent each patch in the document using a linear combination of atom from the general dictionary. We denote the atoms representing patch as . We denote the count of vocabulary in document as , where . We then use the pLSA model to learn the latent topic of the documents.
Topic Learning. All of the documents can be summarized by the cooccurrence matrix, where each entry indicates the count of the word in document and is the total number of documents. The EM algorithm is used to compute the parameters and by maximizing the log-likelihood of the observed data. After learning, represents the mixture proportions of each document. The maximum value for each of the document can be assumed as the document topic assignment.
Subdictionary Construction. We assume determined topics. All of the documents are then classified into group . For one document group , we collect all of the patches that belong to these documents and denote these patches as . As such, we can obtain group . We aim to learn compact subdictionaries from . Each of the is apparently expected to have the same distinctive patterns. We use the SRIR for each group’s to learn the subdictionary for each topic, such that the most suitable subdictionary for each given local image patch can be selected using the pLSA model.
SRIR. We divide the LR image into overlapping documents and the documents into overlapping patches. Then, we represent each document in the same manner as that conducted during topic discovery. Each document is analyzed by using the EM algorithm to determine its topic assignment. Each patch of a document is reconstructed by using the topic corresponding to the subdictionary. We do this for all of the documents in the test image and then take the average of all overlapping portions to obtain the reconstructed HR image.
3. Experimental Results
3.1. Samples and Settings
In our experiments, the IR images and corresponding visible images were obtained from  http://www.dgp.toronto.edu/~nmorris/data/IRData/. Samples of the training images are shown in Figure 9. The LR images used in all the experiments were downsampled from the HR images. In our experiments, the LR images were generated by shrinking the corresponding HR images with the scale factor of 3.
We employed the peak signal-to-noise ratio (PSNR) and the structural similarity measurement (SSIM) to evaluate the superresolved image and assess the performance of the proposed method. The mean values of the PSNR and SSIM of all of the test images were used as the quality index. The PSNR evaluates the reconstruction quality based on the pixel intensity. The SSIM measures the similarity between two images based on their structural information. The SSIM metric needs a “perfect” reference image for comparison and provides a normalized value between , where “0” indicates that the two images are totally different, whereas “1” confirms that the two images are the same. Thus, higher values of PSNR and SSIM indicate a result with better quality.
3.2. Reconstruction Results
In this section, we conduct several experiments to evaluate the effectiveness of the proposed method.
Experiment 1 (comparison with the state-of-the-art algorithms). The proposed method was tested using some IR images to validate the effectiveness of the proposed resolution enhancement method in terms of visual fidelity and objective criterion. We compare our algorithm with some well-known image SR algorithms, such as the nearest neighbor, cubic B-spline interpolation method, and Yang’s method , to validate the efficiency of our method. In our method, the root patch size is 16 × 16, the depth of the tree is 3, and the number of training patches in the training process is 100,000. For the multitopic dictionary, the number of atoms in the general dictionary is 1,000. The number of atoms is the same in the multitopic dictionaries. We assume determined topics (). For Yang’s method, the number of atoms is 1,000. We present the SR results of images (with a scale factor of 3) obtained using different methods in Figure 10. We extract the region after magnification within the red box to show the details after SR. We observe that the bicubic interpolation method blurs the sharpness of the edges and misses some fine details in the reconstructed images. Yang’s method  recovers a significant number of details but produces many jagged and ringing artifacts, along with edges or details. The proposed method obtains better visual quality than all of the other three competing methods.
Moreover, the PSNR and SSIM values of the SR results on LR images using various algorithms are listed in Table 1. We observe that the average PSNR and SSIM gains of the proposed method over Yang’s method  and the bicubic interpolation method are in dB, which show that the SR results from the proposed method have better objective quality in terms of PSNR and SSIM.
Experiment 2 (effect of multisensor). To validate the effectiveness of multisensor by combining the information of visible images, we compared multisensor SRIR with traditional SRIR algorithm as Yang’s method . The number of training patches in the training process is 100,000. For the dictionary learning step, the number of atoms in the dictionary is 1,000. Figure 11 shows the SR results of the IR image. Figure 11(c) shows the results of the traditional SRIR algorithm as Yang’s method , where severely jagged artifacts along the edges and annoying details are produced. The SR result is limited. Figure 11(d) shows the results of combining the information of visible images. We observe that the result is significantly improved qualitatively and quantitatively. The PSNR and SSIM values of the SR results on LR images using various algorithms are listed in Table 2.
Experiment 3 (effect of multiscale patches). We compare the SR results obtained from the dictionaries using multiscale patches and one fixed-scale patch. In the multiscale patches-based method, the root patch size is 16 × 16. In the fixed-scale patch, fixed patches with three different patch sizes 4 × 4, 8 × 8, and 16 × 16 are analyzed. The number of training patches in the training process is 100,000. For the dictionary learning step, the number of atoms in the dictionary is 1,000. The reconstruction results are shown in Figure 12. We have observed that different images prefer different patch sizes for optimal performance. The multiscale treatment can help represent the image in a more efficient manner, thereby allowing applications to provide a more global look of the image. We observe that the reconstructed HR images obtained from the multiscale patches-based method, as shown in Figure 12(f), are better in terms of quantitation and visual perception than those obtained from the single-scale patches-based methods, as shown in Figures 12(c) to 10(e). The PSNR and SSIM values of the SR results on LR images using various algorithms are listed in Table 3.
We proposed a novel sparse representation-based image SR method. The algorithm combines detailed information in visible images to improve the resolution of the IR image. Given the complementary nature of these types of information, the proposed method can generate state-of-the-art results in SR tasks. Considering the fact that the optimal sparse domains of natural images can vary significantly across different images and different image patches in a single image, the proposed method uses a simple model that generates pyramid images and divides the pyramid images into multiscale patches to represent the image in a more efficient manner. We also partition the natural images into documents and group the documents to determine the inherent topics using pLSA and to learn the sparse dictionary of each topic using the sparse dictionary learning technique. Extensive experimental results show that our proposed method can achieve competitive performance compared to state-of-the-art methods.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The research is sponsored by the National Natural Science Foundation of China (nos. 61271330 and 61411140248), the Research Fund for the Doctoral Program of Higher Education (no. 20130181120005), the National Science Foundation for Postdoctoral Scientists of China (no. 2014M552357), the Science and Technology Plan of Sichuan Province (no. 2014GZ0005), and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.
T. S. Huang and R. Y. Tsai, “Multi-frame image restoration and registration,” Advances in Computer Vision and Image Processing, vol. 1, pp. 317–339, 1984.View at: Google Scholar
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” in Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS '08), pp. 1033–1040, December 2008.View at: Google Scholar
P. Purkait and B. Chanda, “Image upscaling using multiple dictionaries of natural image patches,” in Computer Vision—ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5–9, 2012, Revised Selected Papers, Part III, vol. 7726 of Lecture Notes in Computer Science, pp. 284–295, Springer, Berlin, Germany, 2013.View at: Publisher Site | Google Scholar