#### Abstract

A novel parallel tool for large-scale image enhancement/reconstruction and postprocessing of radar/SAR sensor systems is addressed. The proposed parallel tool performs the following intelligent processing steps: image formation, for the application of different system-level effects of image degradation with a particular remote sensing (RS) system and simulation of random noising effects, enhancement/reconstruction by employing nonparametric robust high-resolution techniques, and image postprocessing using the fuzzy anisotropic diffusion technique which incorporates a better edge-preserving noise removal effect and faster diffusion process. This innovative tool allows the processing of high-resolution images provided with different radar/SAR sensor systems as required by RS endusers for environmental monitoring, risk prevention, and resource management. To verify the performance implementation of the proposed parallel framework, the processing steps are developed and specifically tested on graphic processing units (GPU), achieving considerable speedups compared to the serial version of the same techniques implemented in C language.

#### 1. Introduction

The amount of data acquired by imaging satellites has been growing steadily in recent years. Many techniques of parallel computing and distributed systems are used by such imaging systems in novel remote sensing (RS) applications, which require timely responses for swift decisions. Relevant examples include monitoring of natural disasters like earthquakes and floods, military applications, tracking of man-induced hazards, forest fires, oil spills, and other types of biological agents. In addition, the acquisition of large-scale RS images with radar/SAR systems collects huge data of information [1]. This amount of data requires significant high computationally resources for applying image processing techniques in order to improve the quality of the images. Therefore, one solution to achieve the required computationally demanded issue goes through parallel approaches in a high performance computing (HPC) sense using PC clusters, grids, and clouds, among others [2]. In this regard, significant efforts have been realized to develop parallel tools for processing remotely sensed data [3–5]. In [3], Bernabé et al. present a classification for remotely sensing imaginary tool which implements the well-known unsupervised k-means and the ISODATA clustering techniques with Graphic Processing Units (GPU). However, this tool is focused only on classification issues. Another RS tool, developed by Shkvarko et al. [4], presents a simulation software for intelligent postprocessing of large scale RS imaginary. However, only computer simulations of the enhancement/reconstruction regularization of RS images, without parallel implementations, were carried out. On the other hand, an end-to-end image simulation tool for space and airborne imaging systems, called PICASO, was developed in [5]. This tool seeks to optimize image quality within real world constrains, but the algorithms are not implemented in a parallel scheme.

In this paper, a new parallel tool for processing remotely sensed images is addressed. The proposed parallel tool performs the following processing steps: image enhancement/reconstruction and postprocessing. First, the image formation technique applies different system-level effects in order to degrade the images (i.e., along the range and azimuth directions) and adding random noising effects. Second, the high-resolution enhancement/reconstruction of the power spatial spectrum pattern (SSP) of the wave field scattered from the extended remotely sensed scene is employed via the descriptive regularization approach with the Robust Space Filter (RSF) and the Robust Adaptive Space Filter (RASF) algorithms [6]. Third, the fuzzy anisotropic diffusion post-processing technique is applied. This last step is performed via the modification of the well-known Perona-Malik (PM) anisotropic diffusion technique [7] via the incorporation of fuzzy logic sets. The *model-free* fuzzy anisotropic diffusion post-processing technique, the “*model-based*” enhancement/reconstruction regularization algorithms, and the image formation step are unified into a new tool for providing intelligent processing of RS imagery.

The set of steps described above will provide to the end user a tool with the aim of improving the quality of the RS images acquired from radar/SAR systems; however, there is an important drawback for its high computational cost. Therefore, the proposed parallel tool is developed using GPU. Nowadays, these specialized hardware devices have evolved into a highly parallel, multithreaded, many-core processors with tremendous computational speed and very high memory bandwidth [2]. As a result, the proposed framework provides very powerful data processing and analysis capabilities coupled to the parallel computing and data resources in a manner that is transparent to the user.

The rest of the paper is organized as follows. In Section 2, a brief summary of the intelligent data processing steps employed in the proposed system is presented. The GPU-based data processing techniques for the parallel algorithm implementation is performed and detailed in Section 3. In Section 4, the results of the proposed approach and the performance analysis with the GPU platform are presented. The application of two case studies for the intelligent data processing of large-scale images with the proposed parallel tool is included in order to verify the accuracy and the time performance of the system. Finally, the concluding remarks are presented in Section 5.

#### 2. Intelligent Data Processing Design Steps for Remote Sensing Imaging Problems

In this section, we present a brief overview of the general processing chain employed in this study. The addressed methodology for real-time formation/enhancement/reconstruction/post-processing of the RS imagery acquired with radar/SAR systems is described. Figure 1 presents the design flow methodology considered in this work.

##### 2.1. Image Formation

In this section, we present the summary of the RS imaging problem that was previously developed in [6, 8]. Let us consider the measurement data wave field modeled as a superposition of the echo signals and additive noise assumed to be available for observations and recordings within the prescribed time-space observation domain , where defines the time-space points in the observation domain . The model of observation wave field is specified by the linear stochastic equation of observation (EO) with operator form [9]: ; ; ; . ?Next, we take into account the conventional finite-dimensional vector form approximation [6, 8, 10] of the continuous-form EO, where the data acquisition model is defined by a set of equations as
for methods/systems to be aggregated/fused, that is, , where represent the system/method degradation operators usually referred to as the imaging system point spread functions (PSF), and vector represents the noise in the actually acquired image, respectively. The mean of the random scattering vector **e **has a statistical meaning of the average power scattering function traditionally referred as the spatial spectrum pattern (SSP), where the asterisk indicates the complex conjugate. This SSP is a second-order statistics of the scattered field that represent the brightness reflectivity of the image scene, represented in a conventional pixel format over the rectangular scene frame [9].

##### 2.2. Image Enhancement/Reconstruction

In this stage, we estimate as a discrete-form representation of the desired SSP over the pixel-formatted object scene remotely sensed with an employed array radar/SAR. Thus, one can seek to estimate given the data correlation matrix preestimated by some means, for example, via averaging the correlations over independent snapshots [11, 12]; , and by determining the solution operator that we also refer to as the signal image formation operator (SO) **G** such that
Such image enhancement/reconstruction processing tasks can be mathematically formalized in terms of the following optimization problem [9]:
in which represents the objective function where the first term (systematic risk component) is performed over the randomness of the distorted SFO with the uncertainty conditioned by the statistical bound . The regularization parameter balances the systematic risk component (specified by the first term) and the fluctuation risk component (specified by the second term) in , while induces the weighted metrics structure in the systematic risk. The regularization parameter *a* and the invertible weight matrix determine the user adjustable “degrees of freedom.”

In this regard, the solution to the optimization problem derived in previous studies [8, 13] is described as follows:
where defines the so-called reconstruction operator (with the regularization parameter *a* and stabilizer ), and is the inverse of the diagonal loaded noise correlation matrix [14], that is, , attributing the unknown correlated noise component. Thus, for different methods, different cost functions are employed. Here we exemplify two practically motivated high-resolution reconstructive imaging techniques [6, 15, 16] that will be used at the parallel toolbox: (i) the robust spatial filtering (RSF) and (ii) the robust adaptive spatial filtering (RASF) methods.(i)*RSF*. The RSF method implies no preference to any prior model information (i.e., ) and balanced minimization of the systematic and noise error measures by adjusting the regularization parameter to the inverse of the signal-to-noise ratio (SNR). In that case the SO becomes the Tikhonov-type robust spatial filter (RSF) [6]: , in which the RSF regularization parameter is adjusted to a particular operational scenario model, namely, for the case of a certain operational scenario, where represents the white observation noise power density and is the average a priori SSP value.(ii)*RASF*. In this method, and are adjusted in an adaptive fashion following the minimum risk strategy [9], that is, , the diagonal matrix with the estimate at its principal diagonal, in which case the SO becomes itself the solution-dependent operator that results in the following robust adaptive spatial filter (RASF): . In all practical RS scenarios (and, specifically, in SAR uncertain imaging applications [17–19]), it is a common practice to accept the robust white additive noise model, that is, , attributing the unknown correlated noise component.Any other feasible adjustments in the degrees of freedom (the regularization parameter and the weight matrix ) provide other possible SSP reconstruction techniques that we do not consider in this study.

##### 2.3. Image Postprocessing

In this section, the aggregation of the fuzzy anisotropic diffusion and the regularization-based methods are described for the reconstruction/post-processing image processing. The authors consider that the aggregation of such methods in the proposed parallel toolbox increases the flexibility and robust edge definition in the reconstructed image. That is, the noise and the edges are both high frequency image components, and the conventional image enhancement/reconstruction techniques oriented for large-scale remote sensing imaging do not work well for edge-preserving smoothing of image corrupted with additive and speckle noise (i.e., a tradeoff between sharpening and blurring must be selected). In this study, the gradient as the edge factor in the anisotropic diffusion is replaced by a rule-based fuzzy anisotropic diffusion. Also, a fuzzy inference system is employed to replace the edge stopping function providing a better control on the diffusion process.

The aggregation of the reconstruction/post-processing framework offers the possibility to preserve high spatial resolution performances via anisotropic diffusion image post-processing, which implies anisotropic regularizing windowing (WO) over the reconstructed solution in the image space. Next, the following equation represents the celebrated Perona-Malik anisotropic diffusion method [7]: for the continuous 2D rectangular scene frame . In (5), represents the evolutionary reconstructed SSP estimate , which provides edge preservation in the scene regions with high gradient contrast while performing smoothed windowing over the homogeneous image zones corrupted by speckle noise. Also, represents the diffusion coefficient, denotes the gradient of the image, and is called edge stopping function, which is selected as a decreasing function of the gradient of the reconstructed image.

Now, the discrete version of the anisotropic diffusion equation of (5) is represented as where is equal to the number of pixels in the neighborhood that is normally at the boundaries in the north, south, east, and west directions. is the discrete gradient in one of four diffusion directions. The diffusion coefficient is next defined as follows: where parameter is called the scale parameter, which should be selected to be smaller than the gradient at edges and larger than the gradient at noise. Furthermore, because of inherent noise, the calculated edge strength based on gradient is not a true reflection of the edge strength. Therefore, the ambiguity of choosing a suitable value for parameter and thus the uncertainty in the diffusion coefficient justify the use of fuzzy set theory in such situations. The discrete version of the anisotropic diffusion equation of (6) is represented as in which through represents the diffusion coefficients, , , and are the distance between the pixels, and through indicate the nearest-neighbor differences for the corresponding direction.

In this study, we propose to change each nearest-neighbor differences ( in the discrete representation) that define the edge factor (algorithmically the same as the Laplacian filter in each direction) in the traditional Perona-Malik algorithm of (6) with a Fuzzy Logic System (FLS) approach that calculates the edges in order to avoid the noise of the image [20].

Fuzzy logic system (FLS) is a rule-based theory in which an input is first fuzzified (converted from a crisp number to a fuzzy set) and subsequently processed by an inference engine that retrieves knowledge in the form of fuzzy rules contained in a rule-base. The fuzzy sets computed by the fuzzy inference as the output of each rule are then composed and defuzzified (converted from a fuzzy set to a crisp number). The following fuzzy rules are defined in Table 1.

The proposed fuzzy anisotropic diffusion approach is next described in detail as follows: in which through represent the diffusion coefficients computed using (7), , , and are the distance between the pixels, and through represent the edge in each direction as a result of the fuzzy logic system.

Now, we are ready to describe the GPU-based implementation of the intelligent parallel processing toolbox for remotely sensed imagery.

#### 3. GPU-Based Parallel Toolbox Implementation

In order to speed up the processing design flow methodology described in the previous section, a parallel framework is used. This parallel tool was developed using the graphic user interface (GUI) of MATLAB and the algorithmic implementation (i.e., image formation/reconstruction/post-processing) with GPU. The proposed approach also allows other functionalities, such as gray-scale image representation, random noising effects, image enhancement/reconstruction, fuzzy edge detection representation, and loading/storing of results for different radar/SAR systems. The developed parallel tool is shown in Figure 2. From the analysis of Figure 2, it is easy to observe that the toolbox presents the results for each corresponding processing stage.

In addition, the algorithmic implementation of the processing chain framework uses several processors working with independent data partitions [2]. However, before getting the details of the implementation let us remark that the implementation uses CUDA-enabled GPU, which are separate devices that are installed in a host computer, run asynchronously to the host processor, and have their own physical memory.

The basic unit of work on the GPU is a thread. Every thread acts as if it has its own processor with separate registers and identity that happens to run in a shared memory environment [2]. CUDA programs utilize kernels, which are subroutines callable from the host that work on the CUDA device. The extern function is called from the host, and it calls the different kernels. A kernel utilizes many threads to perform the work defined in the kernel source code. The kernel should make partitions of the data to be processed by each thread, taking care of not overlapping thread processing on any memory section in order to avoid undesired results.

The way the GPU is divided is in Streaming Multiprocessors (SMs), and each SM contains cores that execute and identical instruction set, or sleep; up to 32 threads may be scheduled at a time, called a warp, but maximum 24 warps are active in one SM. The threads in the same multiprocessor can share “Shared Memory” by synchronizing their execution for coordinating accesse to memory. The SM divides registers among threads and threads access the register memory as local; they have access to local cache memories in the multiprocessor, while the multiprocessors have access to the global GPU (device) memory.

The GPU blocks represent the number of parallel blocks that we will launch in our grid. Hardware is free to assign blocks to any processor at any time, and at execution time they tell GPU how many threads to launch per block [21]. Figure 3 shows how block and thread configuration would look in a small, 6 block, 100-thread kernel.

One of the first decisions to make before developing the GPU implementation is about the memory mapping of data, that is, how the GPU memory will be used for an improved performance. To do that, we assume that image fits into GPU memory, and in future works this problem will be addressed for larger images. However, instead of just copying and using the values for each pixel, we use textures. Textures are bound to global memory and can provide both cache and some limited, 9-bit processing capabilities [2]. The global memory is allocated as CUDA arrays; when the arrays are bound, textures are cache optimized for spatial locality and allow interpolation, wrapping, and clamping.

On the other hand, we use fuzzy rules which are common to each thread and do not change over time, and for them we use another kind of memory known as constant memory. Constant memory is used for data that will not change over the course of a kernel execution. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth [21].

Considering the aggregation of parallel techniques with GPU computing, we next describe the efficient GPU-based implementation of each processing stage of the proposed parallel tool: GPU implementation of the (i) image formation; (ii) image enhancement/reconstruction; (iii) image post-processing.

Now, we describe the specific functions and CUDA kernels used for efficient implementation of image formation/reconstruction/post-processing on the GPU.

First, a kernel is implemented to compute the image formation processing applying the conventional Matched Space Filter (MSF) algorithm [22]. Due to the fact that this filter can be expressed as a Toeplitz matrix, the processing stage is highly improved due to its potential for parallelization in a massively parallel scheme. The *image-formation* kernel is launched with many threads as pixels contain the RS image, where each thread executes the corresponding operation of the MSF method. Additionally, the time processing of the algorithm is also reduced with the efficient management of data transfers from the CPU and GPU. The device overlap function was activated with the capacity to simultaneously execute a CUDA kernel while performing a copy between CPU to GPU memory [2, 21]. Multiple CUDA streams are created to perform this overlap of computation in data transfer.

Second, the computational procedures for the implementation of the RSF/RASF reconstructive algorithms are described. We start with the memory configuration management using the “mmap” library of GNU C. With this library, the image can be accessed from the CPU with a simple pointer without the necessity to read each pixel value. Next, the matrix-matrix multiplication operations are computed. A specific CUDA kernel is implemented using one grid of blocks, in which each block processes the corresponding sub-matrix operation in a parallel scheme. The results are next loaded in the shared memory of the GPU. Note that several additional operations must be implemented in this stage. In this study, we propose to use the *reduction* algorithm [23].

Next, the extern function Perona-Malik is called from the host, it binds the texture to array, calls the “fuzzy” kernel for image enhancement, and normalizes the resulting array using the optimized NVIDIA Performance Primitives [24]. The Perona fuzzy kernel is computed for the image enhancement post-processing. In this stage, the first operation is to read the texture at subpixel precision and take differences in a neighbourhood window of pixels; this corresponds to through values for the fuzzy rules of Table 1. The fuzzy input vectors are used as membership function, and that this means the *Zero*, *Black*, and *White* values are used in the fuzzy rules as depicted in Table 1. We have 2 membership functions defined for each of the 8 possible directions (neighbours) of a pixel, resulting in 16 fuzzy values.

After that, it is applied the implication method (min function), afterwards, an aggregation method (max function), and finally defuzze the values applying the anisotropic diffusion.

The implication method is computed Imp_{Y} as the minimum between the resulting and the fuzzy input value for *White* or *Black*, depending on the membership function used, where is equal to through , and is equal to White and Black. The aggregation method takes the maximum between Imp_{White} and Imp_{Black}. Finally the anisotropic diffusion was applied using (9).

The resulting pixel values are independent from others, and hence the fuzzy calculations are suitable for a GPU implementation. Each thread is responsible for computing the fuzzy anisotropic diffusion of a part of the image. This image part is selected as shown in Figure 3; that is, it is done considering the block’s size, block’s index, and thread’s index.

The resulting pixel values are float numbers; hence, in order to ease the visualization, they are normalized to unsigned char values for each channel. This task is performed before displaying the resulting image but is optional in case of post-processing. We take advantage of optimized functions provided by the Nvidia Performance Primitives [24], using the *nppsMinMax_32f* to get the minimum and maximum float value resulting from the fuzzy kernel and *nppsNormalize_32f* to normalize the image with the resulting minimum value at 0 and maximum at 255.

#### 4. Results and Performance Analysis

In this section, we carry out the experimental validation of the proposed parallel toolbox system. Also, the time performance analysis is performed in order to demonstrate the time processing improvement achieved with the GPU-based implementation. The experimental case studies are next described for a high-resolution RS image acquired from an SAR system.

##### 4.1. Experimental Validation of the Intelligent Parallel Processing

In these experiments, the validation was performed with large scales (1*K*-by-1*K* pixel format) RS images borrowed from real-world high-resolution terrain SAR imagery (lower Manhattan region, NY, USA and the Pentagon region near Washington, DC, USA [25]). In the image formation stage, we considered the conventional side-looking synthetic aperture radar (SAR) with the fractionally synthesized aperture as an RS imaging system [11, 12]. The regular SFO of such SAR is factored along two axes in the image plane using the MSF method: the azimuth or cross-range coordinate (horizontal axis, ) and the slant range (vertical axis, ), respectively. We considered the conventional triangular SAR range ambiguity function (AF) [11] and Gaussian approximation [22] of the SAR azimuth AF. Also, the chi-squared additive noise of 20?dB signal-to-noise ratio (SNR) was incorporated to test the performances of the particular RS processing chain.

In the reconstruction and post-processing stage, we conduct the evaluation of the GPU-based implementation of the well-known reconstructive and enhancement post-processing algorithms. The validation has been realized between the original scene frame, the degraded RS image, the reconstructive RASF algorithm, the Perona-Malik (PM) technique, and the fuzzy anisotropic-diffusion technique.

In analogy to the image reconstruction for quantitative evaluation of the RS reconstruction performances, the quality metric defined as an improvement in the output signal-to-noise ratio (IOSNR) is employed. This metric is defined as follows: ; , where is related to the size of the RS image, represents the value of the th element (pixel) of the original image, represents the value of the th element (pixel) of the degraded image formed applying the MSF technique, and represents a value of the th pixel of the image reconstructed with two considered reconstructive-related methods; , where corresponds to the RASF-PM algorithm and corresponds to the RASF-fuzzy anisotropic diffusion algorithm, respectively. The quantitative measures of the image enhancement/reconstruction performance gains achieved with the particular employed RASF-PM and RASF-Fuzzy anisotropic diffusion techniques, are evaluated via the IOSNR metric, as reported in Table 2.

According to this quality metric, the higher the IOSNR (the average over 100 realizations), the better the improvement of the image reconstruction/post-processing with the particular employed algorithm.

Next, the qualitative results are presented in Figures 4 and 5. Figures 4(a) and 5(a) show the original test scene image. Figures 4(b) and 5(b) present the noised low-resolution (degraded) scene image formed with the conventional MSF algorithm. Figures 4(c) and 5(c) present the scene image reconstructed with the RASF algorithm. Figures 4(d) and 5(d) present the scene image enhancement employing the RASF-PM post-processing algorithm, Figures 4(e) and 5(e) present the result of the fuzzy edge detection, and Figures 4(f) and 5(f) present the results of the image enhancement using the implemented RASF-fuzzy anisotropic diffusion post-processing algorithm.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

From the analysis of the qualitative and quantitative simulation results reported in Figures 4 and 5 and Table 2, one may deduce that the aggregation of the reconstructive model-based RASF method and the model-free fuzzy anisotropic diffusion method overperforms the classical PM anisotropic diffusion method in the experimental validation scenario of the intelligent parallel processing toolbox.

##### 4.2. Time Performance Analysis

Next, we compared the required processing time for two different implementation schemes as reported in Tables 3 and 4. In the first case, the reference image formation/reconstruction/post-processing procedures for the intelligent parallel processing of remotely sensed imagery were implemented in the conventional C language in a Dell PowerEdge Server with an Intel Xeon E5603 Quad Core processor at 1.6?GHz and 24?GB of RAM. In the second case, the same algorithms were implemented using the proposed parallel toolbox approach which aggregates the graphic user interface of MATLAB with the parallel multiprocessor scheme based on GPU. In Table 3, a comparative feature analysis of the employed GPU is presented.

Next, considering the implementation on a Dell PowerEdge Server, which has a Nvidia Tesla C2075 GPU, we have achieved kernel-processing times up to 30?ms, which means 10+ times faster than desktop GPU processing with a Nvidia Gforce GTS 450.

The overall process for each stage of aggregation of the reconstructive model-based RASF method and the model-free fuzzy anisotropic diffusion method is 98?ms for Tesla and 897?ms for GTS, and considering only the fuzzy anisotropic kernel the times are 39?ms for Tesla and 357?ms for GTS. The speedup of each stage is over 200 times in comparison to a Matlab implementation, and 10 times a Desktop GPU implementation, as presented in the comparative processing time analysis of Table 4. Note that Matlab implementation does not need any transfers or GPU management; hence, it is only presented the overall processing time.

#### 5. Conclusions

In this paper, a parallel tool for large-scale remote sensing images was presented. This tool allows the image formation and the reconstruction of images using two well-known techniques: the robust spatial filter (RSF) and the robust adaptive spatial filter (RASF). It also applies a post-processing stage based on the Perona-Malik algorithm jointly with a Fuzzy Logic System (FLS) named in this paper as fuzzy anisotropic diffusion technique. This toolbox was implemented using a Matlab graphic user interface (GUI) and the mathematical operations were computed using graphics processor units (GPU). Under this paradigm, it was possible to save significant processing time due to the adaptation of the algorithms for its parallel implementation. For the presented case of studies, it was shown that the processing time in comparison with a PC based implementation was reduced +200 times. Thus, this approach represents a powerful parallel tool for processing huge amount of large scale RS images.

#### Acknowledgment

This study was supported by Consejo Nacional de Ciencia y Tecnologia (CONACYT), Mexico, under Grant CB-2010- 01-158136.