Abstract

This paper investigates a cross-scale space semantic feature coherent image inpainting approach since it is challenging for the existing image inpainting methods to fuse the semantic feature information effectively. Firstly, we learn the feature semantic relevance step-by-step from the high-level semantic feature map’s attention mechanism and then we apply what we have learned to the preceding low-level feature map. In order to preserve the visual and semantic coherence of image repair, the missing content can be filled by changing attention from deep to shallow in a multiscale manner. A broader receptive field is generated by partial convolution, and semantic feature relevance is achieved using a multiscale cross feature space feature attention mechanism based on semantic attention. This technique improves the extensibility and continuity of the restored images by reconstructing the semantic information of different feature spaces, not only taking into account the reuse of existing semantic space features but also including across feature spaces. The experimental results demonstrated an improvement in PSNR, SSIM, and L1 performance by 10.50%, 0.13%, and 47.09%, respectively, with clear benefits.

↑ means the bigger, the better; ↓ means the smaller, the better.

1. Introduction

In order to make the restored image look very natural and be difficult to tell apart from the undamaged image, image inpainting requires the algorithm to fill up the missing areas of the image in accordance with the image itself or the training set information. According to existing researches, it will be quite obvious as long as there is even a tiny discrepancy between the filled content and the undamaged area. As a result, in order to achieve high-quality image inpainting, it is necessary for both the content semantics and the generated image texture to be sufficiently real and clear.

Currently, there are two primary categories in which image inpainting techniques fall: the first is the traditional texture generation method. The fundamental concept is to fill in the missing area by selecting identical pixel patches from the area of the image that is undamaged. The alternative approach uses deep learning to encode the image as a feature of highly dimensional hidden space, which is subsequently decoded to provide a fully recovered image. The missing areas of the damaged image must be filled in with appropriate information in order to achieve high-quality image inpainting. The present approaches either generate semantically consistent patches from the context of the region or fill the region by replicating image patches, oblivious to the importance of both visual and semantic credibility. As a result, these two techniques have some drawbacks when it comes to maintaining adequate semantics and distinct texture.

The motivation of this research is to further enhance the semantic consistency of image restoration, gradually understand the regional semantic relevance from the attention in the high-level semantic feature map, and apply the understood attention to the prior low-level feature map. It can guarantee the visual and semantic coherence of image repair since the missing content can be filled by moving attention from deep to shallow in a multiscale manner. Besides, attention mechanism in neural network is a resource optimization allocation scheme that assigns computing resources to more important tasks first and solves the problem of information overload when computing resources are limited, especially for the automobile systems [1, 2] in self-driving application.

We have developed a robust strategy for learning semantic feature maps across feature spaces. For missing areas, the generation model can produce results with semantic consistency. We proposed a framework for multiscale image inpainting based on a deep learning model; it emphasizes a cross-scale semantic correlation image inpainting technique that takes into account both the current feature scale space and the cross-scale feature space. By utilizing a cross feature space feature attention mechanism and semantic attention mechanism, we achieved semantically-coherent image restoration.

This approach achieves high-quality images by realizing image restoration from a semantic standpoint and combining multiscale feature space information. Additionally, the results of the experiment demonstrate that our technique performs better in terms of PSNR, SSIM, and L1 performance metrics. Our primary contributions are as follows:(1)In this paper, a cross-scale method for semantically-coherent image restoration with four scales is proposed. Cross-scale semantic feature extraction is realized with our novel method. High-quality image restoration with semantics coherence is achieved through our search and generation strategy.(2)A reconstruction module called cross-scale coherent semantic attention (CCSA) is proposed. Attention score to reconstruct the sibling features of the lower-level semantic network module is calculated. Reasoning operations is utilized to depict the useful regions. With this technique, the semantic features of several feature spaces can be combined, and the feature information is then transferred to the subsequent layer for feature fusion. The experimental results demonstrate that the cross-scale reconstruction technique improves PSNR, SSIM, and L1 performance by 5.34%, −0.14%, and 33.86%, respectively.(3)A semantic residual attention (SRA) module is proposed, which could further enhance the network’s performance and increase the semantic coherence of image restoration through the semantic residual structure, as well as reducing network residual error. This approach enhances PSNR and L1 performance by 3.43% and 27.51%, respectively.

The approaches for image restoration could mainly be divided into two types. The first one is the classical texture synthesis method, while the second is the deep learning method [3].

2.1. Classical Matching Approach

Training set is not required for such method, for example, the DIP approach [4]. Only one damaged image is needed for the entire procedure, which may then be utilized for image restoration. The TV (total variation) model [5] was enhanced with the CDD model [6], which addresses the issue with the TV model’s inability to restore the visual connectedness of images. When attempting to find the best match using Criminsi’s traditional violent block matching method [7, 8], the outcome is not always pleasing. Because we only consider how closely the portion outside the hole matches the other images when looking for the best match. Barnes’ PatchMatch [9] is a very clever patch matching technique that accelerates patch matching by taking use of the local correlation of images. Although this technique can attain the overall approximate optimal, it cannot guarantee that every patch will find the best match. Because they require a lot of processing to achieve pixel level filling and patching, these traditional approaches are typically slow. The absence of semantic knowledge and in-depth understanding of visuals are another significant flaw in such methods. The restoration of complicated semantic scenes cannot be handled by this strategy, and it is difficult to produce semantically plausible solutions.

2.2. Deep Learning-Based Regular Filling

An unsupervised visual feature learning system driven by context-based pixel prediction is Pathak’s context-encoder [1012]. It can generate acceptable results for semantic filling and it is used to generate content for any image area based on its surroundings. Global and local discriminators are introduced by the GL technique [13, 14]. Local and global consistent images can be produced using this technique. Any shape of a missing region can be filled using the entire convolution neural network. This strategy has greater benefits than patch-based approaches such as PatchMatch [9]. The color difference, blurring, and other flaws are improved by Liu’s partial convolution technique [15, 16]. There are certain benefits to this paradigm for irregular holes.

2.3. Deep Learning-Based Progressive Filling

For instance, edge guided repair methods [17, 18] needs the determination of the edge in advance, and various parameters will result in varied edge features, which will influence the repair results. The prior one shot fill model is not the same as the RFR model [19]. The RNN framework and this network are comparable. The first input to the RFR module is the feature map, and the second input to the RFR module is the output results. After a number of cycles, the subsequent stage of feature fusion will be initiated in this manner. In Zhang’s PGN [20], progressive filling at the image level was accomplished by connecting GANs together using LSTM. With partial convolution and expansion methods, Guo’s FRRN [21] stacks 8 full resolution residual modules to achieve progressive filling. These processes frequently require a lot of computational resources and are time-consuming.

2.4. Attention-Based Deep Learning

The deep learning model could produce semantically consistent results for missing areas by utilizing advanced semantic feature learning. Nevertheless, it remains difficult to get aesthetically realistic outcomes from small potential features. Using the similar texture of the feature map source area to fill in the target area, Yu’s Deepfillv1 [22] method, for instance, proposes an improved GCA structure based on contextual attention [23]. The content learned from the contextual attention layer is the key feature information which is could be used to repair the missing area for a damaged image. Gated convolution is used in the enhanced Deepfillv2 [24]. When the damaged area is in free form, the gated convolutions are optimized to produce gaps near the filling edge. It is suggested to divide the image into patches and then identify each local region using a spectral normalized discriminator. The attention transfer network (ATN), which is designed to transfer the features of the known area to the missing area to achieve a better filling impact, is used in the pyramid type layer-by-layer repair [25], the generator adopts the structure of encoding and decoding, and the encoder adopts the pyramid type encoder. Diversified repair [2628] developed a novel framework based on the probability principle that combines prior conditions and potential variables and has several parallel paths in order to produce multivariate results with appropriate confidence. The image is changed into a hidden space by a variational automatic coder [29, 30], and an image restoration operation is then carried out in the hidden space. According to the realistic and diversity dynamic balancing repair approach [30], pixels near the hole center should have more degrees of freedom while those close to the hole edge should be more predictable. It can dynamically balance the authenticity and diversity within the missing area [31], making the generated content more diversified towards the hole center and the hole boundary more similar to the adjacent image content. By learning this patch match behavior to a generator without attention through joint training to assist context reconstruction tasks, Zeng et al. [32] proposed the context reconstruction assisted repair and encouraged the generated output to be reasonable even when it is reconstructed from the surrounding areas. Wide-ranging focus [33], a novel attention perception layer (AAL), is introduced to better use the high-frequency properties of long-distance correlation in order to enhance the appearance consistency between the visible region and the generated area.

Few studies have been conducted on multiscale semantic feature fusion, and the majority of approaches now in use only take into account of image restoration with one scale. Therefore, it is important to investigate semantic consistency image inpainting techniques from a cross-scale space perspective.

3. Our Method

3.1. Overall Structure of Our Method

Figure 1 depicts the overall structure of our network, which is primarily composed of several basic blocks as shown in Figure 2 (BBs) connected by cross-scale coherent semantic attention (CCSA) and semantic residual attention (SRA) blocks. The present scale’s feature information is learned by each BB individually, and the semantic coherent attention module and semantic residual attention module connect several scale space. We split each input into two paths in each BB structure. Pixel-wise concat is used to combine the output from last two BB blocks. To restore more information while maintaining visual performance, the two channels are pooled maximum and on average.

The semantic correlation attention method in the backbone network realizes the cross-scale semantic correlation learning. This cross-scale semantic correlation can make use of feature at several scales. The main purpose of this structure is to achieve the cross-scale propagation of feature information between two adjacent BBs. In our network, four BBs are included, each BB represents a distinct scale space and essentially satisfies the requirements, and three cross-scale attention structures are consequently needed to enable semantic feature transmission.

In order to further reduce the semantic residuals between modules and enhance network performance, the semantic residual attention module mainly realizes the transmission of semantic residuals across adjacent BB modules. The experimental results demonstrate that the introduction of the semantic residual module improves the network’s overall performance, demonstrating the semantic residual module’s value in raising the semantic residual of the network.

Search and generation are the two key steps in the realization of semantic attention learning. Image restoration with semantic cross scale and associated functionality is realized. Our network does not directly employ the convolutional layer for feature learning. Instead, we employ partial convolution to achieve a bigger receptive field and boost learning effectiveness even more.

3.2. Feature Reconstruction Based on Semantic Coherent Attention

We believe that it is insufficient to reconstruct M solely by taking into account the relationship between M and M′ (which represent the known area and the missing area in the feature map, respectively) in the feature map, as this ignores the correlation between the generated image patches, particularly the semantic correlation, which may result in lacking ductility and continuity in the restoration results.

We investigate the semantic residual and semantic correlation between the generated restoration image blocks in order to resolve this weakness and propose a SCA layer. As for illustration, the SCA layer implementation includes search and generation steps. Figure 3 illustrates how the SCA layer works, with M and M′ representing, respectively, the known area and the missing area in the feature map.

In order to initialize mi during the search, the RSA layer searches for the closest matching context patch mi in the known region M for the ith patch mi in M.

Then, in order to recover the mi during generation, we set mi as the primary component and all previously generated patches as the secondary part. The two sections’ weights are determined using the following cross-correlation measures:where Dadi denotes the similarity between two created adjacent patches and Dmaxi represents the similarity between mi and the context area’s most similar patch . The weights of the context patch part and all previously created patch parts are normalized as Dmaxi and Dadi, respectively. The following are the two steps.

3.2.1. Search

In order to apply the convolution filter to M, we first extract the patch from M and transform it to a convolution filter. We can find the correlation between each patch in M and every patch in M by using this procedure. Based on this, we initialize each generated patch mi with the context patch mi, which is the most comparable to it and, for the subsequent operation, we give it the maximum cross-correlation value Dmaxi.

3.2.2. Generation

We start the generation process from the upper left patch of M (marked with m1 in Figure 3). Dad1 is 0 and m1 has never had a patch, so we simply replace m1 with , making m1 = . Although the preceding patch, m1, serves as an additional reference for the subsequent patch, m2, we treat m1 as a convolution filter in order to get the cross-correlation measure, Dad2, between m1 and m2. Then, to update the m2 value, Dad2 and Dmax2 are merged and adjusted to weights of m1 and m2, respectively. The steps of the generation process, from m1 to mn, can be summed up as follows:

This process is a recursive process. The method described above can be used to determine the repair area.

3.3. Image Reconstruction Based on Cross-Scale Semantic Coherent

We propose employing the semantic correlation feature between high-level and low-level semantic modules to reconstruct feature maps in order to preserve as much low-level semantic information as possible. We utilize some reasoning operations to depict the useful regions since we are confident that the high-level semantic network module must deal with smaller missing regions (relative to low-level regions). In more detail, the low-level semantic module’s feature map’s patches are deconvoluted using the similarity score, which is then used to reconstruct the filled feature map based on the features from the high-level semantic module.

Formally, we assume that the cross-scale semantic feature reconstruction network’s i-layer feature of the jth network module is fij. The following definition enumerates the sibling features shared by nearby modules:where is the measure of similarity between (x, y) and (x′, y′) that is unknown. The adjacent pixels are smoothed to further enhance the continuity and smoothness between them:

The output mapping of hierarchical modules represents various semantic levels for various semantic properties. In order to keep the semantic information from the preceding module, we additionally include a trainable parameter .

Finally, the final attention score was used to reconstruct the sibling features of the lower-level semantic network module as follows:

3.4. Multiscale Feature Selection and Fusion

A deeper module is then employed to extract features from the feature map. Cross-scale procedures can keep the deep network’s low-level semantic information flowing. It might, however, include some deceptive background details. With this technique, we intended to use multiscale feature extraction to extract information from a wide receptive field. Four distinct scales are employed to extract features. To preserve the balance between performance and efficiency, we specifically use distinct expansion rates for different scale extractions to obtain a 3 × 3 convolution kernel. We consider the convolution operation , which has a kernel size of k and an expansion rate of r. Thus, the following is a definition of the feature selection operation:where are the maximum and average values for each channel that must also be determined, respectively. The computation of each scale’s attention score may then be performed, where the scale and the value are [1, 2, 4, 8]. Finally, the following formula can be used to get the cumulative output:

Low-level semantic information may be lost and low-level semantics may be destroyed when feature mapping travels through low-level semantic modules. In order to ensure that low-level semantic information can be transmitted throughout the network, the high-level semantic module must be paired with the low-level semantic feature module. In order to achieve this purpose, we reconstruct the feature as well as the feature from the high-level semantic module to link through the channel, and the core size is 1 × 1 as the output feature. The output characteristic can be expressed as follows, assuming that the original input characteristic is given as F:

4. Experiment Results

4.1. Training Platform, Data, and Evaluation Metrics

This research compares regularly used test datasets in order to validate our image inpainting strategy. Urban100 [34], DTD [35], and CelebA [36], are the test datasets. The main training performance evaluation metrics are PSNR [37], SSIM [38], and L1 error. PSNR is a peak signal-to-noise ratio that serves as an objective measure for image evaluation. PSNR is the most popular and widely used approach for evaluating image quality objectively. The structural similarity index (SSIM) is an image quality evaluation metric that compares image brightness, contrast, and structure. The training platform and related parameters employed in this technique are shown in Table 1.

Training settings: Adam, learning_rate = 0.0001,  = 0.9,  = 0.999, and Keras 2.7 training platform.

In the experiments comparing the performance of other methods, the comparison methods were replicated according to the literature, and the training process was conducted on the same platform and data sets. The number of iterations of each training procedure is 500, and the total training epoch is 1000.

4.2. Comparison with Existing Methods

Our proposed method shows good results on various datasets, as shown in Table 2. Almost all of the metrics are optimal on the testing datasets. For PSNR metric, our method achieves the best in all data sets for different mask ratios. For SSIM metric, our method achieves the best in all data sets for different mask ratios, except on Urban100 dataset with mask ratio 0.4, our method ranks second with 0.9883, while PSR method ranks first with 0.9884. For L1 error, our method has the best performance, except on CelebA dataset, ranking second with mask ratios 0.3 and 0.4.

From this, it can be stated that, while the new technique does not produce the best results in all datasets, it does so in the majority of them, and its performance is significantly enhanced when compared to the original methods, demonstrating the new method’s clear benefits.

The table compares the PSNR, SSIM, and L1 performance metrics of several approaches to more clearly illustrate the impact of comparison between them. CelebA is one of them, and it serves as the training data set. 30000 pairs of training data are obtained after preprocessing, of which 10% are used as the test data set.

Model size and inference time for the image with a size of 128  128: Inference time for our method is 4.69 ms, which is slightly higher than that of the Pconv method and PSR method. The estimated inference time for the image with a resolution of 720P (1080  720) should be less than 263.8 ms (56.25 patches with a resolution of 128  128). Inference time details of different methods are shown in Table 3.

5. Ablation Experiment

The single variable control principle serves as the theoretical foundation. We control the modification of the single variable and leave the other variables unaltered in each group of trials so that the impact of a single variable or single structure on system performance can be examined.

Four groups of comparison experiments were designed in order to verify the operation of each module of the cross-scale semantic feature restoration approach. RES0ATT0 is set as the baseline, followed by RES1ATT1, RES0ATT1, RES1ATT0, and RES0ATT0. RES1ATT1 stands for using cross-scale feature attention and semantic residuals. Similar to semantic residuals, cross-scale semantic reconstruction is not included in RES0ATT0. The impact of utilizing cross-scale feature reconstruction and semantic residuals is demonstrated in RES1ATT1. The results demonstrated that the network’s PSNR, SSIM, and L1 performance has improved as a result of the addition of the aforementioned two components. The PSNR, SSIM, and L1 metrics have improved by 10.5%, 0.13%, and 47.09%, respectively, over the benchmark technique RES0ATT0, with the PSNR and L1 indicators showing the most improvement. The detailed results are shown in Table 4.

The result that RES1ATT1 has a beneficial performance when compared to RES0ATT1 and RES1ATT0 indicates that the two new structures play an important role in promoting the network performance. However, with a single structure, the PSNR, SSIM, and L1 indicators improved by 5.34%, −0.14%, and 33.86% and 3.43%, −0.10%, and 27.51%, respectively, in comparison to the benchmark network. Among them, both PSNR and L1 performance metrics have improved significantly, especially the L1 performance, while SSIM indicators have decreased slightly, but the decline is almost negligible.

Figure 4 illustrates the PSNR results for various experiment settings. Semantic residual structure is added in the RES1ATT1 and RES0ATT1 strategies. When compared to the other two, PSNR of RES0ATT1 increased more quickly in the beginning, but after around 400 epochs, the rate of growth slowed down and RES1ATT1 overtook it. Similar to the RES1ATT0 approach, the RES1ATT1 method overtook the RES1ATT0 method after around 500 epochs. The RES0ATT0 approach performs the worst out of all the strategies, showing that the cross-scale semantic feature learning structure and the semantic residual structure both work well in promoting the network performance.

In Figure 5, similar findings are also illustrated. The similarity of two figures can be explained by L1 error since it indicates the overall level of inaccuracy between images. This study demonstrates the benefit of incorporating semantic residual structure and cross-scale feature attention structure by demonstrating that the L1 error of RES1ATT1 is caused by other settings.

Naturally, the SSIM indications show comparable results. The SSIM indicators are less distinguishable between the outcomes than the previous two indicators because they have been approaching saturation for a long period. As a result, no in-depth comparison of SSIM is provided here, but Table 3 shows the average value of the last 20 outcomes.

5.1. Visual Performance Comparison

This experiment investigates the visual experimental results of RDN, Deepfill, PCONV, RFR, PSR, and other approaches in order to further validate the comparison of the visual performance of various image restoration techniques. With 500 iterations of each epoch, the epoch is set as 1000 and all models run on the same training and validation datasets. PSNR, SSIM, and L1 are the primary performance evaluation metrics. The results of the experiments demonstrate that the strategy based on cross-scale semantic feature attention produces the best performance.

As an illustration, Figure 6 shows that our novel technique generates results that are 33.518, 0.982, and 0.014, while the RFR method and PSR method yield results that are 29.602, 0.930, and 0.023 and 32.372, 0.966, and 0.016, respectively. The outcomes demonstrate that our strategy outperforms other methods in terms of performance metrics. Figures 69 display the similar outcomes.

6. Conclusions

In view of the difficulty in semantic level image inpainting in previous image repair methods, this paper proposes a cross-scale semantic feature image repair method to improve the lack of ductility and continuity of the existing methods.

This approach can capture semantic feature information from various scale space in addition to the semantic feature information of the current scale space, which can help the image inpainting process. Higher quality image restoration is possible using the semantic feature information. The results of the experiment indicated that integrating the cross-scale semantic feature restoration method can accelerate the spread of semantic features, which is advantageous for the application of semantic level image restoration.

Data Availability

Data are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to acknowledge the support from the National Natural Science Foundation of China (No. 62162027); Science and Technology Project of Jiangxi Provincial Department of Education (No. GJJ210646); and Key R&D Projects of Jiujiang City (No. 2020069).