Abstract

In recent years, deep learning-based watermarking algorithms have received extensive attention. However, the existing algorithms mainly use the autoencoder to insert watermark automatically and ignore using the prior knowledge to guide the watermark embedding. In this paper, an end-to-end framework based on embedding guidance is proposed for robust image watermarking. It contains four modules, i.e., prior knowledge extractor, encoder, attacking simulator, and decoder. To guide the watermark embedding, the prior knowledge extractor providing chrominance and edge information of cover images is used to modify cover images before inserting the watermark by the encoder. To enhance the robustness of watermark extraction, the attacking simulator applying various differentiable attacks on the encoded images is introduced before extracting the watermark by the decoder. Experimental results show that the proposed algorithm achieves a good balance between invisibility and robustness and is superior to state-of-the-art algorithms.

1. Introduction

The unauthorized distribution of copies has become a threat to sharing of multimedia products. Hence, how to declare the ownership of the products is an urgent problem to be solved [1]. Digital watermarking technologies are widely used in copyright protection by embedding copyright information into digital products [2], such as digital literature, music, film, photography, and face portrait. Robustness against different attacks is significant for the practical application of digital watermarking. Traditionally, watermarking algorithms mainly rely on hand-crafted features to improve the robustness, such as applying various transforms [35] or using perceptual masking [6, 7]. The drawback to these hand-crafted algorithms is that they are not simultaneously robust to some types of distortions because different types of distortions often require different techniques [8]. Consequently, some deep learning-based algorithms have been presented [923]. They usually utilize convolutional neural network (CNN) to design end-to-end architecture with an encoder and a decoder. In order to further improve robustness, some improvement measures are proposed. These improvements can be categorized into two classes, i.e., attacking simulation and model architecture design [10]. The summary of different watermarking algorithms is listed in Table 1.

1.1. Attacking Simulation

Zhu et al. [11] were the first to propose a robust watermarking network HiDDeN with an attacking simulator. The attacking simulator was inserted into the network to satisfy the end-to-end training. However, HiDDeN can only be robust to a single attack, such as JPEG, Gaussian blur, crop, and dropout. Then, Mellimi et al. [12] and Ahmadi et al. [13] improved the attacking simulator to resist combined attacks. Since JPEG compression attack is nondifferentiable, some works [1416] focused on JPEG compression simulation and improved the JPEG simulator by various differentiable methods to enhance the robustness against JPEG compression.

1.2. Model Architecture Design

Dhaya et al. [17] proposed a lightweight convolution neural network (LW-CNN) for the digital watermarking scheme, which had more resilience than other standard approaches. Fang et al. [18] exploited a template-based approach combined with U-Net to achieve better robustness. Cun et al. [19] used SplitNet and RefineNet to smooth watermarked regions for a better quality of watermarked images. Mun et al. [20] introduced attention mechanism into the watermarking field to achieve good performance in robustness against attacks. In addition, some notable algorithms with adversarial training [2123] have greatly improved the perceptual quality of the watermarked images.

However, these existing CNN-based robust watermarking algorithms focus on attacking simulation and model architecture design before the watermark extraction. They do not consider prior knowledge to guide the watermark embedding. To further balance between invisibility and robustness, motivated by the traditional algorithms, some prior knowledge, such as the chrominance and edge saliency of cover images, is considered before the watermark embedding. The major contributions are as follows.(1)We propose a prior knowledge extractor to obtain the chrominance and edge saliency of cover images for guiding watermark embedding.(2)We propose an embedding guided end-to-end framework for robust watermarking based on the proposed prior knowledge extractor and attacking simulator.(3)We conduct a lot of empirical experiments to evaluate the performance of the proposed algorithm in terms of invisibility and robustness. Experimental results demonstrate that our algorithm achieves a good balance between invisibility and robustness and performs better than state-of-the-art algorithms.

2. Methods

In this section, the proposed framework is described in detail. The overall architecture and loss functions are presented in subsection 2.1. Then, each module is explained in subsections 2.22.6 one by one, i.e., prior knowledge extractor, encoder, attacking simulator, decoder, and discriminator.

2.1. Model Architecture

The main framework is presented in Figure 1. As shown in Figure 1, the proposed model is based on autoencoder structure, which consists of four modules: a prior knowledge extractor, an encoder, an attacking simulator, and a decoder. The prior knowledge extractor obtains prior knowledge to modify cover images for guiding watermark insertion. After that, the encoder hides the watermark into the modified cover image. Then, the attacking simulator performs various simulated attacks on encoded images as a network layer. Finally, the decoder extracts the watermark from attacked (or unattacked) encoded images. These modules achieve their objectives through the following loss functions.

The encoder aims to insert the watermark into the cover image invisibly. So, the distortion loss is used to limit the distortion of the encoded image bywhere Ico and Ien represent the cover image and encoded image, respectively.

The decoder wants to extract the watermark from the encoded images as much as possible. So, the reconstruction loss is adopted to improve the quality of the extracted watermark bywhere M and Mout are the original watermark and extracted watermark, respectively.

The discriminator is used to judge whether the generated images are similar enough to the cover images. The discriminator and encoder compete with each other. So, the adversarial loss is considered to optimize the visual quality of the encoded image bywhere D represents the discriminator.

Therefore, the total loss for the proposed framework iswhere α, β, and γ are three hyper-parameters.

2.2. Prior Knowledge Extractor Module

Most existing deep learning-based algorithms mainly use the autoencoder to insert watermark automatically and ignore using the prior knowledge to guide the watermark embedding. According to the human visual system (HVS), people are less sensitive to modification in regions with rich chrominance and edge information [2429]. So, the chrominance and edge saliency proposed in [30] are considered prior knowledge in this paper. The cover image is modified before watermark insertion to make the watermarking robust. Figure 2 depicts the flow diagram of our proposed prior knowledge extractor.

In order to obtain the chrominance information of the cover images, first, the cover image is converted into YCbCr color space bywhere Y represents the luminance component and Cb and Cr represent chrominance components.

Then, the chrominance saliency SC(x) of a point x is obtained bywhere fb(x) and fc(x) are the normalization mappings of the Cb and Cr components, respectively, δ is a parameter set as 0.25 in this paper.

In order to obtain the edge information of cover images, the canny operator [31] is used to extract edge information. The edge saliency SE(x) of a point x is computed bywhere Canny(x) represents the result calculated by the canny operator for a given point x and τ is a threshold set as 2 in this paper.

Finally, as is known to all, the stronger the chrominance and edge saliency are, the less sensitive the human eye is. So, the cover image is modified bywhere Ico is the original cover image after normalization and Iin is its modified one. According to (8), the greater the chrominance and edge saliency is, the smaller the modification of the cover pixel is, consequently, the relatively greater the change of cover pixel is in the watermark insertion.

2.3. Encoder Module

The architecture of the encoder network is illustrated in Figure 3. As shown in Figure 3, the encoder network has two parallel branches corresponding to the cover image and watermark image, respectively. One branch uses some convolutional layers to extract shallow detail features and deep semantic features of input normalized watermark images. The other branch uses a sequence of convolutional layers to extract features of the input cover image for merging with the features extracted from the watermark image. Specifically, in order to embed watermark images into cover images, the encoder concatenates the features map extracted from each alternate layer of the watermark branch to the corresponding output features of the cover branch. Like [32], this concatenating process is repeated four times. Finally, the cover image and watermark are entirely fused as encoded images.

2.4. Attacking Simulator Module

In order to be robust against a variety of image distortions, as shown in Figure 1, an attacking simulator is inserted between the encoder and decoder to simulate various attacks by differentiable methods. Its parameters do not require to be updated during the entire network training process. Note that each iteration randomly selects one type of attack with equal probability. Specifically,[33], as shown in Figure 4, our attacking simulator includes four types of attacks: Gaussian blur, crop, JPEG compression, and dropout.

2.4.1. Gaussian Blur

Gaussian blur is also called Gaussian smoothing. It blurs the encoded images by performing a convolution operation with a Gaussian kernel. The larger the size of the convolution kernel, the stronger the blur attack.

2.4.2. Crop

Crop operation is simulated by randomly cropping out a small rectangle from the encoded images, namely, by replacing all the pixel values in this rectangle with zero. Specifically, the attack is simulated by multiplying with a 0–1 mask of the same size as the encoded image. In this mask, the region with pixel value 0 represents the cropped region, while the region with pixel value 1 represents the remaining region.

2.4.3. JPEG Compression

The steps of JPEG compression are composed of color space transformation, discrete cosine transform, quantization, and entropy coding. The sampling and discrete cosine transform steps are modeled by the max-pooling layer and convolution layer, respectively. Especially, as shown in Figure 5, the nondifferentiable quantization step is approximately simulated by performing JPEG-mask on the feature maps [11].

2.4.4. Dropout

Dropout attack is a common noise in image processing. It is implemented by arbitrarily replacing a certain ratio of pixels with zero. The detailed processing is similar to crop attack by multiplying with a 0–1 mask. The difference is that the pixel values 0 and 1 are randomly distributed in the mask.

2.5. Decoder Module

In the end-to-end training, the decoder carries out the decoding procedure after encoding or attacking. The structure of the decoder is shown in Figure 6. The decoder takes the encoded or attacked image as input and extracts the watermark image. It uses seven Conv-BN-ReLU blocks to extract the watermark image from the input image. In this process, the function of convolutional operation is to extract features, and batch normalization (BN) speeds up the calculation while ReLU activation plays the filtering role. The final convolutional layer with a 3 × 3 kernel outputs watermark images.

2.6. Discriminator Module

The primary role of the discriminator is to improve the visual similarity between the encoded and cover images by adversarial training. The architecture of the discriminator is presented in Figure 7. It is similar to that of the decoder. The difference is that the discriminator outputs binary classification results to judge whether the image contains the watermark or not. Therefore, the discriminator is built with five Conv-BN-ReLU blocks, an adaptive average pooling layer, a linear layer with a single output unit, and a Sigmoid activation layer.

3. Experimental Results and Analysis

In this section, experiments are carried out to prove the effectiveness and robustness of the proposed algorithm. The training datasets and experimental details are described in subsection 3.1. Then, the ablation experiments in subsection 3.2 are performed to demonstrate the improvements in the proposed algorithm. Finally, the robustness of the model for different types of attacks is tested in subsection 3.3.

3.1. Experimental Datasets, Implementation Details, and Evaluation Metrics
3.1.1. Experimental Datasets

5,000 images randomly selected from the COCO dataset [34] are used as cover images. Three types of images are taken as watermark images for experiments. They are 5,000 logo images randomly selected from logo-2k + [35], 5,000 digital number images from MNIST [36], and 5,000 general images from ImageNet [37]. These watermarks are converted into grayscale images before embedding. 5,000 cover images and each 5,000 watermark images are regarded as 5,000 pairs for the following experiments. Then, the cover images and watermark images are, respectively, divided into training/validation/testing sets according to the ratio of 8 : 1:1 and resized to 128 × 128.

3.1.2. Implementation Details

The proposed watermarking model is trained iteratively using the ADAM optimizer [38] with an initial learning rate of 1.0e-3. The batch size is set as 16. The weights in the loss function shown in (4) are set as α = 0.3, β = 0.7, and γ = 0.001. In addition, all simulated attacks have a hyperparameter governing the strengths: the kernel width ω of Gaussian blur is 3; quality factor QF of JPEG compression is 90; and ratios p of crop and dropout are 0.1 and 0.15, respectively.

3.1.3. Evaluation Metrics

The image visual quality is commonly evaluated by peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) metrics. Their definitions are given in the following.

Given two images U and V, the PSNR can be defined aswhere L is the maximum pixel value, which is usually set as 255, MSE is mean squared error defined aswhere n is the number of pixels.

The SSIM between two images U and V is defined aswhere μU and μV are the means, σU and σV are the standard deviations, σUV is the cross-covariance of U and V, and C1 and C2 are two constants used to avoid a null denominator.

3.2. Ablation Experiments

Here, some ablation experiments are conducted to validate the proposed model. All the experiments are performed under the combined attacks with all four different types of attacks.

Firstly, we begin by analyzing what the prior knowledge extractor can do. Table 2 shows the average PSNR and SSIM values of 5,000 encoded images and 5,000 extracted watermarks with/without the extractor. As the results are shown, the visual quality of both the encoded images and extracted watermarks is improved after introducing the prior knowledge extractor. This is because the extractor obtains prior knowledge to find more suitable locations for watermark embedding.

Then, we verify the effectiveness of the attacking simulator. So, we compared the proposed models without and with the attacking simulator in the training stage. The results are shown in Table 2. As shown in Table 3, when the attacking simulator is considered, although the quality of the encoded images is sacrificed a little bit and the quality of extracted watermarks improves significantly. The PSNR values of the extracted watermarks increase from 24.12 dB to 38.19 dB and SSIM from 0.7824 to 0.9722.

The experimental results in Tables 2 and 3 show that either the prior knowledge extractor or attacking simulator is significant to the robust watermarking.

3.3. Comparison Experiments with Other Algorithms

In order to further evaluate the performance of the proposed algorithm, our algorithm is compared with some existing deep learning-based algorithms [3941] in terms of both invisibility and robustness.

3.3.1. Invisibility

The challenge for digital watermarking is to improve robustness while keeping invisibility. Figure 8 shows the visual comparison of different watermarking algorithms. In addition, Table 4 presents their corresponding numerical results by PSNR and SSIM. It can be observed from Figure 8 and Table 4 that the watermarks are invisible in the encoded images for the proposed algorithm with high PSNR and SSIM values, while it is not the case for the other three algorithms who suffer from a little color bias. This is due to the use of prior knowledge for guiding watermark insertion in our algorithm.

3.3.2. Robustness

In order to test the robustness, the encoded images are carried out in five different types of attacks. Table 5 presents the average PSNR and SSIM values of 5,000 encoded images and 5,000 watermark images for four compared algorithms. In addition, Figure 9 shows some visual samples of the extracted watermarks. It can be observed from Table 5 and Figure 9 that the proposed algorithm achieves the best performance for all five types of attacks in both numerical and visual aspects, especially for the combined attack. Although the encoded images are distorted under various attacks, our algorithm can preserve watermark fidelity to a great extent with few errors. However, it is not the case for the other three algorithms, whose extracted watermarks suffer from some errors with some noise in vision. This is attributable to the watermarking guidance of prior knowledge and the consideration of attacking simulator in our algorithm. Regarding three different types of watermark images, all the compared algorithms perform best on the MNIST watermarks. The main reason is that the other two types of watermark images are more complex and contain more semantic information, which results in more difficulty in the watermark extraction.

In addition, we evaluate the generalization performance of different watermarking algorithms against attacks different from those in the training stage in two aspects, i.e., different attack levels and different attack types.

3.3.3. Different Attack Levels

Figure 10 shows the average PSNR values of the extracted watermark images under different attack levels. As shown in Figure 10, our algorithm still performs better than the other three algorithms when being attacked by different levels of various attacks. In addition, the performance of all four compared algorithms decreases with the increase in attack levels.

Different attack types. To evaluate the performance in resisting the attacks that were not considered during the training stage, we select four kinds of black-box image attacks (resizing, medium blur, salt and pepper noise, and Gaussian noise) to test the model. The levels of these attacks are as follows: the scaling factor T of resizing is 2; kernel width ω of medium blur is 3; ratio p of salt and pepper noise is 0.2; and standard deviation σ of Gaussian noise is 1.0. Table 6 shows the average PSNR values of the extracted watermark images of different algorithms. As can be seen from Tables 5 and 6, the proposed algorithm still maintains higher PSNR values than the other three algorithms, though its performance decreases when facing attacks different from the training stage.

4. Conclusion

In this paper, we propose an embedding guided end-to-end framework for robust image watermarking. In this algorithm, a prior knowledge extractor and attacking simulator are introduced to guide watermarking embedding and enhance the robustness of watermark extraction, respectively. The experiment results demonstrate that, compared to the existing algorithms, the proposed algorithm performs better in both invisibility and robustness. However, the proposed algorithm does not consider other common attacks in practical application, such as printing, screen photography, and geometric transformation. Therefore, in the future, we will focus on the simulation of these attacks and study the deep learning-based watermarking algorithms against these attacks.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Open Fund of Advanced Cryptography and the System Security Key Laboratory of Sichuan Province (Grant no. SKLACSS-202113) and the National Natural Science Foundation of China (Grant no. 62072251), the PAPD fund.