Abstract

We present a scheme for commutative watermarking-encryption (CWE) of audio data and demonstrate its robustness against an important class of attacks, Time-Scale Modifications (TSM). In addition, we show how the proposed CWE scheme can be integrated into a cryptographic protocol enabling public verification of the embedded mark without disclosing the mark or the watermarking key used for embedding.

1. Introduction

Commutative watermarking-encryption (CWE) means the combination of watermarking and encryption in such a way that the encryption and watermarking functions commute; that is,where is the encryption function, is the encryption key, is the watermarking key, is the cleartext media data, and is the mark to be embedded.

If encryption and watermarking do commute, their combination can serve as an important building block within a Digital Rights Management (DRM) System, as detailed further in Section 2. In the present paper, an existing CWE concept for still images [1] is extended to audio files. To the best of our knowledge, this is the first CWE scheme for audio files to appear in the literature. In addition, we show that the presented CWE scheme can be integrated into a modified version of a protocol due to Craver and Katzenbeisser [2], enabling zero-knowledge verification of the watermark, meaning a verifier can verify the presence of a watermark without disclosure of the mark or the watermarking key . The rest of the paper is organized as follows: in Section 2, we motivate the need for CWE schemes and identify some basic requirements. In Section 3, we shortly review existing CWE schemes for still images and encryption/watermarking techniques for audio files, with a special emphasis on those algorithms using similar techniques as in our approach. In Section 4, we present our CWE scheme in detail. Section 5 provides experimental results on the robustness and fidelity of the watermarking part. Section 6 presents the integration of the CWE scheme into a zero-knowledge protocol for verifying the mark, and Section 7 concludes the paper.

2. Motivation for CWE

The concept of commutative watermarking-encryption (CWE) was first discussed in [3] with a special emphasis on watermarking in the encrypted domain. From the left-hand side of (1) it is clear that the watermarking function must be able to act in the encrypted domain, which means that only a limited set of audiovisual features (if any) is available to the embedder and can be used to embed the mark.

2.1. Dispute Resolve Protocols

The prime motivation to look at CWE schemes originates from the need to implement so-called Dispute Resolve Protocols, where a rights owner provides a digital media object to a distributor , who in turn sells to some customer . In this scenario, a number of attacks are possible, most importantly the case where sells a copy of in his own right. In particular, if such a copy is detected, the Dispute Resolve Protocol must be able to identify as the rightful owner of and to identify as the offending party.

An obvious solution is that embeds a watermark identifying as the rightful owner into and provides the marked object to . The distributor in turn marks for each customer with an additional watermark uniquely identifying . Unfortunately, in this scenario the distributor is able to generate identical copies of and sell them to customers . If these copies are marked with the identifier of some specific customer , the distributor can repudiate having generated the copies and the customer could be held responsible for the offence of .

The basic problem here is that has access to the marked object in plaintext. If a CWE scheme is available, however, the following protocol between a generic seller and a generic buyer becomes possible, as proposed in [4]:(1) encrypts with her symmetric key . The result is .(2) sends to , together with an individual mark that is to embed into .(3) embeds into and encrypts the result with his own key . The final result is sent to .(4) verifies that contains as watermark. If the verification is successful, removes her own encryption and sends the result back to the buyer.(5) removes his encryption from and is in possession of the individually marked object .If the distributor takes the role of the seller in this protocol and the rightsholder performs the en- and decryption operations in steps (1) and (4), respectively, the problem mentioned above can be solved, if a CWE scheme for the media object is available. The need for a CWE scheme becomes obvious in steps (3) and (4), where an encrypted media object is watermarked and the presence of a watermark is verified in an encrypted object, respectively. Moreover, steps (3) and (4) call for a public key watermarking scheme, where there is a private embedding key and a public detection key, or an asymmetric scheme, where it is possible to verify the existence of a watermark without fully disclosing the embedding key or the watermark itself.

2.2. DRM Systems

In Digital Rights Management (DRM) Systems [5], encryption and watermarking are often combined in a natural way: the media data are transferred to a compliant media player in encrypted form, so that access to the plaintext data happens only under control of the compliant player. In addition, watermarks are embedded into the media data to have an additional layer of protection which is present even after the data have been decrypted. These watermarks can be used to claim copyright, enforce copying restrictions, or track illegal copies offered on the Internet. If a CWE scheme is used, compliant media players have the opportunity to detect and insert watermarks even in encrypted data. More generally, it should be possible to protect multimedia data throughout the distribution chain in a flexible way by allowing the encryption and watermarking operations to commute [6].

2.3. Searching in Encrypted Databases

With the advent of cloud computing, new security challenges have arisen. For example, cloud computing clients need to secure their data, not only to protect their data from public attacks, but also to protect their data from their cloud service provider [7]. Thus, clients need to encrypt their data in the cloud. On the other hand, a cloud service provider or a client often has the need to search through the client data according to certain metadata or tags. It is therefore highly desirable to provide techniques which can protect the clients’ privacy and offer a large amount of accessibility at the same time. CWE schemes can provide such a solution, if metadata are used as watermarks and embedded into the encrypted data.

3.1. CWE Schemes for Image Data

To the best of our knowledge, no CWE schemes for audio data have been proposed so far. However, there have been a number of attempts aimed at still images, of which we only review the so-called invariant encryption approach, as it is also used in our audio CWE scheme. For a more comprehensive review of existing CWE schemes for still images, see [8].

The invariant encryption approach to CWE as introduced in [1] is to encrypt the media data completely (as opposed to the partial encryption approach, which leaves part of the data unencrypted to host the watermark), but to use a weaker cipher that leaves a feature space of the media data invariant. This invariant feature space can be used to embed a watermark. For example, a permutation cipher can be used for encryption, leaving the global first-order statistics of the image untouched. The invariant feature space is therefore represented by the image histogram and a histogram-based algorithm can be used to embed the mark. The advantage of the invariant encryption approach is that all media data are encrypted (and not just a subset as in partial encryption schemes). The disadvantage, besides using a weaker cipher, is an inherent lack of robustness of the watermark.

3.2. Audio Ciphering Methods

Let the audio signal consist of a set of sample values . Most existing audio ciphering methods like, for example, [9] or [10], substitute the audio sample values and change (i.e., flatten) the global histogram of the amplitudes of the sample values. The flattening of the histogram makes it impossible to use the histogram for embedding a watermark. In [11], however, a permutation cipher is used to permute the sample values in the time domain, thereby keeping the histogram invariant. This shows that it is possible to transfer the invariant encryption approach to audio data.

3.3. Audio Watermarking Methods

From the host of existing audio watermarking methods (see [12] for an overview), the method by Xiang et al. [13] is the most important for our work, as it uses (a part of) the amplitude histogram for embedding the mark. The range of the audio sample values is splitted into equal-sized bins. The amplitude histogram is an -dimensional vectorwhere denotes the number of samples falling into the th bin. The relevant part of the amplitude histogram consists of the bins covering the interval , where is the mean value of the absolute amplitude values and is some fixed parameter. This condition makes sure that the bins in the relevant part of the histogram are “well filled,” that is, . To embed a watermarking bit , a triple of consecutive histogram bins with heights is used. If , the relation should hold, where is a predefined threshold value. If the relation is not satisfied by the three bins, a certain number of samples is shifted from the first and third bin of the triple into the second bin by adding and subtracting, respectively, a bin width to the samples. An analogous process is carried out if .

As embedding the mark has altered the mean value of the amplitude values, for extraction, the correct mean value has to be searched within a search space . For each mean value in , the corresponding histogram is formed and the distance between the first extracted bits and a known synchronization sequence sync is computed. The mean value associated with the minimum distance is used to extract the remaining watermark bits.

The described synchronization process helps to make the watermark robust against TSM attacks (cf. Section 5.2). Although the watermarking scheme is based on the histogram, it cannot be used in conjunction with a permutation cipher to form CWE scheme, because only a certain number of sample values in a histogram bin are modified. Therefore, after application of the permutation cipher, different sample values than before are modified, which destroys the commutativity property. Moreover, the scheme by Xiang et al. does not use a secret watermarking key .

4. The Proposed CWE Scheme

The proposed scheme is based on the earlier ideas [1, 13] described in Section 3. In order to apply them in the audio domain and in order to make the overall scheme more robust to TSM attacks, some modifications were necessary, which are described in the following paragraphs.

4.1. Ciphering Algorithm

An analogue audio signal is transferred into the digital domain by sampling the time-continuous signal at a certain discrete sampling rate. At the same time, the obtained samples are quantized according to the bit depth available, the result being a set of sample values , where can be seen as a discrete time coordinate. Common bit depths for representing audio are 16, 20, or 24 bit. The general idea is to permute the discrete points in time, while leaving the sample values untouched. In order to generate the permutations, the discrete version of Arnolds Cat Map [14] was used, because it is a well-known chaotic map used by many authors for generating permutations in image ciphering (see e.g., [15]). The discrete Cat Map is a two-dimensional map defined on a square grid bywhere and are parameters that can serve as the secret key if the function is used for encryption purposes. Two-dimensional permutations of the square grid can be quickly generated by repeated application of the Cat Map. Note, however, that there are only different keys. Therefore, it has been proposed in [16] to change the secret parameters in every iteration of the Cat Map. In order to apply the Cat Map on a discrete audio signal of length , the audio signal is rearranged into a square grid of size . If is not a square number, the signal is padded with random sample values having the same probability distribution (i.e., the same histogram) as the original signal. This makes sure that the padded values cannot be distinguished from the original values by an attacker. Moreover, the original histogram is largely unchanged by the padding (cf. Figure 1). Figure 2 shows the effect of the Cat Map after five iterations on the waveform of an example signal. The resulting PSNR between original and enciphered signal is 16.47.

4.2. Watermarking Algorithm
4.2.1. Basic Principles

The design goals for the watermarking algorithm to be used within our proposed CWE scheme were as follows:(i)The watermarking algorithm should commute with the permutation cipher in the sense of (1).(ii)It should be robust against Time-Scale Modification (TSM) attacks (see Section 5.2).(iii)It should be able to use a long watermarking key in order to prevent an attacker to insert her own watermark.These goals call for a combination of the watermarking concepts described in [1, 13]: in order to have full commutativity with the permutation cipher, it is necessary to swap entire histogram bins. These swaps can be randomized using a secret watermarking key as described in [1]. However, this procedure can imply a substantial change of the histogram mean. In order to deploy a synchronization procedure for robustness against TSM attacks as described in [13], the original mean needs to be transferred separately to the verifier. Moreover, as a TSM attack may change the height relation of two histogram bins if only bins which are sufficiently different are selected for swapping, that is, should hold, where is a strength parameter governing the robustness of the scheme. Note that this relation is unchanged if the two histogram bins are swapped. However, it might be changed by a TSM attack, which could lead to the detector choosing the wrong bin pairs for extracting the mark. Therefore, the bin pairs used to embed the mark now need to serve as watermarking key, as opposed to an initial seed for a pseudorandom number generator as in [1].

4.2.2. Algorithm Details

Embedding. Let the watermark consist of bits . The first bits of the watermark are used as synchronization sequence sync and should be known to the detector. As a first step, the embedder generates the amplitude histogram of the audio signal and forms its relevant part as described in Section 3.3. For each the embedder computes a histogram bin pair in the following way:(i)Generate a random number , where is the number of bins within .(ii)Find the th unused bin within . Generate another random number , such that and .(iii)If has not been used before and , save the pair . The watermarking bit is now embedded in the following way:(a)For , the relation must hold. If this is not the case, swap the bins by assigning new values to all samples in the bins.(b)For , the relation must hold. If this is not the case, swap the bins by assigning new values to all samples in the bins.(iv)If has been used before or generate a new random number .

Detection. The detector needs to know the original mean value of the samples of the unmarked cover work , along with the synchronization sequence sync of length and the sequence of bin pairs used for embedding the mark , which serves as a watermarking key.

As described in [13], for finding the correct mean value after a potential TSM attack, the detector first computes a search space , where is a parameter governing the size of the search space. Now, for each member of the search space, the corresponding histogram part is formed and a synchronization sequence is extracted from by comparing the first histogram bin pairs given in . For each histogram part , is extracted and the Rogers-Tanimoto [17] dissimilarity from is computed, where is the number of occurrences where a sync bit is and a corresponding bit is . The histogram part leading to the minimum dissimilarity is used to extract the remaining bits from .

4.3. Security Analysis
4.3.1. Watermarking

By construction, the proposed watermarking algorithm is highly sensitive to histogram-changing modifications of the cover audio files, like random exchange of histogram bins or histogram flattening operations. This kind of operation is able to remove the watermark partly or even completely, as our experiments with StirMark (cf. Section 5.2) have shown.

Unauthorised embedding and detection of the watermark, on the other hand, is difficult without knowledge of the watermarking key In what follows, we provide a lower bound of the number of possible keys.

consists of bin pairs chosen from the relevant part of the amplitude histogram containing bins. We divide the bins into equal parts of size . Assuming that, in each pair, the first bin comes from a different part and second bin is chosen in a distance from the first, there are possibilities to choose a single bin pair. Because there are bin pairs and their order is important, we arrive at a boundfor the number of keys. Typical parameter choices like and lead to a bound . Note that if the histogram pairs in are revealed, but not their order, as in the protocol described in Section 6.2, a watermark length of is still sufficient to provide a key length of about 200 bit.

4.3.2. Permutation Cipher

As mentioned above in Section 4.1, the Cat Map suffers from a low number of possible keys, which can be remedied only if the key parameters are changed from iteration to iteration. Because in principle the required permutations can be generated in a different way, it is more interesting in this context to look at the security of permutation ciphers for audio files in general. In [18] the authors investigated the security of permutation ciphers as applied to images with greyvalues and found that if plaintexts are known, attacks with a complexity are possible, requiring frequent key updates. Applying these results to an audio file of length means that known plaintexts are sufficient to break the cipher, where is the number of possible sample values. Because the bit depth of an audio file is usually higher than that of an image file (16 Bit as opposed to 8 Bit per sample, resp., pixel), the key for the permutation needs to be updated twice as often as for an image file of comparable size.

5. Experimental Results

The following experiments were carried out with a collection of audio files provided by the European Broadcast Union (EBU) for sound quality assessment. The audio files include artificially generated signals as well as speech, single instruments, and pop music (https://tech.ebu.ch/publications/tech3253).

5.1. Perceptibility

In order to measure the perceptibility of an embedded mark, the Peak Signal-to-Noise Ratio (PSNR) between the marked work and the original work was computed, as is common in the literature. Figure 3 shows the PSNR between original file and marked file for increasing length of the watermark and seven example soundfiles. The parameter set used for embedding was . According to [19], noise becomes perceptible at a PSNR < 35 dB. Figure 3 therefore shows that using these parameters it is possible to embed up to 512 bit without problems.

If the histogram bins are broadened (i.e., if is decreased), the capacity goes down, as less bins are available for embedding, while more samples are affected by the embedding, making the watermark more perceptible, but also more robust. Figure 4 shows the PSNR values for 70 test files for and a 32 Bit mark and for and a 56 Bit mark, respectively. The solid lines indicate the average PSNR values. The watermarking parameters are chosen in such a way that a good robustness against TSM attacks is achieved (cf. Section 5.2); however, the noise introduced by the watermark is at the border of being noticeable.

5.2. Robustness
5.2.1. TSM Attacks

TSM attacks basically try to desynchronize embedder and detector by compressing or extending the time axis of the audio file. A common requirement is that an audio watermark should be able to survive a rescaling of about 10% [20]. Moreover, the human auditory system is relatively insensitive to TSM attacks, which makes even higher percentages seem realistic. In resample mode, certain audio samples are repeated or removed in order to stretch or extend the time axis. In pitch-invariant mode, the speed of the audio file is modified without changing the samples. In order to implement these attacks in practice, the popular open-source tool Audacity V2.1.1 (http:/www.audacityteam.org) was used.

Figure 5 shows the Bit Error Rates (BER) when retrieving watermarks of 32 bit and 56 bit length after TSM attacks. For these parameter choices, the proposed watermarking algorithm is extremely robust against TSM attacks in resample mode, while the robustness against pitch-invariant mode is slightly worse, but still very good. In particular, the required robustness against 10% rescaling is fulfilled. In general, the robustness is quite sensitive to the choice of parameters. In particular, if the number of histogram bins is further increased, the robustness decreases accordingly.

5.2.2. StirMark

In order to evaluate the robustness against general signal manipulations, the well-known benchmarking tool StirMark for Audio, V1.3.2 (https:/sourceforge.net/projects/stirMark) was used to simulate common attacks. Although a low number of histogram bins were chosen ( or ) to achieve a higher robustness, the algorithm performed very differently, depending on the type of attack. For example, low-pass filtering or inserting a sine signal into the audio file is able to destroy the mark completely. On the other hand, sample manipulations like inserting Zero-Samples, periodic deletion of samples (cropping, or CutSamples in StirMark), or manipulating the least significant bits (LSBs) of the samples do not affect the watermark. Table 1 (for ) and Table 2 (for ) give the detailed results of our experiments with StirMark. In general, it is hard to devise a CWE algorithm which is robust against a wide class of attacks, because in a CWE scheme the embedder must be able to operate in the encrypted domain and therefore cannot rely upon important perceptual features of the cover work. It is possible, however, to achieve certain robustness against a certain well-defined class of attacks, as the proposed algorithm shows.

5.3. Commutativity with Encryption

The presented CWE scheme is not fully commutative in theory because of the padding needed in the encryption step: as the embedding step changes the original histogram, the padding samples introduced by a mark-then-encrypt operation can be slightly different from the padding in an encrypt-then-mark operation. We tested the influence of this issue by verifying a mark which was embedded into seven test files in the following three scenarios:(i): the mark is embedded into the encrypted cover work, and then extracted.(ii): the mark is embedded into the plaintext cover work. The marked cover work is encrypted and the mark is extracted.(iii): the mark is embedded into the encrypted cover work, then the work is decrypted and the mark extracted.

In all scenarios the mark could be extracted without any errors from all test files. In practice, the number of padding samples is very small compared to the overall number of samples and they are rarely influenced by the watermarking.

Note that the noncommutativity is not intrinsic to the overall scheme, but results from the special way the permutations are generated, namely, by deploying the two-dimensional discrete Cat Map. If the permutation is generated by some alternative mechanism like, for example, the one proposed in [21], which does not require padding, the scheme is fully commutative by construction.

6. Minimum Knowledge Verification of the Mark

The discussion in Section 2 has shown that in order to be useful in a generic buyer-seller protocol, there ought to be a way to verify the watermark without fully disclosing either the mark or the watermarking key. In [2], Katzenbeisser and Craver propose a probabilistic protocol which is in principle able to integrate any symmetric watermarking algorithm. Here, we make a few modifications to this protocol to work with the proposed audio watermarking algorithm. These modifications strive to make full use of the special properties of the proposed CWE scheme and are able to eliminate a certain weakness of the scheme by Craver and Katzenbeisser.

6.1. The Protocol by Craver and Katzenbeisser

In this protocol, a prover Alice wants to prove the presence of a watermark to some verifier Bob without disclosing or the watermarking key . The cover work is viewed as an array of samples. The watermark , which has also length , is embedded by the prover using some symmetric watermarking algorithm. The result is the marked work . Alice now generates some secret permutation and publishes along with a random graph having nodes and .

In order to prove the presence of   within to Bob, Alice and Bob engage in a multistep protocol. Each step consists of the following substeps:(1)Alice generates two permutations and with the property . Then she computes and .(2)Alice generates a so-called ownership ticketwhere is a secure hashfunction and and are encrypted versions of and , respectively. Alice sends to Bob.(3)Bob flips a coin and, depending on the outcome, asks Alice either to decrypt or for him.(4)If is opened, Bob can compute and and verify the hashvalues contained in . Having thus verified to be in possession of the correct , Bob goes on to verify that is present within .(5)If is opened, Bob computes and and verifies the hashvalues contained in . In this case, Alice’s knowledge of is verified.

Craver and Katzenbeisser go on to show that if these steps are repeated times, Alice has probability to fool Bob into believing that her watermark is contained in .

In our opinion, this protocol, while being very ingenious, has two drawbacks: first, the verifier gets to know the marked work together with and can therefore get some information about the secret permutation (the same is true for and , but in this case getting information about means being able to solve an instance of the graph isomorphism problem [22]).

Second, and more importantly, it is not clear how exactly in step (4) the presence of in the scrambled work should be verified without disclosure of the watermarking key.

6.2. The Modified Version

In our modified version of the protocol, we take advantage of the special structure of our watermarking algorithm and strive to eliminate the two drawbacks mentioned above. As in Craver and Katzenbeissers original protocol, the prover Alice generates a secret permutation and a graph . She marks the cover work with the watermark using the algorithm described in Section 4.2 to get the marked work . She then publishes and the permuted watermarking key , but not or . Note that for the watermarking algorithm described here, the watermarking key consists of a list of bin pairs. If this list is used for extracting the watermark in permuted form, the result will be the permuted watermark. The modified protocol now proceeds in steps. Each step consists of the following substeps:(1)Alice generates two permutations with the property . Then she computes and .(2)Alice generates an ownership ticketwhere is a secure hashfunction and are encrypted versions of and , respectively. Alice sends to Bob.(3)Bob flips a coin and, depending on the outcome, asks Alice either to decrypt or for him.(4)If is opened, Bob can compute and verify the hashvalue contained in . Having thus verified to be in possession of the correct , Bob goes on to compute and uses to extract from . He can check the correctness of by verifying the hashvalue in .(5)If is opened, Bob computes and verifies the hashvalue contained in . In this case, Alice’s knowledge of is verified.

In this modified version of the protocol, it is not necessary to publish because the permuted mark can be extracted directly from . Moreover, it is clear how to do this without knowledge of the watermarking key, as the permuted key will yield the permuted mark . Another interesting aspect of the modified protocol in connection with the watermarking algorithm described here is that it can be applied to a permuted marked work in exactly the same way, where is some permutation independent of the secret permutation used in the interactive verification protocol.

The proposed protocol has been implemented and extensively tested in Python. For a 1024 bit watermark, the amount of data to be transferred between prover and verifier is about 4 Kilobytes per step.

A serious drawback of the protocol is the fact that publication of the permuted watermarking key reveals the bin pairs where the watermarking bits are embedded. It is therefore an easy task for an attacker to modify those bins accordingly in order to remove the watermark. By using , the attacker could also embed a permuted mark . The attacker would not, however, be able to prove knowledge of or itself.

7. Conclusion

We have shown that existing approaches for commutative watermarking-encryption for images can be transferred to audio files. While it is hard in this context to define a watermarking algorithm which is generally robust, we could achieve a relatively strong robustness against Time-Scale Modification (TSM) attacks. By introducing some modifications into an earlier interactive verification protocol, it was possible to fit the watermarking algorithm defined here into an interactive verification protocol with minimum knowledge verification, that is, without disclosure of the watermark or the watermarking key. Our further research efforts will focus on the question of how the proposed watermarking algorithm can be made more robust in general.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.