Abstract

We live in a digital era in which communication is largely based on the exchange of digital information on data networks. Communication is often pictured as a sender that transmits a digital file to a receiver. This file travels from a source to a destination and, to have a quick and immediate communication, we need an encoding strategy that should be efficient and easy yet secure. This communication could be based on a layout articulated in two operations that are heterogeneous and in some case conflicting but that are needed to be applied to the original file to have efficiency and security. These two operations are data compression and encryption. The aim of this work is to study the combination of compression and encryption techniques in digital documents. In this paper we will test the combinations of some of the state-of-the-art compression and cryptography techniques in various kinds of digital data.

1. Introduction

We live in a digital era in which the way we communicate has dramatically changed. We can communicate digital files at any time, on any device, and with anyone on the planet.

Data compression has been one branch of computer science that made this digital revolution possible. By compressing the input data, for example, to 1 (the compressed file size is 1 th of original file size) it is possible to send the same information times faster or to send the file at once on a transmission channel that has capacity or even to send files in parallel on a channel that has capacity .

In this digital revolution communication has a price: accidentally we have to consent that the digital message we are sending will be potentially intercepted and read on its way to the destination.

Cryptography might be a solution to this issue: if the sender encrypts the message, assuming that only the destination has a way to decrypt it, then privacy will be insured.

Therefore digital communication should be based on a layout articulated in two operations that are heterogeneous and in some case conflicting but that are needed to be applied to the original file to have efficiency and security. These two operations are data compression and encryption.

The aim of this work is to study the combination of compression and encryption techniques on digital documents. In this paper we test the state-of-the-art compression and cryptography techniques on various kinds of digital data.

The next section shall present an introduction to the most commonly used methods in data compression and cryptography, together with a short survey of past work.

Sections 3, 4, and 5 will show the experimental results obtained on one-dimensional, two-dimensional, and three-dimensional data.

Section 6 contains our conclusions and future research directions.

2. Data Compression and Cryptography

Today the way we communicate has dramatically changed. We communicate digitally and we aim to have an efficient and secure communication. The research in this field is devoted to improving the way we communicate so as to have stronger requirements of efficiency and security, where efficiency is given by data compression and security by encryption.

Data compression is today essential for digital communication. Without data compression we would not have digital televisions, smart-phones, satellite communications, Internet, etc.

Information theory tells us that there is a limit to the amount of compression we can gain from a digital file and that this limit is the Shannon entropy [1]: the highest the entropy is, the lowest the possibility of compressing data shall be.

Therefore the enemy of compression is randomness. But, on the other side, encryption needs to bring randomness into the digital data to bring security. This is why, when we have to perform both compression and encryption, we will always compress first and then encrypt, as shown in the workflow of Figure 1.

There are therefore two interesting questions to be posed: What is the cost of encryption, in terms of file size, after performing compression? How bad is performing first encryption and then compression?

Here we give, experimentally, an answer to both these questions.

2.1. Data Compression

Data Compression can be defined as the coding of data to minimize its representation. The compression process is called lossless if the original one can be exactly reconstructed from the compressed copy; otherwise it is called lossy.

The theoretical background of the data compression techniques is strong and well established.

It dates back to the seminal work of Shannon who, more than half a century ago, gave precise limits on the performance of any lossless compression algorithm: this limit is the entropy of the source we want to compress.

Data compression techniques are specifically dependent on the type of data that has to be compressed and on the desired performance.

In this paper we will experiment with largely used compression methods such as Run Length Encoding, Huffman Coding, Arithmetic Coding, LZW, JPEG, and JPEG2000.

Run Length Encoding (RLE) [1] is one of the simplest compression algorithms in which any repeated run of characters is coded by using only two elements: the first is used as a counter: it memorizes how long the string is; the second contains the repetitive element that constitutes the string. If a data item occurs at consecutive times in the input stream, RLE replaces the occurrences with the single pair nd.

Huffman Coding has been introduced in 1952 by Huffman [2]. The output from Huffman’s algorithm is a variable-length code where more common symbols are generally represented using fewer bits than less common symbols.

Arithmetic Coding [3] is a form of entropy encoding that encodes a message into a single number: an arbitrary-precision fraction where .

In Huffman Coding there is a limit: each source character has to be coded with at least one bit. This limitation does not apply to Arithmetic Coding.

Textual substitution methods, often called dictionary methods or Lempel-Ziv methods after the important papers by Ziv and Lempel [4, 5], maintain a constantly changing dictionary of strings to adaptively compress a stream of characters by replacing common substrings with indices into the dictionary. Lempel and Ziv proved that these schemes were practical as well as asymptotically optimal for a general source model.

In 1984 Welch published [6] in which he described a practical implementation of the method outlined in [5]. He called this work LZW and LZW is the compression method used in the UNIX COMPRESS program and in earlier versions of the MS-DOS ARC program.

Digital images can be defined as a set of two-dimensional arrays of integer data (the samples), represented with a given number of bits per component.

Because of the many applications of digital images there have been various processes of standardization involving image compression.

Perhaps the widest diffused standard is JPEG (that is an acronym for Joint Photographic Experts Group, the unofficial name of the Standardization Committee that promoted the standard—see [7]).

JPEG has four modes of operation. One of them is Lossless JPEG that is the lossless version of this standard. Today Lossless JPEG is far from the state of the art of lossless image compression but it is used in this paper to show the performance of a simple image compression method when coupled with encryption.

We generally refer to JPEG as the baseline lossy method of the compression standard.

Today JPEG compression is still very popular and widespread.

JPEG-2000 [8] is a recent standard for image compression developed by the Joint Photographic Experts Group for still image coding.

The standard is intended to offer unprecedented access into the image while still in compressed domain. Thus, images can be accessed, manipulated, edited, transmitted, and stored in a compressed form. The lossless mode of the standard is very close to the actual state of the art in lossless image compression.

Video compression, since the beginning of the 1980s, has been an attractive research area because a digital video may provide more information than a single image frame. The huge computational complexity and memory space required for video processing are now more attainable, due to the more advanced, achievable computational capability today we have available.

MPEG4 [9] is a standard for the encoding of digital audio and video developed by the ISO/IEC Moving Picture Experts Group (MPEG). MPEG-4 is used primarily for applications such as video telephony and digital television, for the transmission of movies via the web and for storage.

Current research in data compression focuses also on the compression of other three-dimensional data, i.e., hyperspectral images or medical images (see, for example, [10, 11]).

In this paper we have used the implementation of RLE, Huffman Coding, and Arithmetic Coding described in [12].

For JPEG and JPG2000 we used the open source libraries LibJpegTurbo [13] and OpenJpeg [14]. For MPEG4 we tested an on-line codec.

2.2. Encryption

Cryptography today is crucial in many aspects of today’s world: from Internet banking and ecommerce to email and web-based business processes.

A common way to encrypt a message is in blocks. Naturally such systems are known as block ciphers. The most important and still most widely used block cipher is the Data Encryption Standard or DES.

Triple DES, often indicated as 3DES, was developed by W. Tuchman. It is a block cipher that builds on the success of DES, while increasing the key length so that brute force attacks become almost impossible. It uses DES three times.

In 1997 NIST started the search for a successor to DES called Advanced Encryption Standard (AES). This would be an unclassified, public encryption scheme. In 2000 the scheme Rijndael, named after its Belgian inventors Joan Daemen and Vincent Rijmen of the COSIC Laboratory, was selected to be the new AES.

For a complete, comparative analysis on the above symmetric encryption block cipher algorithms see [15].

RC4 (Rivest Cipher 4) is a stream cipher. In the past it was widespread for its simplicity and speed in software, but after that its multiple vulnerabilities have been discovered and it is not considered secure any more [16]. It is still largely used in noncritic applications.

The code we experimented for encryption is taken from the open source library OpenSSL [17].

2.3. Other Approaches to Compression and Security

The need for a secure and efficient data transmission has had an impact also on the research in other applications of compression and security.

The delicate balance between compression and cryptography might be framed in a slightly wider context, and it is possible to consider other points of view from which this deep and intriguing question has been studied; see, for example, [18, 19].

In this digital era information owners are reluctant to distribute their data in an insecure environment. In fact they fear that digital products such as music files, videos, and images could be copied by anyone who has access to the network.

This has led to the idea of protecting the products by using a watermark that could be visible or invisible.

Watermarking can be coupled to data compression to have a safer transmission of a digital object. This combination of compression and steganography has been tested in various domains; for example, see [2022].

There are other papers in literature that study the combination of compression and encryption techniques, generally by considering only one-dimensional data or specific compression algorithms; see, for example, [2326].

3. Text and One-Dimensional Digital Data

One-dimensional data, for example, textual data, programs, and object codes, are generally lossless coded because the end user, a human in the case of textual data or a computer in the case of object codes, would not accept a single bit error in the decoded message.

The test data set on which we have experimented includes 17 files from the well-known Calgary Corpus [27]. The Calgary Corpus is a collection of text and binary files created in 1987 by Ian Witten, Tim Bell, and John Clearly. It is named after the University of Calgary in which the three creators worked.

We have experimented on one-dimensional data with four standard compression algorithms, Huffman Coding, Arithmetic Coding, Lempel-Ziv-Welch Coding, and Run Length Encoding, and with four standard encryption algorithms, DES, 3DES, AES, and RC4.

Figure 2 shows the workflow of the tests.

In a first set of tests the input files have been first compressed and then encrypted; then in a second round of testing the same input files have been first encrypted and then compressed.

Tables 1, 2, 3, and 4 show the experimental results obtained in first compressing and then encrypting the files.

The first column shows the file name, the second column shows the file size in bytes, the third column shows the file size after compression, and the remaining columns show the file size in bytes after first compression and then encryption.

The experiments summarized in the four tables show that the cost of encryption after compression is negligible for this kind of data.

Table 5 sums up the results. Its first column is the compression method examined, its second column is the test set size in bytes, the third column is the size in bytes of compressing the test set with that particular compression method, and the columns from 4 to 7 include the size in bytes of compressing and then encrypting with a specific encryption method and also the average cost of encryption in terms of percentage of the original file.

The cost of encryption is almost zero. For example, the table shows that this cost, as expected, is zero for RC4 and between 0,0021% and 0,0053% of the original file size for all the other methods.

As shown in Figure 1 we have also tested the opposite approach in which we first encrypt the file and only after that we compress the encrypted file.

Table 6 is the analogue of Table 5 for this opposite approach. Its rows represent the encryption method used, and the columns from 4 to 7 represent the encryption plus compression process, showing for each compression method the size in bytes obtained after compressing the encrypted files.

In this case it is clear that compressing after an encryption does not help; instead it even increases both the encrypted and the original file sizes. We have reported for each compression method the amount, in bytes, of the increase in size with respect to the original file.

Compression after encryption does not work; for example, the randomness induced by the encryption algorithms brings Arithmetic Coding to almost double the original file size.

4. Images and Two-Dimensional Digital Data

Two-dimensional data, i.e., digital images, are generally lossy coded because the human end user is not capable of noticing small pixel errors.

Lossless image compression is required in applications where the pictures are subject to further processing, intensive editing, or repeated compression/decompression.

It is generally the choice also for images obtained at great cost or in applications where the desired quality of the rendered image is yet unknown. Thus, precious art works to be preserved, medical imaging, remotely sensed images, prepress industry, and image archival systems are generally losslessly compressed.

The test data set on which we have experimented includes 6 images from [28].

We have experimented on two-dimensional data with three standard compression algorithms, JPEG, Lossless JPEG, and JPEG 2000 in lossless mode, and with four standard encryption algorithms, DES, 3DES, AES, and RC4.

Figure 3 shows the workflow of the tests.

In a first set of tests the input files have been first compressed and then encrypted; then in a second round of testing the same input files have been first encrypted and then compressed.

Tables 7, 8, and 9 show the experimental results obtained in first compressing and then encrypting the files. The first column shows the file name, the second column shows the file size in bytes, the third column shows the file size after compression, and the remaining columns show the file size in bytes after first compression and then encryption.

Lossless JPEG and JPEG 2000 in lossless mode are of course lossless compression algorithms. JPEG, being lossy, has been tested with an “average” quality setting.

Again the tables show that, even for image compression, the cost of encryption after compression is negligible for this kind of data.

Table 10 sums up the results in the case of encryption after compression for digital images.

Its first column is the compression method examined, its second column is the test set size in bytes, the third column is the size in bytes when compressing the test set with that particular compression method, and the columns from 4 to 7 include the size in bytes of compressing and then encrypting with a specific encryption method and also the average cost of encryption in terms of percentage of the original file.

For example, the table shows that this encryption cost, as expected, is zero for RC4 and between 0,0003% and 0,0008% of the original file size for all the other methods.

Table 11 is the analogue of Table 10 for this opposite approach. Its rows represent the encryption method used, and the columns from 4 to 7 represent the encryption plus compression process, showing for each compression method the size in bytes obtained after compressing the encrypted files.

Also in this case it is clear that compressing after an encryption does not help; instead it increases the file size with respect to both the encrypted and the original file sizes for lossless compression algorithms. We have reported for each compression method the amount, in bytes, of the increase in size with respect to the original file.

Do not be misguided by column 4: JPEG is a lossy algorithm and it still compresses the encrypted file, with an average compression ratio, for our test data set, of 4.6 for RC4, 4.4 for DES, 4.4 for 3DES, and 4.4 for AES. But this is not good with respect to the 18.7 compression ratio obtained on the test data set by JPEG, with the same settings, before encryption.

In order to obtain the best result in terms of privacy and size, for digital images, it is therefore necessary to sequentially operate compression techniques and, only after compression, encryption techniques.

In fact to operate these techniques in reverse order, although possible, does not guarantee the best results in terms of compression ratio, and it also implies the need to cope with problems relating to the encryption. The application of an encryption algorithm on a digital image implies the encryption of all the bytes of the image, including the bytes defining its header. Therefore a compression algorithm designed for images, such as JPEG, JPEG2000, or Lossless JPEG, would not recognize the data as a digital image and it would not have enough elements to understand how many rows and columns should have the image.

To avoid this problem in our tests we have implemented a simple algorithm that is able to reconstruct, starting from the encrypted file, the original image header that will be used by the compression algorithms to recognize the data as an image.

5. Video and Three-Dimensional Digital Data

Digital video is an example of three-dimensional data. It is generally lossy coded because the target end user is human and because the size of digital video files is such that lossy video compression is essential for transmission or storage purposes.

The test data set on which we have experimented includes 5 small synthetic videos in the.avi format.

We have experimented on these 5 files the effects of the MPEG4 compression standard, paired with DES and RC4.

In this set of tests the input files have been first compressed and then encrypted.

Table 12 shows the experimental results obtained in first compressing and then encrypting the files. The first column shows the file name, the second column shows the file size in bytes, the third column shows the file size after compression, and the remaining columns show the file size in bytes after first compression and then encryption.

Table 13 sums up the results in the case of encryption after MPEG4 compression for digital videos.

Its first column is the compression method examined, MPEG4, its second column shows the test set size in bytes, the third column is the size in bytes when compressing the test set with that particular compression method, and columns from 4 to 5 include the size in bytes of compressing and then encrypting with DES or RC4 and also the average cost of encryption in terms of percentage of the original file.

From Table 13 it is clear, one more time, that the cost of encryption is negligible and that this cost tends to be even more neglectable when the file sizes grow.

6. Conclusions and Future Research

We live in a digital era in which the way we communicate has dramatically changed. We can communicate digital files at any time, on any device, and with anyone on the planet.

We communicate digitally, and we aim to have an efficient and secure communication. We are always improving the way we communicate to have stronger requirements of efficiency and security, where efficiency means data compression and security means encryption.

The digital communication could be based on a layout articulated in two operations that are heterogeneous and in some case conflicting but that are needed to be applied to the original file to have efficiency and security. These two operations are data compression and encryption.

The enemy of compression is randomness, but on the other side encryption needs to bring randomness into the digital data to bring security. This is why, when we have to perform both compression and encryption, we will always compress first the data and then encrypt it.

In this paper we have given an experimental answer to two interesting questions: “What is the cost of encryption in terms of file size after performing compression?” And “how bad is performing first encryption and then compression?”

The answer to the first question is that, happily, the cost of security is negligible if we perform first compression and then encryption. This cost has been examined by testing some of the state-of-the-art compression and encryption algorithms on one-dimensional, two-dimensional, and three-dimensional data.

The answer to the second question is the answer the theory already gave us. It is not efficient at all to do first encryption and then compression. The file size will definitely grow and in some cases the resulting output will be far larger than the original input.

New research is now involving more testing and the development of ad hoc algorithms that combine and exploit data compression and security.

Data Availability

All the data and software we have used in our experiments is available online.

Conflicts of Interest

The author declares that they have no conflicts of interest.

Acknowledgments

The author would like to thank the students Alessandro Sacco, Luigi Amitrano, Paolo Anastasio, and Marco Castaldo for performing preliminary experiments related to this paper.