Abstract

A distributed storage system (DSS) is a fundamental building block in many distributed applications. It applies linear network coding to achieve an optimal tradeoff between storage and repair bandwidth when node failures occur. Additively homomorphic encryption is compatible with linear network coding. The homomorphic property ensures that a linear combination of ciphertext messages decrypts to the same linear combination of the corresponding plaintext messages. In this paper, we construct a linearly homomorphic symmetric encryption scheme that is designed for a DSS. Our proposal provides simultaneous encryption and error correction by applying linear error correcting codes. We show its IND-CPA security for a limited number of messages based on binary Goppa codes and the following assumption: when dividing a scrambled generator matrix into two parts and , it is infeasible to distinguish from random and to find a statistical connection between and . Our infeasibility assumptions are closely related to those underlying the McEliece public key cryptosystem but are considerably weaker. We believe that the proposed problem has independent cryptographic interest.

1. Introduction

The world’s ability to generate, process, and store information is growing at an exponential rate [1]. The Internet of Things (IoT) has enabled objects to collect and share a vast amount of data enabling new applications and improving efficiency. In distributed IoT, intelligence is pushed to the very edge of the networks. Such decentralized approach has created challenges regarding the security and privacy of the collected data [2]. A distributed storage system (DSS) is a widely used technology for storing data in a reliable way. It is one of the essential building blocks for distributed applications. Such a system consists of a collection of storage nodes that may be individually unreliable but apply redundancy to make the system reliable as a whole. Coding schemes are applied to ensure its reliability and to reduce the bandwidth required for repair. In particular, linear network coding has turned out to offer good performance both in theory and in practice.

Complications arise if we cannot be certain that the storage nodes are well-behaved. Encryption needs to be applied to ensure the confidentiality of the data. However, traditional cryptographic primitives are ill-suited for network coding which requires that data packets from different nodes can be combined according to the coding scheme. Secure network coding [36] has been applied to ensure confidentiality in the information-theoretic security model. However, secure network coding incurs a cost on the storage capacity of the system. It decreases exponentially with the number of compromised nodes [7]. Furthermore, in many cases the storage nodes are provided by a third party storage service provider leading to systems with zero secrecy capacity [8].

In this paper, we consider the confidentiality of network coding and, in particular, distributed storage systems in a setting where the adversary has complete control of the nodes but is computationally bounded. We devise a linear error correcting code based symmetric additively homomorphic encryption scheme that is compatible with linear network coding. There are several advantages of our scheme compared to ordinary encryption:(1)Linear network coding can be applied as if working directly with the plaintext messages. Linear operations on the ciphertext space transfer to the plaintext space upon decryption.(2)The encrypted parts of the file do not disclose which part is which. The part information can be kept in the plaintext domain. It makes it impossible for the storage nodes or the adversary to eavesdrop on which subsets of the data the user requests.(3)The plaintext data can be first authenticated and then encrypted. For storage systems, this ordering is often desirable to ensure plaintext integrity. Our scheme can support this functionality with an additively homomorphic message authentication code such as [9] meaning that all linear combinations of the plaintext messages are authenticated.(4)Our scheme provides simultaneous encryption and error correction.

There are encryption schemes possessing additively homomorphic properties such as the Goldwasser-Micali scheme [9] and the Paillier cryptosystem [10]. However, to apply coding schemes for distributed storage we need flexibility in choosing the ciphertext space field which, for efficiency reasons, is often an extension field of the binary field when working with big data [11]. The required flexibility is not provided by existing proposals.

We construct a symmetric encryption scheme that is homomorphic from to , where and is a finite field. In particular, our security proofs are shown in the case where the binary field resulting in a scheme that is homomorphic from the additive group to . We also show that our construction is semantically (IND-CPA) secure in the standard model (on ) for a fixed number of messages showing that it provides indistinguishability for each individual part of the file. We apply problems that are closely related to the McEliece cryptosystem [12]. In particular, we formulate an assumption that is related to the pseudorandomness of the McEliece generator matrix. However, our assumption is much weaker. We believe that the corresponding problem has cryptographic interest in its own right.

The paper is organized as follows. In Section 2 we present work that is related to ours. Section 3 describes the preliminaries for the rest of the paper. We formulate in Section 4. We show that the scheme is IND-CPA secure for a limited number of messages in Sections 5 and 6. In Section 7 we consider the infeasibility of the applied problems and discuss how the scheme can be applied in practice with compact keys. Finally, Section 8 provides the conclusion.

The theory of confidentiality of distributed storage is related to that of network coding. Cai and Yeung were the first to consider secure network coding [3, 4]. In their security model, a passive wiretapper is able to eavesdrop on a subset of the links between nodes. The adversary is computationally unbounded and privacy is considered information-theoretically. A similar model was considered in [1316]. The security model of eavesdropping nodes, which is more natural for distributed storage, was suggested by Pawar et al. [8]. In their model, a computationally unbounded eavesdropper can access data on her selection of the nodes. The maximum file size that can be stored with information-theoretic security in the DSS using an optimal bandwidth MDS code (with exact repair) is called the secrecy capacity of the DSS. Regenerating codes achieving the secrecy capacity were suggested by Shah et al. [5]. Regenerating codes and locally repairable secure codes that achieve minimum storage requirements for a DSS were suggested by Rawat et al. [6]. Multiple simultaneous node failures, cooperative regenerating codes, and their secrecy capacity were considered in [17]. Kosut et al. considered networks where a node behaves traitorously [18]. Multiple nodes containing adversarial errors were considered by Dikaliotis et al. [19]. Pawar et al. considered an active omniscient adversary that has complete knowledge of the data on all nodes and can corrupt nodes, where [20].

The concept of homomorphic encryption was introduced by Rivest et al. [21]. While fully homomorphic encryption enables arbitrary computations on ciphertexts, many proposed schemes have homomorphic properties over specific operations. For example, RSA [22] is homomorphic over multiplication. Additively homomorphic schemes enable the computation of linear combinations of the ciphertexts. For the Goldwasser-Micali scheme [9] and the Paillier cryptosystem [10] multiplication in the ciphertext space corresponds to addition in the plaintext space. The scheme proposed by Lyubashevsky et al. is additively homomorphic with a polynomial ring as the ciphertext space [23]. Other asymmetric schemes with additively homomorphic properties can be found, for example, from [2429]. The functionality of public key encryption incurs a computational burden that is not needed in certain situations. For many applications, symmetric encryption suffices. Few symmetric schemes with the additive homomorphic property have been proposed. Some constructions, mostly concentrating on realizing fully homomorphic encryption, can be found from [3033]. In addition, the ciphertext and plaintext spaces in these schemes cannot be easily applied with linear network coding where we want to work with extension fields of the binary field for efficiency reasons.

3. Preliminaries

3.1. Notation

Standard notation will be used for probabilistic algorithms [34]. We denote by the result of running a probabilistic algorithm on input with randomness and setting to be equal to the output. We denote the uniform probability distribution on a set by . If is a random variable and is a distribution, we denote when is distributed according to . A probability ensemble is a collection of random variables indexed by the integers. The problem of computationally distinguishing between two probability ensembles and is denoted by .

Whenever we refer to indistinguishability of probability ensembles, we mean computational indistinguishability unless stated otherwise. Security proofs are considered in the standard model. That is, all algorithms are considered to be probabilistic polynomial time (PPT) and time complexity is considered in the average case. The success probability (called the advantage) of an adversary on a problem is considered asymptotically as a function of a security parameter and is denoted by . A function is negligible if for every there is such that for every . A problem is considered infeasible if for all PPT algorithms the advantage is negligible.

3.2. Dynamic Distributed Storage

Let be a file consisting of elements from a finite field . A dynamic distributed storage system (DSS) consists of live nodes each storing symbols over . These nodes can be individually unreliable but the system is designed to apply redundancy in a clever way to achieve robust and efficient data recovery against failures. The file is encoded into a codeword consisting of blocks . Given such a codeword , the part is stored into node . During operation, some of the nodes of the DSS may fail. If node fails, a new node is added to the network. It contacts live nodes and downloads symbols from each. The total amount of downloaded data, , is called the repair bandwidth. The new node processes these symbols to reconstruct . The repair process is conducted so that data stored at nodes allows to be completely constructed (the “ out of property”). A DSS satisfying such a property is often referred to as a -DSS.

There is a tradeoff between the repair bandwidth and the amount of data that can be stored in each node [35]. Dimakis et al. suggested network coding [36, 37] for distributed data storage in order to reduce the bandwidth of node repair [35]. They introduced regenerating codes that achieve the optimal tradeoff between storage and repair bandwidth. This tradeoff can be achieved with linear network coding [20]. See Figure 1 for an example of a DSS and the repair process after node failure.

3.3. Mutual Information

Mutual information of two random variables and iswhere is the joint probability distribution function of and , is the marginal probability distribution function of , and is the marginal probability distribution function of . We say that and are dependent ifGeneralizing this to probability ensembles and we say that and are dependent iffor every .

3.4. McEliece Cryptosystem and Related Problems

The McEliece scheme applies binary Goppa codes [38] to enable asymmetric encryption. The key generation algorithm outputs a private/public key pair such that the private key consists of three matrices with entries in , where is an permutation matrix, is a nonsingular matrix, and is the generator matrix for a binary Goppa code that is able to correct up to errors. The public key is the composition matrix . A message is encrypted by by computing , where is a randomly chosen error vector of Hamming weight . For the decryption, first computes and then decodes the corresponding Goppa codeword to obtain . Since is nonsingular, the message is computed by multiplying with from the right. A semantically secure version of the scheme can be found in [39]. Here, semantic security refers to indistinguishability of ciphertexts under chosen plaintext attack. For details on semantic security, see, for example, [40].

The security of is based on a certain assumption on the generator matrix . Let denote the random variable determined by the probability distribution of sampling a generator matrix according to , where is a security parameter. Let the probability ensemble . Let denote the probability ensemble of random matrices with the same size as . The following hardness assumption was first formulated in [41].

Assumption 1 (pseudorandomness of McEliece generator matrix). There exists a negligible function such that for every .

In addition to this pseudorandomness assumption, relies on the hardness of the learning parity with noise problem. However, we do not need to apply it in our scheme.

4. Additively Homomorphic Symmetric Encryption Scheme

In this section, we give a construction of a symmetric encryption scheme that is homomorphic from the additive group to , where and . Due to linearity, it will be compatible with linear network coding. Our construction is inspired by the symmetric scheme suggested in [42], the homomorphic scheme suggested in [43] and the McEliece public key encryption scheme [12], and, especially, its IND-CPA variant [39]. Similarly to the McEliece scheme, our scheme is based on binary Goppa error correcting codes [38]. However, contrary to the McEliece scheme, we do not disclose the scrambled generator matrix. We also do not add any errors while encrypting which means that the full error correction capacity of the code can be utilized in applications. It would also be easy to adapt our proposal to apply other codes on an arbitrary finite field . However, binary fields and their extensions are useful for many applications since they enable efficient data combination due to efficiency of addition modulo 2 [11].

In general, the scheme operates as follows. Suppose that our file is divided into parts constituting plaintext messages . Each of these messages are padded with a random suffix and encrypted by encoding with a scrambled generator matrix of a linear error correcting code: . Note that the resulting ciphertexts can be linearly combined and the corresponding combination translates back to the plaintext space upon decoding due to linearity of the code. Furthermore, since the generator matrix is scrambled, an adversary is not able to determine the applied code and thus not able to decrypt the ciphertexts. In the following, we rigorously formulate this construction and the related computational assumptions. Based on computational indistinguishability, we then proceed to show its semantic security.

Definition 2 (). The symmetric encryption schemeconsists of a three-tuple of algorithms given in the following: (1): based on the security parameter , chooses a randomization length , a linear -error correcting Goppa code over with a generator matrix such that . It also samples a random nonsingular matrix and a random permutation matrix . It then sets the cleartext length to be such that , where and sets as public parameters and outputs as the secret key.(2): the input consists of a key , a plaintext . It then samples a randomand encodes the concatenation using to obtain a ciphertext message(3): the input consists of a key and a ciphertext . The plaintext message is obtained by decoding using the Goppa code, mapping the decoded message by and discarding the last bits.

The key generation, encryption, and decryption processes are depicted in Algorithms 1, 2, and 3, respectively.

Gen:
Determine randomization length
Generate
random non-singular matrix
Goppa generator matrix
random permutation matrix
Set cleartext length
Output public parameters
Output secret key
Enc:
Generate a random
suffix
Add suffix
Encode
Dec:
Permute
Decode
Demix
Discard suffix

Note that contrary to the McEliece cryptosystem, the matrix is not public. Instead, it is kept as a secret key. In addition, no error vectors are added in the encryption process.

We shall now proceed to show the IND-CPA security of our construction. Our plan is the following. We first show that can be divided into two parts, and , such that the output of is the sum of the outputs of these two algorithms. We then proceed to show that produces a probability ensemble that is indistinguishable from random under a certain (reasonable) assumption. We then consider the sum of the outputs of these two algorithms and proceed to show that (under another reasonable assumption) the complete encryption algorithm produces ciphertexts that are indistinguishable from random.

We start by showing that can be expressed as a sum of two algorithms. Let the scrambled generator matrix be partitioned into and submatrices and such that , where denotes transpose. Then we havewhere is deterministic PT, is PPT, and is the internal randomness used by Enc.

Now, adds a different element to the output of determined by the randomness . Suppose that we are encrypting messages and that the output of is a truly random for every . Then for every and every plaintext message the output of would be characterized byand would satisfy perfect secrecy for encryptions. In reality, the output of is not truly random. However, in the following we show that it is indistinguishable from random under a certain assumption. Then we consider the connection between and and, finally, the indistinguishability of encryptions from random. For easier reference, variables used in the description of the scheme, as well as in the following proofs, have been collected into Table 1. Similarly, the used random variables have been collected into Table 2.

5. The Probability Ensemble Induced by Enc2

In the following, we consider the probability ensemble induced by for encryptions. That is, we have a -tuplesuch that for every , where and . Note that , and depend on the security parameter . In the following, we have made the dependence explicit. We can consider as a random variable over by setting , where is a matrix chosen uniformly at random. For convenience, we assume that is written in such a matrix form.

5.1. Indistinguishability of from Random

Our plan is to show the indistinguishability of from random for all . In order to do that we want to be also indistinguishable from random. We could apply the McEliece assumption (Assumption 1) that states that the complete generator matrix satisfies this property. However, such an assumption is too strong in our case. We derive a weaker assumption that relates only to .

Definition 3. Let denote a probability ensemble of McEliece generator matrices (chosen according to some schema) such that is distributed over matrices of size for every . Let and denote the probability ensembles such that for every , where is distributed over matrices of size and is distributed over matrices of size , where and and are chosen according to .

Assumption 4 ( indistinguishable from random). Let denote the uniform probability ensemble such that for every . For every PPT algorithm , there is a negligible function such thatfor every .

If the generator matrix satisfies the formulated assumption, then cannot be distinguished from random. Suppose that is exchanged with truly random matrix. Let be a probability ensemble such that , where and . Let denote the uniform probability ensemble such that , where is determined by . Clearly, the statistical distancefor every and since all of the elements of are uniformly random.

We shall now provide a connection between Assumption 4 and the indistinguishability of from for .

Proposition 5. For every PPT algorithm there is a PPT algorithm such thatfor every and .

Proof. The reduction is straightforward. Let be given and let be a PPT algorithm considered as a distinguisher for and . Let us define the distinguisher for and that is shown in Algorithm 4.
If , then is invoked with rows of a McEliece generator matrix. By the description of , is queried with a matrix sampled according to . Let now . Then is invoked with an element sampled according to and since outputs the same bit as , we have

(1) procedure is a matrix
(2)
(3)
(4)return  
(5) end procedure

A direct consequence of Proposition 5 is the result we aimed for: indistinguishability of from random under Assumption 4.

Proposition 6. For every PPT algorithm and ,for every .

6. Semantic Security for Messages

Let us now turn to the probability ensemble induced by the complete encryption algorithm . We establish the semantic security of by proving that it satisfies ciphertext indistinguishability for up to messages under two assumptions: Assumption 4 and a new one regarding independence of and . Let be any plaintext messages. Let such that , where and . As before, let us consider in the matrix form. Set also . That is, the rows of consist of the plaintext messages. We call the message matrix of .

6.1. Computational Independence

Assumption 4 concerns the last part of the generator matrix . However, we need to also make an assumption regarding . For example, suppose that it was possible that . Then would be easily distinguishable with high probability by choosing , the identity matrix. To foil such attempts, we want and to be sufficiently independent of each other. We shall formulate an assumption concerning the mutual information of and .

Let us define the following experiment in which we attempt to determine whether two probability ensembles are dependent. Suppose that we have three probability ensembles , and . Suppose also that is indistinguishable from . Furthermore, suppose that while for every . We define the experiment that is shown in Algorithm 5.

(1) procedure Dependability experiment
(2)
(3)if    then
(4)
(5)else
(6)
(7)end if
(8)if   then
(9)output 1
(10)else
(11)output 0
(12)end if
(13) end procedure

In the experiment, is either given an element from such that or an element from that is indistinguishable from such that . Since and are indistinguishable, succeeds in this experiment with nonnegligible probability only if it is able to find the dependability of from .

Definition 7. Let be probability ensembles. We say that and are computationally independent if for every PPT algorithm and every probability ensemble such that is computationally indistinguishable from and for every there is a negligible function such thatfor every . If this does not hold, then we say that and are noticeably dependent.

Note that it follows from the definition of thatfor every . We formulate the following assumption concerning the relationship between and .

Assumption 8 ( and computationally independent). For every probability ensemble indistinguishable and independent from and every PPT algorithm there is a negligible function such thatfor every .

The assumption states that it is not feasible to find any information that links and . The assumption is still weaker than the McEliece assumption that states that the whole is indistinguishable from random. (If they are, then necessarily and are computationally independent.) However, Assumption 8 does not require to be indistinguishable from random. In fact, our proofs do not depend at all on the structure of as long as and are computationally independent. To make the scheme faster, we could, for instance, omit and from affecting the first rows of the generator matrix .

We are now ready to show the semantic security of by showing the indistinguishability of from random.

Proposition 9. has indistinguishable encryptions for messages under Assumptions 4 and 8.

Proof. Suppose that Assumption 4 holds. We establish the claim by showing that for every set of plaintext messages and every PPT algorithm there is a PPT algorithm such thatfor , where is induced by . Then, under Assumption 8, the advantage of is negligible.
Since is truly random, we have . In addition, by Assumption 4, is computationally indistinguishable from and therefore is well defined. Let the security parameter be fixed and let be any messages. Let be the message matrix of . Written in the matrix form, we have and the elements are of the formwhere and .
Let be any PPT algorithm considered as a distinguisher for . Using , we construct an algorithm that determines the dependability of and (see Algorithm 6).
Suppose that the input is random matrix. Thenis a truly random matrix. Therefore, was invoked with a matrix sampled according to . Suppose now that . Thenand was sampled according to . Since outputs the same bit as , we have

(1) procedure   is either or a random matrix
(2)
(3)
(4)
(5)output  
(6) end procedure

is IND-CPA secure under Assumptions 4 and 8 whenever the adversary is restricted to at most queries to the encryption oracle (the test query included). Considering a DSS, whenever the dataset is divided into at most parts, each of those parts remains secret even under a chosen ciphertext attack where the adversary is able to choose each of those parts separately and adaptively.

7. Infeasibility, Key Size, and Error Correction Capacity

7.1. Infeasibility of the Problems

Let us briefly consider the infeasibility of the underlying problems related to . The IND-CPA security is based on assumptions that are weaker but closely related to the ones underlying the McEliece scheme. The selection of parameters for the McEliece scheme has been considered in [44] and the best performing attacks are based on information set decoding. In addition, due to algebraic attacks against Goppa codes [45, 46] the rate cannot be close to one and the degree of the Goppa polynomial has to satisfy , where is the smallest integer satisfying , where and [47]. Choosing maximizes the complexity of information set decoding attacks [44].

For , the attacker is not given the generator matrix. Instead, the attacker gets at most scrambled messages under an adaptive chosen plaintext attack. Therefore, can be drastically lower for . We suggest and so that randomization length is slightly more than half of the input. The rate should be kept close to a constant. We suggest choosing a rate that is close to due to information set decoding attacks [44].

7.2. Key Size

The key size of is big if truly random matrices are used. In a practical setting, we want to use pseudorandom matrices for and . The key size is dramatically decreased by exchanging these matrices with a short seed and generating and using a pseudorandom generator . The generating matrix of the Goppa code can be derived from the Goppa polynomial and pseudorandom elements generated by . Therefore, in practice, the key can be compactly presented by the seed and the polynomial .

Typically, in a distributed storage systems we want to encrypt files or file systems that are huge. If a large file is divided into few parts, we do not want to consider each part as a single plaintext message since such an approach would require and to be at least as large as the length of the file part. In such a case, we can further divide the part into smaller blocks and encrypt those block independently. Such an approach enables us to select small and efficient values for and . Note that such a division does not affect the homomorphic property of the scheme provided that each of the file parts are processed similarly and encrypted with the same keys. It also does not have an effect on the key size since the keys of those individual blocks can be derived from the same seed and the polynomial .

7.3. Error Correction

Due to requirements of semantic security and error correction, ciphertexts contain overhead compared to plaintext messages. For example, with , where the rate , plaintexts of length will be encrypted into ciphertexts of length . The scheme can correct up to errors, where is the degree of the Goppa generator polynomial. With these parameters, we should choose [47]. Choosing the smallest , which results in the most efficient implementation, enables us to correct up to 5 errors in each 256 bits meaning that the plaintext messages are correctly decrypted with high probability whenever the error rate is less than %. If more error correction capacity is needed, then a higher degree Goppa generator polynomial needs to be selected and/or the rate should be lowered. As a final remark, we note that the binary Goppa code can be exchanged with another linear code on a finite field . However, we have only shown the security of based on the indistinguishability of a scrambled Goppa generator matrices. The applied linear code has to satisfy a similar infeasibility result.

8. Conclusion

We propose an additively homomorphic symmetric encryption scheme that is compatible with linear network coding: a linear combination of ciphertext messages decrypts to the same linear combination of corresponding plaintext messages. The scheme can be used for the encryption of data stored in a distributed storage system (DSS), for example, in the distributed Internet of Things. We show that the scheme is semantically secure (IND-CPA) and provides computational indistinguishability for each individual part of the file stored in the DSS. In combination with an additively homomorphic MAC our scheme supports the authenticate- then-encrypt paradigm that ensures plaintext integrity. Finally, based on Goppa codes, our scheme offers simultaneous error correction. Our proofs are shown for the binary field which is commonly used for the implementation of a DSS due to computational efficiency reasons. We also discuss the selection of secure parameters for the scheme and explain how it can be applied with compact keys.

Disclosure

Work related to this manuscript has first appeared in the author’s doctoral thesis [48].

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.