Abstract

In recent years, the combination of power systems and IoT (Internet of things) has become a new generation of smart grid. Cloud computing is used to provide storage and computing services for smart grid data due to its convenience. Generally, searchable encryption technology is considered a feasible solution to guarantee data security and support search function. Li proposed a searchable symmetric encryption scheme based on pseudo-random function for smart grid data in 2019. In this paper, we propose a more efficient and more secure searchable symmetric encryption scheme for smart grid data. The scheme improves the search efficiency by introducing the Bloom filter and changing the structure of the index. Specifically, we first narrow the search scope and then perform a second search, which eliminates the false positive caused by the introduction of the Bloom filter. At the same time, we assign an ID to each piece of data (a row of data) in a tabular dataset, store the hash value of the ID (adding salt) in the index, and the search results are returned by the server contain none of the ID information, which improves security. Experiments on real data show that our scheme is 52% more efficient than the previous scheme.

1. Introduction

The smart grid is regarded as the next-generation power system because of its reliability, flexibility, sustainability, and efficiency, which has attracted extensive attention in academia and industry. The smart grid collects and uses data through sensors, smart meters, and other IoT devices to realize intelligence. For example, smart grid can predict the occurrence of faults through real-time monitoring and data analysis, intelligently control the equipment, reduce power consumption, and improve operational efficiency. However, with the increase in the amount of power data, the traditional smart grid cannot meet the growing needs of storage and data management and cannot meet the analysis needs of the power industry to quickly obtain knowledge and information from massive data [1]. Later, cloud computing is used to solve these problems, which provides huge storage space and powerful computing ability and can store and process power grid data properly.

The power grid is one of the most important national infrastructures, as electricity is related to everyone’s life, even to the development of the country and society. In order to better apply the smart grid and deal with the harm caused by malicious attacks and data leakage, we must make data security a priority. From the perspective of external factors, the cloud server may be attacked [2]. The attacker can directly obtain the plaintext data of the power grid, including user personal information, power grid statistics data, and other private information. From the perspective of internal factors, the third-party CSP has all the operation permissions for the data. Moreover, the grid data contain the user’s personal information, such as name, address, and telephone number, so the third-party CSP may be interested in the grid data for some commercial reasons. How to ensure the security of the grid data in the cloud server has become an important issue. One method is to encrypt the data before uploading it to the cloud server and download the ciphertext to a local machine when the user needs it. But this method consumes too much bandwidth.

Searchable encryption has been widely studied since it was proposed 20 years ago [3]. On the premise of ensuring data security, it realizes the retrieval of ciphertext, which has become a feasible method to solve the problem of data security in smart grid. The main idea of searchable symmetric encryption is that the data owner builds an index corresponding to the keywords and sends the index and ciphertext to the server. When the client performs a search, it generates a corresponding trapdoor according to the keywords to be searched and sends the trapdoor to the server. Then, the server performs the search algorithm based on the index according to the trapdoor sent by the client and returns the corresponding encrypted file to the client. The client decrypts the ciphertext and finally obtains the corresponding plaintext. Song et al. [3] proposed the first searchable encryption scheme in 2000. After that, many scholars strive to achieve efficient and secure searchable encryption schemes [416]. Among them, schemes [47] construct more efficient indexes and search algorithms but only support the single keyword search. Early multikeyword retrieval schemes [8, 9] are Boolean keyword queries, which support combined and parsed retrieval of keywords, i.e., querying documents that contain several keywords at the same time and documents that contain at least one of several keywords. However, this retrieval method is not flexible enough. Later, Cao et al. [10] proposed a sortable multikeyword search scheme in 2011 which is the first real multikeyword search scheme. The scheme uses the vector space model (VSM) [11] in text data retrieval as the basis of multikeyword retrieval, combined with the ASPE algorithm proposed by Wong et al. [12] to achieve a more secure and efficient multikeyword search scheme. In recent years, there are some new multikeyword search schemes [1316] and some fuzzy search schemes [17, 18]. Andola et al. [19] proposed a public key encryption with keyword search scheme which is immune to online keyword guessing attacks (ON-KGA) in a malicious server environment by adding a nonce with the keyword in 2017. Recently, Andola et al. [20] proposed a secure searchable encryption scheme using hash-based indexing which reduces the computational load in 2022. Andola et al. [21] proposed a scheme that reduces the communication overhead between the users and cloud server by introducing a manager in 2017. In addition, the scheme has the same level of security while achieving the overhead reduction goal. Then, Andola et al. [22] enhanced the security of [21] in 2018. However, these schemes are in the scenario where the data is a combination of random keywords, such as mail and log. Due to the characteristics of frequent updates and fixed data format of smart grid data, the current typical searchable symmetric encryption schemes are not applicable to smart grid data [23]. Li et al. [23] proposed a searchable symmetric encryption scheme applied in smart grid data in 2019, similar to [5] using pseudo-random functions to construct the scheme with simplicity and ease of updating. Zhu et al. [24] improved the scheme of [23] by adding multikeyword and fuzzy keyword search using the N-Gram algorithm and Hamming distance in 2021. Moreover, some searchable encryption schemes were constructed based on the Bloom filter principle [4, 25, 26], but most of them directly use the Bloom filter as the index, which is not easy to update and have false positive. Andola et al. [27] investigated existing searchable encryption techniques and analysed their robustness to attacks in 2022.

Furthermore, scheme [23] is not time efficient enough, and the search results returned to the user contain plaintext-related information, which reduces the security of the scheme. In this paper, on the basis of scheme [23], we propose a more efficient, more secure, easy to update, and false-positive-free searchable symmetric encryption scheme for smart grid data by combining the principles of pseudo-random function and Bloom filter [28]. More specifically, we improve the time efficiency by introducing the Bloom filter principle and changing the structure of the index. In addition, the security is improved by introducing self-defined data IDs to hide the ID information in the search results. The main work of this paper can be summarized as follows:(1)A new searchable symmetric encryption scheme for smart grid data based on Bloom filter and pseudo-random function is proposed, and the search efficiency is 52% faster than the previous scheme.(2)We customize ID for each piece of data and insert the hash value of the ID into the index, as the hash value of ID is used for server search. Indexes and search results do not contain any ID information and we improve the security of the hash by adding salt. Therefore, compared with the previous scheme, we have improved the overall security.(3)We implemented the proposed scheme and evaluated the scheme with real datasets; the results show that the proposed scheme has higher search efficiency and security.

The rest of the paper is organized as follows: Section 2 presents the formulation of the problem in this paper. In Section 3, we introduce the main ideas of the previous scheme [23]. The details of our scheme are described in Section 4. We perform an experimental verification in Section 5 and discuss the security analysis in Section 6. Finally, the conclusion and future work are given in Section 7.

2. Problem Formulation

2.1. Preliminaries

Some algorithms and tools used in this paper are as follows:(1)MD5 Algorithm: The MD5 message digest algorithm, a cryptographic hash function, calculates a message of arbitrary length as a 128-bit (16-byte) hash, which is usually expressed as a 32-bit hexadecimal sequence. To facilitate comparison with the original scheme, we first use the MD5 algorithm in our scheme. In addition, we also use the KECCAK256 algorithm to test our scheme to achieve a higher level of security.(2)Pseudo-random Function: A pseudo-random function is defined asSimply, a pseudo-random function is computationally indistinguishable from a random function. If it is impossible for the adversary to determine whether the random number pair is correct or not under the condition that the adversary knows any random number pairs ( is the secret key of the pseudo-random function), we can say the pseudo-random function is secure. We use the AES algorithm as the pseudo-random function in our scheme. Advanced encryption standard is one of the most popular algorithms for symmetric key encryption. The algorithm has high efficiency and security.(3)BKDR Hash: BKDR hash is a string hashing algorithm that amplifies small differences in the original data by multiplying it with prime numbers. The hash values derived by this algorithm have a small probability of conflicting collisions, but they are very large values and cannot be mapped directly to the bit array address. In our scheme, we first conduct the modular calculation upon the hash values, and the remainder is used as the address in the bit array. The comparison of other string hash functions [29] is shown in Table 1, where BKDR hash works best and we choose it as the string hash function in our scheme.(4)Bloom Filter: The general solution to determine whether an element is in the set is to compare the element with the elements in the set one by one. But this method is too inefficient. Bloom filter maps an element to a position on the bit array using the hash function and sets that position to 1. When querying, it just calculates whether its hash value is 1 at the corresponding position on the bit array, then we can determine whether the element is in the set. For an element , is the hash sequence of as shown as follows:When adding an element to the Bloom filter, we simply set the values of the hash sequence of to 1 at all corresponding positions in the Bloom filter. To determine whether an element is in the Bloom filter, we just determine whether the values of the hash sequence of are all 1 in the corresponding positions of the Bloom filter. Bloom filter is more space-efficient and search-efficient than most other algorithms, with the drawback that false positives can occur.(5)Salt: There are many common numbers or strings in daily life whose MD5 values can be easily obtained in plaintext by querying an MD5 dictionary. Adding salt is adding some predefined strings to the original characters so that it is difficult to look up the plaintext through the hash dictionary. This method is widely used in user password storage scenarios in servers.

2.2. Notations

The notations and descriptions used in this paper are shown in Table 2.

2.3. System Model

As shown in Figure 1, our scenario involves two parties, namely, the Data Owner (DO) and the cloud server (CS). The DO owns the original plaintext grid data and continuously collects newly generated data. The DO first builds indexes based on the plaintext and then encrypts the plaintext with a common encryption algorithm. Finally, the DO uploads the indexes and ciphertext together to the CS. The CS provides data storage and searching services for the DO. After receiving the trapdoor from the DO, it matches the index files and sends the calculation results to the DO.

Similar to general searchable encryption schemes, it is assumed in our threat model that the cloud server is honest but curious, i.e., the cloud server will provide the services as the user wishes but will also analyse the user data and learn additional information out of curiosity. This paper is based on the characteristics of smart grid data; each piece of data has the same format with the same number of data attributes; the data include year, month, and other data attributes, such as the U.S. Energy Information Administration’s EIA-861M form format, which is the standard for monthly reporting data in the electric power industry in the U.S. In addition, the data owner can customize the data attributes they want in a format similar to that shown in Table 3.

3. The Scheme of Li

Typical searchable symmetric encryption schemes include the following four functions:(i): The client generates the private key with parameter .(ii): The client takes the private key and the keyword as input and outputs the trapdoor for the keyword .(iii): The client takes the private key and the data set as input and outputs the index file of .(iv): The server takes trapdoor and index as input, if contains output 1 otherwise output 0.

Similar to the general searchable symmetric encryption scheme, the cloud server in the scheme proposed by Li et al. [23] stores the encrypted files and index files. When searching, the data owner generates the trapdoor for keyword . The cloud server performs for each piece of index in to determine whether the corresponding plaintext contains the keyword . Finally, the IDs of all data that contain the keyword in are output and return to the data owner. The specific steps are as follows:

3.1. The Data Owner Builds Index

The data owner runs the function and does the following steps for each piece of data in the plaintext:(a)For each data unit of data , it calculates the hash value of , where is a hash function.(b)Itcalculates the trapdoor for , where is a pseudo-random function.(c)calculates the codeword for , where is another pseudo-random function. is the ID of this piece of data. The scheme takes the row number of the data as the ID.(d)The n codewords of a piece of data are randomly inserted into a list of length n. Finally, the index of is obtained.

Through the previous steps, the corresponding index is generated for each piece of data, and they are stacked in the original order to finally obtain the index of data . Then, the data owner sends the index to the cloud server.

3.2. The Data Owner Builds Trapdoor

The data owner proposes a keyword to be searched, runs the function to compute the trapdoor for using the secret key and sends to the cloud server.

3.3. The Cloud Server Retrieval

After the cloud server receives the trapdoor for , it runs the function to do the following calculations and judgments for each index : The cloud server calculates the codeword in combination with the corresponding row number and returns 1 if the codeword is in this index , then output the row number (i.e., ID) of this index; otherwise, 0 is returned. It inserts all data IDs containing the keyword into a list that is eventually returned to the data owner.

The scheme is simple and practical, with low spatial complexity and is easy to update. The update only needs to append the new index to the original index, which will not affect the structure of the original index, the main steps are shown in Figure 2.

The time consumptions in [23] are divided into three main parts, namely, data owner index construction time, data owner trapdoor construction time, and cloud server search time. Both index and trapdoor construction times are necessary hash and pseudo-random function calculation time. Therefore, the cloud server search time is more critical and it is intuitive to users, and its length determines the user experience. The cloud server search time is mainly consumed in determining whether the computed codewords exist in the existing indexes, and the search method is linear, using the codewords of the trapdoor to compare with the codewords in the indexes one by one. As the strings of codewords are pretty long, the search efficiency is low. In addition, the search result is the row number of the data that meets the search requirements, which exposes the relationship between the trapdoor and the corresponding data row number to the server.

4. Our Improved Scheme

In view of the abovementioned problems, we can use the principle of the Bloom filter to speed up the search process. Moreover, we introduce data ID and store only the hash of the ID in the index, instead of storing the ID in plaintext. Simultaneously, due to the introduction of data ID, the order of the data in the index can be disrupted, and the plaintext and index no longer correspond to each other row by row. In this way, we can enhance overall security. Our scheme also contains the typical four functions in Section 3. The specific steps of our scheme are as follows:

4.1. The Data Owner Builds Index

The Data Owner first adds an ID before each piece of data, incrementing from 1, or using other custom ID assignment methods. After that, it runs the function and does the following for each piece of data in the plaintext:(a)For each data unit of data , it calculates the hash value of , where is a hash function.(b)calculates the trapdoor for , where is a pseudo-random function.(c)calculates the codeword for , where is another pseudo-random function. is the ID of this piece of data, and is a custom string added to the calculation process of ID hash.(d)inserts the n codewords of a piece of data into a list of length n and disorganize the list.(e)calculates the hash value of the data ID , where is a hash function. It inserts this value into the head position of the list in (d) to get a list of length n + 1 as the index of this piece of data.

After doing the previous processes for each piece of data, lists are obtained, and then we disorganize the order to get the index of the data and send it to the cloud server.

4.2. The Cloud Server Build BF Index

The cloud server does the following for each index in index file :(a)creates a bit array of length and initializes all to 0. It calculates the BKDR hash value for each remaining codeword in except for the first element (the hash value of the data ID) in . The value of is distributed between 0 and .(b)takes the value of as the corresponding position of this codeword in the bit array and sets the position to 1 to get the BF index of the data .

After doing the previous processes for each index, we get M bit arrays as the BF index of data , and this step only needs to be calculated once for subsequent search.

4.3. The Data Owner Builds Trapdoor

The Data Owner proposes a keyword to be searched, runs the function, computes the trapdoor for using the secret key , and sends to the cloud server.

4.4. The Cloud Server Retrieval

After the cloud server receives the trapdoor for keyword , it runs the function to do the following calculations and judgments for each BF index :(a)calculates the codeword using the first element (hash of data ID) in the index .(b)If the BKDRhash value corresponding to the location position in this index (i.e., the bit array) is 1, then 1 is returned and the first element (the hash of the data ID) in this index is output; otherwise, 0 is returned.

After outputting the hash values of all eligible data IDs, we get the preresult. The following calculations and judgments are made for each element in the preresult: using the elements to calculate the codeword , if the codeword is in this index , then return 1 and output the hash value of this data ID; otherwise, 0 is returned. The main steps are shown in Figure 3.

Due to the false-positive nature of the Bloom filter, there may be redundant results in the preresult. We need to perform another linear comparison of the preresult (i.e., step (c)) and finally return the exact results to the data owner. The preresult is already the result obtained after one screening, when the amount of data is already very small, and then do the linear comparison in just a very small amount of time. The data owner maintains a dictionary corresponding to and hash values and can find the corresponding data ID by ID hash value in a few microseconds. If necessary, the cloud server will return the corresponding ciphertext to data owner by the hash value of ID later. Our scheme is efficient, more secure, and easy to update. When updating, we simply append the new index to the original index, without making changes to the original index.

5. Performance Evaluation

To verify the efficiency of our scheme, we experimentally compared it with the scheme of [23]. The experimental device is Windows 10 OS, 3.2 GHz Intel Core I5 processor, 8 GB RAM. The generic cryptographic algorithm provided by the pyCryptodome cryptographic algorithm library [30] is used, with MD5 and AES algorithms for the hash function and pseudo-random function, respectively. We also use the KECCAK256 algorithm to replace MD5 to achieve a higher level of security. The comparison is mainly done in terms of cloud server search time and data owner index construction time. Since the trapdoor construction time is exactly the same for both schemes, no comparison is made. In addition, the theoretical comparison with other typical searchable symmetric encryption schemes is shown in Table 4 (n is an extremely small constant compared with N).

5.1. Experimental Data

The data used for the experiments are public data sets, AMI (advanced metering infrastructure) statistics provided by the official website of the EIA (U.S. Energy Information Administration). The data format is the EIA-861M tabular format, which is the standard format for monthly reports in the power industry [31]. The report collects information such as monthly electricity sales and revenues from a statistical sample of U.S. electric utilities in a CSV file set that includes 22 data attributes, including month and year, customer address, and customer ID. The data types include strings and unsigned integers, and the experimental data is incremented by the amount of data, from 2000 to 16000 items. By experimenting with the selection of suitable parameters, including the number of bits of the codeword string to be computed and the length of the bit array, it is necessary to find the appropriate balance between the Bloom filter search time and the linear comparison time to minimize the total time.

5.2. Experimental Results

To guarantee the accuracy and correctness of the experiment results, we take the average value of three repeated experiments for each data point in the figures. In the case of different data volumes, we use line graphs to compare the time efficiency of the two schemes. As can be seen from Figure 4, the search time of our scheme is on average about 52% faster than that of the scheme [23]. The reasons for this are twofold. First, we speed up the search process using the Bloom filter principle and eliminate false positives after a secondary search, improving the search efficiency under the guarantee of exact search. Second, we change the structure of the index. The step of calculating the hash value of the data ID is omitted in the search process. In [23], the hash of the row number is calculated when the cloud server searches. In our scheme, the hash of the custom ID is included in the index, which can be used directly when the cloud server searches. This will not only reduce the search time but also improve security. It can be seen from Figure 5 that our scheme takes slightly more time to build the index than [23]. This is because the cloud server in our scheme consumes a small amount of extra time in constructing the bloom index. Our experimental code (including KECCAK256 version) is on [32], and the code of scheme [23] is on [33], that is, the code for our comparative experiment.

6. Security Analysis

The cloud server cannot get any plaintext information by indexing, because in this scenario, a hash function and a pseudo-random function are used to calculate the plaintext keywords to get the codewords, unless the attacker can crack the pseudo-random function (AES); however, this is computationally impossible. In [23], the authors use the row number where the data is located as the ID of the data, making the same keyword in two data have different codewords, achieving index unlinkability. However, this has two disadvantages. First, each plaintext data sorting and index sorting is exactly the same, which will leak the plaintext order to the server. Second, the search results (row numbers of all data containing corresponding keywords) are directly exposed to the cloud server, which can learn the correspondence between the trapdoor and the row numbers where the plaintext data containing its keyword is located. In our scheme, we customize an ID for each piece of data, and after generating an index, we can disrupt the order of the index and improve security. In addition, the search result is a hash of data ID and the cloud server cannot learn the correspondence between any trapdoor and the data ID containing its keywords. For more mathematical and security proofs, we recommend the reader to read [4, 34, 35].

In the system model of the scheme, there are two roles, namely, Data Owner (DO) and cloud server (CS). We assume that the CS is honest-but-curious, which means that CS can be trusted when executing DO’s command. Meanwhile, CS is curious about the data. The CS may analyse the encrypted plaintext, index, and search result to obtain relevant information of plaintext [20].

6.1. Security Model of Index Unlinkability (IND-UN)

In this security model, we assume adversary to be a malicious cloud server and challenger to be a sender. IND-UN guarantees that an adversary cannot learn whether two pieces of indexes have the same keyword . In the following content, we use to denote is drawn uniformly at random from the set . We set as the number of data unit (keyword) in a piece of data. is a hash function. is pseudo-random function. is a predefined string.

Theorem 1. We assume that if can break IND-UN with non-negligible , then there exists another adversary which can break the pseudo-random function and hash function with non-negligible .

Proof. (i)Setup. The challenger gets a private key .(ii)Query. The challenger chooses two pieces of data ( is the plaintext data set).(iii)Challenge. The challenger lets , if there exists the same keyword in the two pieces of data; otherwise, . The challenger computes the index of . For each keyword in ,It inserts the of the data into a list of length n and disorganizes the list. It calculates the hash value of the data ID . It inserts this value into the head position of the list to get a list of length n + 1 as the index. Then, the challenger gets the indexes , for . The challenger sends the indexes to the adversary .(iv)Response. The adversary makes a guess and lets , if there exists the same keyword in the two pieces of indexes; otherwise, and output .We define the of the adversary to be . Here, the adversary will simulate the adversary to break the pseudo-random function and hash function. The adversary cannot break the pseudo-random function and hash function as the adversary is a polynomial time attacker. Because breaking the hash function (KECCAK256) and pseudo-random function (AES) is difficult problems. Therefore, the is negligible, which means our scheme meets the index unlinkability. We achieve this as the same keywords have different codewords in two different pieces of indexes.

6.2. Security Model of Trapdoor Indistinguishability (TD-IND)

In this security model, we assume adversary to be a malicious cloud server and challenger to be a sender. TD-IND guarantees that the trapdoors constructed by two keywords and are indistinguishable for an adversary. In the following content, we use to denote that is drawn uniformly at random from the set . We set as the number of data unit (keyword) in a piece of data. is a hash function. is a pseudo-random function.

Theorem 2. We assume that if adversary can break TD-IND with non-negligible , then there exists another adversary which can break the pseudo-random function and hash function with non-negligible .

Proof. (i)Setup. The challenger gets a private key .(ii)Query. The challenger chooses two keywords ( is the plaintext data set).(iii)Challenge. The challenger computes and the trapdoor of keywords . For the keyword ,Then, the challenger gets the trapdoor for . The challenger sends the trapdoor to the adversary .(iv)Response. The adversary makes a guess and output .We define the of the adversary to be . Here, the adversary will simulate the adversary to break the pseudo-random function and hash function. The adversary cannot break the pseudo-random function and hash function as the adversary is a polynomial time attacker. Because breaking the hash function (KECCAK256) and pseudo-random function (AES) is difficult problems. Therefore, the is negligible, which means our scheme meets the trapdoor indistinguishability.

7. Conclusion and Future Work

In this paper, we propose a simple, efficient, and easily updatable searchable symmetric encryption scheme under smart grid data. High efficiency was achieved using the Bloom filter principle, and the scheme was experimentally verified to be about 52% faster than the latest scheme [23]. In addition, the introduction of custom data ID allows the index order to be disrupted so that it no longer corresponds to the plaintext row by row. The search result is the hash of ID, which no longer exposes the relationship between trapdoor and data ID, improving overall security. The datasets selected in our experiment are public, but the power department has lots of other data in practice that cannot be published involving the privacy of users. These data are much larger and more string-like data about users’ personal privacy, which is very suitable for our scheme.

There are three research priorities in future research work, where the first is the range search of numerical class data to meet the needs of statistics and financial audit of smart grid data. For example, we need to query which users used electricity in a certain range last year. We can use some cryptographic methods such as OPE (order-preserving encryption) and HE (homomorphic encryption) to realize range search. The second is fuzzy keyword search. When searching, users may enter wrong keywords, such as missing letters and wrong letters. We can use LSH (locality-sensitive hashing) [17] or fuzzy set to achieve fuzzy keyword search. The third is forward security, which aims to reduce information leakage and further enhances the security of the scheme. In order to achieve forward security, we need to prevent data information disclosure when updating the data in cloud server; more information can be found in [36].

Data Availability

The links of the experimental data of this study and related code are included within this article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Zhejiang Provincial Natural Science Foundation of China (Grant no. LQ20F020019) and the Technology on Communication Security Laboratory (Grant no. 6142103190105).