Abstract

Currently, smart devices of Internet of Things generate massive amount of data for different applications. However, it will expose sensitive information to external users in the process of IoT data collection, transmission, and mining. In this paper, we propose a novel indexing and searching schema based on homocentric hypersphere and similarity-aware asymmetric LSH (H2SA-ALSH) for privacy-preserved data collection and mining over IoT environments. The H2SA-ALSH collects multidimensional data objects and indexes their features according to the Euclidean norm and cosine similarity. Additionally, we design a --AMIP searching algorithm based on H2SA-ALSH. Our approach can boost the performance of the maximum inner production (MIP) queries and top- queries for a given query vector using the proposed indexing schema. Experiments show that our algorithm is excellent in accuracy and efficiency compared with other ALSH-based algorithms using real-world datasets. At the same time, our indexing scheme can protect the user’s privacy via generating similarity-based indexing vectors without exposing raw data to external users.

1. Introduction

In recent years, Internet of Things (IoT) technology has been applied to a wide range of applications [1, 2], mainly driven by the rising number of Internet-connected devices that already amount to several billion [3]. The devices of IoT [4] aim to connect everyday objects, such as humans, plants, and even animals, to the Internet to enable interactions among these objects [5]. Applications of IoT have been widely developed in medical healthcare [6, 7], vehicular networks [8, 9], and industrial IoT [10]. With the widespread popularity of IoT, a massive amount of data is generated and widespread at a relatively fast speed.

Thus, applications in different IoT domains have seen an explosion of information generated from heterogeneous devices every day. Recently, the data collection and mining over IoT data streams have increasingly incurred research interests [1115].

However, due to weak privacy and security protection in IoT devices, some smart applications of IoT expose sensitive data and user privacy to security threats [16]. Thus, data mining over raw data will collect and expose user-sensitive information. As with stream data mining [17], interesting knowledge, regularities, or high-level information, they can easily introduce privacy protection policies. At present, MIP (maximum inner production) search is prominent, and it was used in a wide range of applications, such as matrix factorization-based recommendation systems [1820], multiclass label prediction [21, 22], SVM classification [23], and even deep learning [24]. However, it is time-consuming to conduct the MIP search in high-dimensional space. Moreover, it may cause user’s privacy leakage. A query system needs to collect the raw data from devices of the IoT system. Many types of research try to construct an appropriate approximate structure for the search. It is usually called approximate maximum inner product (AMIP) search [2529], in which a given query and a data object , is the set of target objects, and the AMIP algorithms will compute the approximate maximum inner product results for in .

The techniques of AMIP are often based on locality-sensitive hashing (LSH) [30], which can solve the AMIP problem in sublinear time. Currently, many algorithms based on LSH are also proposed, such as L2-ALSH [28], Sign-ALSH [31], Simple-ALSH [27], XBOX [32], and -ALSH [33]. Additionally, many data mining tasks over massive datasets are also applied using LSH based algorithms [3437] to accelerate the MIP search. The common AMIP data mining algorithms, such as L2-ALSH [28] and Simple-ALSH [27], converts AMIP searching problem into -ANN searching problem. Recently, some novel methods proposed to solve the high-dimensional AMIP search problems by introducing approximate features into the indexing vectors.

Motivated by the promising techniques, we can extract target features from raw data objects of IoT devices and conduct the maximum similarity search between the input vector and the set of extracted indexing vectors, and only the object of the matched features is needed to be transmitted to the user for a final decision. Thus, it will protect the user’s privacy from sensitive information collection and exposure to third-parties query services [38, 39]. The contributions of the paper are as follows: (1)We propose a novel privacy-preserved indexing and searching schema, termed H2SA-ALSH for high-dimensional data objects collection and mining. The indexing scheme is based on homocentric hyperspheres and similarity-aware algorithm (H2SA). The searching is applied to compute the cosine similarity between a query vector and data objects. The proposed schema can support AMIP search, top-k search, etc., without exposing raw data privacy(2)We optimize the proposed indexing solution to fit IoT data collection and mining. In the process of IoT data collection, we establish an incremental indexing mechanism, which indexes an input item immediately, when a data item arrives. For IoT data mining, we design SRP-LSH to accelerate the search by filtering the low-similarity objects. Moreover, the algorithm is not sensitive to the data, i.e., it presents acceptable performance over different distribution datasets(3)We conduct comprehensive experiments to evaluate the H2SA-ALSH indexing and searching scheme using three real-world data sets. The experimental results show that the proposed approach is more accurate and efficient than the state-of-the-art algorithms. As a result, a searching query will not be directly conducted over the raw data objects in IoT environments

2. Problem Definition

In the section, we briefly present preliminaries of the proposed techniques and state our research problem formally. Then we use the common notations in AMIP literature and present the MIP and corresponding AMIP searching problem formally.

Definition 1. Maximum inner product (MIP) search. For a data collection that already received data objects and an arbitrary query , the MIP search aims to find that satisfies

Definition 2. The -approximate maximum inner product (-AMIP) search. Given an approximate ratio , the goal of the -AMIP search is to construct an approximate structure, and a user can find the approximate result , , which satisfies the following condition for a query , i.e., , is the accurate result of the MIP search.

In the paper, we convert the -AMIP search problem to the -ANN problem. The -ANN problem aims to find the nearest neighbour according to the Euclidean distance. The definition of the -ANN problem is as follows:

Definition 3. Given an approximate ratio , and for a query vector , -ANN aims to find data object , , which satisfies the following formula: where is accurately obtained by the MIP search.

The LSH is a common method to solve the -ANN problem. We use the definition of the nearest neighbour whose distance measure is measured as to depict the LSH paradigm. Let be a hash function that maps an item to a hash value, and the corresponding definition is as follows.

Definition 4. When a hash family meets the following conditions, it can be called sensitive. For multidimensional data objects and , the hash function from satisfies: (1)If , then, the probability of is at most (2)If , then, the probability of is at least where and , respectively.

We adopt the common LSH technology to solve the ANN search problem, and similar data objects have higher probability of getting the same hash function results than those with lower similarity. Thus, the LSH can solve the nearest neighbour and similarity problems of multidimensional data even in linear time [40].

Furthermore, we transform the AMIP search problem into the nearest neighbour problem via asymmetric locality sensitive hashing (ALSH). There have been some researches on ALSH technologies [27, 30, 31, 32, 33]. In this paper, we use the transformation [32]. For a data object and a query , the transformation is as follows:

In formulas (3) and (4), the constant is used to present the largest Euclidean norm among the data collection . The maximum Euclidean norm may constantly change when it collects more data from IoT devices. In our schema, we assign an appropriate to each ALSH unit, and the is the maximum Euclidean norm. For a data object , . Through the QNF transformation, the AMIP search problem can be converted into the nearest neighbour search problem. The following formula can be used in the transform:

In Equation (5), for a query are constant, so we have

The is the nearest neighbour search problem, and it can be solved by the -LSH technology quickly. We will present the signed random projections LSH (SRP) and -LSH, where similarity measurement methods are the correlation similarity and the distance, respectively. When the distance is the correlation similarity, let be the angle of two multidimensional vectors, and be the multidimensional vector, where. The distance of correlation similarity is

The correlation similarity is , and the SRP-LSH can solve the maximum correlation similarity search. The procedure can be depicted as follows: first, a random vector with is obtained. The random vector determines a hash function , and the hash function will return dualistic results. If , then, , else . The LSH family is formed by several random vectors. By the SRP-LSH, we can conclude

i.e.,

Now, we briefly propose the indexing schema based on the asymmetric LSH scheme for high-dimensional AMIP search. We also adopt the -LSH and SRP-LSH. The indexing features from IoT devices were calculated by the Euclidean norm and cosine similarity among the data. More details, when a data object comes, the schema calculates the s Euclidean norm and keeps the feature into an exact block according to the cosine similarity. The exact block and the exact bucket determine the data item’s storage unit. When conducting a query, the schema adopts transformation and searches the -AMIP results through the -LSH, precisely through the QALSH [32]. We have kept the block partition principle of -ALSH. The blocks are divided by the Euclidean norm of the data objects with the division ratio. Besides, we consider another factor determining the inner product which is the angle between the given query and the data objects. We use SRP-LSH to divide one block into buckets, so the buckets are the minimum storage unit in our schema. The overview of our indexing schema is shown in Figure 1.

When we conduct the AMIP search, we traverse the blocks in order, traversing blocks from a large Euclidean norm to a small block. Then, we traverse from high similarity to low similarity according to the cosine similarity within one block.

In our schema, the calculation can focus on the data objects that can be considered as candidates, which have a higher possibility of becoming the AMIP search results, and the search process finishes when there is no necessary to traverse the rest data objects. Thus, filtering the unnecessary data objects allows the schema to reach a remarkable time performance.

Our work is different from the article [32], in which the data object is treated as the static items, and all data are only divided into buckets by Euclidean norm. Our schema considers IoT environments, where the data is updated frequently, and we cannot sort the whole static sets. Instead, the input object will be inserted into our H2SA-ALSH unit when it comes. The indexing construction does not decrease the accuracy of the following queries. Therefore, our indexing schema is more appropriate for IoT scenarios where the features are dynamically generated through distributed devices and applications.

3. Indexing Construction

Given a continuous object series , and an incoming object , we first calculate the Euclidean norm . To effectively divide the blocks , we introduce the interval rate . Given an AMIP search approximation rate and the query angle in the bucket, is the approximation rate of ANN, and the can be expressed as follows: where .

We present explanations about and use to represent blocks. We assign a data object into different blocks and different buckets. There are several buckets in , and different buckets represent the classification of different objects according to the cosine similarity. Every indexing unit has a unique identifier that consists of a block identifier () and bucket identifier (). The schema determines the specific bucket identifier of the data object according to the hash family of SRP-LSH. Assuming that the hash family of SRP-LSH uses hash functions, the bucket number can be expressed as bits. The bucket can be initialized later. All data objects that satisfy will be assigned into the block . When putting the data objects into the buckets, the number of buckets gets larger. We set a threshold , and the bucket will use QNF to convert the -dimensional data into the -dimensional if the number reaches and then builds QALSH indexing. For the buckets which number of data objects is less than the threshold , the raw feature is stored directly. When dividing the block, the schema will determine the first block based on the first data object of . The maximum norm of this block is , and the block will set as the benchmark block. Then, we can determine other data objects’ blocks. For the subsequent data, we can calculate the specific block based on the norm and the benchmark . The process can be presented as the following Algorithm 1.

Input: a time series with objects interval ratio , and a threshold
Output: The number of disjoint K, K disjoint sets with blocks .
;
Compute ;
;
Whiledo
   ;
   Compute the bucket of the block using SRP-LSH hash family;
   ;
   ;
   Ifthen
      Build hash tables for of using ALSH;
   End
   Ifthen
      Insert into QALSH indexing of ;
   End
End
;
Return K, .

4. Similarity-Aware AMIP Searching

To respond to the arbitrary maximum inner product query , we first need to calculate the Euclidean norm and then we set the MIP value as . Since the maximum norm of the data objects in the first block is the largest one, it is most likely that the block contains the MIP data object. Thus, we traverse the block from to . Each block contains many buckets according to the angle similarity. Moreover, the MIP data objects are most likely to have high cosine similarity with the query , and the traversal of the buckets is performed in ascending sequence as the angle.

For a block , the AMIP process can be described as the three main stages. First, for a query and block , we first calculate a deadline condition . All satisfy , and . For a block, we can have . In the AMIP algorithm, we consider the effect of data norm , and the angle between the query and . Since , within each bucket, we use SRP-LSH to estimate the cosine similarity between and . If the similarities of the buckets satisfy the given similarity, the schema will conduct the AMIP search process. The cosine similarity calculation will cause errors, and in the later section, we will demonstrate the specific error. Then, we use these two deadlines and given cosine similarity to AMIP in the buckets. (1) Before starting to traverse the block , the schema will stop traversing the rest blocks if and then the algorithm will return the AMIP data object. (2) If , we traverse the buckets in the block, and if the cosine similarity does not satisfy the given similarity, the schema skips the buckets and traverses other buckets.

Input: The query q, threshold , the number of disjoint sets K, and the structure of H2SA-ALSH: ;
Output: The approximate MIP objects
Compute ;
;
;
fordo
   ;
   Ifthen
     Break
   End
   Fordo
     Compute the hash value of using SRP-LSH hash family;
     If the causes hash collision with using “hash banding” technique then
       Ifthen
          ;
       Else
          ;
          ;
       End
     End
     ;
     ;
   End
End
Return

In the process of cosine similarity searching, we apply hashing banding to improve the calculation accuracy. For details, an identifier of a bucket can be represented by bits. When we use the hashing banding, in which the bits are divided into bands, and each band has bits. For a query , the SRP-LSH hash functions will calculate bits, which are also divided into bands. If one of bands from is the same as the corresponding band of the bucket’s band, we term it as having a hash similarity collision, and the angle meets our calculation requirement. The total AMIP searching algorithm can be described in Algorithm 2.

5. Theoretical Analysis

5.1. Accuracy Analysis

Theorem 5. Set the approximation of -AMIPS to , and the approximate value of -ANN is . By setting , the probability that the result returned by Algorithm 2 meets -AMIP is .

Proof. According to the paper [33, 37], we know that the probability that QALSH returns the result of -ANN is at least . If we fix the QALSH error rate is , then the AMIP algorithm that searches for MIP will return a result of -AMIP. Below, we focus on proving that the AMIPS meets -AMIP.

We first derive the expression , assuming is the MIP for a query , and in the block , . According to the previous formula, we have

As with -ANN, according to [33], for and , QALSH returns a result of -ANN, which is , let be the angle of and . Combining the above formula, we have

By SRP-LSH, we know

Now we try to calculate , assuming is the angle variable that changes as a threshold for a query, and represents the angle between and , where , . For theoretical analysis, we assume the angles of data items obey the uniform distribution, i.e.,, and is the angle of the smaller similarity bucket traversed in for the . Then, we assume that the number of data items for a block is, then

Thus, we can get the cumulative density function of as follows:

Also, we can calculate the deviation of to get the probability density function:

Assuming as the threshold, , we have

Finally, we have

Let be . We can depict the interval rate of the block as

5.2. Complexity Analysis

In this section, we conduct an analysis of the space and time complexity of our algorithm.

Theorem 6. Given an approximation ratio for a -AMIP search, we use space to construct indexing structure and cost time at most for a -AMIP object searching.

Proof. The storage structure of the H2SA-ALSH does not have essential differences with the -ALSH, and we also use QALSH to store and index the data. Algorithm 1 has two parts of overhead: the space of cost by arrived data and space cost by indexing LSH (QALSH). According to -ALSH [33], the space overhead of QALSH hash table is . Thus, the space overhead of Algorithm 1 is . To answer a -AMIP query, in the worst case, Algorithm 2 needs to check objects in all disjoint units, and the schema searches all the units and the search will cost query time.

More details, the overhead of for query time represents the worst case. For the real data sets, the H2SA-ALSH will filter out most of data units, even the data is random distribution or even skewed. The H2SA-ALSH will stop in the first few blocks, and in one block, the schema only searches a few buckets. Therefore, the average query time of a -AMIP object will be much better than in the worst case.

6. Experimental Evaluation

We conduct experiments on three real-world data sets (Mnist [41], Sift [42], and YearPredictionMSD [43] (be termed as Year)) and compare our algorithm with three state-of-the-art AMIP algorithms. The experiments mainly evaluate the precision of AMIP results, the time efficiency of constructing the index, and the query efficiency. We run all the experiments on an Intel Xeon E5-2620 machine with eight cores and 32 GB of memory. All the algorithms in the experiments are implemented by the C++ language and run on Centos 7 OS.

The main evaluation metrics of the experiments are the recall and precision of the AMIP results, overall approximation ratio, and running time of AMIP search. To evaluate the performance of our algorithm, we compare our approach with Simple-ALSH [27], -ALSH [32], and Sign-ALSH [31]. The experiment verified the performance of all methods for 0.5--AMIP search by varying from 1 to 10 to show the evaluation results of recall and precision. Thus, we get the top- MIP objects by 0.5-AMIP. Figures 24 describe the recall and precision curves of the evaluation. We can see from the curves of Figures 24, the H2SA-ALSH is better than those of other algorithms in the top- searching (), which means that the H2SA-ALSH can obtain more precise search results compared with other algorithms (Simple-ALSH, sign-ALSH, and -ALSH).

Furthermore, we use the metric of approximation ratio to evaluate the precision of the search results. For the approximate --AMIP search, we set the given approximation ratio be 0.5. Then, we compare the approximation ratios of our algorithm with other algorithms. The comparison is conducted under --AMIP searching using top- searching (, 2, 5, and 10).

The approximation ratio is expressed as , whose value is less than 1. The overall approximation ratio is the average approximation of all queries that can show precision. Additionally, when the ratio is greater, we can obtain better AMIP search results. As shown in Figure 5, the overall approximation ratios of all algorithms are higher than the . Our algorithm has a better approximation ratio than all the other algorithms, which means that our algorithm will reach better precision for an arbitrary query.

To examine the query efficiency, we evaluate our algorithm performance on approximate object searching. We compare the average computation time for a query with the latest -ALSH algorithm. Figure 6 shows that the average query time of our algorithm is less than the time used in -ALSH over the three data sets. Especially in the year dataset, the query efficiency of our approach improves nearly 60% compared with -ALSH.

7. Conclusion

In the paper, we propose a novel indexing and searching schema, termed as H2SA-ALSH, in IoT environments. The H2SA-ALSH can construct indexing for multidimensional data objects according to the Euclidean norm and cosine similarity without collecting the raw data objects. At the same time, the extracted indexing features are built with approximate disturbance elements into the features. By collecting and indexing the disturbed features on the fly, we design a --AMIP searching algorithm, to achieve accurate and efficient maximum inner product searching and top- searching for a given vector. Experiments demonstrate the accuracy and efficiency improvement of our approach compared with three AMIP-based algorithms using real-world data sets.

Data Availability

The authors declare that all the data and materials in this manuscript are available.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant 2021YFB3101305 and by the National Natural Science Foundation of China under Grant 61931019.