Efficient Two-Step Protocol and Its Discriminative Feature Selections in Secure Similar Document Detection
Recently, the risk of information disclosure is increasing significantly. Accordingly, privacy-preserving data mining (PPDM) is being actively studied to obtain accurate mining results while preserving the data privacy. We here focus on secure similar document detection (SSDD), which identifies similar documents of two parties when each party does not disclose its own sensitive documents to the another party. In this paper, we propose an efficient two-step protocol that exploits a feature selection as a lower-dimensional transformation, and we present discriminative feature selections to maximize the performance of the protocol. The proposed protocol consists of two steps: the filtering step and the postprocessing step. For the feature selection, we first consider the simplest one, random projection (RP), and propose its two-step solution, SSDD-RP. We then present two discriminative feature selections and their solutions: SSDD-LF which selects a few dimensions locally frequent in the current querying vector and SSDD-GF which selects ones globally frequent in the set of all document vectors. We finally propose a hybrid one, SSDD-HF, which takes advantage of both SSDD-LF and SSDD-GF. We empirically show that the proposed two-step protocol significantly outperforms the previous one-step protocol by three or four orders of magnitude.
Similar document detection is the problem of finding similar documents of two parties, Alice and Bob, and has been widely used in the version management of files, copyright protection, and plagiarism detection . Secure similar document detection (SSDD)  has been recently introduced to identify similar documents while preserving the privacy of each party’s documents, as shown in Figure 1. That is, SSDD finds similar document pairs whose cosine similarity  exceeds the given tolerance while not disclosing document vectors to the other party. SSDD is a typical example of privacy-preserving data mining (PPDM) [4–6] and has the following applications:(i)Detection of double submissions [2, 7]: in two or more conferences that are not allowing double submissions, SSDD finds the double-submitted papers while not disclosing the papers to the other conference(s).(ii)Detection of insurance fraud : in an insurance fraud detection system, SSDD searches for similar accident cases of two or more insurance companies while not providing sensitive or private cases to the other company or companies.(iii)Sharing of similar symptom cases : doctors or patients of a hospital want to find similar symptom cases with the prescriptions or medical treatment histories of other hospitals without disclosing each other’s medical or health information.
Jiang et al.  proposed a novel solution for SSDD by exploiting secure multiparty computations (SMCs) [9, 10] in a semihonest model. Their solution has preserved the privacy of two parties by using the secure scalar product in computing the cosine similarity between document vectors. They suggested the use of random matrix or homomorphic encryption methods , which are representative methods of secure scalar product. In this paper, we use the SSDD solution using the random matrix method by Jiang et al. as the base protocol, which we call SSDD-Base. However, SSDD-Base has a critical problem in that it incurs severe computation and communication overhead. Letting Alice’s and Bob’s document sets be and , respectively, SSDD-Base requires secure scalar products. In many cases, the dimension of the document vectors reaches tens of thousands or even hundreds of thousands in number, and SSDD-Base incurs a very high complexity of , which is not practical to support a large volume of document databases. In particular, if there are many parties or frequent changes in document databases, the overhead becomes much more critical.
To alleviate the computation and communication overhead of SSDD-Base, in this paper, we present a two-step protocol that exploits the feature selection of a lower-dimensional transformation. The feature selection transforms high-dimensional document vectors into low-dimensional feature vectors, and in general, it selects tens to hundreds dimensions from thousands to tens of thousands of dimensions. We call this feature selection FS. Representative FS includes a random projection (RP) , the document frequency (DF), and a linear discriminant analysis (LDA) . In this paper, we use RP and DF because they are known to be simple but efficient feature selections. To devise a two-step protocol, we need to find an upper bound of the cosine similarity for the filtering process. Thus, we first present an upper bound of FS and formally prove its correctness. Using the upper bound property of FS, we then propose a generic two-step protocol, called SSDD-FS. The proposed SSDD-FS works as follows: in the first filtering step, it converts -dimensional vectors into (≪)-dimensional vectors and applies the secure protocol to -dimensional vectors to filter out nonsimilar -dimensional vectors; in the second postprocessing step, it applies the base Protocol SSDD-Base to the nonfiltered -dimensional vectors. In the filtering step, SSDD-FS prunes many nonsimilar high-dimensional vectors by comparing low-dimensional vectors with relatively less complexity of , and thus, it significantly improves the performance compared with SSDD-Base.
For an efficient SSDD-FS, the FS should be highly discriminative; that is, the FS should filter out as many high-dimensional vectors as possible if they are nonsimilar. In this paper, we analyze SSDD protocols in detail and propose four different techniques as the discriminative FS implementation. An RP can first be considered the easiest way of implementing the FS. The RP randomly selects dimensions from dimensions. An RP is easy, but its filtering effect will be very low owing to the randomness. To solve this problem, we exploit the DF that selects the feature dimensions based on the frequencies in all document vectors. In particular, by referring to the DF concept, we present three variants of DF, called the local frequency (LF), global frequency (GF), and hybrid frequency (HF). First, the LF considers the term frequencies of Alice’s current querying vector (which we call the current vector) and selects high-frequency dimensions from the current vector. The LF focuses on the locality, which means that considering the current vector only might be sufficient to decrease the upper bound of the cosine similarity. Second, the GF means the DF itself; that is, the GF counts the number of documents containing each term (dimension), constructs a frequency vector from those counts (which we call the whole vector), and selects high-frequency dimensions from the whole vector. The GF focuses on the globality because it considers all document vectors. To implement the GF, however, we need to make a secure protocol for obtaining the whole vector from both Alice’s and Bob’s document sets. For this, we propose a protocol, SecureDF, as a secure implementation of the DF. Third, the HF takes advantage of both locality of the LF and globality of the GF. The HF computes a difference vector between the current and whole vectors and selects high-valued dimensions from the difference vector. This is because the HF tries to maximize the value difference between Alice’s and Bob’s vectors for each selected dimension and eventually decreases the upper bound of the cosine similarity. Table 1 summarizes these four feature selections and their corresponding SSDD protocols, SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-HF, which are described in Section 4.
In this paper, we empirically evaluate the base protocol, SSDD-Base, and our four SSDD-FS protocols (SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-HF) using various datasets. The experimental results show that SSDD-FS protocols significantly outperform SSDD-Base. This means that the proposed two-step protocols effectively prune a large number of nonsimilar sequences early in the filtering step. In particular, SSDD-HF, which takes advantage of both the locality of SSDD-LF and the globality of SSDD-GF, shows the best performance. Compared with SSDD-Base, SSDD-HF significantly reduces the execution time of SSDD by three or four orders of magnitude.
The rest of this paper is organized as follows. Section 2 explains related work and background of the research. Section 3 presents the FS-based two-step protocol, SSDD-FS, and proves its correctness. Section 4 introduces four novel feature selections, RP, LF, GF, and HF, and proposes their corresponding secure protocols. Section 5 explains experimental results on various datasets. We finally summarize and conclude the paper in Section 6.
2. Related Work and Background
With increasing need of preserving privacy of big data, there have been many efforts on PPDM . PPDM solutions can be classified into four categories: data perturbation, -anonymization, distributed privacy preservation, and privacy preservation of mining results . Among these categories, secure multiparty computation (SMC) has been used in distributed privacy preservation. As a recent secure multiparty computation of principal component analysis (PCA), Won et al.  proposed S-PCA that computes PCA without disclosing the sensitive data of individual nodes. Also, SSDD can be regarded as an application of distributed privacy preservation. Shah and Joshi  summarized PPDM technologies in a distributed environment. Representative SMCs include secure comparison, secure sum, and secure scalar product, and SSDD is an interesting application of exploiting the secure scalar product.
Jiang et al.  proposed a novel solution for SSDD by exploiting SMCs in a semihonest model. Their solution has preserved the privacy of two parties by using the secure scalar product in computing the cosine similarity between document vectors. They also proposed -gram based SSDD protocols that are effective in detecting local similarity unlike the existing vector space model, which is effective in detecting global similarity . Buyrukbilen and Bakiras  introduced a solution based on simhash to reduce high computational and communication costs. Simhash is a dimensionality reduction technique that encodes all the document terms and their frequencies into a fixed-size bit vector and made it possible to reduce the similarity calculation to a secure XOR computation between two bit vectors.
We use the cosine similarity as the basic operation of similar document detection. The cosine similarity of two -dimensional vectors and , which are normalized to size 1, is computed as , where is the scalar product of and ; that is, . If we can compute securely in two parties, we can also compute securely. There are two representative methods for a secure scalar product . The first is a random matrix method, where two parties share the same random matrix, which they use to securely compute the scalar product. The second is the homomorphic encryption method , where two parties use a homomorphic probability key system for the secure computation of scalar products. In this paper, we use the random matrix method because it is more efficient than the homomorphic encryption method; however, we can instead use the homomorphic encryption method for the protocols discussed later. Without a loss of generality, we assume that vectors and are normalized to size . That is, , and thus, simply .
Algorithm 1 shows the protocol of SSDD-Base, the recent SSDD solution by Jiang et al. . SSDD-Base uses a random matrix method for secure scalar products, where Alice and Bob share the same matrix and securely determine whether two vectors and are similar. For the correctness and a detailed description of Protocol SSDD-Base, readers are referred to . In SSDD, we apply SSDD-Base for each pair of document vectors. More formally, if and are sets of document vectors owned by Alice and Bob, respectively, we apply SSDD-Base for each pair , where and , respectively. As we mentioned in Section 1, however, SSDD-Base incurs the severe computation and communication overhead of , which will be very serious if there are several parties or if a large number of documents are changed dynamically. To alleviate this critical overhead, in this paper, we discuss a two-step solution for SSDD.
In text and time-series mining, many lower-dimensional transformations have been proposed to solve the dimensionality curse problem [19–21] of high-dimensional vectors. We can classify lower-dimensional transformations into feature extractions and feature selections . First, the feature extraction creates a few new features from an original high-dimensional vector. Representative examples of feature extractions include latent semantic indexing (LSI) , locality preserving indexing (LPI), discrete Fourier transform (DFT) , discrete Wavelet transform (DWT), and piecewise aggregate approximation (PAA) . In contrast, the feature selection selects a few discriminative features from the original (or transformed) high-dimensional vectors. Representative examples of feature selections include RP, DF, LDA, and principal component analysis (PCA) [11, 12]. In this paper, we use RP and DF with the appropriate variations. This is because RP and DF are much simpler than other transformations, and accordingly, they are easily applied to SSDD with low complexity; on the other hand, LSI, LPI, LDA, and PCA may provide very accurate feature vectors, but they are too complex to be applied to SSDD. For a detailed description of lower-dimensional transformations for text mining, readers are referred to [22, 23].
There have been many efforts on PPDM . PPDM solutions can be classified into four categories: data perturbation, -anonymization, distributed privacy preservation, and privacy preservation of mining results . SSDD can be regarded as an application of distributed privacy privation. For the detailed explanation on problems and solutions of data perturbation and -anonymization, readers are referred to survey papers [4, 13].
3. Feature Selection Based Secure Two-Step Protocol
We use the FS for the secure two-step protocol. To transform an -dimensional vector into an -dimensional vector, the FS chooses randomly or highly frequent dimensions from dimensions, and thus, its transformation process is very simple. In this section, we first assume that the FS can select dimensions from dimensions in a secure manner, and we then propose a secure two-step protocol of SSDD using the secure FS.
To use a lower-dimensional transformation for SSDD, we need to find an upper bound function that satisfies (1), where and are -dimensional feature vectors transformed from -dimensional vectors, and , respectively, by the transformation . In (1), , , , and .The reason why the transformation should satisfy (1) is that the SSDD using should not incur any false dismissal, which is known as Parseval’s theorem (the lower bound property of the Euclidean distances) in time-series matching . To obtain an upper bound of the lower-dimensional transformation , we first define an upper bound of as follows.
Definition 1. If a lower-dimensional transformation transforms -dimensional vectors, and , to -dimensional vectors, and , respectively, we define the upper bound function of , denoted by , throughwhere is the squared Euclidean distance between and ; that is, .
In this paper, we want to use the FS as a lower-dimensional transformation , and thus, we formally prove that the upper bound function of the FS satisfies (1), the upper bound property of the cosine similarity.
Theorem 2. If a feature selection FS transforms -dimensional vectors, and , to -dimensional vectors, and , respectively, is an upper bound of ; that is, (3) holds.
Using the upper bound property of the FS, we now propose a generic two-step Protocol SSDD-FS. Algorithm 2 shows Protocol SSDD-FS. As shown in the protocol, SSDD-FS maintains -dimensional and and -dimensional and of SSDD-Base. In addition, Alice and Bob share matrix and an matrix of SSDD-Base. Lines through of SSDD-FS are the first step of discarding nonsimilar -dimensional vectors in the -dimensional space. First, Lines through securely compute the scalar product for -dimensional vectors and . Except for using -dimensional vectors instead of -dimensional vectors, these steps are the same as those of SSDD-Base. The only difference from SSDD-Base is that Bob additionally sends to Alice in Line for computing . In Line , Alice computes Δ (=) usingNext, Alice computes an upper bound function of the FS, , in Line , and in Line , we apply the filtering process by comparing the upper bound () and the given tolerance (). If the upper bound is less than the tolerance, that is, if , the actual cosine similarity will also be less than the tolerance, and we do not need to compute it in the next -dimensional space. That is, if , we can skip Line of the second step. Thus, Line is executed only if -dimensional vectors of are not filtered out by the upper bound. Finally, in Line , we compute the actual cosine similarity for using SSDD-Base.
We note here how SSDD-FS improves the performance compared with SSDD-Base depending on how many -dimensional vectors are discarded in the first step. This filtering effect depends largely on the discriminative power of the feature selection, that is, the efficiency of the FS. In other words, if the FS largely exploits the filtering effect, SSDD-FS can reduce the computation and communication overhead from to . Based on this observation, we need to maximize the filtering effect of the FS, which can be seen as a problem of how we choose dimensions from dimensions for maximizing the discriminative power of the FS. Therefore, we propose efficient FS variants and their SSDD protocols in Section 4.
4. Discriminative Feature Selections for the Two-Step Protocol
In this section, we propose four methods to implement the FS of Protocol SSDD-FS. Figure 2 shows the procedure of SSDD-FS including the feature selection step. As shown in the figure, we first obtain and from and , respectively, through the feature selection, which should also be applied securely. As mentioned in Section 1, we presented the RP, LF, GF, and HF for the feature selection method, and in this section we describe how they work in detail. In Figure 2, the secure feature selection corresponds to Line of Protocol SSDD-FS, and the other two steps correspond to the first (Lines through ) and second (Line ) steps, respectively.
4.1. RP: Random Projection
An RP is the easiest way of implementing the FS, and it randomly selects dimensions from dimensions. We can consider two different methods for applying the RP to SSDD-FS. The first selects dimensions dynamically for each document pair , and the second first determines dimensions and then uses these predetermined dimensions for all document pairs.
To use the first RP method, Alice and Bob should share indexes, , of randomly selected dimensions for each before starting the first step of SSDD-FS. This sharing process can be implemented as Alice randomly selecting dimensions and sending their indexes to Bob or as Alice and Bob sharing the same seed of a random function. That is, we can implement the first RP method by modifying Line of Protocol SSDD-FS as Lines (1-1) through (1-3) of Algorithm 3.
The second RP method uses the same dimensions for all pairs. We can easily implement this method because Alice and Bob share the same indexes only once before starting SSDD-FS. These first and second RP methods do not disclose any values of Alice’s and Bob’s document vectors and thus are considered secure. In addition, these two methods have the same effect in selecting dimensions randomly. Thus, we use the second method, which we call SSDD-RP to differentiate it from SSDD-FS, because it is much simpler than the first.
4.2. LF: Local Frequency
SSDD-RP proposed in Section 4.1 has a problem of exploiting only a slight filtering effect during the first filtering step. This low filtering effect is due to the RP choosing features without any consideration of the characteristics of the document vectors. According to real experiments, SSDD-RP shows a very slight improvement in the SSDD performance compared with SSDD-Base. To solve the problem of SSDD-RP and enlarge the filtering effect, in this paper, we consider how frequent each term is in the document or document set; that is, we use the term frequency (TF) (in this paper, we use TF for simplicity, but we can also use TF-IDF (term frequency-inverse document frequency) instead of TF. Using which frequency among TF, TF-IDF, and other feature frequencies is orthogonal to our approach, and we use TF for easy understanding of the proposed concept). In general, we use the TF concept as follows: we first compute the number of occurrences (that is, the frequency) of each term throughout the whole dataset and then choose the highly frequent dimensions. We call this selection method document frequency (DF). We consider the TF (or DF) in SSDD-FS because if we select the highly frequent dimensions, we can obtain relatively small upper bounds by the relatively large of (2), and accordingly, we can largely exploit the filtering effect.
As a feature selection using term frequencies, we first consider how frequent each term is in an individual document rather than the whole document set; that is, we first propose the feature selection of exploiting the locality of each document. More precisely, for a pair of documents , the locality-based selection chooses highly frequent dimensions in Alice’s current vector . This selection is based on the simple intuition that, even without considering whole vectors of the document set, the current vector itself will have a significant influence on the upper bound . In this selection, we can instead use Bob’s vector rather than Alice’s vector as the current vector, or we can also use both Alice’s and Bob’s vectors and . Using , however, incurs additional communication overhead, and thus, in this paper we consider a simple method of using Alice’s as the current vector. We call this selection method the local frequency (LF) because it considers individual (that is, local) documents rather than whole documents, and we denote the protocol of applying the LF to SSDD-FS as SSDD-LF. SSDD-LF exploits the locality by selecting dimensions for each document at every start time. Algorithm 4 shows how we implement SSDD-LF by modifying Line of SSDD-FS of Algorithm 2. In Line (1-2), Alice first selects the top- frequent dimensions from her current vector. She sends those indexes of the selected dimensions to Bob in Line (1-3). Thus, they can share the same indexes and obtain -dimensional feature vectors using the same indexes in Line (1-4).
We now analyze the computation and communication overhead of the feature selection in SSDD-LF. As shown in Algorithm 4, for each vector , Alice chooses the top frequent dimensions from dimensions of , and communicates with Bob to share those indexes. First, Alice needs the additional computation overhead of to select the top frequent dimensions from the current -dimensional vector. Second, Alice and Bob need the additional communication overhead to share the indexes. However, this communication process can be applied with Line of SSDD-Base in Algorithm 1; that is, Alice can send indexes together with the encrypted vector to Bob. The amount of indexes is much smaller than that of the -dimensional vector, and the overhead of indexes can be negligible. Thus, we can say that SSDD-LF causes the computation overhead of , but the communication overhead can be ignored. In particular, we compare each vector of Alice with a large number of vectors (∈) of Bob, and thus, the computation overhead of can also be ignored as a preprocessing step.
Another point to consider in SSDD-LF is whether its feature selection process is secure. That is, there should be no privacy disclosure when Alice selects indexes and shares them with Bob. Fortunately, Alice sends only indexes to Bob rather than entry values of , and the sensitive values are not disclosed during the selection process. Unfortunately, however, the information regarding which dimensions are frequent in is revealed to Bob. If a user cannot be allowed even this limited disclosure of information, we recommend using the previous SSDD-RP or the next SSDD-GF or SSDD-HF as a more secure protocol.
4.3. GF: Global Frequency
SSDD-LF described above has a problem in considering only Alice’s current vector and ignoring all other vectors of Bob. Due to this problem, SSDD-LF exploits the filtering effect for only a part of Bob’s vectors but does not do so for most of the other vectors. To overcome this problem, in this section, we propose another feature selection that uses the whole vector whose elements represent the number of documents containing the corresponding term. Unlike the LF focusing on the current vector only, it considers all document vectors and has the characteristics of globality. We call this feature selection the global frequency (GF) and denote the GF-based secure protocol as SSDD-GF. Actually, the GF is the same as the DF, which has been widely used as the representative feature selection, and it works as follows. First, let be a whole vector, where is the number of documents containing the th term; that is, is the DF value of the th term. Then, to reduce the number of dimensions from to , the GF simply selects dimensions whose DF values are larger than those of the other dimensions. We can obtain the whole vector by scanning all of the document vectors a single time. The traditional DF constructs the whole vector based on the assumption that all document vectors are maintained in a single computer. In SSDD, however, the document vectors are distributed to Alice and Bob, each of whom does not want to provide their own vectors to the other. Thus, to use the GF in SSDD, we first need to present a secure protocol for constructing the whole vector from the document vectors stored in a distributive manner by Alice and Bob.
Algorithm 5 shows Protocol SecureDF that securely constructs a whole vector from Alice’s and Bob’s document vectors and obtains frequent dimensions from . In Lines through , Alice and Bob compute their own whole vectors independently. That is, Alice computes her own whole vector from her own document set , and Bob obtains from . In Lines and , they share those whole vectors and with each other. In Lines through , they then compute the aggregated whole vector from those vectors. After obtaining the whole vector , Alice and Bob can select -frequent dimensions from . We note that Alice sends to Bob in Line , and Bob sends to Alice in Line . Vectors and , however, are not the exact values of the document vectors but simple statistics, and thus we can say that SecureDF does not reveal any privacy of individual documents. The computation and communication complexities of SecureDF are merely and , respectively. In addition, SecureDF can be seen as a preprocessing step that is executed only once for all document vectors of Alice and Bob. Thus, its complexity can be negligible compared with the complexity of SSDD-Base.
We now describe SSDD-GF, which exploits SecureDF as the feature selection. Algorithm 6 shows how we modify Line of Algorithm 2 to convert SSDD-FS into SSDD-GF. In Line (1-0), we first apply SecureDF to obtain the whole vector and determine indexes that are most frequent found in . For the current -dimensional vectors and , Alice and Bob obtain -dimensional vectors and using the determined indexes. As shown in Algorithm 6, the current vectors and even their term frequencies are not disclosed to each other, and thus, we can say that SSDD-GF is a secure protocol of SSDD.
4.4. HF: Hybrid Frequency
The LF and GF have the following different characteristics from the viewpoint of a filtering effect. First, the LF considers Alice’s current vector only, and thus, the filtering effect will be large for only a portion of Bob’s vectors whose TF patterns differ greatly from the current vector; however, this effect is exploited less for most of the other vectors. In other words, the LF can exploit a better filtering effect than the GF when Alice’s current vector differs significantly from the whole vector in the TF patterns. Second, the GF considers the whole vector without considering the current vector, and it can thus exploit the filtering effect relatively evenly on many of Bob’s document vectors. That is, the GF can exploit a better filtering effect than the LF when Alice’s current vector has similar characteristics to the whole vector in the TF patterns.
To take advantage of both the locality of LF and the globality of GF, we now propose a hybrid feature selection, called hybrid frequency (HF). That is, the HF uses the current vector for exploiting the locality of LF and at the same time also uses the whole vector for exploiting the globality of GF. We then present an advanced secure protocol, SSDD-HF, by applying the HF to the SSDD-FS. Simply speaking, the HF compares the current and whole vectors and selects the feature dimensions whose differences are larger than those of the other dimensions. In more detail, we select feature dimensions that have one of the following two characteristics: dimensions that frequently occur in Alice’s current vector but seldom occur in the whole vector (i.e., whose values are relatively large in the current vector but relatively small in the whole vector) or, on the contrary, the dimensions that seldom occur in Alice’s current vector but frequently occur in the whole vector. This is because the larger is (that is, the difference between the values of the selected feature dimension), the smaller is or, in other words, the larger of (2) is, which exploits the larger filtering effect.
However, we cannot directly compare Alice’s current vector and the whole vector using SecureDF. The reason is that represents “frequencies of terms” in a single vector, whereas represents the “frequencies of documents” containing those terms. That is, the meaning of the frequencies in differs from that of , and thus, their scales are also different. To resolve this problem, before comparing the two vectors and , we first normalize them using their mean () and standard deviation (). More precisely, we first normalize and as and by (5), and we next obtain the difference vector . We then select the largest dimensions from and use them as the features of SSDD-HF.
Algorithm 7 shows how we modify Line of SSDD-FS in Algorithm 2 to implement SSDD-HF. First, as in SSDD-GF, Line (1-0) constructs by executing SecureDF. Next, in Lines (1-2) and (1-3), we normalize the current and whole vectors and obtain the difference vector from those normalized vectors. Finally, in Lines (1-4) through (1-6), Alice chooses dimensions from and shares those dimensions with Bob. That is, Lines (1-4) through (1-6) are the same as Lines (1-2) through (1-4) of SSDD-LF in Algorithm 4 except that SSDD-LF uses the current vector whereas SSDD-HF uses the difference vector .
The overhead of the feature selection in SSDD-HF can be seen as a summation of those in SSDD-LF and SSDD-GF. That is, like SSDD-GF, it has the overhead of applying SecureDF to obtain the whole vector , and at the same time, like SSDD-LF, it has the overhead of choosing the largest dimensions from the -dimensional difference vector . Such overhead, however, can be negligible for the following reasons: as we described with SSDD-GF earlier, SecureDF having computation and communication complexities of and can be seen as a preprocessing step executed only once for all document vectors, and its overhead can be negligible throughout the whole SSDD process; as we described for SSDD-LF earlier, the computation complexity of choosing dimensions from an -dimensional vector can be ignored because it can also be seen as a preprocessing step. One more notable point is that SSDD-HF is a secure protocol like SSDD-GF because it uses SecureDF and the difference vector, which are secure and do not disclose any original values or sensitive indexes of individual vectors.
5. Performance Evaluation
5.1. Experimental Data and Environment
In this section, we empirically evaluate the feature selection-based SSDD protocols. We used three datasets obtained from the document sets of the UCI repository . The first dataset consists of KOS blog entries collected from dailykos.com (KOS in short). KOS consists of 3,430 documents with 6,906 different terms (dimensions) and has a total 467,714 terms. The second dataset contains full papers published for the Neural Information Processing Systems Conference (NIPS in short). NIPS consists of 1,500 documents with 12,419 different terms and has about 1.9 million terms in total. The third dataset contains email messages of Enron (EMAILS in short). EMAILS consists of 39,861 emails with 28,102 different terms, and has about 6.4 million terms in total.
We experimented using five SSDD protocols: SSDD-Base as the basic one and the four proposed protocols, SSDD-RP, SSDD-LF, SSDD-GF, and SSDD-HF. In the experiment, we basically measured the elapsed time of executing SSDD for each protocol. In the first experiment, we varied the number of dimensions for a fixed tolerance, where the number of dimensions is , that is, the number of features selected by the feature selection. In the second experiment, we varied the tolerance for a fixed number of dimensions. For these two experiments, we used KOS and NIPS, which have relatively small numbers of documents compared with EMAILS. On the other hand, the third experiment was conducted to test the scalability of each protocol, and thus, we used EMAILS because it is much larger than those of KOS and NIPS.
The hardware platform is an HP ProLiant ML110 G7 workstation equipped with an E31220 3.10 GHz Intel Xeon Quad Core CPU, with 16 GB of RAM and a 250 GB HDD. CentOS 6.5 Linux was used as the software platform. We also used C language for implementing all of the protocols. We applied SSDD in a single machine using a local loop for network communication. The reason why we use the local loop is because we want to intentionally ignore the network speed because different network speeds or environments may largely distort the actual execution time of each protocol. We measured the execution time spent for Alice to sends each document to Bob and securely identify its similarity. More precisely, we stored the whole dataset in Bob and selected ten query documents for Alice. We then executed each SSDD protocol for these ten query documents and used their sum as the experimental result.
5.2. Experimental Results
Figure 3 shows the experimental results for KOS. First, in Figure 3(a), we set the tolerance to 0.80 and varied the number of documents by 70, 210, 350, 490, and 640, which correspond to 1%, 3%, 5%, 7%, and 9% of the number of KOS documents. In the figure, the -axis shows the number of (selected) dimensions, and the -axis shows the actual execution time. Note that the -axis is at the log scale.
(a) Different numbers of dimensions (tolerance = 0.80)
(b) Different tolerances (number of dimensions = 70)
Figure 3(a) shows that all of the proposed protocols significantly outperform the basic SSDD-Base. Even SSDD-RP for selecting features randomly beats SSDD-Base by exploiting the filtering effect in the first step of the two-step protocol. Next, SSDD-GF shows a better performance than SSDD-RP because it selects the features frequently occurring throughout the whole dataset using DF. In the case of SSDD-RP and SSDD-GF, we note that as the number of dimensions increases, the execution time decreases. This is because the larger the number of dimensions we use, the larger the filtering effect we can exploit. SSDD-LF for using the locality of the current vector also outperforms both SSDD-RP and SSDD-Base. Note that SSDD-LF is better than SSDD-GF in dimension 350 (= about 5% of the number of documents) and worse than SSDD-GF in dimension 350 and beyond. This changing point of dimension 350 is reported by the actual experiment. This is because only a small number of dimensions have a significant influence on the locality of the current vector. Finally, SSDD-HF used for taking advantage of both SSDD-LF and SSDD-GF shows the best performance for all dimensions. In Figure 3(a), we note that the execution times of SSDD-LF and SSDD-HF slightly increase as the number of dimensions increases. The reason for this is that as the number of dimensions increases, the filtering effect increases relatively slowly, but the overhead in obtaining a current/difference vector and choosing dimensions from that vector increases relatively quickly.
Second, in Figure 3(b), we set the number of dimensions to 70 (1% of the total dimensions) and vary the tolerance from 0.95 to 0.75 by decreasing it by 0.05. Note that the closer the tolerance is to 1.0, the stronger the similarity we use. As shown in Figure 3(b), all of the proposed protocols significantly improve the performance compared with SSDD-Base. In particular, SSDD-LF and SSDD-HF, which exploit the locality, show a better performance than the other two proposed protocols. We note here that as the tolerance decreases, the execution times of all proposed protocols gradually increase. This is because the smaller the tolerance we use, the more similar documents we obtain. In the summary of Figure 3, the proposed SSDD-LF and SSDD-HF significantly outperform SSDD-Base by up to 726.6 and 9,858 times, respectively.
Figure 4 shows the experimental results for NIPS. Like Figure 3 of KOS, we measure the execution time of SSDD by varying the number of dimensions and the tolerance. In Figure 4(a), we set the tolerance to 0.80 and increase the number of dimensions from 120 (1%) to 600 (5%) by steps of 120 (1%), where 120 indicates 1% of a total of 12,419 documents. Next, in Figure 4(b), we set the number of dimensions to 120 and decrease the tolerance from 0.95 to 0.75 by steps of 0.05. The experimental results in Figures 4(a) and 4(b) show a very similar trend with those in Figures 3(a) and 3(b). That is, all of the proposed protocols significantly outperform SSDD-Base, and SSDD-HF shows the best performance. In Figure 4, SSDD-HF significantly improves the performance compared with SSDD-Base by up to 16,620 times.
(a) Different numbers of dimensions (tolerance = 0.80)
(b) Different tolerances (number of dimensions = 120)
Figure 5 shows the results for the scalability test using the large volume high-dimensional dataset, EMAILS. We set the tolerance and number of dimensions to 0.80 and 70, respectively, and increase the number of documents (emails) from 40 (0.1%) to 39,861 (100%) by tenfold. In this experiment, we exclude the results of SSDD-Base, SSDD-RP, and SSDD-GF for the case of 39,861 documents owing to the excessive execution time. As shown in Figure 5, as with the results of KOS and NIPS, our feature selection based protocols outperform SSDD-Base in all cases, and in particular, SSDD-LF and SSDD-HF show the best performance regardless of the number of documents. We also note that all of the proposed protocols show a pseudo-linear trend on the number of documents (note that the - and -axes are all at the log scales). That is, the protocols are pseudo-linear solutions on the number of documents, and we can state that they are excellent in terms of both scalability and performance.
In this paper, we addressed an efficient method for significantly reducing the computation and communication overhead in the area of secure similar document detection. The contributions of the paper can be summarized as follows. First, we thoroughly analyzed the previous one-step protocol and pointed out that it incurs a serious performance overhead for high-dimensional document vectors. Second, to alleviate the overhead, we presented the feature selection-based two-step protocol and formally proved its correctness. Third, to improve the filtering efficiency of the two-step protocol, we proposed four feature selections: the RP for selecting features randomly, the LF for exploiting the locality of a current vector, the GF for exploiting the globality of all document vectors, and the HF for considering both the locality and globality. Fourth, for each feature selection, we presented a formal protocol and analyzed its secureness and complexity. Fifth, through experiments on three real datasets, we showed that all of the proposed protocols significantly outperformed the base protocol, and in particular, the HF-based secure protocol improved the performance by up to three to four orders of magnitude. As future work, we will consider two issues: the use of feature extraction (feature creation) instead of feature selection for a dimensionality reduction and the use of homomorphic encryption rather than a random matrix for a secure scalar product.
The authors declare that they have no competing interests.
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (no. R7117-16-0214, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams).
D. Sorokina, J. Gehrke, S. Warner, and P. Ginsparg, “Plagiarism detection in arXiv,” in Proceedings of the 6th IEEE International Conference on Data Mining, pp. 1070–1075, Hong Kong, December 2006.View at: Google Scholar
R. Agrawal and R. Srikant, “Privacy-preserving data mining,” SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 29, no. 2, pp. 439–450, 2000.View at: Google Scholar
P. R. Bhaladhare and D. C. Jinwala, “Novel approaches for privacy preserving data mining in k-anonymity model,” Journal of Information Science and Engineering, vol. 32, no. 1, pp. 63–78, 2016.View at: Google Scholar
S. Buyrukbilen and S. Bakiras, “Secure similar document detection with simhash,” in Proceedings of the Workshop on VLDB-Secure Data Management (SDM '13), pp. 61–75, Trento, Italy, August 2013.View at: Google Scholar
Y. Peng, G. Kou, Y. Shi, and Z. Chen, “Privacy-preserving data mining for medical data: application of data partition methods,” in Communications and Discoveries from Multidisciplinary Data, vol. 123 of Studies in Computational Intelligence, pp. 331–340, Springer, Berlin, Germany, 2008.View at: Publisher Site | Google Scholar
B. Pinkas, “Cryptography techniques for privacy-preserving data mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12–19, 2002.View at: Google Scholar
E. Bingam and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining, pp. 245–250, ACM SIGKDD, San Francisco, Calif, USA, August 2001.View at: Google Scholar
Y.-S. Moon, H.-S. Kim, S.-P. Kim, and E. Bertino, “Publishing time-series data under preservation of privacy and distance orders,” in Proceedings of the 21st International Conference on Database and Expert Systems Applications, Part II, pp. 17–31, Publishing Time-Series Data Under Preservation of Privacy and Distance Orders, Bilbao, Spain, August 2010.View at: Google Scholar
W. Jiang and B. K. Samanthula, “N-gram based secure similar document detection,” in Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy, pp. 239–246, Richmond, Va, USA, July 2011.View at: Google Scholar
B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen, “On secure scalar product computation for privacy-preserving data mining,” in Proceedings of the 7th Annual International Conference in Information Security & Cryptology, pp. 104–120, Seoul, Republic of Korea, December 2004.View at: Google Scholar
Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 412–420, Nashville, Tenn, USA, July 1997.View at: Google Scholar
W. Han, J. Lee, Y. Moon, S. Hwang, and H. Yu, “A new approach for processing ranked subsequence matching based on ranked union,” in Proceedings of the ACM SIGMOD International Conference on Management of data (SIGMOD '11), pp. 457–468, Athens, Greece, June 2011.View at: Publisher Site | Google Scholar
Bag of Words Data Sets, UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.