Research Article  Open Access
Efficient TwoStep Protocol and Its Discriminative Feature Selections in Secure Similar Document Detection
Abstract
Recently, the risk of information disclosure is increasing significantly. Accordingly, privacypreserving data mining (PPDM) is being actively studied to obtain accurate mining results while preserving the data privacy. We here focus on secure similar document detection (SSDD), which identifies similar documents of two parties when each party does not disclose its own sensitive documents to the another party. In this paper, we propose an efficient twostep protocol that exploits a feature selection as a lowerdimensional transformation, and we present discriminative feature selections to maximize the performance of the protocol. The proposed protocol consists of two steps: the filtering step and the postprocessing step. For the feature selection, we first consider the simplest one, random projection (RP), and propose its twostep solution, SSDDRP. We then present two discriminative feature selections and their solutions: SSDDLF which selects a few dimensions locally frequent in the current querying vector and SSDDGF which selects ones globally frequent in the set of all document vectors. We finally propose a hybrid one, SSDDHF, which takes advantage of both SSDDLF and SSDDGF. We empirically show that the proposed twostep protocol significantly outperforms the previous onestep protocol by three or four orders of magnitude.
1. Introduction
Similar document detection is the problem of finding similar documents of two parties, Alice and Bob, and has been widely used in the version management of files, copyright protection, and plagiarism detection [1]. Secure similar document detection (SSDD) [2] has been recently introduced to identify similar documents while preserving the privacy of each party’s documents, as shown in Figure 1. That is, SSDD finds similar document pairs whose cosine similarity [3] exceeds the given tolerance while not disclosing document vectors to the other party. SSDD is a typical example of privacypreserving data mining (PPDM) [4–6] and has the following applications:(i)Detection of double submissions [2, 7]: in two or more conferences that are not allowing double submissions, SSDD finds the doublesubmitted papers while not disclosing the papers to the other conference(s).(ii)Detection of insurance fraud [2]: in an insurance fraud detection system, SSDD searches for similar accident cases of two or more insurance companies while not providing sensitive or private cases to the other company or companies.(iii)Sharing of similar symptom cases [8]: doctors or patients of a hospital want to find similar symptom cases with the prescriptions or medical treatment histories of other hospitals without disclosing each other’s medical or health information.
Jiang et al. [2] proposed a novel solution for SSDD by exploiting secure multiparty computations (SMCs) [9, 10] in a semihonest model. Their solution has preserved the privacy of two parties by using the secure scalar product in computing the cosine similarity between document vectors. They suggested the use of random matrix or homomorphic encryption methods [9], which are representative methods of secure scalar product. In this paper, we use the SSDD solution using the random matrix method by Jiang et al. as the base protocol, which we call SSDDBase. However, SSDDBase has a critical problem in that it incurs severe computation and communication overhead. Letting Alice’s and Bob’s document sets be and , respectively, SSDDBase requires secure scalar products. In many cases, the dimension of the document vectors reaches tens of thousands or even hundreds of thousands in number, and SSDDBase incurs a very high complexity of , which is not practical to support a large volume of document databases. In particular, if there are many parties or frequent changes in document databases, the overhead becomes much more critical.
To alleviate the computation and communication overhead of SSDDBase, in this paper, we present a twostep protocol that exploits the feature selection of a lowerdimensional transformation. The feature selection transforms highdimensional document vectors into lowdimensional feature vectors, and in general, it selects tens to hundreds dimensions from thousands to tens of thousands of dimensions. We call this feature selection FS. Representative FS includes a random projection (RP) [11], the document frequency (DF), and a linear discriminant analysis (LDA) [12]. In this paper, we use RP and DF because they are known to be simple but efficient feature selections. To devise a twostep protocol, we need to find an upper bound of the cosine similarity for the filtering process. Thus, we first present an upper bound of FS and formally prove its correctness. Using the upper bound property of FS, we then propose a generic twostep protocol, called SSDDFS. The proposed SSDDFS works as follows: in the first filtering step, it converts dimensional vectors into (≪)dimensional vectors and applies the secure protocol to dimensional vectors to filter out nonsimilar dimensional vectors; in the second postprocessing step, it applies the base Protocol SSDDBase to the nonfiltered dimensional vectors. In the filtering step, SSDDFS prunes many nonsimilar highdimensional vectors by comparing lowdimensional vectors with relatively less complexity of , and thus, it significantly improves the performance compared with SSDDBase.
For an efficient SSDDFS, the FS should be highly discriminative; that is, the FS should filter out as many highdimensional vectors as possible if they are nonsimilar. In this paper, we analyze SSDD protocols in detail and propose four different techniques as the discriminative FS implementation. An RP can first be considered the easiest way of implementing the FS. The RP randomly selects dimensions from dimensions. An RP is easy, but its filtering effect will be very low owing to the randomness. To solve this problem, we exploit the DF that selects the feature dimensions based on the frequencies in all document vectors. In particular, by referring to the DF concept, we present three variants of DF, called the local frequency (LF), global frequency (GF), and hybrid frequency (HF). First, the LF considers the term frequencies of Alice’s current querying vector (which we call the current vector) and selects highfrequency dimensions from the current vector. The LF focuses on the locality, which means that considering the current vector only might be sufficient to decrease the upper bound of the cosine similarity. Second, the GF means the DF itself; that is, the GF counts the number of documents containing each term (dimension), constructs a frequency vector from those counts (which we call the whole vector), and selects highfrequency dimensions from the whole vector. The GF focuses on the globality because it considers all document vectors. To implement the GF, however, we need to make a secure protocol for obtaining the whole vector from both Alice’s and Bob’s document sets. For this, we propose a protocol, SecureDF, as a secure implementation of the DF. Third, the HF takes advantage of both locality of the LF and globality of the GF. The HF computes a difference vector between the current and whole vectors and selects highvalued dimensions from the difference vector. This is because the HF tries to maximize the value difference between Alice’s and Bob’s vectors for each selected dimension and eventually decreases the upper bound of the cosine similarity. Table 1 summarizes these four feature selections and their corresponding SSDD protocols, SSDDRP, SSDDLF, SSDDGF, and SSDDHF, which are described in Section 4.

In this paper, we empirically evaluate the base protocol, SSDDBase, and our four SSDDFS protocols (SSDDRP, SSDDLF, SSDDGF, and SSDDHF) using various datasets. The experimental results show that SSDDFS protocols significantly outperform SSDDBase. This means that the proposed twostep protocols effectively prune a large number of nonsimilar sequences early in the filtering step. In particular, SSDDHF, which takes advantage of both the locality of SSDDLF and the globality of SSDDGF, shows the best performance. Compared with SSDDBase, SSDDHF significantly reduces the execution time of SSDD by three or four orders of magnitude.
The rest of this paper is organized as follows. Section 2 explains related work and background of the research. Section 3 presents the FSbased twostep protocol, SSDDFS, and proves its correctness. Section 4 introduces four novel feature selections, RP, LF, GF, and HF, and proposes their corresponding secure protocols. Section 5 explains experimental results on various datasets. We finally summarize and conclude the paper in Section 6.
2. Related Work and Background
With increasing need of preserving privacy of big data, there have been many efforts on PPDM [13]. PPDM solutions can be classified into four categories: data perturbation, anonymization, distributed privacy preservation, and privacy preservation of mining results [14]. Among these categories, secure multiparty computation (SMC) has been used in distributed privacy preservation. As a recent secure multiparty computation of principal component analysis (PCA), Won et al. [15] proposed SPCA that computes PCA without disclosing the sensitive data of individual nodes. Also, SSDD can be regarded as an application of distributed privacy preservation. Shah and Joshi [16] summarized PPDM technologies in a distributed environment. Representative SMCs include secure comparison, secure sum, and secure scalar product, and SSDD is an interesting application of exploiting the secure scalar product.
Jiang et al. [2] proposed a novel solution for SSDD by exploiting SMCs in a semihonest model. Their solution has preserved the privacy of two parties by using the secure scalar product in computing the cosine similarity between document vectors. They also proposed gram based SSDD protocols that are effective in detecting local similarity unlike the existing vector space model, which is effective in detecting global similarity [17]. Buyrukbilen and Bakiras [7] introduced a solution based on simhash to reduce high computational and communication costs. Simhash is a dimensionality reduction technique that encodes all the document terms and their frequencies into a fixedsize bit vector and made it possible to reduce the similarity calculation to a secure XOR computation between two bit vectors.
We use the cosine similarity as the basic operation of similar document detection. The cosine similarity of two dimensional vectors and , which are normalized to size 1, is computed as , where is the scalar product of and ; that is, . If we can compute securely in two parties, we can also compute securely. There are two representative methods for a secure scalar product [2]. The first is a random matrix method, where two parties share the same random matrix, which they use to securely compute the scalar product. The second is the homomorphic encryption method [18], where two parties use a homomorphic probability key system for the secure computation of scalar products. In this paper, we use the random matrix method because it is more efficient than the homomorphic encryption method; however, we can instead use the homomorphic encryption method for the protocols discussed later. Without a loss of generality, we assume that vectors and are normalized to size . That is, , and thus, simply .
Algorithm 1 shows the protocol of SSDDBase, the recent SSDD solution by Jiang et al. [2]. SSDDBase uses a random matrix method for secure scalar products, where Alice and Bob share the same matrix and securely determine whether two vectors and are similar. For the correctness and a detailed description of Protocol SSDDBase, readers are referred to [2]. In SSDD, we apply SSDDBase for each pair of document vectors. More formally, if and are sets of document vectors owned by Alice and Bob, respectively, we apply SSDDBase for each pair , where and , respectively. As we mentioned in Section 1, however, SSDDBase incurs the severe computation and communication overhead of , which will be very serious if there are several parties or if a large number of documents are changed dynamically. To alleviate this critical overhead, in this paper, we discuss a twostep solution for SSDD.

In text and timeseries mining, many lowerdimensional transformations have been proposed to solve the dimensionality curse problem [19–21] of highdimensional vectors. We can classify lowerdimensional transformations into feature extractions and feature selections [22]. First, the feature extraction creates a few new features from an original highdimensional vector. Representative examples of feature extractions include latent semantic indexing (LSI) [23], locality preserving indexing (LPI), discrete Fourier transform (DFT) [24], discrete Wavelet transform (DWT), and piecewise aggregate approximation (PAA) [25]. In contrast, the feature selection selects a few discriminative features from the original (or transformed) highdimensional vectors. Representative examples of feature selections include RP, DF, LDA, and principal component analysis (PCA) [11, 12]. In this paper, we use RP and DF with the appropriate variations. This is because RP and DF are much simpler than other transformations, and accordingly, they are easily applied to SSDD with low complexity; on the other hand, LSI, LPI, LDA, and PCA may provide very accurate feature vectors, but they are too complex to be applied to SSDD. For a detailed description of lowerdimensional transformations for text mining, readers are referred to [22, 23].
There have been many efforts on PPDM [13]. PPDM solutions can be classified into four categories: data perturbation, anonymization, distributed privacy preservation, and privacy preservation of mining results [14]. SSDD can be regarded as an application of distributed privacy privation. For the detailed explanation on problems and solutions of data perturbation and anonymization, readers are referred to survey papers [4, 13].
3. Feature Selection Based Secure TwoStep Protocol
We use the FS for the secure twostep protocol. To transform an dimensional vector into an dimensional vector, the FS chooses randomly or highly frequent dimensions from dimensions, and thus, its transformation process is very simple. In this section, we first assume that the FS can select dimensions from dimensions in a secure manner, and we then propose a secure twostep protocol of SSDD using the secure FS.
To use a lowerdimensional transformation for SSDD, we need to find an upper bound function that satisfies (1), where and are dimensional feature vectors transformed from dimensional vectors, and , respectively, by the transformation . In (1), , , , and .The reason why the transformation should satisfy (1) is that the SSDD using should not incur any false dismissal, which is known as Parseval’s theorem (the lower bound property of the Euclidean distances) in timeseries matching [20]. To obtain an upper bound of the lowerdimensional transformation , we first define an upper bound of as follows.
Definition 1. If a lowerdimensional transformation transforms dimensional vectors, and , to dimensional vectors, and , respectively, we define the upper bound function of , denoted by , throughwhere is the squared Euclidean distance between and ; that is, .
In this paper, we want to use the FS as a lowerdimensional transformation , and thus, we formally prove that the upper bound function of the FS satisfies (1), the upper bound property of the cosine similarity.
Theorem 2. If a feature selection FS transforms dimensional vectors, and , to dimensional vectors, and , respectively, is an upper bound of ; that is, (3) holds.
Proof. Equation (3) is simply derived from (2), and we omit the detailed proof.
Using the upper bound property of the FS, we now propose a generic twostep Protocol SSDDFS. Algorithm 2 shows Protocol SSDDFS. As shown in the protocol, SSDDFS maintains dimensional and and dimensional and of SSDDBase. In addition, Alice and Bob share matrix and an matrix of SSDDBase. Lines through of SSDDFS are the first step of discarding nonsimilar dimensional vectors in the dimensional space. First, Lines through securely compute the scalar product for dimensional vectors and . Except for using dimensional vectors instead of dimensional vectors, these steps are the same as those of SSDDBase. The only difference from SSDDBase is that Bob additionally sends to Alice in Line for computing . In Line , Alice computes Δ (=) usingNext, Alice computes an upper bound function of the FS, , in Line , and in Line , we apply the filtering process by comparing the upper bound () and the given tolerance (). If the upper bound is less than the tolerance, that is, if , the actual cosine similarity will also be less than the tolerance, and we do not need to compute it in the next dimensional space. That is, if , we can skip Line of the second step. Thus, Line is executed only if dimensional vectors of are not filtered out by the upper bound. Finally, in Line , we compute the actual cosine similarity for using SSDDBase.

We note here how SSDDFS improves the performance compared with SSDDBase depending on how many dimensional vectors are discarded in the first step. This filtering effect depends largely on the discriminative power of the feature selection, that is, the efficiency of the FS. In other words, if the FS largely exploits the filtering effect, SSDDFS can reduce the computation and communication overhead from to . Based on this observation, we need to maximize the filtering effect of the FS, which can be seen as a problem of how we choose dimensions from dimensions for maximizing the discriminative power of the FS. Therefore, we propose efficient FS variants and their SSDD protocols in Section 4.
4. Discriminative Feature Selections for the TwoStep Protocol
In this section, we propose four methods to implement the FS of Protocol SSDDFS. Figure 2 shows the procedure of SSDDFS including the feature selection step. As shown in the figure, we first obtain and from and , respectively, through the feature selection, which should also be applied securely. As mentioned in Section 1, we presented the RP, LF, GF, and HF for the feature selection method, and in this section we describe how they work in detail. In Figure 2, the secure feature selection corresponds to Line of Protocol SSDDFS, and the other two steps correspond to the first (Lines through ) and second (Line ) steps, respectively.
4.1. RP: Random Projection
An RP is the easiest way of implementing the FS, and it randomly selects dimensions from dimensions. We can consider two different methods for applying the RP to SSDDFS. The first selects dimensions dynamically for each document pair , and the second first determines dimensions and then uses these predetermined dimensions for all document pairs.
To use the first RP method, Alice and Bob should share indexes, , of randomly selected dimensions for each before starting the first step of SSDDFS. This sharing process can be implemented as Alice randomly selecting dimensions and sending their indexes to Bob or as Alice and Bob sharing the same seed of a random function. That is, we can implement the first RP method by modifying Line of Protocol SSDDFS as Lines (11) through (13) of Algorithm 3.

The second RP method uses the same dimensions for all pairs. We can easily implement this method because Alice and Bob share the same indexes only once before starting SSDDFS. These first and second RP methods do not disclose any values of Alice’s and Bob’s document vectors and thus are considered secure. In addition, these two methods have the same effect in selecting dimensions randomly. Thus, we use the second method, which we call SSDDRP to differentiate it from SSDDFS, because it is much simpler than the first.
4.2. LF: Local Frequency
SSDDRP proposed in Section 4.1 has a problem of exploiting only a slight filtering effect during the first filtering step. This low filtering effect is due to the RP choosing features without any consideration of the characteristics of the document vectors. According to real experiments, SSDDRP shows a very slight improvement in the SSDD performance compared with SSDDBase. To solve the problem of SSDDRP and enlarge the filtering effect, in this paper, we consider how frequent each term is in the document or document set; that is, we use the term frequency (TF) (in this paper, we use TF for simplicity, but we can also use TFIDF (term frequencyinverse document frequency) instead of TF. Using which frequency among TF, TFIDF, and other feature frequencies is orthogonal to our approach, and we use TF for easy understanding of the proposed concept). In general, we use the TF concept as follows: we first compute the number of occurrences (that is, the frequency) of each term throughout the whole dataset and then choose the highly frequent dimensions. We call this selection method document frequency (DF). We consider the TF (or DF) in SSDDFS because if we select the highly frequent dimensions, we can obtain relatively small upper bounds by the relatively large of (2), and accordingly, we can largely exploit the filtering effect.
As a feature selection using term frequencies, we first consider how frequent each term is in an individual document rather than the whole document set; that is, we first propose the feature selection of exploiting the locality of each document. More precisely, for a pair of documents , the localitybased selection chooses highly frequent dimensions in Alice’s current vector . This selection is based on the simple intuition that, even without considering whole vectors of the document set, the current vector itself will have a significant influence on the upper bound . In this selection, we can instead use Bob’s vector rather than Alice’s vector as the current vector, or we can also use both Alice’s and Bob’s vectors and . Using , however, incurs additional communication overhead, and thus, in this paper we consider a simple method of using Alice’s as the current vector. We call this selection method the local frequency (LF) because it considers individual (that is, local) documents rather than whole documents, and we denote the protocol of applying the LF to SSDDFS as SSDDLF. SSDDLF exploits the locality by selecting dimensions for each document at every start time. Algorithm 4 shows how we implement SSDDLF by modifying Line of SSDDFS of Algorithm 2. In Line (12), Alice first selects the top frequent dimensions from her current vector. She sends those indexes of the selected dimensions to Bob in Line (13). Thus, they can share the same indexes and obtain dimensional feature vectors using the same indexes in Line (14).

We now analyze the computation and communication overhead of the feature selection in SSDDLF. As shown in Algorithm 4, for each vector , Alice chooses the top frequent dimensions from dimensions of , and communicates with Bob to share those indexes. First, Alice needs the additional computation overhead of to select the top frequent dimensions from the current dimensional vector. Second, Alice and Bob need the additional communication overhead to share the indexes. However, this communication process can be applied with Line of SSDDBase in Algorithm 1; that is, Alice can send indexes together with the encrypted vector to Bob. The amount of indexes is much smaller than that of the dimensional vector, and the overhead of indexes can be negligible. Thus, we can say that SSDDLF causes the computation overhead of , but the communication overhead can be ignored. In particular, we compare each vector of Alice with a large number of vectors (∈) of Bob, and thus, the computation overhead of can also be ignored as a preprocessing step.
Another point to consider in SSDDLF is whether its feature selection process is secure. That is, there should be no privacy disclosure when Alice selects indexes and shares them with Bob. Fortunately, Alice sends only indexes to Bob rather than entry values of , and the sensitive values are not disclosed during the selection process. Unfortunately, however, the information regarding which dimensions are frequent in is revealed to Bob. If a user cannot be allowed even this limited disclosure of information, we recommend using the previous SSDDRP or the next SSDDGF or SSDDHF as a more secure protocol.
4.3. GF: Global Frequency
SSDDLF described above has a problem in considering only Alice’s current vector and ignoring all other vectors of Bob. Due to this problem, SSDDLF exploits the filtering effect for only a part of Bob’s vectors but does not do so for most of the other vectors. To overcome this problem, in this section, we propose another feature selection that uses the whole vector whose elements represent the number of documents containing the corresponding term. Unlike the LF focusing on the current vector only, it considers all document vectors and has the characteristics of globality. We call this feature selection the global frequency (GF) and denote the GFbased secure protocol as SSDDGF. Actually, the GF is the same as the DF, which has been widely used as the representative feature selection, and it works as follows. First, let be a whole vector, where is the number of documents containing the th term; that is, is the DF value of the th term. Then, to reduce the number of dimensions from to , the GF simply selects dimensions whose DF values are larger than those of the other dimensions. We can obtain the whole vector by scanning all of the document vectors a single time. The traditional DF constructs the whole vector based on the assumption that all document vectors are maintained in a single computer. In SSDD, however, the document vectors are distributed to Alice and Bob, each of whom does not want to provide their own vectors to the other. Thus, to use the GF in SSDD, we first need to present a secure protocol for constructing the whole vector from the document vectors stored in a distributive manner by Alice and Bob.
Algorithm 5 shows Protocol SecureDF that securely constructs a whole vector from Alice’s and Bob’s document vectors and obtains frequent dimensions from . In Lines through , Alice and Bob compute their own whole vectors independently. That is, Alice computes her own whole vector from her own document set , and Bob obtains from . In Lines and , they share those whole vectors and with each other. In Lines through , they then compute the aggregated whole vector from those vectors. After obtaining the whole vector , Alice and Bob can select frequent dimensions from . We note that Alice sends to Bob in Line , and Bob sends to Alice in Line . Vectors and , however, are not the exact values of the document vectors but simple statistics, and thus we can say that SecureDF does not reveal any privacy of individual documents. The computation and communication complexities of SecureDF are merely and , respectively. In addition, SecureDF can be seen as a preprocessing step that is executed only once for all document vectors of Alice and Bob. Thus, its complexity can be negligible compared with the complexity of SSDDBase.

We now describe SSDDGF, which exploits SecureDF as the feature selection. Algorithm 6 shows how we modify Line of Algorithm 2 to convert SSDDFS into SSDDGF. In Line (10), we first apply SecureDF to obtain the whole vector and determine indexes that are most frequent found in . For the current dimensional vectors and , Alice and Bob obtain dimensional vectors and using the determined indexes. As shown in Algorithm 6, the current vectors and even their term frequencies are not disclosed to each other, and thus, we can say that SSDDGF is a secure protocol of SSDD.

4.4. HF: Hybrid Frequency
The LF and GF have the following different characteristics from the viewpoint of a filtering effect. First, the LF considers Alice’s current vector only, and thus, the filtering effect will be large for only a portion of Bob’s vectors whose TF patterns differ greatly from the current vector; however, this effect is exploited less for most of the other vectors. In other words, the LF can exploit a better filtering effect than the GF when Alice’s current vector differs significantly from the whole vector in the TF patterns. Second, the GF considers the whole vector without considering the current vector, and it can thus exploit the filtering effect relatively evenly on many of Bob’s document vectors. That is, the GF can exploit a better filtering effect than the LF when Alice’s current vector has similar characteristics to the whole vector in the TF patterns.
To take advantage of both the locality of LF and the globality of GF, we now propose a hybrid feature selection, called hybrid frequency (HF). That is, the HF uses the current vector for exploiting the locality of LF and at the same time also uses the whole vector for exploiting the globality of GF. We then present an advanced secure protocol, SSDDHF, by applying the HF to the SSDDFS. Simply speaking, the HF compares the current and whole vectors and selects the feature dimensions whose differences are larger than those of the other dimensions. In more detail, we select feature dimensions that have one of the following two characteristics: dimensions that frequently occur in Alice’s current vector but seldom occur in the whole vector (i.e., whose values are relatively large in the current vector but relatively small in the whole vector) or, on the contrary, the dimensions that seldom occur in Alice’s current vector but frequently occur in the whole vector. This is because the larger is (that is, the difference between the values of the selected feature dimension), the smaller is or, in other words, the larger of (2) is, which exploits the larger filtering effect.
However, we cannot directly compare Alice’s current vector and the whole vector using SecureDF. The reason is that represents “frequencies of terms” in a single vector, whereas represents the “frequencies of documents” containing those terms. That is, the meaning of the frequencies in differs from that of , and thus, their scales are also different. To resolve this problem, before comparing the two vectors and , we first normalize them using their mean () and standard deviation (). More precisely, we first normalize and as and by (5), and we next obtain the difference vector . We then select the largest dimensions from and use them as the features of SSDDHF.
Algorithm 7 shows how we modify Line of SSDDFS in Algorithm 2 to implement SSDDHF. First, as in SSDDGF, Line (10) constructs by executing SecureDF. Next, in Lines (12) and (13), we normalize the current and whole vectors and obtain the difference vector from those normalized vectors. Finally, in Lines (14) through (16), Alice chooses dimensions from and shares those dimensions with Bob. That is, Lines (14) through (16) are the same as Lines (12) through (14) of SSDDLF in Algorithm 4 except that SSDDLF uses the current vector whereas SSDDHF uses the difference vector .

The overhead of the feature selection in SSDDHF can be seen as a summation of those in SSDDLF and SSDDGF. That is, like SSDDGF, it has the overhead of applying SecureDF to obtain the whole vector , and at the same time, like SSDDLF, it has the overhead of choosing the largest dimensions from the dimensional difference vector . Such overhead, however, can be negligible for the following reasons: as we described with SSDDGF earlier, SecureDF having computation and communication complexities of and can be seen as a preprocessing step executed only once for all document vectors, and its overhead can be negligible throughout the whole SSDD process; as we described for SSDDLF earlier, the computation complexity of choosing dimensions from an dimensional vector can be ignored because it can also be seen as a preprocessing step. One more notable point is that SSDDHF is a secure protocol like SSDDGF because it uses SecureDF and the difference vector, which are secure and do not disclose any original values or sensitive indexes of individual vectors.
5. Performance Evaluation
5.1. Experimental Data and Environment
In this section, we empirically evaluate the feature selectionbased SSDD protocols. We used three datasets obtained from the document sets of the UCI repository [26]. The first dataset consists of KOS blog entries collected from dailykos.com (KOS in short). KOS consists of 3,430 documents with 6,906 different terms (dimensions) and has a total 467,714 terms. The second dataset contains full papers published for the Neural Information Processing Systems Conference (NIPS in short). NIPS consists of 1,500 documents with 12,419 different terms and has about 1.9 million terms in total. The third dataset contains email messages of Enron (EMAILS in short). EMAILS consists of 39,861 emails with 28,102 different terms, and has about 6.4 million terms in total.
We experimented using five SSDD protocols: SSDDBase as the basic one and the four proposed protocols, SSDDRP, SSDDLF, SSDDGF, and SSDDHF. In the experiment, we basically measured the elapsed time of executing SSDD for each protocol. In the first experiment, we varied the number of dimensions for a fixed tolerance, where the number of dimensions is , that is, the number of features selected by the feature selection. In the second experiment, we varied the tolerance for a fixed number of dimensions. For these two experiments, we used KOS and NIPS, which have relatively small numbers of documents compared with EMAILS. On the other hand, the third experiment was conducted to test the scalability of each protocol, and thus, we used EMAILS because it is much larger than those of KOS and NIPS.
The hardware platform is an HP ProLiant ML110 G7 workstation equipped with an E31220 3.10 GHz Intel Xeon Quad Core CPU, with 16 GB of RAM and a 250 GB HDD. CentOS 6.5 Linux was used as the software platform. We also used C language for implementing all of the protocols. We applied SSDD in a single machine using a local loop for network communication. The reason why we use the local loop is because we want to intentionally ignore the network speed because different network speeds or environments may largely distort the actual execution time of each protocol. We measured the execution time spent for Alice to sends each document to Bob and securely identify its similarity. More precisely, we stored the whole dataset in Bob and selected ten query documents for Alice. We then executed each SSDD protocol for these ten query documents and used their sum as the experimental result.
5.2. Experimental Results
Figure 3 shows the experimental results for KOS. First, in Figure 3(a), we set the tolerance to 0.80 and varied the number of documents by 70, 210, 350, 490, and 640, which correspond to 1%, 3%, 5%, 7%, and 9% of the number of KOS documents. In the figure, the axis shows the number of (selected) dimensions, and the axis shows the actual execution time. Note that the axis is at the log scale.
(a) Different numbers of dimensions (tolerance = 0.80)
(b) Different tolerances (number of dimensions = 70)
Figure 3(a) shows that all of the proposed protocols significantly outperform the basic SSDDBase. Even SSDDRP for selecting features randomly beats SSDDBase by exploiting the filtering effect in the first step of the twostep protocol. Next, SSDDGF shows a better performance than SSDDRP because it selects the features frequently occurring throughout the whole dataset using DF. In the case of SSDDRP and SSDDGF, we note that as the number of dimensions increases, the execution time decreases. This is because the larger the number of dimensions we use, the larger the filtering effect we can exploit. SSDDLF for using the locality of the current vector also outperforms both SSDDRP and SSDDBase. Note that SSDDLF is better than SSDDGF in dimension 350 (= about 5% of the number of documents) and worse than SSDDGF in dimension 350 and beyond. This changing point of dimension 350 is reported by the actual experiment. This is because only a small number of dimensions have a significant influence on the locality of the current vector. Finally, SSDDHF used for taking advantage of both SSDDLF and SSDDGF shows the best performance for all dimensions. In Figure 3(a), we note that the execution times of SSDDLF and SSDDHF slightly increase as the number of dimensions increases. The reason for this is that as the number of dimensions increases, the filtering effect increases relatively slowly, but the overhead in obtaining a current/difference vector and choosing dimensions from that vector increases relatively quickly.
Second, in Figure 3(b), we set the number of dimensions to 70 (1% of the total dimensions) and vary the tolerance from 0.95 to 0.75 by decreasing it by 0.05. Note that the closer the tolerance is to 1.0, the stronger the similarity we use. As shown in Figure 3(b), all of the proposed protocols significantly improve the performance compared with SSDDBase. In particular, SSDDLF and SSDDHF, which exploit the locality, show a better performance than the other two proposed protocols. We note here that as the tolerance decreases, the execution times of all proposed protocols gradually increase. This is because the smaller the tolerance we use, the more similar documents we obtain. In the summary of Figure 3, the proposed SSDDLF and SSDDHF significantly outperform SSDDBase by up to 726.6 and 9,858 times, respectively.
Figure 4 shows the experimental results for NIPS. Like Figure 3 of KOS, we measure the execution time of SSDD by varying the number of dimensions and the tolerance. In Figure 4(a), we set the tolerance to 0.80 and increase the number of dimensions from 120 (1%) to 600 (5%) by steps of 120 (1%), where 120 indicates 1% of a total of 12,419 documents. Next, in Figure 4(b), we set the number of dimensions to 120 and decrease the tolerance from 0.95 to 0.75 by steps of 0.05. The experimental results in Figures 4(a) and 4(b) show a very similar trend with those in Figures 3(a) and 3(b). That is, all of the proposed protocols significantly outperform SSDDBase, and SSDDHF shows the best performance. In Figure 4, SSDDHF significantly improves the performance compared with SSDDBase by up to 16,620 times.
(a) Different numbers of dimensions (tolerance = 0.80)
(b) Different tolerances (number of dimensions = 120)
Figure 5 shows the results for the scalability test using the large volume highdimensional dataset, EMAILS. We set the tolerance and number of dimensions to 0.80 and 70, respectively, and increase the number of documents (emails) from 40 (0.1%) to 39,861 (100%) by tenfold. In this experiment, we exclude the results of SSDDBase, SSDDRP, and SSDDGF for the case of 39,861 documents owing to the excessive execution time. As shown in Figure 5, as with the results of KOS and NIPS, our feature selection based protocols outperform SSDDBase in all cases, and in particular, SSDDLF and SSDDHF show the best performance regardless of the number of documents. We also note that all of the proposed protocols show a pseudolinear trend on the number of documents (note that the  and axes are all at the log scales). That is, the protocols are pseudolinear solutions on the number of documents, and we can state that they are excellent in terms of both scalability and performance.
6. Conclusions
In this paper, we addressed an efficient method for significantly reducing the computation and communication overhead in the area of secure similar document detection. The contributions of the paper can be summarized as follows. First, we thoroughly analyzed the previous onestep protocol and pointed out that it incurs a serious performance overhead for highdimensional document vectors. Second, to alleviate the overhead, we presented the feature selectionbased twostep protocol and formally proved its correctness. Third, to improve the filtering efficiency of the twostep protocol, we proposed four feature selections: the RP for selecting features randomly, the LF for exploiting the locality of a current vector, the GF for exploiting the globality of all document vectors, and the HF for considering both the locality and globality. Fourth, for each feature selection, we presented a formal protocol and analyzed its secureness and complexity. Fifth, through experiments on three real datasets, we showed that all of the proposed protocols significantly outperformed the base protocol, and in particular, the HFbased secure protocol improved the performance by up to three to four orders of magnitude. As future work, we will consider two issues: the use of feature extraction (feature creation) instead of feature selection for a dimensionality reduction and the use of homomorphic encryption rather than a random matrix for a secure scalar product.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (no. R7117160214, Development of an Intelligent Sampling and Filtering Techniques for Purifying Data Streams).
References
 D. Sorokina, J. Gehrke, S. Warner, and P. Ginsparg, “Plagiarism detection in arXiv,” in Proceedings of the 6th IEEE International Conference on Data Mining, pp. 1070–1075, Hong Kong, December 2006. View at: Google Scholar
 W. Jiang, M. Murugesan, C. Clifton, and L. Si, “Similar document detection with limited information disclosure,” in Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE '08), pp. 735–743, Cancun, Mexico, April 2008. View at: Publisher Site  Google Scholar
 A. Stavrianou, P. Andritsos, and N. Nicoloyannis, “Overview and semantic issues of text mining,” SIGMOD Record, vol. 36, no. 3, pp. 23–34, 2007. View at: Publisher Site  Google Scholar
 C. C. Aggarwal and P. S. Yu, “Privacypreserving data mining: a survey,” Handbook of Database Security: Applications and Trends, pp. 431–460, 2008. View at: Publisher Site  Google Scholar
 R. Agrawal and R. Srikant, “Privacypreserving data mining,” SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 29, no. 2, pp. 439–450, 2000. View at: Google Scholar
 P. R. Bhaladhare and D. C. Jinwala, “Novel approaches for privacy preserving data mining in kanonymity model,” Journal of Information Science and Engineering, vol. 32, no. 1, pp. 63–78, 2016. View at: Google Scholar
 S. Buyrukbilen and S. Bakiras, “Secure similar document detection with simhash,” in Proceedings of the Workshop on VLDBSecure Data Management (SDM '13), pp. 61–75, Trento, Italy, August 2013. View at: Google Scholar
 Y. Peng, G. Kou, Y. Shi, and Z. Chen, “Privacypreserving data mining for medical data: application of data partition methods,” in Communications and Discoveries from Multidisciplinary Data, vol. 123 of Studies in Computational Intelligence, pp. 331–340, Springer, Berlin, Germany, 2008. View at: Publisher Site  Google Scholar
 C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, “Tools for privacy preserving distributed data mining,” ACM SIGKDD Explorations Newsletter, vol. 4, no. 2, pp. 28–34, 2002. View at: Publisher Site  Google Scholar
 B. Pinkas, “Cryptography techniques for privacypreserving data mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12–19, 2002. View at: Google Scholar
 E. Bingam and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining, pp. 245–250, ACM SIGKDD, San Francisco, Calif, USA, August 2001. View at: Google Scholar
 D. Cai, X. He, and J. Han, “SRDA: an efficient algorithm for largescale discriminant analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp. 1–12, 2008. View at: Publisher Site  Google Scholar
 E. Bertino, D. Lin, and W. Jiang, “A survey of quantification of privacy preserving data mining algorithms,” in PrivacyPreserving Data Mining, C. C. Aggarwal and P. S. Yu, Eds., vol. 34, pp. 183–205, Kluwer Academic, Norwell, Mass, USA, 2008. View at: Publisher Site  Google Scholar
 Y.S. Moon, H.S. Kim, S.P. Kim, and E. Bertino, “Publishing timeseries data under preservation of privacy and distance orders,” in Proceedings of the 21st International Conference on Database and Expert Systems Applications, Part II, pp. 17–31, Publishing TimeSeries Data Under Preservation of Privacy and Distance Orders, Bilbao, Spain, August 2010. View at: Google Scholar
 H.S. Won, S.P. Kim, S. Lee, M.J. Choi, and Y.S. Moon, “Secure principal component analysis in multiple distributed nodes,” Security and Communication Networks, vol. 9, no. 14, pp. 2348–2358, 2016. View at: Publisher Site  Google Scholar
 M. Shah and H. D. Joshi, “Privacy preserving data mining techniques in a distributed environment,” International Journal of Computer Applications, vol. 94, no. 6, pp. 21–27, 2014. View at: Publisher Site  Google Scholar
 W. Jiang and B. K. Samanthula, “Ngram based secure similar document detection,” in Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy, pp. 239–246, Richmond, Va, USA, July 2011. View at: Google Scholar
 B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen, “On secure scalar product computation for privacypreserving data mining,” in Proceedings of the 7th Annual International Conference in Information Security & Cryptology, pp. 104–120, Seoul, Republic of Korea, December 2004. View at: Google Scholar
 S. Berchtold, C. Böhm, and H. Kriegal, “The pyramidtechnique,” ACM SIGMOD Record, vol. 27, no. 2, pp. 142–153, 1998. View at: Publisher Site  Google Scholar
 Y.S. Moon, B.S. Kim, M. S. Kim, and K.Y. Whang, “Scalinginvariant boundary image matching using timeseries matching techniques,” Data and Knowledge Engineering, vol. 69, no. 10, pp. 1022–1042, 2010. View at: Publisher Site  Google Scholar
 Y.S. Moon and W.K. Loh, “Triangular inequalitybased rotationinvariant boundary image matching for smart devices,” Multimedia Systems, vol. 21, no. 1, pp. 15–28, 2014. View at: Publisher Site  Google Scholar
 F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002. View at: Publisher Site  Google Scholar
 Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 412–420, Nashville, Tenn, USA, July 1997. View at: Google Scholar
 Y.S. Moon and B. S. Lee, “Safe MBRtransformation in similar sequence matching,” Information Sciences, vol. 270, pp. 28–40, 2014. View at: Publisher Site  Google Scholar  MathSciNet
 W. Han, J. Lee, Y. Moon, S. Hwang, and H. Yu, “A new approach for processing ranked subsequence matching based on ranked union,” in Proceedings of the ACM SIGMOD International Conference on Management of data (SIGMOD '11), pp. 457–468, Athens, Greece, June 2011. View at: Publisher Site  Google Scholar
 Bag of Words Data Sets, UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.
Copyright
Copyright © 2017 SangPil Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.