Table of Contents Author Guidelines Submit a Manuscript
Security and Communication Networks
Volume 2017, Article ID 6841216, 12 pages
https://doi.org/10.1155/2017/6841216
Research Article

Efficient Two-Step Protocol and Its Discriminative Feature Selections in Secure Similar Document Detection

1Department of Computer Science, Kangwon National University, 1 Kangwondaehak-gil, Chuncheon-si, Gangwon 24341, Republic of Korea
2Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Republic of Korea

Correspondence should be addressed to Yang-Sae Moon; rk.ca.nowgnak@noomsy

Received 27 July 2016; Revised 31 January 2017; Accepted 6 February 2017; Published 28 March 2017

Academic Editor: Kai Rannenberg

Copyright © 2017 Sang-Pil Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recently, the risk of information disclosure is increasing significantly. Accordingly, privacy-preserving data mining (PPDM) is being actively studied to obtain accurate mining results while preserving the data privacy. We here focus on secure similar document detection (SSDD), which identifies similar documents of two parties when each party does not disclose its own sensitive documents to the another party. In this paper, we propose an efficient two-step protocol that exploits a feature selection as a lower-dimensional transformation, and we present discriminative feature selections to maximize the performance of the protocol. The proposed protocol consists of two steps: the filtering step and the postprocessing step. For the feature selection, we first consider the simplest one, random projection (RP), and propose its two-step solution, SSDD-RP. We then present two discriminative feature selections and their solutions: SSDD-LF which selects a few dimensions locally frequent in the current querying vector and SSDD-GF which selects ones globally frequent in the set of all document vectors. We finally propose a hybrid one, SSDD-HF, which takes advantage of both SSDD-LF and SSDD-GF. We empirically show that the proposed two-step protocol significantly outperforms the previous one-step protocol by three or four orders of magnitude.