Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2017, Article ID 4504589, 11 pages
Research Article

Information-Balance-Aware Approximated Summarization of Data Provenance

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Correspondence should be addressed to Jisheng Pei; nc.ude.auhgnist.sliam@70sjp

Received 28 February 2017; Revised 21 April 2017; Accepted 2 May 2017; Published 12 September 2017

Academic Editor: Chi-Hung Chi

Copyright © 2017 Jisheng Pei and Xiaojun Ye. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Extracting useful knowledge from data provenance information has been challenging because provenance information is often overwhelmingly enormous for users to understand. Recently, it has been proposed that we may summarize data provenance items by grouping semantically related provenance annotations so as to achieve concise provenance representation. Users may provide their intended use of the provenance data in terms of provisioning, and the quality of provenance summarization could be optimized for smaller size and closer distance between the provisioning results derived from the summarization and those from the original provenance. However, apart from the intended provisioning use, we notice that more dedicated and diverse user requirements can be expressed and considered in the summarization process by assigning importance weights to provenance elements. Moreover, we introduce information balance index (IBI), an entropy based measurement, to dynamically evaluate the amount of information retained by the summary to check how it suits user requirements. An alternative provenance summarization algorithm that supports manipulation of information balance is presented. Case studies and experiments show that, in summarization process, information balance can be effectively steered towards user-defined goals and requirement-driven variants of the provenance summarizations can be achieved to support a series of interesting scenarios.