Table of Contents Author Guidelines Submit a Manuscript
Scientific Programming
Volume 2017 (2017), Article ID 3072813, 9 pages
https://doi.org/10.1155/2017/3072813
Research Article

Cross-Checking Multiple Data Sources Using Multiway Join in MapReduce

National Technical University of Athens, Athens, Greece

Correspondence should be addressed to Zaid Momani; moc.oohay@ynamom_dez

Received 27 May 2017; Revised 30 August 2017; Accepted 27 September 2017; Published 20 November 2017

Academic Editor: Marco Aldinucci

Copyright © 2017 Foto Afrati et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. F. T. Juster and J. P. Smith, “Improving the quality of economic data: lessons from the HRS and AHEAD,” Journal of the American Statistical Association, vol. 92, no. 440, pp. 1268–1278, 1997. View at Publisher · View at Google Scholar · View at Scopus
  2. J. W. Graham, “Missing data analysis: making it work in the real world,” Annual Review of Psychology, vol. 60, pp. 549–576, 2009. View at Google Scholar
  3. A. C. Acock, “Working with missing values,” Journal of Marriage and Family, vol. 67, no. 4, pp. 1012–1028, 2005. View at Publisher · View at Google Scholar · View at Scopus
  4. A. Holzinger, M. Dehmer, and I. Jurisica, “Knowledge discovery and interactive data mining in bioinformatics—state-of-the-art, future challenges and research directions,” BMC Bioinformatics, vol. 15, no. 6, pp. 1–9, 2014. View at Publisher · View at Google Scholar · View at Scopus
  5. S. F. Messner, “Exploring the consequences of erratic data reporting for cross-national research on homicide,” Journal of Quantitative Criminology, vol. 8, no. 2, pp. 155–173, 1992. View at Publisher · View at Google Scholar
  6. A. M. Wood, I. R. White, and S. G. Thompson, “Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals,” Clinical Trials, vol. 1, no. 4, pp. 368–376, 2004. View at Publisher · View at Google Scholar · View at Scopus
  7. J. W. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in Rough Sets and Current Trends in Computing, pp. 378–385, Springer, 2001. View at Google Scholar
  8. B. Padmanabhan, Z. Zheng, and S. O. Kimbrough, “Personalization from incomplete data: what you don't know can hurt,” in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '01), pp. 154–163, ACM, August 2001. View at Scopus
  9. M. Magnani, “Techniques for dealing with missing data in knowledge discovery tasks,” Obtido, vol. 15, no. 1, article 2007, 2004. View at Google Scholar
  10. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava, “Truth finding on the deep web: is the problem solved?” Proceedings of the VLDB Endowment, vol. 6, no. 2, pp. 97–108, 2012. View at Google Scholar · View at Scopus
  11. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 39, no. 1, pp. 1–38, 1977. View at Google Scholar · View at MathSciNet
  12. X. L. Dong, L. Berti-Equille, and D. Srivastava, “Integrating conflicting data: the role of source dependence,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 550–561, 2009. View at Google Scholar · View at Scopus
  13. F. Afrati, Z. Momani, and N. Stasinopoulos, “Cross-checking data sources in MapReduce,” in New Trends in Databases and Information Systems, vol. 539 of Communications in Computer and Information Science, pp. 165–174, Springer International Publishing, Cham, Switzerland, 2015. View at Publisher · View at Google Scholar
  14. F. N. Afrati, D. Delorey, M. Pasumansky, and J. D. Ullman, “Storing and querying tree-structured records in Dremel,” Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1131–1142, 2014. View at Publisher · View at Google Scholar
  15. S. Melnik, A. Gubarev, J. J. Long et al., “Dremel: Interactive analysis of web-scale datasets,” Communications of the ACM, vol. 54, no. 6, pp. 114–123, 2011. View at Publisher · View at Google Scholar · View at Scopus
  16. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. View at Publisher · View at Google Scholar · View at Scopus
  17. F. N. Afrati and J. D. Ullman, “Optimizing joins in a map-reduce environment,” in Proceedings of the 13th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '10), pp. 99–110, ACM, March 2010. View at Publisher · View at Google Scholar · View at Scopus
  18. C. Doulkeridis and K. Nørvåg, “A survey of large-scale analytical query processing in MapReduce,” The VLDB Journal, vol. 23, no. 3, pp. 355–380, 2014. View at Publisher · View at Google Scholar · View at Scopus
  19. R. Vernica, M. J. Carey, and C. Li, “Efficient parallel set-similarity joins using MapReduce,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '10), pp. 495–506, ACM, June 2010. View at Publisher · View at Google Scholar · View at Scopus
  20. Y. Kim and K. Shim, “Parallel top-k similarity join algorithms using MapReduce,” in Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE '12), pp. 510–521, IEEE, April 2012. View at Publisher · View at Google Scholar · View at Scopus
  21. A. Metwally and C. Faloutsos, “V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 704–715, 2012. View at Google Scholar
  22. R. Baraglia, G. De Francisci Morales, and C. Lucchese, “Document similarity self-join with MapReduce,” in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10), pp. 731–736, IEEE, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  23. Y. N. Silva, J. M. Reed, and L. M. Tsosie, “MapReduce-based similarity join for metric spaces,” in Proceedings of the 1st International Workshop on Cloud Intelligence, p. 3, ACM, August 2012. View at Publisher · View at Google Scholar · View at Scopus
  24. F. N. Afrati, A. D. Sarma, D. Menestrina, A. Parameswaran, and J. D. Ullman, “Fuzzy joins using MapReduce,” in Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE '12), pp. 498–509, IEEE, Washington, DC, USA, April 2012. View at Publisher · View at Google Scholar · View at Scopus
  25. X. L. Dong, E. Gabrilovich, G. Heitz et al., “From data fusion to knowledge fusion,” in Proceedings of the VLDB Endowment, vol. 7, no. 10, pp. 881–892, June 2014. View at Publisher · View at Google Scholar · View at Scopus
  26. L. Kolb and E. Rahm, “Parallel entity resolution with dedoop,” Datenbank-Spektrum, vol. 13, no. 1, pp. 23–32, 2013. View at Publisher · View at Google Scholar
  27. L. Kolb, A. Thor, and E. Rahm, “Don't match twice: redundancy-free similarity computation with MapReduce,” in Proceedings of the 2nd Workshop on Data Analytics in the Cloud, pp. 1–5, ACM, June 2013. View at Publisher · View at Google Scholar · View at Scopus
  28. H. Garcia-Molina, J. D. Ullman, and J. Widom, Database Systems—The Complete Book, Pearson Education, 2nd edition, 2009.
  29. Oracle, Class Random Java Documentation, https://docs.oracle.com/javase/7/docs/api/java/util/Random.html.
  30. F. N. Afrati, A. D. Sarma, A. Rajaraman, P. Rule, S. Salihoglu, and J. Ullman, “Anchor-points algorithms for hamming and edit distances using mapreduce,” in Proceedings of the 17th International Conference on Database Theory (ICDT '14), pp. 4–14, Athens, Greece, March 2014. View at Publisher · View at Google Scholar
  31. U.S. General Services Administration, U.S. government's open data, 2013, http://www.data.gov/.