Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014, Article ID 135641, 13 pages
http://dx.doi.org/10.1155/2014/135641
Review Article

Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia

Received 30 December 2013; Accepted 8 April 2014; Published 19 May 2014

Academic Editors: L. Li, L. Sanchez, and F. Yu

Copyright © 2014 Parnia Samimi and Sri Devi Ravana. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. E. M. Voorhees, “The philosophy of information retrieval evaluation,” in Evaluation of Cross-Language Information Retrieval Systems, pp. 355–370, Springer, 2002. View at Google Scholar
  2. C. Cleverdon, “The Cranfield tests on index language devices,” Aslib Proceedings, vol. 19, no. 6, pp. 173–194, 1967. View at Publisher · View at Google Scholar
  3. S. I. Moghadasi, S. D. Ravana, and S. N. Raman, “Low-cost evaluation techniques for information retrieval systems: a review,” Journal of Informetrics, vol. 7, no. 2, pp. 301–312, 2013. View at Google Scholar
  4. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: the Concepts and Technology Behind Search, Addison-Wesley, 2011.
  5. J. Howe, “The rise of crowdsourcing,” Wired Magazine, vol. 14, no. 6, pp. 1–4, 2006. View at Google Scholar
  6. Y. Zhao and Q. Zhu, “Evaluation on crowdsourcing research: current status and future direction,” Information Systems Frontiers, 2012. View at Publisher · View at Google Scholar · View at Scopus
  7. R. Munro, S. Bethard, V. Kuperman et al., “Crowdsourcing and language studies: the new generation of linguistic data,” in Proceedings of the Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 122–130, Association for Computational Linguistics, 2010.
  8. V. Ambati, S. Vogel, and J. Carbonell, “Active learning and crowd-sourcing for machine translation,” in Language Resources and Evaluation (LREC), vol. 7, pp. 2169–2174, 2010. View at Google Scholar
  9. C. Callison-Burch, “Fast, cheap, and creative: evaluating translation quality using amazon's mechanical turk,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 286–295, Association for Computational Linguistics, August 2009. View at Scopus
  10. K. T. Stolee and S. Elbaum, “Exploring the use of crowdsourcing to support empirical studies in software engineering,” in Proceedings of the 4th International Symposium on Empirical Software Engineering and Measurement (ESEM '10), no. 35, ACM, September 2010. View at Publisher · View at Google Scholar · View at Scopus
  11. D. R. Choffnes, F. E. Bustamante, and Z. Ge, “Crowdsourcing service-level network event monitoring,” in Proceedings of the 7th International Conference on Autonomic Computing (SIGCOMM '10), pp. 387–398, ACM, September 2010. View at Publisher · View at Google Scholar · View at Scopus
  12. A. Brew, D. Greene, and P. Cunningham, “Using crowdsourcing and active learning to track sentiment in online media,” in Proceedings of the Conference on European Conference on Artificial Intelligence (ECAI '10), pp. 145–150, 2010.
  13. R. Holley, “Crowdsourcing and social engagement: potential, power and freedom for libraries and users,” 2009.
  14. D. C. Brabham, “Crowdsourcing the public participation process for planning projects,” Planning Theory, vol. 8, no. 3, pp. 242–262, 2009. View at Publisher · View at Google Scholar · View at Scopus
  15. O. Alonso, D. E. Rose, and B. Stewart, “Crowdsourcing for relevance evaluation,” ACM SIGIR Forum, vol. 42, no. 2, pp. 9–15, 2008. View at Google Scholar
  16. Crowdflower, https://http://www.crowdflower.com/.
  17. G. Paolacci, J. Chandler, and P. G. Ipeirotis, “Running experiments on amazon mechanical turk,” Judgment and Decision Making, vol. 5, no. 5, pp. 411–419, 2010. View at Google Scholar · View at Scopus
  18. W. Mason and S. Suri, “Conducting behavioral research on amazon's mechanical turk,” Behavior Research Methods, vol. 44, no. 1, pp. 1–23, 2012. View at Publisher · View at Google Scholar · View at Scopus
  19. Y. Pan and E. Blevis, “A survey of crowdsourcing as a means of collaboration and the implications of crowdsourcing for interaction design,” in Proceedings of the 12th International Conference on Collaboration Technologies and Systems (CTS '11), pp. 397–403, May 2011. View at Publisher · View at Google Scholar · View at Scopus
  20. O. Alonso and S. Mizzaro, “Using crowdsourcing for TREC relevance assessment,” Information Processing and Management, vol. 48, no. 6, pp. 1053–1066, 2012. View at Publisher · View at Google Scholar · View at Scopus
  21. J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960. View at Google Scholar
  22. J. L. Fleiss, “Measuring nominal scale agreement among many raters,” Psychological Bulletin, vol. 76, no. 5, pp. 378–382, 1971. View at Publisher · View at Google Scholar · View at Scopus
  23. K. Krippendorff, “Estimating the reliability, systematic error and random error of interval data,” Educational and Psychological Measurement, vol. 30, no. 1, pp. 61–70, 1970. View at Google Scholar
  24. C. Eickhoff and A. P. de Vries, “Increasing cheat robustness of crowdsourcing tasks,” Information Retrieval, vol. 16, no. 2, pp. 121–137, 2013. View at Publisher · View at Google Scholar · View at Scopus
  25. G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling, “Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking,” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '11), pp. 205–214, ACM, July 2011. View at Publisher · View at Google Scholar · View at Scopus
  26. O. Alonso, “Implementing crowdsourcing-based relevance experimentation: an industrial perspective,” Information Retrieval, vol. 16, no. 2, pp. 101–120, 2013. View at Google Scholar
  27. M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari-Nezhad, E. Bertino, and S. Dustdar, “Quality control in crowdsourcing systems: issues and directions,” IEEE Internet Computing, vol. 17, no. 2, pp. 76–81, 2013. View at Google Scholar
  28. O. Alonso and R. Baeza-Yates, “Design and implementation of relevance assessments using crowdsourcing,” in Advances in Information Retrieval, pp. 153–164, Springer, 2011. View at Google Scholar
  29. S. Khanna, A. Ratan, J. Davis, and W. Thies, “Evaluating and improving the usability of mechanical turk for low-income workers in India,” in Proceedings of the 1st ACM Symposium on Computing for Development (DEV '10), no. 12, ACM, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  30. A. Kittur, E. H. Chi, and B. Suh, “Crowdsourcing user studies with mechanical turk,” in Proceedings of the 26th Annual CHI Conference on Human Factors in Computing Systems (CHI '08), pp. 453–456, ACM, April 2008. View at Publisher · View at Google Scholar · View at Scopus
  31. M. Hirth, T. Hoßfeld, and P. Tran-Gia, “Analyzing costs and accuracy of validation mechanisms for crowdsourcing platforms,” Mathematical and Computer Modelling, vol. 57, no. 11-12, pp. 2918–2932, 2013. View at Publisher · View at Google Scholar · View at Scopus
  32. G. Kazai, J. Kamps, and N. Milic-Frayling, “An analysis of human factors and label accuracy in crowdsourcing relevance judgments,” Information Retrieval, vol. 16, no. 2, pp. 138–178, 2013. View at Google Scholar
  33. D. E. Difallah, G. Demartini, and P. Cudré-Mauroux, “Mechanical cheat: spamming schemes and adversarial techniques on crowdsourcing platforms,” in Proceedings of the CrowdSearch Workshop, Lyon, France, 2012.
  34. L. De Alfaro, A. Kulshreshtha, I. Pye, and B. T. Adler, “Reputation systems for open collaboration,” Communications of the ACM, vol. 54, no. 8, pp. 81–87, 2011. View at Publisher · View at Google Scholar · View at Scopus
  35. G. Kazai, J. Kamps, and N. Milic-Frayling, “The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2583–2586, ACM, 2012.
  36. L. Hammon and H. Hippner, “Crowdsourcing,” Wirtschaftsinf, vol. 54, no. 3, pp. 165–168, 2012. View at Google Scholar
  37. J. Ross, L. Irani, M. S. Silberman, A. Zaldivar, and B. Tomlinson, “Who are the crowdworkers? Shifting demographics in mechanical turk,” in Proceedings of the 28th Annual CHI Conference on Human Factors in Computing Systems (CHI '10), pp. 2863–2872, ACM, April 2010. View at Publisher · View at Google Scholar · View at Scopus
  38. P. Ipeirotis, “Demographics of mechanical turk,” 2010.
  39. G. Kazai, “In search of quality in crowdsourcing for search engine evaluation,” in Advances in Information Retrieval, pp. 165–176, Springer, 2011. View at Google Scholar
  40. M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso, “An evaluation framework for plagiarism detection,” in Proceedings of the 23rd International Conference on Computational Linguistics, pp. 997–1005, Association for Computational Linguistics, August 2010. View at Scopus
  41. W. Mason and D. J. Watts, “Financial incentives and the performance of crowds,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 2, pp. 100–108, 2010. View at Google Scholar
  42. J. Heer and M. Bostock, “Crowdsourcing graphical perception: using mechanical turk to assess visualization design,” in Proceedings of the 28th Annual CHI Conference on Human Factors in Computing Systems (CHI '10), pp. 203–212, ACM, April 2010. View at Publisher · View at Google Scholar · View at Scopus
  43. C. Grady and M. Lease, “Crowdsourcing document relevance assessment with mechanical turk,” in Proceedings of the Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pp. 172–179, Association for Computational Linguistics, 2010.
  44. J. Chen, N. Menezes, J. Bradley, and T. A. North, “Opportunities for crowdsourcing research on amazon mechanical turk,” Human Factors, vol. 5, no. 3, 2011. View at Google Scholar
  45. A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut, “CrowdForge: crowdsourcing complex work,” in Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (UIST '11), pp. 43–52, ACM, October 2011. View at Publisher · View at Google Scholar · View at Scopus
  46. P. Clough, M. Sanderson, J. Tang, T. Gollins, and A. Warner, “Examining the limits of crowdsourcing for relevance assessment,” IEEE Internet Computing, vol. 17, no. 4, pp. 32–338, 2012. View at Google Scholar
  47. O. Scekic, H. L. Truong, and S. Dustdar, “Modeling rewards and incentive mechanisms for social BPM,” in Business Process Management, pp. 150–155, Springer, 2012. View at Google Scholar
  48. S. P. Dow, B. Bunge, T. Nguyen, S. R. Klemmer, A. Kulkarni, and B. Hartmann, “Shepherding the crowd: managing and providing feedback to crowd workers,” in Proceedings of the 29th Annual CHI Conference on Human Factors in Computing Systems (CHI '11), pp. 1669–1674, ACM, May 2011. View at Publisher · View at Google Scholar · View at Scopus
  49. T. W. Malone, R. Laubacher, and C. Dellarocas, “Harnessing crowds: mapping the genome of collective intelligence,” MIT Sloan School Working Paper 4732-09, 2009. View at Google Scholar
  50. G. Kazai, J. Kamps, and N. Milic-Frayling, “Worker types and personality traits in crowdsourcing relevance labels,” in Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 1941–1944, ACM, October 2011. View at Publisher · View at Google Scholar · View at Scopus
  51. J. Vuurens, A. P. de Vries, and C. Eickhoff, “How much spam can you take? an analysis of crowdsourcing results to increase accuracy,” in Proceedings of the ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR ’11), pp. 21–26, 2011.
  52. J. B. P. Vuurens and A. P. de Vries, “Obtaining high-quality relevance judgments using crowdsourcing,” IEEE Internet Computing, vol. 16, no. 5, pp. 20–27, 2012. View at Google Scholar
  53. W. Tang and M. Lease, “Semi-supervised consensus labeling for crowdsourcing,” in Proceedings of the SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR '11), 2011.
  54. J. Le, A. Edmonds, V. Hester, and L. Biewald, “Ensuring quality in crowdsourced search relevance evaluation: the effects of training question distribution,” in Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation, pp. 21–26, 2010.
  55. R. M. McCreadie, C. Macdonald, and I. Ounis, “Crowdsourcing a news query classification dataset,” in Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation (CSE '10), pp. 31–38, 2010.
  56. T. Xia, C. Zhang, J. Xie, and T. Li, “Real-time quality control for crowdsourcing relevance evaluation,” in Proceedings of the 3rd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC '12), pp. 535–539, 2012.
  57. G. Zuccon, T. Leelanupab, S. Whiting, E. Yilmaz, J. M. Jose, and L. Azzopardi, “Crowdsourcing interactions: using crowdsourcing for evaluating interactive information retrieval systems,” Information Retrieval, vol. 16, no. 2, pp. 267–305, 2013. View at Google Scholar
  58. D. Zhu and B. Carterette, “An analysis of assessor behavior in crowdsourced preference judgments,” in Proceedings of the SIGIR Workshop on Crowdsourcing for Search Evaluation, pp. 17–20, 2010.
  59. L. Von Ahn, M. Blum, N. J. Hopper et al., “CAPTCHA: using hard AI problems for security,” in Advances in Cryptology—EUROCRYPT 2003, pp. 294–311, Springer, 2003. View at Google Scholar
  60. L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “reCAPTCHA: human-based character recognition via web security measures,” Science, vol. 321, no. 5895, pp. 1465–1468, 2008. View at Publisher · View at Google Scholar · View at Scopus
  61. V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? Improving data quality and data mining using multiple, noisy labelers,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '08), pp. 614–622, ACM, August 2008. View at Publisher · View at Google Scholar · View at Scopus
  62. P. Welinder and P. Perona, “Online crowdsourcing: rating annotators and obtaining cost-effective labels,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops (CVPRW '10), pp. 25–32, June 2010. View at Publisher · View at Google Scholar · View at Scopus
  63. P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang, “Repeated labeling using multiple noisy labelers,” Data Mining and Knowledge Discovery, vol. 28, no. 2, pp. 402–441, 2014. View at Google Scholar
  64. M. Hosseini, I. J. Cox, N. Milić-Frayling, G. Kazai, and V. Vinay, “On aggregating labels from multiple crowd workers to infer relevance of documents,” in Advances in Information Retrieval, pp. 182–194, Springer, 2012. View at Google Scholar
  65. P. G. Ipeirotis, F. Provost, and J. Wang, “Quality management on amazon mechanical turk,” in Proceedings of the Human Computation Workshop (HCOMP '10), pp. 64–67, July 2010. View at Publisher · View at Google Scholar · View at Scopus
  66. B. Carpenter, “Multilevel bayesian models of categorical data annotation,” 2008.
  67. V. C. Raykar, S. Yu, L. H. Zhao et al., “Learning from crowds,” The Journal of Machine Learning Research, vol. 11, pp. 1297–1322, 2010. View at Google Scholar · View at Scopus
  68. R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263, Association for Computational Linguistics, October 2008. View at Scopus
  69. A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the EM algorithm,” Applied Statistics, vol. 28, pp. 20–28, 1979. View at Google Scholar
  70. R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” Advances in Neural Information Processing Systems, vol. 20, pp. 1257–1264, 2008. View at Google Scholar
  71. H. J. Jung and M. Lease, “Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization,” in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1095–1096, ACM, 2012.
  72. H. J. Jung and M. Lease, “Improving quality of crowdsourced labels via probabilistic matrix factorization,” in Proceedings of the 26th AAAI Conference on Artificial Intelligence, 2012.
  73. A. J. Quinn and B. B. Bederson, “Human computation: a survey and taxonomy of a growing field,” in Proceedings of the 29th Annual CHI Conference on Human Factors in Computing Systems (CHI '11), pp. 1403–1412, ACM, May 2011. View at Publisher · View at Google Scholar · View at Scopus
  74. A. Kulkarni, M. Can, and B. Hartmann, “Collaboratively crowdsourcing workflows with turkomatic,” in Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW '12), pp. 1003–1012, February 2012. View at Publisher · View at Google Scholar · View at Scopus
  75. M. Soleymani and M. Larson, “Crowdsourcing for affective annotation of video: development of a viewer-reported boredom corpus,” in Proceedings of the ACM SIGIR Workshop on Crowdsourcing for Search Evaluation (CSE '10), pp. 4–8, 2010.