A Dataset for Evaluating Query Suggestion Algorithms in Information Retrieval

published in Proceedings of the 27th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1-6, DOI: 10.23919/SOFTCOM.2019.8903906, September 19-21, 2019, Split, Croatia.

Cite as

Full paper

A Dataset for Evaluating Query Suggestion Algorithms in Information Retrieval

Authors

Ioan Bădărînză, Adrian Sterca, Darius Bufnea
Department of Computer Science, Faculty of Mathematics and Computer Science, Babeș-Bolyai University of Cluj-Napoca, Romania

Copyright

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

This paper presents a dataset that can be used for evaluating query suggestion algorithms in textual information retrieval. The dataset is public and offered free of charge to the information retrieval research community. The data was gathered in an experiment that lasted more than 2 months and to which participated a number of 119 users, mainly faculty students. The dataset contains web browsing history and query history (submitted to the Google search engine) from all these users. The data is indexed in a database and downloadable in a database dump format. The dataset is very useful for evaluating general query suggestion algorithms by themselves (in a standalone manner) or against Google’s MPC query suggestion algorithm. At the same time, the dataset supports building and testing personalized query suggestion algorithms that consider the user context/profile when computing query suggestions.

Key words

dataset; query suggestion; information retrieval; search engine.

BibTeX bib file

bada2019.bib

References

  1. J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Clustering user queries of a search engine, Proceedings of the 10th International Conference on World Wide Web ser. WWW ’01, pp. 162-168, 2001.
  2. B. J. Jansen, A. Spink, T. Saracevic, Real life real users and real needs: A study and analysis of user queries on the web, Inf. Process. Manage., vol. 36, no. 2, pp. 207-227, Jan. 2000.
  3. H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma, Probabilistic query expansion using query logs, Proceedings of the 11th International Conference on World Wide Web ser. WWW ’02, pp. 325-332, 2002.
  4. M. Sanderson, Ambiguous queries: Test collections need more sense, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ser. SIGIR ’08, pp. 499-506, 2008.
  5. L. Li, H. Deng, A. Dong, Y. Chang, H. Zha, R. Baeza-Yates, Analyzing user’s sequential behavior in query auto-completion via markov processes, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval ser. SIGIR ’15, pp. 123-132, 2015.
  6. Z. Bar-Yossef, N. Kraus, Context-sensitive query auto-completion, Proceedings of the 20th International Conference on World Wide Web ser. WWW ’11, pp. 107-116, 2011.
  7. C. Manning, P. Raghavan, and H. Schutze, An introduction to Information Retrieval, Cambridge University Press, 2009.
  8. D. D. Lewis, Y. Yang, T. G. Rose, F. Li, Rcv1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., vol. 5, pp. 361-397, Dec. 2004.
  9. G. Pass, A. Chowdhury, C. Torgeson, A picture of search, Proceedings of the 1st International Conference on Scalable Information Systems ser. InfoScale ’06, 2006.
  10. Text retrieval conference (trec) data, May 2019, [online] Available: https://trec.nist.gov/data.html.
  11. The clueweb09 dataset, May 2019, [online] Available: http://lemurproject.org/clueweb09/.
  12. The clueweb12 dataset, May 2019, [online] Available: http://lemurproject.org/clueweb12/.
  13. Kaggle: Your home for data science, May 2019, [online] Available: https://www.kaggle.com/.
  14. Statcounter global stats: Browser market share worldwide, May 2019, [online] Available: http://gs.statcounter.com/browser-market-share.
  15. W3counter: Web browser market share trends, May 2019, [online] Available: https://www.w3counter.com/trends.
  16. Statista: Search engine market share world-wide, May 2019, [online] Available: https://www.statista.com/statistics/216573/worldwide-market-share-of-search-engines.
  17. Statcounter global stats: Desktop search engine market share worldwide, May 2019, [online] Available: http://gs.statcounter.com/search-engine-market-share/desktop/worldwide.
  18. Search engine market share, May 2019, [online] Available: https://netmarketshare.com/search-engine-market-share.aspx.
  19. V. Niculescu, D. Bufnea, A. Sterca, MPI scaling up for powerlist based parallel programs, Proceedings of 27th Euromicro International Conference on Parallel Distributed and Network-Based Processing (PDP 2019), pp. 199-204, February 13–15, 2019, 2019.
  20. Wordart, May 2019, [online] Available: https://wordart.com/6i76j7rcw0bd/word-art.
  21. J.-Y. Jiang, Y.-Y. Ke, P.-Y. Chien, P.-J. Cheng, Learning user reformulation behavior for query auto-completion, Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval ser. SIGIR ’14, pp. 445-454, 2014.
  22. I. Bădărînză, A. Sterca, F. M. Boian, Using the user’s recent browsing history for personalized query suggestions, 2018 26th International Conference on Software Telecommunications and Computer Networks (SoftCOM), pp. 1-6, Sep. 2018.

Darius Bufnea