Measuring and Visualizing the Scrappiness Level of a Website

published in Proceedings of the 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing – SYNASC 2017, pp. 304-311, DOI: 10.1109/SYNASC.2017.00057, September 21-24, 2017, Timișoara, Romania

Cite as

Full paper

Measuring and Visualizing the Scrappiness Level of a Website

Authors

Darius Bufnea, Diana Șotropa
Department of Computer Science, Faculty of Mathematics and Computer Science,
Babeş-Bolyai University of Cluj-Napoca

Copyright

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

Scraper sites are questionable quality sites that copy their content partially or entirely from other websites and sometimes gain more ranking and popularity to the detriment of the original websites. This usually happens from a search engine point of view. Misleading a user to a scraper site almost always implies an unhappy, time consuming user experience, the scraper site being an unnecessary link in the user’s navigation path. In this paper we present a method through which one can numerically measure and quantify the scrappiness level of a website and also visually display this level. In the same time, this paper wants to advert to the web and research communities about this type of websites and to urge actions against them.

Key words

scraper site; scrappiness level; link spam; web spam detection; content spam; document similarity; search engine; web search.

BibTeX bib file

bufnea-halita-2017.bib

References

  1. C. Castillo et al., A reference collection for web spam, ACM Sigir Forum, vol. 40, no. 2. ACM, 2006, pp. 11-24.
  2. D. S. Evans, The online advertising industry: Economics, evolution, and privacy, The journal of economic perspectives, vol. 23, no. 3, pp. 37-60, 2009.
  3. M. Najork, Web spam detection, Encyclopedia of Database Systems. Springer, 2009, pp. 3520-3523.
  4. N. Spirin and J. Han, Survey on web spam detection: principles and algorithms, ACM SIGKDD Explorations Newsletter, vol. 13, no. 2, pp. 50-64, 2012.
  5. R. Patel, Z. Qiu, and C. T. Kwok, Classifying sites as low quality sites, Apr. 7 2015, US Patent 9,002,832.
  6. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, Combating web spam with trustrank, Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004, pp. 576-587.
  7. M. Erdélyi, A. Garzó, and A. A. Benczúr, Web spam classification: a few features worth more, Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. ACM, 2011, pp. 27-34.
  8. M. Daiyan, S. K. Tiwari, and M. A. Alam, Mining product reviews for spam detection using supervised technique, International Journal of Emerging Technology and Advanced Engineering, vol. 4, no. 8, pp. 619-623, 2014.
  9. G.-G. Geng, Q. Li, and X. Zhang, Link based small sample learning for web spam detection, Proceedings of the 18th international conference on World wide web. ACM, 2009, pp. 1185-1186.
  10. C. Wei et al., Fighting against web spam: a novel propagation method based on click-through data, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2012, pp. 395-404.
  11. C. P. Bharatbhai and K. M. Patel, Analysis of spam link detection algorithm based on hyperlinks, IFRSA International Journal of Data Warehousing & Mining, vol. 4, pp. 67-72, 2014.
  12. L. Araujo and J. Martinez-Romo, Web spam detection: new classification features based on qualified link analysis and language models, IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, pp. 581-590, 2010.
  13. D. Fetterly, M. Manasse, and M. Najork, Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages, Proceedings of the 7th International Work- shop on the Web and Databases: collocated with ACM SIGMOD/PODS 2004. ACM, 2004, pp. 1-6.
  14. Y. Liu et al., Identifying web spam with the wisdom of the crowds, ACM Transactions on the Web (TWEB), vol. 6, no. 1, pp. 2:1-2:30, 2012.
  15. J. Beel and B. Gipp, On the robustness of Google Scholar against spam, Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. ACM, 2010, pp. 297-298.
  16. D. Haliță and D. Bufnea, A study regarding inter domain linked documents similarity and their consequent bounce rate, Studia Universitatis Babeș-Bolyai, Informatica, vol. 59, no. 1, 2014.
  17. N. Poggi, J. L. Berral, T. Moreno, R. Gavalda, and J. Torres, Automatic detection and banning of content stealing bots for e-commerce, NIPS 2007 workshop on machine learning in adversarial environments for computer security, 2007.
  18. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web. ACM, 2006, pp. 83-92.
  19. D. S. Market, The EU copyright legislation, https://ec.europa.eu/digital-single-market/en/eu-copyright-legislation, Last visited on 25.05.2017.
  20. G. Gan, C. Ma, and J. Wu, Data clustering: theory, algorithms, and applications. Siam, 2007, vol. 20.
  21. W. H. Gomaa and A. A. Fahmy, A survey of text similarity approaches, International Journal of Computer Applications, vol. 68, no. 13, 2013.
  22. G. Kondrak, N-gram similarity and distance, International Symposium on String Processing and Information Retrieval. Springer, 2005, pp. 115-126.
  23. Squid: Optimising web deliver, http://www.squid-cache.org/, Last visited on 25.05.2017.

Darius Bufnea