A New Language Independent Strategy for Clickbait Detection

published in Proceedings of the 28th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1-6, DOI: 10.23919/SoftCOM50211.2020.9238342, September 17-19, 2020, Hvar, Croatia.

Cite as

Full paper

A New Language Independent Strategy for Clickbait Detection

Authors

Claudia Ioana Coste, Darius Bufnea, Virginia Niculescu
Department of Computer Science, Faculty of Mathematics and Computer Science, Babeș-Bolyai University of Cluj-Napoca, Romania

Copyright

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

Clickbait is a bad habit of today’s web publishers, which resort to such a technique in order to deceive web visitors and increase publishers’ page views and advertising revenue. Clickbait incidence is also an indicator for fake news and so, clickbait detection represents a mean in the fight against spreading false information. Recently, both the research community and the big actors on the WWW scene such as social networks and search engines, turn their attention towards this negative phenomenon that is more and more present in our everyday browsing experience. The detection techniques are usually based on intelligent classifiers, features selection being also of great importance. This paper aims to bring its own contributions in clickbait analysis and detection by presenting a new language independent strategy for clickbait detection that considers only general features that are non language specific. This approach is justified by the need for a higher level of abstractization in the clickbait detection, allowing its usability for articles written in different languages. A complex experiment on a real sample dataset was conducted and the obtained results are compared with the most relevant previous work results.

Key words

clickbait detection; features; intelligent classifier; natural language; accuracy.

BibTeX bib file

softcom2020-clickbait.bib

References

  1. D. Bufnea and D. Sotropa, A community driven approach for click bait reporting, in Proceedings of the 26th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1-6, September 13-15, 2018, Split – Supetar (Island of Brac), Croatia.
  2. D.T. Gilbert et al., Unbelieving the unbelievable: some problems in the rejection of false information, in Journal of Personality and Social Psychology 59 (4), 601-613, 1990.
  3. K. Shu et al., Fake news detection on social media: a data mining perspective, in: ACM SIGKDD Explorations Newsletter, pp. 22-36, 2017.
  4. P. Biyani, K. Tsioutsiouliklis and J. Blackmer, 8 Amazing Secrets for Getting More Clicks: Detecting Clickbaits in News Streams Using Article Informality, in Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016. pp. 94-100, Phoenix, Arizona, SUA.
  5. F. Heylighen and J.-M. Dewaele, Formality of language: definition, measurement and behavioral determinants, 1999.
  6. M. Coleman and T.L. Liau, A computer readability formula designed for machine scoring, in Journal of Applied Psychology, 1975, Vol. 60. pp. 283-284.
  7. J. Anderson, Lix and Rix: Variations on a Little-known Readability Index, in Journal of Reading 26 (6), pp. 490-496, 1983.
  8. M. Potthast et al., Clickbait detection, in: Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science, vol 9626. Springer, Cham. DOI: 10.1007/978-3-319-30671-1_72, 2016.
  9. M. Potthast et al., The clickbait challenge 2017: towards a regression model for clickbait strength, s.l. : CoRR, 2018, Vol. abs/1812.10847.
  10. A. Anand, T. Chakraborty and N. Park, We used neural networks to detect clickbaits: you won’t believe what happened next!, in Advances in Information Retrieval, 39th European Conference on IR Research (ECIR’17), Lecture Notes in Computer Science, Springer, 2017.
  11. A. Chakraborty et al., Stop clickbait: detecting and preventing clickbaits in online news media, San Francisco, CA, SUA: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2016. 978-1-5090-2846-7.
  12. D. Denisko and M.M. Hoffman, Classification and interaction in random forests, Proceedings of the National Academy of Sciences Feb 2018, 115 (8) 1690-1692; DOI: 10.1073/pnas.1800256115.
  13. A. Zhang et al., Dive into deep learning, Chapter Bidirectional recurrent neural networks, URL: https://d2l.ai/d2l-en.pdf (visited on: 10/06/2020).
  14. C. Manning et al., The Stanford CoreNLP natural language processing toolkit, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics. Baltimore: Maryland, pp. 55-60, 2014.
  15. C.I. Coste, Controlling the click bait, Arad, Romania: Proceedings of the International Student Conference StudMath-IT, 2018. pp. 11-17.
  16. C.I. Coste, Online bad habits: fake news and clickbait, in Proceedings of the Student Interdisciplinary Conference The European Union and Global Order, April 5th, 2019, Cluj-Napoca, Romania.
  17. I. Badarinza, A. Sterca and D. Bufnea, A dataset for evaluating query suggestion algorithms in information retrieval, in Proceedings of the 27th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2019.
  18. StanfordNLP – Python NLP Library for Many Human Languages,, 2017, URL: https://stanfordnlp.github.io/stanfordnlp/ (visited on 05/12/2019).
  19. Universal Dependencies, 2014, Universal POS tags. URL: https://universaldependencies.org/u/pos/ (visited on 05/12/2019).
  20. P. Qi et al., Universal Dependency Parsing from Scratch, in Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2018. pp. 160-170.
  21. A. Mishra, Metrics to Evaluate your Machine Learning Algorithm, 2018, URL: https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234, (visited on: 26/03/2020).
  22. NMSE, URL: https://rem.jrc.ec.europa.eu/RemWeb/atmes2/20b.htm (visited on: 14/05/2020).
  23. Bauhaus-Universität Weimar, The Clickbait Challenge, 2017, URL: https://www.clickbait-challenge.org/ (visited on 05/05/2020).
  24. X. Cao, et al., Machine Learning Based Detection of Clickbait Posts in Social Media, 2018, CoRR abs/1710.01977.
  25. A. Grigorev, Identifying Clickbait Posts on Social Media with Ensemble of Linear Models, 2017, CoRR abs/1710.00399.
  26. G. Louppe, Understanding Random Forests, from Theory to Practice, PhD dissertation, 2014, Cornell University.
  27. T. Plapinger, Tuning a Random Forest Classifier, 2017, URL: https://medium.com/@taplapinger/tuning-a-random-forest-classifier-1b252d1dde92 (visited on: 15/04/2020).
  28. scikit-learn Machine Learning in Python, 2017, URL: https://scikit-learn.org/stable/ (visited on 10/01/2020).
  29. I. Badarinza, A. Sterca, F. Boian, Using the user’s recent browsing history for personalized query suggestions, IEEE SoftCom 2018, Split, Croatia.
  30. I. Badarinza, A. Sterca, F. Boian, The Role of the User’s Browsing and Query History for Improving MPC-generated Query Suggestions, Journal of Communications Software and Systems, vol. 15, no. 1, pp. 26-33, 2019.

Darius Bufnea