Comparison of Data Models For Unsupervised Twitter Sentiment Analysis

S. Limboi

doi:10.24193/subbi.2022.2.05

S. Limboi Department of Computer Science, Babes-Bolyai University, 1, M. Kogalniceanu Street, 400084, Cluj-Napoca, Romania

DOI: https://doi.org/10.24193/subbi.2022.2.05

Abstract

Identifying the sentiment of collected tweets has become a challenging and interesting task. In addition, mining and defining relevant features that can improve the quality of a classification system is crucial. The data modeling phase is fundamental for the whole process since it can reveal hidden information from the textual inputs. Two models are defined in the presented paper, considering Twitter-specific concepts: a hashtag-based representation and a text-based one. These models will be compared and integrated into an unsupervised system that determines groups of tweets based on sentiment labels (positive and negative). Moreover, word-embedding techniques (TF-IDF and frequency vectors) are used to convert the representations into a numeric input needed for the clustering methods.

The experimental results show good values for Silhouette and Davies-Bouldin measures in the unsupervised environment. A detailed investigation is presented considering several items (dataset, clustering method, data representation, or word embeddings) for checking the best setup for increasing the quality of detecting the sentiment from Twitter’s messages. The analysis and conclusions show that the first results can be considered for more complex experiments

References

[1] Baeza-Yates, R., Ribeiro-Neto, B., et al. Modern information retrieval, vol. 463. ACM press New York, 1999.
[2] Chiong, R., Budhi, G. S., and Dhakal, S. Combining sentiment lexicons and content-based features for depression detection. IEEE Intelligent Systems 36, 6 (2021), 99–105.
[3] Hung, L. P., and Alias, S. Beyond sentiment analysis: A review of recent trends in text based sentiment analysis and emotion detection. Journal of Advanced Computational Intelligence and Intelligent Informatics 27, 1 (2023), 84–95.
[4] Hutto, C., and Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (2014), vol. 8, pp. 216–225.
[5] Koto, F., and Adriani, M. Hbe: Hashtag-based emotion lexicons for twitter sentiment analysis. In Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation (2015), pp. 31–34.
[6] Limboi, S., and Dios¸an, L. Hybrid features for twitter sentiment analysis. In Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Za-
kopane, Poland, October 12-14, 2020, Proceedings, Part II 19 (2020), Springer, pp. 210–219.
[7] Pilar, G.-D., Isabel, S.-B., Diego, P.-M., and Luis, G.-´A. J. A novel flexible feature extraction algorithm for spanish tweet sentiment analysis based on the context of words. Expert Systems with Applications 212 (2023), 118817.
[8] Van der Maaten, L., and Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, 11 (2008).
[9] Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing 17 (2007), 395–416.
[10] Wang, W., Li, B., Feng, D., Zhang, A., and Wan, S. The ol-dawe model: tweet polarity sentiment analysis with data augmentation. IEEE Access 8 (2020), 40118–40128.
[11] Xu, D., and Tian, Y. A comprehensive survey of clustering algorithms. Annals of Data Science 2 (2015), 165–193.
[12] Xu, R., and Wunsch, D. Survey of clustering algorithms. IEEE Transactions on neural networks 16, 3 (2005), 645–678.
[13] Zhang, T., Yang, K., Ji, S., and Ananiadou, S. Emotion fusion for mental illness detection from social media: A survey. Information Fusion 92 (2023), 231–246.