Comparison of Data Models For Unsupervised Twitter Sentiment Analysis
Abstract
Identifying the sentiment of collected tweets has become a challenging and interesting task. In addition, mining and defining relevant features that can improve the quality of a classification system is crucial. The data modeling phase is fundamental for the whole process since it can reveal hidden information from the textual inputs. Two models are defined in the presented paper, considering Twitter-specific concepts: a hashtag-based representation and a text-based one. These models will be compared and integrated into an unsupervised system that determines groups of tweets based on sentiment labels (positive and negative). Moreover, word-embedding techniques (TF-IDF and frequency vectors) are used to convert the representations into a numeric input needed for the clustering methods.
The experimental results show good values for Silhouette and Davies-Bouldin measures in the unsupervised environment. A detailed investigation is presented considering several items (dataset, clustering method, data representation, or word embeddings) for checking the best setup for increasing the quality of detecting the sentiment from Twitter’s messages. The analysis and conclusions show that the first results can be considered for more complex experiments
References
[2] Chiong, R., Budhi, G. S., and Dhakal, S. Combining sentiment lexicons and content-based features for depression detection. IEEE Intelligent Systems 36, 6 (2021), 99–105.
[3] Hung, L. P., and Alias, S. Beyond sentiment analysis: A review of recent trends in text based sentiment analysis and emotion detection. Journal of Advanced Computational Intelligence and Intelligent Informatics 27, 1 (2023), 84–95.
[4] Hutto, C., and Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media (2014), vol. 8, pp. 216–225.
[5] Koto, F., and Adriani, M. Hbe: Hashtag-based emotion lexicons for twitter sentiment analysis. In Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation (2015), pp. 31–34.
[6] Limboi, S., and Dios¸an, L. Hybrid features for twitter sentiment analysis. In Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Za-
kopane, Poland, October 12-14, 2020, Proceedings, Part II 19 (2020), Springer, pp. 210–219.
[7] Pilar, G.-D., Isabel, S.-B., Diego, P.-M., and Luis, G.-´A. J. A novel flexible feature extraction algorithm for spanish tweet sentiment analysis based on the context of words. Expert Systems with Applications 212 (2023), 118817.
[8] Van der Maaten, L., and Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, 11 (2008).
[9] Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing 17 (2007), 395–416.
[10] Wang, W., Li, B., Feng, D., Zhang, A., and Wan, S. The ol-dawe model: tweet polarity sentiment analysis with data augmentation. IEEE Access 8 (2020), 40118–40128.
[11] Xu, D., and Tian, Y. A comprehensive survey of clustering algorithms. Annals of Data Science 2 (2015), 165–193.
[12] Xu, R., and Wunsch, D. Survey of clustering algorithms. IEEE Transactions on neural networks 16, 3 (2005), 645–678.
[13] Zhang, T., Yang, K., Ji, S., and Ananiadou, S. Emotion fusion for mental illness detection from social media: A survey. Information Fusion 92 (2023), 231–246.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
When the article is accepted for publication, I, as the author and representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the author specifically retain: the right to make further copies of all or part of the published article for my use in classroom teaching; the right to reuse all or part of this material in a review or in a textbook of which I am the author; the right to make copies of the published work for internal distribution within the institution that employs me.