Word and Punctuation N-Gram Features in Romanian Authorship Attribution
Abstract
This study addresses the problem of authorship attribution for Romanian texts, focusing on the use of N-gram features with an emphasis on semantic-independent representations. While character N-grams have been previously studied, this work extends the exploration to word and part-of-speech (POS) N-grams, as well as combinations involving punctuation, closed-class words, and filtered content words. Using the ROST corpus, we evaluate six supervised learning algorithms, with results averaged over multiple runs to ensure robustness. Our experiments show that Artificial Neural Networks (ANN) consistently achieve the highest performance, with word-based unigrams enhanced by punctuation reaching an average macro-accuracy of 0.93. Importantly, semantically independent features, such as closed-class words and POS replacements for nouns and verbs, yield small further improvements. These findings highlight the effectiveness of carefully designed N-gram features for Romanian AA and suggest that semantic-independent representations can complement traditional lexical approaches.
References
[2] Avram, S.-M., and Oltean, M. A comparison of several ai techniques for authorship attribution on romanian texts. Mathematics 10, 23 (2022), 4589.
[3] Briciu, A., Czibula, G., and Lupea, M. AutoAt: A deep autoencoder-based classification model for supervised authorship attribution. Procedia Computer Science 192 (10 2021), 397–406.
[4] De Marneffe, M.-C., Nivre, J., and Zeman, D. Function words in universal dependencies. Linguistic Analysis 43, 3–4 (2024), 549–588.
[5] Drexler, E. Qnrs: Toward language for intelligent machines, 2021.
[6] He, X., Lashkari, A. H., Vombatkere, N., and Sharma, D. P. Authorship attribution methods, challenges, and future research directions: A comprehensive survey. Information 15, 3 (2024).
[7] Houvardas, J., and Stamatatos, E. N-gram feature selection for authorship identification. In Artificial Intelligence: Methodology, Systems, Applications (2006).
[8] Howedi, F., and Mohd, M. Text classification for authorship attribution using naive Bayes classifier with limited training data. Computer Engineering and Intelligent Systems 5 (2014), 48–56.
[9] Koppel, M., Schler, J., and Argamon, S. Computational methods in authorship attribution. Journal of the American Society for information Science and Technology 60, 1 (2009), 9–26.
[10] López-Anguita, R., Montejo-Ráez, A., and Díaz-Galiano, M. C. Complexity measures and pos n-grams for author identification in several languages: Sinai at pan@clef 2018. In Conference and Labs of the Evaluation Forum (2018).
[11] Lupsa, D., Avram, S.-M., and Lupsa, R. Oldies but goldies: The potential of character n-grams for romanian texts. Studia Universitatis Babes,-Bolyai Informatica 70, 1-2 (2025), 25–42.
[12] Misini, A., Kadriu, A., and Canhasi, E. A survey on authorship analysis tasks and techniques. SEEU Review 17 (12 2022), 153–167.
[13] Niculescu, O., and Vasileanu, M. Prolongation in Romanian. In Interspeech 2025 (2025), pp. 379–383.
[14] Nitu, M., and Dascalu, M. Authorship attribution in less-resourced languages: A hybrid transformer approach for romanian. Applied Sciences 14, 7 (2024), 2700.
[15] Stamatatos, E. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
[16] Stamatatos, E. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy 21, 2 (01 2013), 421–439.
[17] Wanwan, Z., and Jin, M. A review on authorship attribution in text mining. Wiley Interdisciplinary Reviews: Computational Statistics 15 (04 2022).

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Transfer of copyright agreement: When the article is accepted for publication, I, as the author and the representative of the coauthors, hereby agree to transfer to Studia Universitatis Babes-Bolyai, Series Informatica, all rights, including those pertaining to electronic forms and transmissions, under existing copyright laws, except for the following, which the authors specifically retain: the authors can use the material however they want as long as it fits the NC ND terms of the license. The authors have all rights for reuse according to the below license.