TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

IEEE Access Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu Abstract This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data…

Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study (2025)

Authors Mihai Nadǎş, Laura Dioşan Abstract Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini…

Machine-Learning-Based Approaches for Multi-Level Sentiment Analysis of Romanian Reviews (2024)

Mathematics Authors Anamaria Briciu, A. Călin, Diana-Lucia Miholca, Cristiana Moroz-Dubenco, Vladiela Petrașcu, George Dascălu Abstract Sentiment analysis has increasingly gained significance in commercial settings, driven by the rising impact of reviews on purchase decision-making in recent years. This research conducts a thorough examination of the suitability of machine learning and deep learning approaches for…

A Lexicon-based Feature for Twitter Sentiment Analysis (2022)

International Conference on Computational Photography Authors Sergiu Limboi, L. Dioşan Abstract Twitter Sentiment Analysis shows several challenges due to the platform’s features (e.g., short messages, colloquial style, etc.). People want to express their ideas related to personality, events, or breaking news. Social media is one of the fastest ways to express opinions, and research…

Advances in Clickbait and Fake News Detection Using New Language-independent Strategies (2021)

Journal of Communications Software and Systems Authors C. Coste, Darius Bufnea Abstract Online publishers rely on different techniques to trap web visitors, clickbait being one such technique. Besides being a bad habit, clickbait is also a strong indicator for fake news spreading. Its presence in online media leads to an overall bad browsing experience…