TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances…

Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study (2025)

Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini 1.0 Pro, Meta’s Llama 2 and…

Machine-Learning-Based Approaches for Multi-Level Sentiment Analysis of Romanian Reviews (2024)

Sentiment analysis has increasingly gained significance in commercial settings, driven by the rising impact of reviews on purchase decision-making in recent years. This research conducts a thorough examination of the suitability of machine learning and deep learning approaches for sentiment analysis, using Romanian reviews as a case study, with the aim of gaining insights…

A Lexicon-based Feature for Twitter Sentiment Analysis (2022)

Twitter Sentiment Analysis shows several challenges due to the platform’s features (e.g., short messages, colloquial style, etc.). People want to express their ideas related to personality, events, or breaking news. Social media is one of the fastest ways to express opinions, and research directions are developed to analyze the polarity of written messages. Very…

Advances in Clickbait and Fake News Detection Using New Language-independent Strategies (2021)

Online publishers rely on different techniques to trap web visitors, clickbait being one such technique. Besides being a bad habit, clickbait is also a strong indicator for fake news spreading. Its presence in online media leads to an overall bad browsing experience for the web consumer. Recently, big players on the Internet scene, such…