TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end…

Building Large-Scale English-Romanian Literary Translation Resources with Open Models (2025)

Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

IEEE Access Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu Abstract This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data…

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)

arXiv.org Authors Mihai Nadǎş, Laura Dioşan, Andrei Piscoran, Andreea Tomescu Abstract Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by…

UICVD: A Computer Vision UI Dataset for Training RPA Agents (2024)

International Conference on Evaluation of Novel Approaches to Software Engineering Authors Madalina Dicu, Adrian Sterca, Camelia Chira, R. Orghidan Abstract This paper introduces the UICVD Dataset, a novel resource fostering advancements in Robotic Process Au-tomation (RPA) and Computer Vision. The paper focuses on recognizing UI (User Interface) components of a web application which is…