TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end…

Building Large-Scale English-Romanian Literary Translation Resources with Open Models (2025)

Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

IEEE Access Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu Abstract This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data…

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)

arXiv.org Authors Mihai Nadǎş, Laura Dioşan, Andrei Piscoran, Andreea Tomescu Abstract Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by…

Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost (2025)

arXiv.org Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for…

Evaluating Large Language Models for Diacritic Restoration in Romanian Texts: A Comparative Study (2025)

Authors Mihai Nadǎş, Laura Dioşan Abstract Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI’s GPT-3.5, GPT-4, GPT-4o, Google’s Gemini…

LLM Output Compliance with Handcrafted Linguistic Features: An Experiment (2025)

International Conference on Agents and Artificial Intelligence Authors Andrei Olar Abstract Can we control the writing style of large language models (LLMs) by specifying desired linguistic features? We address this question by investigating the impact of handcrafted linguistic feature (HLF) instructions on LLM-generated text. Our experiment evaluates various state-of-the-art LLMs using prompts incorporating HLF…

Using Chat GPT for Malicious Web Links Detection (2024)

International Conference on Web Information Systems and Technologies Authors Thomas Kaisser, C. Coste Abstract Over the last years, the Internet has monopolized most businesses and industries. These outstanding advancements lead to the dangerous development of specialized threats employed to outsmart everyday users, collect personal data and financial benefits. One of the most relevant attacks…