TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)
Authors Mihai Nadǎş, Laura Dioşan, Andreea Tomescu, Andrei Piscoran Abstract Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end…