synthetic data generation

TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and…

Building Large-Scale English-Romanian Literary Translation Resources with Open Models (2025)

Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine-tuning, and evaluation in English-Romanian literary translations, centred on…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances…

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a…

Matching Apictorial Puzzle Pieces Using Deep Learning (2024)

Finding matches between puzzle pieces is a difficult problem relevant to applications that involve restoring broken objects. The main difficulty comes from the similarity of the puzzle pieces and the very small difference between a pair of pieces that almost match and one that does. The proposed solution is based on deep learning models…

UICVD: A Computer Vision UI Dataset for Training RPA Agents (2024)

This paper introduces the UICVD Dataset, a novel resource fostering advancements in Robotic Process Au-tomation (RPA) and Computer Vision. The paper focuses on recognizing UI (User Interface) components of a web application which is not as well known as recognizing real objects in images in the field of computer vision. This dataset derives from…