TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction (2026)

Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and…

Building Large-Scale English-Romanian Literary Translation Resources with Open Models (2025)

Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine-tuning, and evaluation in English-Romanian literary translations, centred on…

Using Community Detection in Adolescent Media Multitasking Research. An Exploratory Study (2025)

In this exploratory study, we used the community detection approach to complex networks analysis to analyze temperamental and executive functioning profiles of media multitaskers in early adolescence. Media multitasking is particularly intense in adolescence (Smahel et al., 2020), with implications for short- and long-term functioning (van der Schuur et al., 2015, 2020). Temperament and…

Evaluating Deep Learning Models for Cross-Platform UI Component Detection: A Study Across Web, Desktop, and Mobile Interfaces (2025)

User interfaces look different across web, desktop, and mobile platforms — not just in layout, but in how buttons, icons, and text appear. This makes it hard for deep learning models trained on one platform to accurately detect UI components on another. In this paper, we evaluate the cross-domain generalization of three modern object…

Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances…

Unveiling Hybrid Cyclomatic Complexity: A Comprehensive Analysis and Evaluation as an Integral Feature in Automatic Defect Prediction Models (2025)

The complex software systems developed nowadays require assessing their quality and proneness to errors. Reducing code complexity is a never-ending problem, especially in today’s fast pace of software systems development. Therefore, the industry needs to find a method to determine the qualities of a software system, the degree of difficulty in developing new functionalities,…

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a…

Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost (2025)

Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred…