TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a…

PyResolveMetrics: A Standards-Compliant and Efficient Approach to Entity Resolution Metrics (2024)

Entity resolution, the process of discerning whether multiple data refer to the same real-world entity, is crucial across various domains, including education. Its quality assessment is vital due to the extensive practical applications in fields such as analytics, personalized learning or academic integrity. With Python emerging as the predominant programming language in these areas,…