{"id":1261,"date":"2026-01-25T19:34:19","date_gmt":"2026-01-25T19:34:19","guid":{"rendered":"https:\/\/www.cs.ubbcluj.ro\/~meco\/tf1-en-3m-three-million-synthetic-moral-fables-for-training-small-open-language-models-2025\/"},"modified":"2026-02-01T12:07:20","modified_gmt":"2026-02-01T12:07:20","slug":"tf1-en-3m-three-million-synthetic-moral-fables-for-training-small-open-language-models-2025","status":"publish","type":"post","link":"https:\/\/www.cs.ubbcluj.ro\/~meco\/tf1-en-3m-three-million-synthetic-moral-fables-for-training-small-open-language-models-2025\/","title":{"rendered":"TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)"},"content":{"rendered":"<div class=\"entry-content\">\n<p>arXiv.org<\/p>\n<h2>Authors<\/h2>\n<p>Mihai Nad\u01ce\u015f, Laura Dio\u015fan, Andrei Piscoran, Andreea Tomescu<\/p>\n<h2>Abstract<\/h2>\n<p>Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -&gt;trait -&gt;setting -&gt;conflict -&gt;resolution -&gt;moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (&lt;24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.<\/p>\n<h2>Citation<\/h2>\n<pre class=\"wp-block-preformatted\">@Inproceedings{Nad\u01ce\u015f2025TF1EN3MTM,\n author = {Mihai Nad\u01ce\u015f and Laura Dio\u015fan and Andrei Piscoran and Andreea Tomescu},\n booktitle = {arXiv.org},\n title = {TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models},\n year = {2025}\n}<\/pre>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character ->trait ->setting ->conflict ->resolution ->moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.\n<\/p>\n","protected":false},"author":6,"featured_media":1043,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":[],"categories":[4],"tags":[85,71,11,74],"_links":{"self":[{"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/posts\/1261"}],"collection":[{"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/comments?post=1261"}],"version-history":[{"count":2,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/posts\/1261\/revisions"}],"predecessor-version":[{"id":1420,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/posts\/1261\/revisions\/1420"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/media\/1043"}],"wp:attachment":[{"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/media?parent=1261"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/categories?post=1261"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.cs.ubbcluj.ro\/~meco\/wp-json\/wp\/v2\/tags?post=1261"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}