Synthetic Data Generation Using Large Language Models: Advances in Text and Code (2025)

Abstract

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment or even substitute for real-world datasets, particularly in scenarios where labeled data is scarce, expensive, or sensitive. This paper surveys recent advances in leveraging LLMs to create synthetic text and code, highlighting key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We examine how these methods can enrich low-resource tasks (e.g. classification, question answering) and facilitate code-centric applications (e.g. instruction tuning, code translation, bug repair) through automated verification of functional correctness. Alongside potential benefits—cost-effectiveness, broad coverage, and controllable diversity—we discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification. Proposed mitigation strategies range from filtering and weighting synthetic outputs to reinforcement learning with execution feedback in code domains. We conclude by outlining open research directions, such as automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, underscoring the growing importance of LLM-generated synthetic data in accelerating AI development while emphasizing ethical and quality safeguards.

Citare

@Inproceedings{Nadǎş2025SyntheticDG,
 author = {Mihai Nadǎş and Laura Dioşan and Andreea Tomescu},
 booktitle = {IEEE Access},
 title = {Synthetic Data Generation Using Large Language Models: Advances in Text and Code},
 year = {2025}
}

Leave a Reply

Your email address will not be published. Required fields are marked *