Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world, high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs) and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific datasets with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often generate distributionally drifted data with localized redundancy, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) extracts symbolic if-then rules from interpretable models and embeds them into prompts to explicitly guide the generation process toward the domain-specific distribution, and (ii) applies dual-granularity filtering that mitigates over-sampling patterns while preserving rare but informative samples to reduce localized redundancy. Extensive experiments on diverse benchmarks demonstrate that ReFine provides robust downstream utility, achieving a top-tier average rank across datasets and data regimes, with an average relative improvement of 7.48% in extreme low-data regimes.
Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world, high-quality tabular data is insufficient.
- Preview

- Year
- 2025
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2509.09960ARXIV-DEFAULT
- TL;DR
- Semantic Scholar