0

Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes

Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world, high-quality tabular data is insufficient.

Preview
Year
2025
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2509.09960ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Synthetic tabular data generation is increasingly essential in machine learning, supporting downstream applications when real-world, high-quality tabular data is insufficient. Existing tabular generation approaches, such as generative adversarial networks (GANs) and fine-tuned Large Language Models (LLMs), typically require sufficient reference data, limiting their effectiveness in domain-specific datasets with scarce records. While prompt-based LLMs offer flexibility without parameter tuning, they often generate distributionally drifted data with localized redundancy, leading to degradation in downstream task performance. To overcome these issues, we propose ReFine, a framework that (i) extracts symbolic if-then rules from interpretable models and embeds them into prompts to explicitly guide the generation process toward the domain-specific distribution, and (ii) applies dual-granularity filtering that mitigates over-sampling patterns while preserving rare but informative samples to reduce localized redundancy. Extensive experiments on diverse benchmarks demonstrate that ReFine provides robust downstream utility, achieving a top-tier average rank across datasets and data regimes, with an average relative improvement of 7.48% in extreme low-data regimes.