0

LLMTabBench: Evaluating LLMs on Binary Tabular Classification From Zero to Few Shots

Supervised classification on tabular data remains a central machine learning task, but its dependence on large labeled datasets limits its applicability in data-scarce settings.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2605.24417CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Supervised classification on tabular data remains a central machine learning task, but its dependence on large labeled datasets limits its applicability in data-scarce settings. Few-shot methods such as TabPFN achieve strong performance through large-scale synthetic pretraining, yet still require labeled context examples. Large Language Models (LLMs) offer a more flexible alternative through zero- and few-shot in-context learning from task descriptions, but their behavior on tabular data remains inconsistent. We introduce LLMTabBench, a benchmark for evaluating LLMs on tabular classification under low-data conditions. The benchmark studies how LLM prior knowledge interacts with task descriptions and few-shot examples, and how performance changes with increasing data complexity across real-world and controlled synthetic datasets. We find that LLMs can be highly competitive in zero-shot settings, sometimes outperforming models given few-shot examples. However, additional examples may conflict with prior knowledge, thereby degrading performance. We also observe a complexity threshold at which LLM performance declines and few-shot examples become less useful. These results clarify key limits of in-context learning for tabular data and inform the deployment of LLMs in low-data regimes.