We develop a systematic benchmark set to test the generalization of state-of-the-art large language models on broader problems beyond linguistic tasks and evaluate it on a systematic progression of GPT models (GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini). Using well-known simple games like Tic-Tac-Toe, Connect Four, and Battleship, all encoded in ASCII, we test their strategic capabilities and spatial reasoning. To probe generalization, we introduce three new games: LEGO Connect Language (LCL) for spatial logic, a shape recognition game, and Guess-the-SMILES (GtS), an advanced spatial logic benchmark in chemistry. Results show that, despite proficiency in standard benchmarks, GPT models perform poorly in these games, failing to anticipate losing moves, play correctly, or recognize spatial relationships. Except for Tic-Tac-Toe and GtS, a systematic progression in gameplay performance as models are formally improved (GPT-3.5, GPT-4, GPT-4o) is not observed. GPT-4 succeeds in shape recognition, but all models consistently struggle with LCL and GtS. This suggests that while GPT models can emulate conversational proficiency and basic rule comprehension, they have limited cognitive flexibility and generalization in strategy and spatial reasoning. Our findings, highlighted with our benchmark suite (ChildPlay GitHub Repository), caution against claims of emergent intelligence in GPT models, which appear more specialized than general.
Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay
A benchmark suite tests GPT models' generalization in strategic and spatial reasoning beyond linguistic tasks, revealing limited performance in specific games and chemistry-related benchmarks.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 3
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2407.11068v4ARXIV-DEFAULT
- TL;DR
- Semantic Scholar