logic
Google Research
23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
BigBenchHard (BBH) evaluation environment with Chain-of-Thought
Prime Community
Big Bench + BBH implementation
OpenOrca Team
An open reproduction of Microsoft's Orca recipe - FLAN prompts with GPT-4 chain-of-thought completions that taught reasoning by imitation.
A heavily-deduplicated, GPT-4-only slice of OpenOrca that delivers similar downstream quality at one-third the size.
by avg parsed score across evals here