logic

Evals testing this capability

Google Research

23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.

BigBenchHard (BBH) evaluation environment with Chain-of-Thought

Prime Community

Big Bench + BBH implementation

OpenOrca Team

An open reproduction of Microsoft's Orca recipe - FLAN prompts with GPT-4 chain-of-thought completions that taught reasoning by imitation.

OpenOrca Team

A heavily-deduplicated, GPT-4-only slice of OpenOrca that delivers similar downstream quality at one-third the size.

by avg parsed score across evals here

logicBar chart with 21 bars. Highest value: Internlm2 5 20B Chat at 74.7.

21 models