Samuel R. Bowman

Debating with More Persuasive LLMs Leads to More Truthful Answers

arXiv 2024

Alignment faking in large language models

arXiv 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

Language Models Learn to Mislead Humans via RLHF

arXiv 2024

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

arXiv 2024

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

COLM

Improving Code Generation by Training with Natural Language Feedback

arXiv 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

language-models-don-t-always-say-what-they

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

arXiv 2023

Pretraining Language Models with Human Preferences

arXiv 2023

Debate Helps Supervise Unreliable Experts

arXiv 2023

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

arXiv 2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

Discovering Language Model Behaviors with Model-Written Evaluations

arXiv 2022

Instruction Induction: From Few Examples to Natural Language Task Descriptions

arXiv 2022

SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

arXiv 2022

Constitutional AI: Harmlessness from AI Feedback

arXiv 2022