Ethan Perez

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

arXiv 2024

Language Models Learn to Mislead Humans via RLHF

arXiv 2024

Best-of-N Jailbreaking

arXiv 2024

Debating with More Persuasive LLMs Leads to More Truthful Answers

arXiv 2024

Alignment faking in large language models

arXiv 2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

arXiv 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

Looking Inward: Language Models Can Learn About Themselves by Introspection

arXiv 2024

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

arXiv 2024

Pretraining Language Models with Human Preferences

arXiv 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

language-models-don-t-always-say-what-they

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

arXiv 2023

Improving Code Generation by Training with Natural Language Feedback

arXiv 2023

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

arXiv 2023

Training Language Models with Language Feedback at Scale

arXiv 2023

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

arXiv 2022

Discovering Language Model Behaviors with Model-Written Evaluations

arXiv 2022

Few-shot Adaptation Works with UnpredicTable Data

arXiv 2022

Constitutional AI: Harmlessness from AI Feedback

arXiv 2022