0

Ethan Perez

Papers
22

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
22papers

Authored papers

22

Inverse Scaling in Test-Time Compute

arXiv 2025

2025

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv 2024

2024

Debating with More Persuasive LLMs Leads to More Truthful Answers

arXiv 2024

2024

Alignment faking in large language models

arXiv 2024

2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

arXiv 2024

2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

arXiv 2024

2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

2024

Best-of-N Jailbreaking

arXiv 2024

2024

Language Models Learn to Mislead Humans via RLHF

arXiv 2024

2024

Looking Inward: Language Models Can Learn About Themselves by Introspection

arXiv 2024

2024

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

arXiv 2024

2024

Pretraining Language Models with Human Preferences

arXiv 2023

2023

Improving Code Generation by Training with Natural Language Feedback

arXiv 2023

2023

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

arXiv 2023

2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

language-models-don-t-always-say-what-they

2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

arXiv 2023

2023

Training Language Models with Language Feedback at Scale

arXiv 2023

2023

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

arXiv 2022

2022

Discovering Language Model Behaviors with Model-Written Evaluations

arXiv 2022

2022

Constitutional AI: Harmlessness from AI Feedback

arXiv 2022

2022

Few-shot Adaptation Works with UnpredicTable Data

arXiv 2022

2022

FiLM: Visual Reasoning with a General Conditioning Layer

arXiv 2017

2017

Affiliations

No known affiliations.

Frequent co-authors

10

from 22 papers