Cite
Notes
Only stored in your browser.
Attribution
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
arXiv 2024
Debating with More Persuasive LLMs Leads to More Truthful Answers
AI Control: Improving Safety Despite Intentional Subversion
arXiv 2023
from 3 papers
Ansh Radhakrishnan
Buck Shlegeris
Ethan Perez
Ryan Greenblatt
Samuel R. Bowman
Adam Jermyn
Akbir Khan
Amanda Askell
researcher
Carson Denison
Cem Anil