Buck Shlegeris
- Papers
- 5
Cite
Notes
Only stored in your browser.
5papers
Authored papers
5Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
arXiv 2024
Alignment faking in large language models
arXiv 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
arXiv 2024
AI Control: Improving Safety Despite Intentional Subversion
arXiv 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
arXiv 2022
Affiliations
No known affiliations.
Frequent co-authors
10from 5 papers
Carson Denison
David Duvenaud
Ethan Perez
Evan Hubinger
Jared Kaplan
co-founder / Chief Science Officer
Monte MacDiarmid
Ryan Greenblatt
Samuel R. Bowman
Fabien Roger
Fazl Barez