Buck Shlegeris

Papers: 5

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

5papers

Authored papers

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv 2024

2024

Alignment faking in large language models

arXiv 2024

2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

2024

AI Control: Improving Safety Despite Intentional Subversion

arXiv 2023

2023

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

arXiv 2022

2022

Affiliations

No known affiliations.

Frequent co-authors

from 5 papers

Carson Denison

David Duvenaud

Ethan Perez

Evan Hubinger

Jared Kaplan

co-founder / Chief Science Officer

Monte MacDiarmid

Ryan Greenblatt

Samuel R. Bowman

Fabien Roger

Fazl Barez