Cite
Notes
Only stored in your browser.
Attribution
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
arXiv 2024
Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
arXiv 2023
from 2 papers
Caden Juang
Garrett Baker
Rohan Subramani
Sam Wang
Severin Field