Cite
Notes
Only stored in your browser.
Attribution
Alignment faking in large language models
arXiv 2024
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models
arXiv 2018
from 2 papers
Akbir Khan
Benjamin Wright
Buck Shlegeris
Carson Denison
Chongli Qin
David Duvenaud
Ethan Perez
Evan Hubinger
Fabien Roger
Jack Chen