Cite
Notes
Only stored in your browser.
Attribution
Eliciting Secret Knowledge from Language Models
arXiv 2025
Improving Alignment and Robustness with Circuit Breakers
arXiv 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
from 3 papers
Andy Zou
founder
Dan Hendrycks
director
Justin Wang
Long Phan
researcher
Maxwell Lin
Alice Gatti
Andy Zhou
Arthur Conmy
Bartosz Cywiński
Bhrugu Bharathi