Cite
Notes
Only stored in your browser.
Attribution
Refusal in Language Models Is Mediated by a Single Direction
arXiv 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Evaluating Superhuman Models with Consistency Checks
arXiv 2023
from 3 papers
Florian Tramer
Aaquib Syed
Ahmed Salem
Andy Arditi
Chenhao Li
Dragos Albastroiu
Edoardo Debenedetti
Giovanni Cherubin
Javier Rando
Lea Schönherr