Daniel Paleka

Cite

Notes

Only stored in your browser.

Attribution

3papers

Authored papers

Refusal in Language Models Is Mediated by a Single Direction

arXiv 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

arXiv 2024

Evaluating Superhuman Models with Consistency Checks

arXiv 2023

No known affiliations.

from 3 papers

Florian Tramer

Aaquib Syed

Ahmed Salem

Andy Arditi

Chenhao Li

Dragos Albastroiu

Edoardo Debenedetti

Giovanni Cherubin

Javier Rando

Lea Schönherr