Dylan Hadfield-Menell

Cite

Notes

Only stored in your browser.

Attribution

4papers

Authored papers

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

arXiv 2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

arXiv 2024

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

arXiv 2023

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

arXiv 2023

No known affiliations.

from 4 papers

Stephen Casper

Abhay Sheshadri

Aengus Lynch

Aidan Ewart

Asa Cooper Stickland

researcher

Cindy Wu

Ethan Perez

Gatlen Culp

Henry Sleight

Jacob Andreas