Dylan Hadfield-Menell
- Papers
- 4
Cite
Notes
Only stored in your browser.
4papers
Authored papers
4Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
arXiv 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
arXiv 2024
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
arXiv 2023
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
arXiv 2023
Affiliations
No known affiliations.
Frequent co-authors
10from 4 papers