Aidan Ewart

Cite

Notes

Only stored in your browser.

Attribution

2papers

Authored papers

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

arXiv 2024

Sparse Autoencoders Find Highly Interpretable Features in Language Models

arXiv 2023

No known affiliations.

from 2 papers

Abhay Sheshadri

Aengus Lynch

Asa Cooper Stickland

researcher

Cindy Wu

Dylan Hadfield-Menell

Ethan Perez

Henry Sleight

Hoagy Cunningham

Lee Sharkey

Logan Riggs