Stephen Casper
- Papers
- 6
Cite
Notes
Only stored in your browser.
6papers
Authored papers
6Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs
arXiv 2025
Obfuscated Activations Bypass LLM Latent-Space Defenses
arXiv 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
arXiv 2024
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
arXiv 2024
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
arXiv 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
arXiv 2023
Affiliations
No known affiliations.
Frequent co-authors
10from 6 papers