Stephen Casper

Papers: 6

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

6papers

Authored papers

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

arXiv 2025

2025

Obfuscated Activations Bypass LLM Latent-Space Defenses

arXiv 2024

2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

arXiv 2024

2024

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

arXiv 2024

2024

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

arXiv 2023

2023

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

arXiv 2023

2023

Affiliations

No known affiliations.

Frequent co-authors

from 6 papers

Dylan Hadfield-Menell

Abhay Sheshadri

Aengus Lynch

Aidan Ewart

Alex Serrano

Asa Cooper Stickland

researcher

Carlos Guestrin

Cindy Wu

Erik Jenner

Ethan Perez