Samuel Marks
- Papers
- 10
Cite
Notes
Only stored in your browser.
Authored papers
10Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
arXiv 2026
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
arXiv 2025
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
arXiv 2025
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
arXiv 2025
Eliciting Secret Knowledge from Language Models
arXiv 2025
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
arXiv 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
arXiv 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
arXiv 2024
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
arXiv 2024
Affiliations
Frequent co-authors
10from 10 papers