Samuel Marks

Papers: 10

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

10papers

Authored papers

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

arXiv 2026

2026

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

arXiv 2025

2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

arXiv 2025

2025

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

arXiv 2025

2025

Eliciting Secret Knowledge from Language Models

arXiv 2025

2025

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

arXiv 2024

2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

2024

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

arXiv 2024

2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

arXiv 2024

2024

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

arXiv 2024

2024

Affiliations

No known affiliations.

Frequent co-authors

from 10 papers

Adam Karvonen

4 shared papers

Can Rager

4 shared papers

Neel Nanda

researcher

David Bau

Aaron Mueller

Arnab Sen Sharma

Arthur Conmy

Bartosz Cywiński

Caden Juang

Helena Casademunt