Fazl Barez
- Papers
- 10
Cite
Notes
Only stored in your browser.
Authored papers
10Precise In-Parameter Concept Erasure in Large Language Models
arXiv 2025
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
arXiv 2024
Best-of-N Jailbreaking
arXiv 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
arXiv 2024
Towards Interpreting Visual Information Processing in Vision-Language Models
arXiv 2024
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
arXiv 2024
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
arXiv 2024
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
arXiv 2023
Understanding Addition in Transformers
arXiv 2023
Interpreting Learned Feedback Patterns in Large Language Models
arXiv 2023
Affiliations
Frequent co-authors
10from 10 papers
David Krueger
Ethan Perez
Philip Torr
Buck Shlegeris
Carson Denison
Clement Neo
David Duvenaud
Evan Hubinger
Jared Kaplan
co-founder / Chief Science Officer
Luke Marks