Fazl Barez

Papers: 10

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

10papers

Authored papers

Precise In-Parameter Concept Erasure in Large Language Models

arXiv 2025

2025

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv 2024

2024

Towards Interpreting Visual Information Processing in Vision-Language Models

arXiv 2024

2024

Best-of-N Jailbreaking

arXiv 2024

2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv 2024

2024

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

arXiv 2024

2024

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

arXiv 2024

2024

Interpreting Learned Feedback Patterns in Large Language Models

arXiv 2023

2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

arXiv 2023

2023

Understanding Addition in Transformers

arXiv 2023

2023

Affiliations

No known affiliations.

Frequent co-authors

from 10 papers

David Krueger

Ethan Perez

Philip Torr

Buck Shlegeris

Carson Denison

Clement Neo

David Duvenaud

Evan Hubinger

Jared Kaplan

co-founder / Chief Science Officer

2 shared papers

Luke Marks

2 shared papers