Nora Belrose

Papers: 9

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

9papers

Authored papers

Partially Rewriting a Transformer in Natural Language

arXiv 2025

2025

Refusal in LLMs is an Affine Function

arXiv 2024

2024

Understanding Gradient Descent through the Training Jacobian

arXiv 2024

2024

Balancing Label Quantity and Quality for Scalable Elicitation

arXiv 2024

2024

Does Transformer Interpretability Transfer to RNNs?

arXiv 2024

2024

LEACE: Perfect linear concept erasure in closed form

NeurIPS 2023 11

2023

Eliciting Latent Predictions from Transformers with the Tuned Lens

arXiv 2023

2023

Eliciting Latent Knowledge from Quirky Language Models

arXiv 2023

2023

Adversarial Policies Beat Superhuman Go AIs

arXiv 2022

2022

Affiliations

No known affiliations.

Frequent co-authors

from 9 papers

Adam Scherlis

2 shared papers

Alex Mallen

2 shared papers

Gonçalo Paulo

2 shared papers

Stella Biderman

founder

Thomas Marshall

Adam Gleave

Danny Halawi

David Schneider-Joseph

1 shared paper

Edward Raff

1 shared paper

Igor Ostrovsky

1 shared paper